🎈优化RSS推送的内容 (#259)

* 🧪 test(tests): 添加了RSS的单元测试 * 🎈 perf(rss and test): 优化了RSS部分源标题正文重复的问题部分RSS源(RSSHub的Twitter）存在正文当标题用的情况，导致推送的时候呈现为两段重复的文字，现通过Jaccard相似系数来判断是否需要去重 * Update nonebot_bison/platform/rss.py Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com> * Update nonebot_bison/platform/rss.py Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com> * 🐞 fix(platform/rss): 修复了漏掉相似文本在后端位置的问题 * 🐞 fix(rss): 修正一些feed无法正确识别时间的bug 一些feed时间只有updated标签或者没有，原先的代码只能解析用published标签的时间 felinae98#275 * 🎈 perf(rss): 更改字符串相似度比较方法从Jaccard相似系数比较相似度改为通过最长公共子序列来比较 * 🦄 refactor(rss): 重构实现字符串相似度比较的方法使用标准库difflib代替原先手搓的LCS * Update nonebot_bison/utils/__init__.py Co-authored-by: felinae98 <731499577@qq.com> * Update nonebot_bison/platform/rss.py * Update nonebot_bison/platform/rss.py --------- Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com> Co-authored-by: felinae98 <731499577@qq.com>
2026-05-09 18:27:56 +08:00 · 2023-07-18 11:54:49 +08:00
parent 1db15ffc75
commit 9838e25bad
11 changed files with 843 additions and 5 deletions
@@ -1,4 +1,5 @@
 import calendar
+import time
 from typing import Any, Optional

 import feedparser
@@ -7,10 +8,17 @@ from httpx import AsyncClient

 from ..post import Post
 from ..types import RawPost, Target
-from ..utils import scheduler
+from ..utils import SchedulerConfig, text_similarity
 from .platform import NewMessage


+class RssSchedConf(SchedulerConfig):
+
+    name = "rss"
+    schedule_type = "interval"
+    schedule_setting = {"seconds": 30}
+
+
 class Rss(NewMessage):

    categories = {}
@@ -19,7 +27,7 @@ class Rss(NewMessage):
    name = "Rss"
    enabled = True
    is_common = True
-    scheduler = scheduler("interval", {"seconds": 30})
+    scheduler = RssSchedConf
    has_target = True

    @classmethod
@@ -31,7 +39,12 @@ class Rss(NewMessage):
        return feed["feed"]["title"]

    def get_date(self, post: RawPost) -> int:
-        return calendar.timegm(post.published_parsed)
+        if hasattr(post, "published_parsed"):
+            return calendar.timegm(post.published_parsed)
+        elif hasattr(post, "updated_parsed"):
+            return calendar.timegm(post.updated_parsed)
+        else:
+            return calendar.timegm(time.gmtime())

    def get_id(self, post: RawPost) -> Any:
        return post.id
@@ -45,9 +58,17 @@ class Rss(NewMessage):
        return feed.entries

    async def parse(self, raw_post: RawPost) -> Post:
-        text = raw_post.get("title", "") + "\n" if raw_post.get("title") else ""
+        title = raw_post.get("title", "")
        soup = bs(raw_post.description, "html.parser")
-        text += soup.text.strip()
+        desc = soup.text.strip()
+        if not title or not desc:
+            text = title or desc
+        else:
+            if text_similarity(desc, title) > 0.8:
+                text = desc if len(desc) > len(title) else title
+            else:
+                text = f"{title}\n\n{desc}"
+
        pics = list(map(lambda x: x.attrs["src"], soup("img")))
        if raw_post.get("media_content"):
            for media in raw_post["media_content"]: