UKM 9838e25bad
🎈优化RSS推送的内容 (#259)
* 🧪 test(tests): 添加了RSS的单元测试

* 🎈 perf(rss and test): 优化了RSS部分源标题正文重复的问题

部分RSS源(RSSHub的Twitter)存在正文当标题用的情况,导致推送的时候呈现为两段重复的文字,现通过Jaccard相似系数来判断是否需要去重

* Update nonebot_bison/platform/rss.py

Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com>

* Update nonebot_bison/platform/rss.py

Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com>

* 🐞 fix(platform/rss): 修复了漏掉相似文本在后端位置的问题

* 🐞 fix(rss): 修正一些feed无法正确识别时间的bug

一些feed时间只有updated标签或者没有,原先的代码只能解析用published标签的时间

felinae98#275

* 🎈 perf(rss): 更改字符串相似度比较方法

从Jaccard相似系数比较相似度改为通过最长公共子序列来比较

* 🦄 refactor(rss): 重构实现字符串相似度比较的方法

使用标准库difflib代替原先手搓的LCS

* Update nonebot_bison/utils/__init__.py

Co-authored-by: felinae98 <731499577@qq.com>

* Update nonebot_bison/platform/rss.py

* Update nonebot_bison/platform/rss.py

---------

Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com>
Co-authored-by: felinae98 <731499577@qq.com>
2023-07-18 11:54:49 +08:00

84 lines
2.4 KiB
Python

import calendar
import time
from typing import Any, Optional
import feedparser
from bs4 import BeautifulSoup as bs
from httpx import AsyncClient
from ..post import Post
from ..types import RawPost, Target
from ..utils import SchedulerConfig, text_similarity
from .platform import NewMessage
class RssSchedConf(SchedulerConfig):
name = "rss"
schedule_type = "interval"
schedule_setting = {"seconds": 30}
class Rss(NewMessage):
categories = {}
enable_tag = False
platform_name = "rss"
name = "Rss"
enabled = True
is_common = True
scheduler = RssSchedConf
has_target = True
@classmethod
async def get_target_name(
cls, client: AsyncClient, target: Target
) -> Optional[str]:
res = await client.get(target, timeout=10.0)
feed = feedparser.parse(res.text)
return feed["feed"]["title"]
def get_date(self, post: RawPost) -> int:
if hasattr(post, "published_parsed"):
return calendar.timegm(post.published_parsed)
elif hasattr(post, "updated_parsed"):
return calendar.timegm(post.updated_parsed)
else:
return calendar.timegm(time.gmtime())
def get_id(self, post: RawPost) -> Any:
return post.id
async def get_sub_list(self, target: Target) -> list[RawPost]:
res = await self.client.get(target, timeout=10.0)
feed = feedparser.parse(res)
entries = feed.entries
for entry in entries:
entry["_target_name"] = feed.feed.title
return feed.entries
async def parse(self, raw_post: RawPost) -> Post:
title = raw_post.get("title", "")
soup = bs(raw_post.description, "html.parser")
desc = soup.text.strip()
if not title or not desc:
text = title or desc
else:
if text_similarity(desc, title) > 0.8:
text = desc if len(desc) > len(title) else title
else:
text = f"{title}\n\n{desc}"
pics = list(map(lambda x: x.attrs["src"], soup("img")))
if raw_post.get("media_content"):
for media in raw_post["media_content"]:
if media.get("medium") == "image" and media.get("url"):
pics.append(media.get("url"))
return Post(
"rss",
text=text,
url=raw_post.link,
pics=pics,
target_name=raw_post["_target_name"],
)