mirror of
https://github.com/suyiiyii/nonebot-bison.git
synced 2026-05-09 18:27:56 +08:00
🎈优化RSS推送的内容 (#259)
* 🧪 test(tests): 添加了RSS的单元测试 * 🎈 perf(rss and test): 优化了RSS部分源标题正文重复的问题 部分RSS源(RSSHub的Twitter)存在正文当标题用的情况,导致推送的时候呈现为两段重复的文字,现通过Jaccard相似系数来判断是否需要去重 * Update nonebot_bison/platform/rss.py Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com> * Update nonebot_bison/platform/rss.py Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com> * 🐞 fix(platform/rss): 修复了漏掉相似文本在后端位置的问题 * 🐞 fix(rss): 修正一些feed无法正确识别时间的bug 一些feed时间只有updated标签或者没有,原先的代码只能解析用published标签的时间 felinae98#275 * 🎈 perf(rss): 更改字符串相似度比较方法 从Jaccard相似系数比较相似度改为通过最长公共子序列来比较 * 🦄 refactor(rss): 重构实现字符串相似度比较的方法 使用标准库difflib代替原先手搓的LCS * Update nonebot_bison/utils/__init__.py Co-authored-by: felinae98 <731499577@qq.com> * Update nonebot_bison/platform/rss.py * Update nonebot_bison/platform/rss.py --------- Co-authored-by: AzideCupric <57004769+AzideCupric@users.noreply.github.com> Co-authored-by: felinae98 <731499577@qq.com>
This commit is contained in:
@@ -1,3 +1,4 @@
|
||||
import difflib
|
||||
import re
|
||||
import sys
|
||||
from typing import Union
|
||||
@@ -109,3 +110,9 @@ def jaccard_text_similarity(str1: str, str2: str) -> float:
|
||||
set1 = set(str1)
|
||||
set2 = set(str2)
|
||||
return len(set1 & set2) / len(set1 | set2)
|
||||
|
||||
|
||||
def text_similarity(str1, str2) -> float:
|
||||
matcher = difflib.SequenceMatcher(None, str1, str2)
|
||||
t = sum(temp.size for temp in matcher.get_matching_blocks())
|
||||
return t / min(len(str1), len(str2))
|
||||
|
||||
Reference in New Issue
Block a user