當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

敏感词过滤及反垃圾文本的相关知识（欢迎收藏）

發布時間：2023/12/20 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了敏感词过滤及反垃圾文本的相关知识（欢迎收藏）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

先介紹一下敏感詞詞庫

：1.funNLP

敏感詞庫：
2.chat-censorship
與聊天客戶端審查調查相關的數據，此存儲庫包含關鍵字黑名單以及其他內容的列表，例如用于觸發在中國使用的應用程序中的審查制度的URL或圖像（應用包括：微博，微信，Line,skype）

3.網上整理的敏感詞庫及Java實現的代碼

請移步github

敏感詞過濾的相關算法：

1.使用敏感詞過濾系統。
信息審核工作都是在信息審核平臺上進行的，網站的運營審核系統中會預先設定一批關鍵詞庫并對詞組進行排列組合，這批詞庫又會根據敏感性進行分類。系統會阻止用戶發布敏感詞匯，或將用戶發出來的含有敏感詞的內容直接刪除。對于某些敏感性較低的詞匯，發出來不會立即刪除，需要經過審核人員過目進行二次審核。
AC自動機算法（原理）

#python實現， # -*- coding:utf-8 -*-import time time1=time.time()# AC自動機算法 class node(object):def __init__(self):self.next = {}self.fail = Noneself.isWord = Falseself.word = ""class ac_automation(object):def __init__(self):self.root = node()# 添加敏感詞函數def addword(self, word):temp_root = self.rootfor char in word:if char not in temp_root.next:temp_root.next[char] = node()temp_root = temp_root.next[char]temp_root.isWord = Truetemp_root.word = word# 失敗指針函數def make_fail(self):temp_que = []temp_que.append(self.root)while len(temp_que) != 0:temp = temp_que.pop(0)p = Nonefor key,value in temp.next.item():if temp == self.root:temp.next[key].fail = self.rootelse:p = temp.failwhile p is not None:if key in p.next:temp.next[key].fail = p.failbreakp = p.failif p is None:temp.next[key].fail = self.roottemp_que.append(temp.next[key])# 查找敏感詞函數def search(self, content):p = self.rootresult = []currentposition = 0while currentposition < len(content):word = content[currentposition]while word in p.next == False and p != self.root:p = p.failif word in p.next:p = p.next[word]else:p = self.rootif p.isWord:result.append(p.word)p = self.rootcurrentposition += 1return result# 加載敏感詞庫函數def parse(self, path):with open(path,encoding='gbk') as f:for keyword in f:self.addword(str(keyword).strip())# 敏感詞替換函數def words_replace(self, text):""":param ah: AC自動機:param text: 文本:return: 過濾敏感詞之后的文本"""result = list(set(self.search(text)))for x in result:m = text.replace(x, '*' * len(x))text = mreturn textif __name__ == '__main__':ah = ac_automation()path='keywords.txt'ah.parse(path)text1=input('輸入文字：')# text1="shabi操草草得到大大蘇打"text2=ah.words_replace(text1)print(text2)time2 = time.time()print('總共耗時：' + str(time2 - time1) + 's')

DFA算法（原理）

#python實現 # -*- coding:utf-8 -*- import time time1=time.time() # DFA算法 class DFAFilter():def __init__(self):self.keyword_chains = {}self.delimit = '\x00'def add(self, keyword):keyword = keyword.lower()chars = keyword.strip()if not chars:returnlevel = self.keyword_chainsfor i in range(len(chars)):if chars[i] in level:level = level[chars[i]]else:if not isinstance(level, dict):breakfor j in range(i, len(chars)):level[chars[j]] = {}last_level, last_char = level, chars[j]level = level[chars[j]]last_level[last_char] = {self.delimit: 0}breakif i == len(chars) - 1:level[self.delimit] = 0def parse(self, path):with open(path,encoding='gbk') as f:for keyword in f:self.add(str(keyword).strip())def filter(self, message, repl="*"):message = message.lower()ret = []start = 0while start < len(message):level = self.keyword_chainsstep_ins = 0for char in message[start:]:if char in level:step_ins += 1if self.delimit not in level[char]:level = level[char]else:ret.append(repl * step_ins)start += step_ins - 1breakelse:ret.append(message[start])breakelse:ret.append(message[start])start += 1return ''.join(ret)if __name__ == "__main__":gfw = DFAFilter()path="keywords.txt"gfw.parse(path)text=input("請輸入文字：")# text="新疆騷亂蘋果新品發布會雞八，操你媽逼的大傻逼你個哈哈哈胡愛思"result = gfw.filter(text)# print(text)print(result)time2 = time.time()print('總共耗時：' + str(time2 - time1) + 's')

3.TTMP網友自創算法（原理，code）

建立反垃圾信息（anti-spam）機制：**

我們經常會遇到一些垃圾信息，比如郵箱中收到的各種垃圾郵件、新浪微博的僵尸粉以及論壇中層出不窮的廣告貼等等。有人會不停的去尋找網站的漏洞以及規則，使用機器發布這些垃圾廣告從而達到營利目的。anti-spam主要是指通過技術手段對數據進行過濾和篩選，將我們認定為不合格的數據清理掉，將系統認為可疑的信息進行提示分類。anti-spam對審核工作也是一個相輔相成的內容。
先看看幾個例子：

Facebook反垃圾實踐
知乎反作弊垃圾文本識別
文本反垃圾在花椒直播中的應用概述
【NLP文本分類】文本分類算法集錦，從入門到精通 ?

關于敏感詞相關的github項目:

1.ToolGood.Words

2.text-antispam

3.textfilter

優質中文NLP資源集合：
包括語言檢測、中外手機/電話歸屬地/運營商查詢、名字推斷性別、手機號抽取、身份證抽取、郵箱抽取，關于BERT的相關資源等等
https://github.com/fighting41love/funNLP

打開之后就會發現你需要的寶藏！

總結

以上是生活随笔為你收集整理的敏感词过滤及反垃圾文本的相关知识（欢迎收藏）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Hatching shader
下一篇： [Lisp] [Scheme][MacO