當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python骂人的程序_Python实现敏感词过滤的4种方法

發(fā)布時間：2023/12/19 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python骂人的程序_Python实现敏感词过滤的4种方法小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

在我們生活中的一些場合經(jīng)常會有一些不該出現(xiàn)的敏感詞，我們通常會使用*去屏蔽它，例如：尼瑪 -> **，一些罵人的敏感詞和一些政治敏感詞都不應(yīng)該出現(xiàn)在一些公共場合中，這個時候我們就需要一定的手段去屏蔽這些敏感詞。下面我來介紹一些簡單版本的敏感詞屏蔽的方法。

（我已經(jīng)盡量把臟話做成圖片的形式了，要不然文章發(fā)不出去）

方法一：replace過濾

replace就是最簡單的字符串替換，當(dāng)一串字符串中有可能會出現(xiàn)的敏感詞時，我們直接使用相應(yīng)的replace方法用*替換出敏感詞即可。

缺點：

文本和敏感詞少的時候還可以，多的時候效率就比較差了

import datetime

now = datetime.datetime.now()

print(filter_sentence, " | ", now)

如果是多個敏感詞可以用列表進行逐一替換

for i in dirty:

speak = speak.replace(i, '*')

print(speak, " | ", now)

方法二：正則表達式過濾

正則表達式算是一個不錯的匹配方法了，日常的查詢中，機會都會用到正則表達式，包括我們的爬蟲，也都是經(jīng)常會使用到正則表達式的，在這里我們主要是使用“|”來進行匹配，“|”的意思是從多個目標(biāo)字符串中選擇一個進行匹配。寫個簡單的例子：

import re

def sentence_filter(keywords, text):

return re.sub("|".join(keywords), "***", text)

print(sentence_filter(dirty, speak))

方法三：DFA過濾算法

DFA的算法，即Deterministic Finite Automaton算法，翻譯成中文就是確定有窮自動機算法。它的基本思想是基于狀態(tài)轉(zhuǎn)移來檢索敏感詞，只需要掃描一次待檢測文本，就能對所有敏感詞進行檢測。（實現(xiàn)見代碼注釋）

#!/usr/bin/env python

# -*- coding:utf-8 -*-

# @Time：2020/4/15 11:40

# @Software：PyCharm

# article_add: https://www.cnblogs.com/JentZhang/p/12718092.html

__author__ = "JentZhang"

import json

MinMatchType = 1 # 最小匹配規(guī)則

MaxMatchType = 2 # 最大匹配規(guī)則

class DFAUtils(object):

"""

DFA算法

"""

def __init__(self, word_warehouse):

"""

算法初始化

:param word_warehouse:詞庫

"""

# 詞庫

self.root = dict()

# 無意義詞庫,在檢測中需要跳過的（這種無意義的詞最后有個專門的地方維護，保存到數(shù)據(jù)庫或者其他存儲介質(zhì)中）

self.skip_root = [' ', '&', '!', '！', '@', '#', '$', '￥', '*', '^', '%', '?', '？', '<', '>', "《", '》']

# 初始化詞庫

for word in word_warehouse:

self.add_word(word)

def add_word(self, word):

"""

添加詞庫

:param word:

:return:

"""

now_node = self.root

word_count = len(word)

for i in range(word_count):

char_str = word[i]

if char_str in now_node.keys():

# 如果存在該key，直接賦值，用于下一個循環(huán)獲取

now_node = now_node.get(word[i])

now_node['is_end'] = False

else:

# 不存在則構(gòu)建一個dict

new_node = dict()

if i == word_count - 1: # 最后一個

new_node['is_end'] = True

else: # 不是最后一個

new_node['is_end'] = False

now_node[char_str] = new_node

now_node = new_node

def check_match_word(self, txt, begin_index, match_type=MinMatchType):

"""

檢查文字中是否包含匹配的字符

:param txt:待檢測的文本

:param begin_index: 調(diào)用getSensitiveWord時輸入的參數(shù)，獲取詞語的上邊界index

:param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則

:return:如果存在，則返回匹配字符的長度，不存在返回0

"""

flag = False

match_flag_length = 0 # 匹配字符的長度

now_map = self.root

tmp_flag = 0 # 包括特殊字符的敏感詞的長度

for i in range(begin_index, len(txt)):

word = txt[i]

# 檢測是否是特殊字符"

if word in self.skip_root and len(now_map) < 100:

# len(nowMap)<100 保證已經(jīng)找到這個詞的開頭之后出現(xiàn)的特殊字符

tmp_flag += 1

continue

# 獲取指定key

now_map = now_map.get(word)

if now_map: # 存在，則判斷是否為最后一個

# 找到相應(yīng)key，匹配標(biāo)識+1

match_flag_length += 1

tmp_flag += 1

# 如果為最后一個匹配規(guī)則，結(jié)束循環(huán)，返回匹配標(biāo)識數(shù)

if now_map.get("is_end"):

# 結(jié)束標(biāo)志位為true

flag = True

# 最小規(guī)則，直接返回,最大規(guī)則還需繼續(xù)查找

if match_type == MinMatchType:

break

else: # 不存在，直接返回

break

if tmp_flag < 2 or not flag: # 長度必須大于等于1，為詞

tmp_flag = 0

return tmp_flag

def get_match_word(self, txt, match_type=MinMatchType):

"""

獲取匹配到的詞語

:param txt:待檢測的文本

:param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則

:return:文字中的相匹配詞

"""

matched_word_list = list()

for i in range(len(txt)): # 0---11

length = self.check_match_word(txt, i, match_type)

if length > 0:

word = txt[i:i + length]

matched_word_list.append(word)

# i = i + length - 1

return matched_word_list

def is_contain(self, txt, match_type=MinMatchType):

"""

判斷文字是否包含敏感字符

:param txt:待檢測的文本

:param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則

:return:若包含返回true，否則返回false

"""

flag = False

for i in range(len(txt)):

match_flag = self.check_match_word(txt, i, match_type)

if match_flag > 0:

flag = True

return flag

def replace_match_word(self, txt, replace_char='*', match_type=MinMatchType):

"""

替換匹配字符

:param txt:待檢測的文本

:param replace_char:用于替換的字符，匹配的敏感詞以字符逐個替換，如"你是大王八"，敏感詞"王八"，替換字符*，替換結(jié)果"你是大**"

:param match_type:匹配規(guī)則 1：最小匹配規(guī)則，2：最大匹配規(guī)則

:return:替換敏感字字符后的文本

"""

tuple_set = self.get_match_word(txt, match_type)

word_set = [i for i in tuple_set]

result_txt = ""

if len(word_set) > 0: # 如果檢測出了敏感詞，則返回替換后的文本

for word in word_set:

replace_string = len(word) * replace_char

txt = txt.replace(word, replace_string)

result_txt = txt

else: # 沒有檢測出敏感詞，則返回原文本

result_txt = txt

return result_txt

if __name__ == '__main__':

dfa = DFAUtils(word_warehouse=word_warehouse)

print('詞庫結(jié)構(gòu)：', json.dumps(dfa.root, ensure_ascii=False))

# 待檢測的文本

msg = msg

print('是否包含：', dfa.is_contain(msg))

print('相匹配的詞：', dfa.get_match_word(msg))

print('替換包含的詞：', dfa.replace_match_word(msg))

方法四：AC自動機

AC自動機需要有前置知識：Trie樹（簡單介紹：又稱前綴樹，字典樹，是用于快速處理字符串的問題，能做到快速查找到一些字符串上的信息。）

詳細參考：

https://www.luogu.com.cn/blog/juruohyfhaha/trie-xue-xi-zong-jie

ac自動機,就是在tire樹的基礎(chǔ)上,增加一個fail指針,如果當(dāng)前點匹配失敗,則將指針轉(zhuǎn)移到fail指針指向的地方,這樣就不用回溯,而可以路匹配下去了。

詳細匹配機制我在這里不過多贅述，關(guān)于AC自動機可以參考一下這篇文章：

http://www.zzvips.com/article/128711.htm

python可以利用ahocorasick模塊快速實現(xiàn)：

# python3 -m pip install pyahocorasick

import ahocorasick

def build_actree(wordlist):

actree = ahocorasick.Automaton()

for index, word in enumerate(wordlist):

actree.add_word(word, (index, word))

actree.make_automaton()

return actree

if __name__ == '__main__':

actree = build_actree(wordlist=wordlist)

sent_cp = sent

for i in actree.iter(sent):

sent_cp = sent_cp.replace(i[1][1], "**")

print("屏蔽詞：",i[1][1])

print("屏蔽結(jié)果：",sent_cp)

當(dāng)然，我們也可以手寫一份AC自動機，具體參考：

class TrieNode(object):

__slots__ = ['value', 'next', 'fail', 'emit']

def __init__(self, value):

self.value = value

self.next = dict()

self.fail = None

self.emit = None

class AhoCorasic(object):

__slots__ = ['_root']

def __init__(self, words):

self._root = AhoCorasic._build_trie(words)

@staticmethod

def _build_trie(words):

assert isinstance(words, list) and words

root = TrieNode('root')

for word in words:

node = root

for c in word:

if c not in node.next:

node.next[c] = TrieNode(c)

node = node.next[c]

if not node.emit:

node.emit = {word}

else:

node.emit.add(word)

queue = []

queue.insert(0, (root, None))

while len(queue) > 0:

node_parent = queue.pop()

curr, parent = node_parent[0], node_parent[1]

for sub in curr.next.itervalues():

queue.insert(0, (sub, curr))

if parent is None:

continue

elif parent is root:

curr.fail = root

else:

fail = parent.fail

while fail and curr.value not in fail.next:

fail = fail.fail

if fail:

curr.fail = fail.next[curr.value]

else:

curr.fail = root

return root

def search(self, s):

seq_list = []

node = self._root

for i, c in enumerate(s):

matched = True

while c not in node.next:

if not node.fail:

matched = False

node = self._root

break

node = node.fail

if not matched:

continue

node = node.next[c]

if node.emit:

for _ in node.emit:

from_index = i + 1 - len(_)

match_info = (from_index, _)

seq_list.append(match_info)

node = self._root

return seq_list

if __name__ == '__main__':

aho = AhoCorasic(['foo', 'bar'])

print aho.search('barfoothefoobarman')

以上便是使用Python實現(xiàn)敏感詞過濾的四種方法，前面兩種方法比較簡單，后面兩種偏向算法，需要先了解算法具體實現(xiàn)的原理，之后代碼就好懂了。（DFA作為比較常用的過濾手段，建議大家掌握一下~）

最后附上敏感詞詞庫：

https://github.com/qloog/sensitive_words

以上就是Python實現(xiàn)敏感詞過濾的4種方法的詳細內(nèi)容，更多關(guān)于python 敏感詞過濾的資料請關(guān)注服務(wù)器之家其它相關(guān)文章！

原文鏈接：https://cloud.tencent.com/developer/article/1625101

總結(jié)

以上是生活随笔為你收集整理的python骂人的程序_Python实现敏感词过滤的4种方法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Win10开始菜单点击无效怎么解决
下一篇： python培训学习方法_python编