日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

文本查重:difflib.SequenceMatcher

發(fā)布時間:2025/3/21 编程问答 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 文本查重:difflib.SequenceMatcher 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

目錄

1. SequenceMatcher FlowChart

1.1 get_matching_blocks()?

1.2?find_longest_match()

1.3 ratio()

2. 例子說明

3. 項(xiàng)目需求函數(shù)更改


參考

  • SequenceMatcher in Python
  • difflib?

SequenceMatcher的基本思想是找到不包含“junk”元素的最長連續(xù)匹配子序列(LCS)。這不會產(chǎn)生最小的編輯序列,但是會產(chǎn)生對人“看起來正確”的匹配。?“junk”是不希望算法與之匹配的東西:例如普通文本文件中的空行,或者HTML文件中的“ <P>”行,等等。SequenceMatcher的輸出很友好,例如:

Input Strings:?my stackoverflow mysteries?AND?mystery

SequenceMatcher algorithm returns:to anyone, the natural match is?"myster"?as follows

my stackoverflow?mysteries
.................mystery..

However, the LCS will output?"mystery"

my?stackoverflow mysteries
my.st.....er......y.......

1. SequenceMatcher FlowChart

? ? ? ? ? ? ? ? ? ??

Given two input strings a and b,

  • ratio( )?returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function?get_matching_blocks?and calculates the ratio as:?ratio = 2.0*M / T?, where M = matches , T = total number of elements in both sequences
  • get_matching_blocks( )?return list of triples describing matching subsequences. The last triple is a dummy, (len(a), len(b), 0). It works by repeated application of find_longest_match( )
  • find_longest_match( )?returns a triple containing the longest matching block in a[aLow:aHigh] and b[bLow:bHigh]

? ? ? ? ? ? ? ? ? ? ? ???

1.1 get_matching_blocks()?

該函數(shù)可根據(jù)自己項(xiàng)目需求進(jìn)行修改:

def get_matching_blocks(self):if self.matching_blocks is not None:return self.matching_blocksla, lb = len(self.a), len(self.b)queue = [(0, la, 0, lb)]matching_blocks = []while queue:alo, ahi, blo, bhi = queue.pop()i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)# a[alo:i] vs b[blo:j] unknown# a[i:i+k] same as b[j:j+k]# a[i+k:ahi] vs b[j+k:bhi] unknownif k: # if k is 0, there was no matching blockmatching_blocks.append(x)if alo < i and blo < j:queue.append((alo, i, blo, j))if i+k < ahi and j+k < bhi:queue.append((i+k, ahi, j+k, bhi))matching_blocks.sort()

新建一個隊(duì)列queue,用包含兩個輸入字符串上下限索引的四元組初始化。當(dāng)隊(duì)列中存在四元組時,將其彈出并傳遞給find_longest_match()函數(shù),該函數(shù)返回描述匹配子序列的三元組。三元組添加到matching_blocks列表中。

?三元組格式為:?(i, j, n),即?a[i:i+n] == b[j:j+n]

匹配子序列左側(cè)和右側(cè)的序列片段將進(jìn)一步添加到隊(duì)列queue中。重復(fù)該過程,直到隊(duì)列為空。

然后對matching_blocks列表進(jìn)行排序,并作為輸出返回。

1.2?find_longest_match()

def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__besti, bestj, bestsize = alo, blo, 0# find longest junk-free match# during an iteration of the loop, j2len[j] = length of longest# junk-free match ending with a[i-1] and b[j]j2len = {} # {位置:重復(fù)最長長度}nothing = []for i in range(alo, ahi):# look at all instances of a[i] in b; note that because# b2j has no junk keys, the loop is skipped if a[i] is junkj2lenget = j2len.getnewj2len = {} # newj2len 記錄到目前為止匹配的字符串的長度for j in b2j.get(a[i], nothing):# a[i] matches b[j]if j < blo: # 當(dāng)前匹配子串[]之前的元素continueif j >= bhi: # 當(dāng)前匹配子串之后的元素breakk = newj2len[j] = j2lenget(j-1, 0) + 1 # 當(dāng)前位置j的前一個位置對應(yīng)的重復(fù)最長長度,使其+1if k > bestsize: # 若當(dāng)前最長長度 > bestsize,則將bestsize更新為當(dāng)前最長長度,同時更新besti、bestjbesti, bestj, bestsize = i-k+1, j-k+1, kj2len = newj2len

輸入為包含字符串上下限索引的四元組,輸出為包含最長匹配塊的三元組。

首先,定義字典b2j,其中對于字符串b中的x,b2j [x]是x出現(xiàn)的索引(到b)的列表。

在外循環(huán)中按字符掃描第一個字符串a(chǎn),我們使用b2j檢查字符串b中該字符的出現(xiàn)。如果存在匹配項(xiàng),我們將更新另一個字典newj2len,這有助于確保到目前為止匹配的字符串的長度。因此,變量besti,bestj和bestsize進(jìn)行了更新,其中考慮了迄今為止獲得的最長的匹配塊數(shù)據(jù)。

在所有最大匹配塊中,該算法返回最早在a中開始的那個,而在所有最大匹配中最早在a中開始的那個,它返回最早在b中開始的那個。

1.3 ratio()

def ratio(self):"""Return a measure of the sequences' similarity (float in [0,1]).Where T is the total number of elements in both sequences, andM is the number of matches, this is 2.0*M / T.Note that this is 1 if the sequences are identical, and 0 ifthey have nothing in common."""matches = sum(triple[-1] for triple in self.get_matching_blocks())return _calculate_ratio(matches, len(self.a) + len(self.b))

retio()函數(shù)計(jì)算序列a和b的相似度,ratio = 2.0*M / T,M為匹配的字符數(shù),T為兩個序列的總字符數(shù)。相似度的計(jì)算可根據(jù)實(shí)際情況進(jìn)行修改。

根據(jù)經(jīng)驗(yàn),ratio()值超過0.6表示序列是緊密匹配的。在這里,根據(jù)計(jì)算,我們獲得了0.8的相似性得分比,因此輸入序列對被視為相似。

2. 例子說明

Let’s describe the step by step procedure of the algorithm by implementing it using an input pair of strings.

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

3. 項(xiàng)目需求函數(shù)更改

get_matching_blocks():

  • 計(jì)算a在b中的抄襲率,即a序列在b序列中的重復(fù)度,b保持不動,遍歷a的子序列去跟b比較,即(alo, i, 0, lb)或(i+k, ahi, 0, lb)。
  • 當(dāng)重復(fù)字?jǐn)?shù)k達(dá)到一定值時,視為抄襲,即k >= duplicateNumber。
  • 相似度radio:ratio = M / len(a),即重復(fù)率?= 重復(fù)字符數(shù)/a的長度
def get_matching_blocks(self, duplicateNumber=6):"""Return list of triples describing matching subsequences.參數(shù):duplicateNumber: 重復(fù)多少字?jǐn)?shù)才算抄襲的下限值"""if self.matching_blocks is not None:return self.matching_blocksla, lb = len(self.a), len(self.b)queue = [(0, la, 0, lb)]matching_blocks = []while queue:alo, ahi, blo, bhi = queue.pop()i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)# a[alo:i] vs b[blo:j] unknown# a[i:i+k] same as b[j:j+k]# a[i+k:ahi] vs b[j+k:bhi] unknownif k: # if k is 0, there was no matching block# 匹配的size長度, k > duplicateNumber, duplicateNumber值根據(jù)具體情況而定if k >= duplicateNumber:matching_blocks.append(x)if alo < i:queue.append((alo, i, 0, lb))if i+k < ahi:queue.append((i+k, ahi, 0, lb))matching_blocks.sort() def ratio(self, only_a=True):matches = sum(triple[-1] for triple in self.get_matching_blocks())return matches/len(self.a) if only_a else _calculate_ratio(matches, len(self.a) + len(self.b))

?

總結(jié)

以上是生活随笔為你收集整理的文本查重:difflib.SequenceMatcher的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。