日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

simhash

發(fā)布時間:2024/3/26 编程问答 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 simhash 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

聽聞SimHash很強,對海量文檔相似度的計算有很高的效率。查了查文檔,大致的流程如下:

大致流程就是:分詞, 配合詞頻計算哈希串(每個分出來的詞最終會計算處同樣的長度), 降維,計算海明距離。

#coding:utf8 import math import jieba import jieba.analyseclass SimHash(object):def __init__(self):passdef getBinStr(self, source):if source == "":return 0else:x = ord(source[0]) << 7m = 1000003mask = 2 ** 128 - 1for c in source:x = ((x * m) ^ ord(c)) & maskx ^= len(source)if x == -1:x = -2x = bin(x).replace('0b', '').zfill(64)[-64:]print(source, x)return str(x)def getWeight(self, source):# fake weight with keywordreturn ord(source)def unwrap_weight(self, arr):ret = ""for item in arr:tmp = 0if int(item) > 0:tmp = 1ret += str(tmp)return retdef simHash(self, rawstr):seg = jieba.cut(rawstr, cut_all=True)keywords = jieba.analyse.extract_tags("|".join(seg), topK=100, withWeight=True)print(keywords)ret = []for keyword, weight in keywords:binstr = self.getBinStr(keyword)keylist = []for c in binstr:weight = math.ceil(weight)if c == "1":keylist.append(int(weight))else:keylist.append(-int(weight))ret.append(keylist)# 對列表進行"降維"rows = len(ret)cols = len(ret[0])result = []for i in range(cols):tmp = 0for j in range(rows):tmp += int(ret[j][i])if tmp > 0:tmp = "1"elif tmp <= 0:tmp = "0"result.append(tmp)return "".join(result)def getDistince(self, hashstr1, hashstr2):length = 0for index, char in enumerate(hashstr1):if char == hashstr2[index]:continueelse:length += 1return lengthif __name__ == "__main__":simhash = SimHash()s1 = "100元=38萬星幣,加微信"s2 = "38萬星幣100元,加VX"with open("a.txt", "r") as file:s1 = "".join(file.readlines())file.close()with open("b.txt", "r") as file:s2 = "".join(file.readlines())file.close()# s1 = "this is just test for simhash, here is the difference"# s2 = "this is a test for simhash, here is the difference"# print(simhash.getBinStr(s1))# print(simhash.getBinStr(s2))hash1 = simhash.simHash(s1)hash2 = simhash.simHash(s2)distince = simhash.getDistince(hash1, hash2)# value = math.sqrt(len(s1)**2 + len(s2)**2)value = 5print("海明距離:", distince, "判定距離:", value, "是否相似:", distince<=value)

經(jīng)計算發(fā)現(xiàn),對大文本有較強的驗證性,對小短文本相似度計算略有偏差,海明距離的計算會有不準。

Building prefix dict from the default dictionary ... Loading model from cache /var/folders/d0/d4zzr4n51m7_vj9ryfb633pc0000gn/T/jieba.cache Loading model cost 0.764 seconds. Prefix dict has been built succesfully. 海明距離: 1 判定距離: 5 是否相似: True

參考鏈接:

  • https://blog.csdn.net/gzt940726/article/details/80460419
  • https://blog.csdn.net/madujin/article/details/53152619

總結

以上是生活随笔為你收集整理的simhash的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。