當前位置：首頁 > 编程语言 > python >内容正文

python

文本相似度几种计算方法及代码python实现

發(fā)布時間：2024/9/30 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了文本相似度几种计算方法及代码python实现小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文本相似度的計算廣泛的運用在信息檢索，搜索引擎, 文檔復制等處：
因此在各種不同的情況與任務中，有不同的文本相似度計算。

方法1 編輯距離
編輯距離又稱Levenshtein距離，是指將一個字符串轉為另一個字符串所需的字符編輯次數，包括以下三種操作：
插入 - 在任意位置插入一個字符
刪除 - 將任意一個字符刪除
替換 - 將任意一個字符替換為另一個字符
編輯距離可以用來計算兩個字符串的相似度，它的應用場景很多，其中之一是拼寫糾正（spell correction）。編輯距離的定義是給定兩個字符串str1和str2, 我們要計算通過最少多少代價cost可以把str1轉換成str2.
舉個例子：
輸入: str1 = “geek”, str2 = “gesek”
輸出: 1
插入 's’即可以把str1轉換成str2
輸入: str1 = “cat”, str2 = “cut”
輸出: 1
用u去替換a即可以得到str2
輸入: str1 = “sunday”, str2 = “saturday”
輸出: 3
我們假定有三個不同的操作： 1. 插入新的字符 2. 替換字符 3. 刪除一個字符。每一個操作的代價為1.

#!/usr/bin/env python3 # -*- coding: utf-8 -*- # @Author: yudengwu # @Date : 2020/5/28 def edit_dist(str1, str2):# m，n分別字符串str1和str2的長度m, n = len(str1), len(str2)# 構建二位數組來存儲子問題（sub-problem)的答案dp = [[0 for x in range(n + 1)] for x in range(m + 1)]# 利用動態(tài)規(guī)劃算法，填充數組for i in range(m + 1):for j in range(n + 1):# 假設第一個字符串為空，則轉換的代價為j (j次的插入)if i == 0:dp[i][j] = j# 同樣的，假設第二個字符串為空，則轉換的代價為i (i次的插入)elif j == 0:dp[i][j] = i# 如果最后一個字符相等，就不會產生代價elif str1[i - 1] == str2[j - 1]:dp[i][j] = dp[i - 1][j - 1]# 如果最后一個字符不一樣，則考慮多種可能性，并且選擇其中最小的值else:dp[i][j] = 1 + min(dp[i][j - 1], # Insertdp[i - 1][j], # Removedp[i - 1][j - 1]) # Replacereturn dp[m][n]str1="重慶是一個好地方" str2="重慶好吃的在哪里" str3= "重慶是好地方" c=edit_dist(str1,str2) c1=edit_dist(str1,str3) print("c：",c) print("c1：",c1)

結果：

2.余弦相識度計算方法

#!/usr/bin/env python3 # -*- coding: utf-8 -*- # @Author: yudengwu # @Date : 2020/5/28 import numpy as np import jieba #讀取停用詞 def stopwordslist(filepath):stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]return stopwords# 加載停用詞 stopwords = stopwordslist("停用詞.txt")def cosine_similarity(sentence1: str, sentence2: str) -> float:""":param sentence1: s:param sentence2::return: 兩句文本的相識度"""seg1 = [word for word in jieba.cut(sentence1) if word not in stopwords]seg2 = [word for word in jieba.cut(sentence2) if word not in stopwords]word_list = list(set([word for word in seg1 + seg2]))#建立詞庫word_count_vec_1 = []word_count_vec_2 = []for word in word_list:word_count_vec_1.append(seg1.count(word))#文本1統(tǒng)計在詞典里出現詞的次數word_count_vec_2.append(seg2.count(word))#文本2統(tǒng)計在詞典里出現詞的次數vec_1 = np.array(word_count_vec_1)vec_2 = np.array(word_count_vec_2)#余弦公式num = vec_1.dot(vec_2.T)denom = np.linalg.norm(vec_1) * np.linalg.norm(vec_2)cos = num / denomsim = 0.5 + 0.5 * cosreturn simstr1="重慶是一個好地方" str2="重慶好吃的在哪里" str3= "重慶是好地方" sim1=cosine_similarity(str1,str2) sim2=cosine_similarity(str1,str3) print("sim1 ：",sim1) print("sim2:",sim2)

結果：

方法3 :利用gensim包分析文檔相似度

#!/usr/bin/env python3 # -*- coding: utf-8 -*- # @Author: yudengwu # @Date : 2020/5/28 import jieba from gensim import corpora,models,similarities #讀取停用詞 def stopwordslist(filepath):stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]return stopwords# 加載停用詞 stopwords = stopwordslist("停用詞.txt")str1="重慶是一個好地方" str2="重慶好吃的在哪里" str3= "重慶是好地方"def gensimSimilarities(str1,str2,str3):all_doc = []all_doc.append(str1)all_doc.append(str2)all_doc.append(str3)# 以下對目標文檔進行分詞，并且保存在列表all_doc_list中all_doc_list = []for doc in all_doc:doc_list = [word for word in jieba.cut(doc) if word not in stopwords]all_doc_list.append(doc_list)# 首先用dictionary方法獲取詞袋（bag-of-words)dictionary = corpora.Dictionary(all_doc_list)# 以下使用doc2bow制作語料庫corpus = [dictionary.doc2bow(doc) for doc in all_doc_list]# 使用TF-IDF模型對語料庫建模tfidf = models.TfidfModel(corpus)index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=len(dictionary.keys()))sim = index[tfidf[corpus]]return simsim=gensimSimilarities(str1,str2,str3) print(sim)

結果為:

與50位技術專家面對面20年技術見證，附贈技術全景圖

總結

以上是生活随笔為你收集整理的文本相似度几种计算方法及代码python实现的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：儿歌《萤火虫》儿歌内容：“萤火虫，点点红
下一篇： LeetCode114. 不同的路径 p

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

文本相似度几种计算方法及代码python实现

總結