當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Levenshtein Distance算法（编辑距离算法）

發(fā)布時間：2024/6/21 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 Levenshtein Distance算法（编辑距离算法）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

編輯距離

編輯距離（Edit Distance），又稱Levenshtein距離，是指兩個字串之間，由一個轉(zhuǎn)成另一個所需的最少編輯操作次數(shù)。許可的編輯操作包括將一個字符替換成另一個字符，插入一個字符，刪除一個字符。一般來說，編輯距離越小，兩個串的相似度越大。
例如將kitten一字轉(zhuǎn)成sitting：
sitten （k→s）
sittin （e→i）
sitting （→g）
俄羅斯科學(xué)家Vladimir Levenshtein在1965年提出這個概念。

應(yīng)用

最小編輯距離通常作為一種相似度計算函數(shù)被用于多種實際應(yīng)用中，詳細如下：（特別的，對于中文自然語言處理，一般以詞為基本處理單元）

DNA分析：基因?qū)W的一個主要主題就是比較 DNA 序列并嘗試找出兩個序列的公共部分。如果兩個 DNA 序列有類似的公共子序列，那么這些兩個序列很可能是同源的。在比對兩個序列時，不僅要考慮完全匹配的字符，還要考慮一個序列中的空格或間隙（或者，相反地，要考慮另一個序列中的插入部分）和不匹配，這兩個方面都可能意味著突變（mutation）。在序列比對中，需要找到最優(yōu)的比對（最優(yōu)比對大致是指要將匹配的數(shù)量最大化，將空格和不匹配的數(shù)量最小化）。如果要更正式些，可以確定一個分?jǐn)?shù)，為匹配的字符添加分?jǐn)?shù)、為空格和不匹配的字符減去分?jǐn)?shù)。

全局序列比對嘗試找到兩個完整的序列 S1和 S2之間的最佳比對。以下面兩個 DNA 序列為例：
S1= GCCCTAGCG
S2= GCGCAATG
如果為每個匹配字符一分，一個空格扣兩分，一個不匹配字符扣一分，那么下面的比對就是全局最優(yōu)比對：
S1'= GCCCTAGCG
S2'= GCGC-AATG
連字符（-）代表空格。在 S2'中有五個匹配字符，一個空格（或者反過來說，在 S1'中有一個插入項），有三個不匹配字符。這樣得到的分?jǐn)?shù)是 (5 * 1) + (1 * -2) + (3 * -1) = 0，這是能夠?qū)崿F(xiàn)的最佳結(jié)果。
使用局部序列比對，不必對兩個完整的序列進行比對，可以在每個序列中使用某些部分來獲得最大得分。使用同樣的序列 S1和 S2，以及同樣的得分方案，可以得到以下局部最優(yōu)比對 S1''和 S2''：
S1 = GCCCTAGCG
S1''= GCG
S2''= GCG
S2 = GCGCAATG
這個局部比對的得分是 (3 * 1) + (0 * -2) + (0 * -1) = 3。

拼寫糾錯（Spell Correction）：又拼寫檢查（Spell Checker），將每個詞與詞典中的詞條比較，英文單詞往往需要做詞干提取等規(guī)范化處理，如果一個詞在詞典中不存在，就被認為是一個錯誤，然后試圖提示N個最可能要輸入的詞——拼寫建議。常用的提示單詞的算法就是列出詞典中與原詞具有最小編輯距離的詞條。

這里肯定有人有疑問：對每個不在詞典中的詞（假如長度為len）都與詞典中的詞條計算最小編輯距離，時間復(fù)雜度是不是太高了？的確，所以一般需要加一些剪枝策略，如：

因為一般拼寫檢查應(yīng)用只需要給出Top-N的糾正建議即可（N一般取10），那么我們可以從詞典中按照長度依次為len、len-1、len+1、len-2、len-3、...的詞條比較；

限定拼寫建議詞條與當(dāng)前詞條的最小編輯距離不能大于某個閾值；

如果最小編輯距離為1的候選詞條超過N后，終止處理；

緩存常見的拼寫錯誤和建議，提高性能。

命名實體抽取（Named Entity Extraction）：由于實體的命名往往沒有規(guī)律，如品牌名，且可能存在多種變形、拼寫形式，如“IBM”和“IBM Inc.”，這樣導(dǎo)致基于詞典完全匹配的命名實體識別方法召回率較低，為此，我們可以使用編輯距離由完全匹配泛化到模糊匹配，先抽取實體名候選詞。

具體的，可以將候選文本串與詞典中的每個實體名進行編輯距離計算，當(dāng)發(fā)現(xiàn)文本中的某一字符串的編輯距離值小于給定閾值時，將其作為實體名候選詞；獲取實體名候選詞后，根據(jù)所處上下文使用啟發(fā)式規(guī)則或者分類的方法判定候選詞是否的確為實體名。

實體共指（Entity Coreference）：通過計算任意兩個實體名之間的最小編輯距離判定是否存在共指關(guān)系？如“IBM”和“IBM Inc.”, "Stanford President John Hennessy "和"Stanford University President John Hennessy"。
機器翻譯（Machine Translation）：

識別平行網(wǎng)頁對：由于平行網(wǎng)頁通常有相同或類似的界面結(jié)構(gòu)，因此平行網(wǎng)頁在HTML結(jié)構(gòu)上應(yīng)該有很大近似度。首先將網(wǎng)頁的HTML標(biāo)簽抽取出來，連接成一個字符串，然后用最小編輯距離考察兩個字符串的近似度。實際中，此策略一般與文檔長度比例、句對齊翻譯模型等方法結(jié)合使用，以識別最終的平行網(wǎng)頁對。

自動評測：首先存儲機器翻譯原文和不同質(zhì)量級別的多個參考譯文，評測時把自動翻譯的譯文對應(yīng)到與其編輯距離最小的參考譯文上，間接估算自動譯文的質(zhì)量，如下圖所示：

字符串核函數(shù)（String Kernel）：最小編輯距離作為字符串之間的相似度計算函數(shù)，用作核函數(shù)，集成在SVM中使用。

問題：找出字符串的編輯距離，即把一個字符串s1最少經(jīng)過多少步操作變成編程字符串s2，操作有三種，添加一個字符，刪除一個字符，修改一個字符

解析：

首先定義這樣一個函數(shù)——edit(i, j)，它表示第一個字符串的長度為i的子串到第二個字符串的長度為j的子串的編輯距離。

顯然可以有如下動態(tài)規(guī)劃公式：

if i == 0 且 j == 0，edit(i, j) = 0
if i == 0 且 j > 0，edit(i, j) = j
if i > 0 且j == 0，edit(i, j) = i
if?i ≥ 1? 且?j ≥ 1 ，edit(i, j) == min{ edit(i-1, j) + 1, edit(i, j-1) + 1, edit(i-1, j-1) + f(i, j) }，當(dāng)?shù)谝粋€字符串的第i個字符不等于第二個字符串的第j個字符時，f(i, j) = 1；否則，f(i, j) = 0。

?	0	f	a	i	l	i	n	g
0	?	?	?	?	?	?	?	?
s	?	?	?	?	?	?	?	?
a	?	?	?	?	?	?	?	?
i	?	?	?	?	?	?	?	?
l	?	?	?	?	?	?	?	?
n	?	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	?	?	?	?	?	?	?
a	2	?	?	?	?	?	?	?
i	3	?	?	?	?	?	?	?
l	4	?	?	?	?	?	?	?
n	5	?	?	?	?	?	?	?

?計算edit(1, 1)，edit(0, 1) + 1 == 2，edit(1, 0) + 1 == 2，edit(0, 0) + f(1, 1) == 0 + 1 == 1，min(edit(0, 1)，edit(1, 0)，edit(0, 0) + f(1, 1))==1，因此edit(1, 1) == 1。依次類推：

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	1	2	3	4	5	6	7
a	2	2	?	?	?	?	?	?
i	3	?	?	?	?	?	?	?
l	4	?	?	?	?	?	?	?
n	5	?	?	?	?	?	?	?

edit(2, 1) + 1 == 3，edit(1, 2) + 1 == 3，edit(1, 1) + f(2, 2) == 1 + 0 == 1，其中s1[2] == 'a' 而 s2[1] == 'f'‘，兩者不相同，所以交換相鄰字符的操作不計入比較最小數(shù)中計算。以此計算，得出最后矩陣為：

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	1	2	3	4	5	6	7
a	2	2	1	2	3	4	5	6
i	3	3	2	1	2	3	4	5
l	4	4	3	2	1	2	3	4
n	5	5	4	3	2	2	2	3

應(yīng)用：

（1）編輯距離是測量一個字符串轉(zhuǎn)換成另外一個字符串需要操作（操作包括：插入??? 刪除??? 置換）的最小次數(shù)。
????????? 編輯距離可以用來計算兩字符串的相似度，另外也可以通過余弦方法來計算兩字符串的相似度
（2）算法實現(xiàn)采用動態(tài)規(guī)劃算法，其求解過程類似于求兩字符串的最長公共序列（LCS）
?????????? 下面是算法實現(xiàn)：

1 public class Distance 2 { 3 public static int getDistance(String s1, String s2) 4 { 5 int len1 = s1.length(); 6 int len2 = s2.length(); 7 8 int[][] d = new int[len1+1][len2+1]; 9 int i=0, j=0; 10 for(i=0; i<=len1; i++) 11 d[i][0] = i; 12 for(j=0; j<=len2; j++) 13 d[0][j] = j; 14 for (i = 1; i < len1+1; i++) 15 for (j = 1; j < len2+1; j++) 16 { 17 int cost = 1; 18 if(s1.charAt(i-1) == s2.charAt(j-1)) 19 { 20 cost = 0; 21 } 22 int delete = d[i - 1][j] + 1; 23 int insert = d[i][j - 1] + 1; 24 int substitution = d[i - 1][j - 1] + cost; 25 d[i][j] = min(delete, insert, substitution); 26 } 27 return (d[len1][len2]); 28 } 29 30 public static int min(int d,int i,int s) 31 { 32 int temp = 0; 33 if(d>i) 34 temp = i; 35 else 36 temp = d; 37 return s<temp?s:temp; 38 } 39 40 public static void main(String args[]) 41 { 42 String s1= "kitten"; 43 String s2 = "sitting"; 44 System.out.println(Distance.getDistance(s1, s2)); 45 } 46 }

?編輯距離（計算兩個字符串的相似度）算法 .Net語言實例：

using System; using System.Collections.Generic;/** 作者：熊仔其人* 時間：2014年4月22日*/ namespace DataTool {/// <summary>/// 相似度/// 熊仔其人/// 2014年4月22日/// </summary>public static class LevenshteinDistance{#region Levenshtein Distance算法（編輯距離算法）/// <summary>/// 三個數(shù)字中取最小的一個數(shù)字/// </summary>/// <param name="first"></param>/// <param name="second"></param>/// <param name="third"></param>/// <returns></returns>private static int LowerOfThree(int first, int second, int third){int min = first;if (second < min)min = second;if (third < min)min = third;return min;}/// <summary>/// 根據(jù)Levenshtein Distance算法（編輯距離算法）計算兩個字符串的相似度/// </summary>/// <param name="text1"></param>/// <param name="text2"></param>/// <returns></returns>private static int Levenshtein_Distance(string text1, string text2){int[,] Matrix;int n = text1.Length;int m = text2.Length;int temp = 0;char ch1;char ch2;int i = 0;int j = 0;if (n == 0){return m;}if (m == 0){return n;}Matrix = new int[n + 1, m + 1];for (i = 0; i <= n; i++){//初始化第一列Matrix[i, 0] = i;}for (j = 0; j <= m; j++){//初始化第一行Matrix[0, j] = j;}for (i = 1; i <= n; i++){ch1 = text1[i - 1];for (j = 1; j <= m; j++){ch2 = text2[j - 1];if (ch1.Equals(ch2)){temp = 0;}else{temp = 1;}Matrix[i, j] = LowerOfThree(Matrix[i - 1, j] + 1, Matrix[i, j - 1] + 1, Matrix[i - 1, j - 1] + temp);}}//for (i = 0; i <= n; i++)//{// for (j = 0; j <= m; j++)// {// Console.Write(" {0} ", Matrix[i, j]);// }// Console.WriteLine("");//}return Matrix[n, m];}/// <summary>/// 根據(jù)Levenshtein Distance算法（編輯距離算法）計算兩個字符串的相似度（百分比）/// </summary>/// <param name="text1">第一個字符串</param>/// <param name="text2">第二個字符串</param>/// <returns>相似度（百分比）</returns>public static decimal LevenshteinDistancePercent(string text1, string text2){if (string.IsNullOrEmpty(text1) && string.IsNullOrEmpty(text2))return 1;else if (string.IsNullOrEmpty(text1) || string.IsNullOrEmpty(text2))return 0;int maxLenth = text1.Length > text2.Length ? text1.Length : text2.Length;int val = Levenshtein_Distance(text1, text2);return 1 - (decimal)val / maxLenth;}#endregion#region 計算兩個字符串的相似度（百分比）/// <summary>/// 計算兩個字符串的相似度（百分比），比較每一個字符組成，返回結(jié)果相似度與字符順序有關(guān)，但是并不需要順序完全一致/// </summary>/// <param name="text1">第一個字符串</param>/// <param name="text2">第二個字符串</param>/// <returns>相似度（百分比）</returns>public static decimal SimilarByStringPercent(string text1, string text2){if (string.IsNullOrEmpty(text1) && string.IsNullOrEmpty(text2))return 1;else if (string.IsNullOrEmpty(text1) || string.IsNullOrEmpty(text2))return 0;decimal returnValue = 0;int maxLength;int i, l;List<string> tb1 = new List<string>();List<string> tb2 = new List<string>();i = 0;l = 1;maxLength = text1.Length;if (text1.Length < text2.Length)maxLength = text2.Length;while (l <= text1.Length){while (i < text1.Length - 1){if (i + l > text1.Length)break;tb1.Add(text1.Substring(i, l));i++;}i = 0;l++;}i = 0;l = 1;while (l <= text2.Length){while (i < text2.Length - 1){if (i + l > text2.Length)break;tb2.Add(text2.Substring(i, l));i++;}i = 0;l++;}foreach (string subStr in tb1){decimal tempRe = 0;if (tb2.Contains(subStr)){tempRe = (decimal)subStr.Length / maxLength;if (tempRe > returnValue)returnValue = tempRe;if (tempRe == 1)break;}}return returnValue;}#endregion} }

轉(zhuǎn)載于:https://www.cnblogs.com/xiongzaiqiren/p/4997947.html

總結(jié)

以上是生活随笔為你收集整理的Levenshtein Distance算法（编辑距离算法）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【读书笔记】iOS-iCloud编程
下一篇：行为驱动开发BDD概要

?	0	f	a	i	l	i	n	g
0	?	?	?	?	?	?	?	?
s	?	?	?	?	?	?	?	?
a	?	?	?	?	?	?	?	?
i	?	?	?	?	?	?	?	?
l	?	?	?	?	?	?	?	?
n	?	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	?	?	?	?	?	?	?
a	2	?	?	?	?	?	?	?
i	3	?	?	?	?	?	?	?
l	4	?	?	?	?	?	?	?
n	5	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	1	2	3	4	5	6	7
a	2	2	?	?	?	?	?	?
i	3	?	?	?	?	?	?	?
l	4	?	?	?	?	?	?	?
n	5	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	1	2	3	4	5	6	7
a	2	2	1	2	3	4	5	6
i	3	3	2	1	2	3	4	5
l	4	4	3	2	1	2	3	4
n	5	5	4	3	2	2	2	3

?	0	f	a	i	l	i	n	g
0	?	?	?	?	?	?	?	?
s	?	?	?	?	?	?	?	?
a	?	?	?	?	?	?	?	?
i	?	?	?	?	?	?	?	?
l	?	?	?	?	?	?	?	?
n	?	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	?	?	?	?	?	?	?
a	2	?	?	?	?	?	?	?
i	3	?	?	?	?	?	?	?
l	4	?	?	?	?	?	?	?
n	5	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	1	2	3	4	5	6	7
a	2	2	?	?	?	?	?	?
i	3	?	?	?	?	?	?	?
l	4	?	?	?	?	?	?	?
n	5	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	1	2	3	4	5	6	7
a	2	2	1	2	3	4	5	6
i	3	3	2	1	2	3	4	5
l	4	4	3	2	1	2	3	4
n	5	5	4	3	2	2	2	3

?	0	f	a	i	l	i	n	g
0	?	?	?	?	?	?	?	?
s	?	?	?	?	?	?	?	?
a	?	?	?	?	?	?	?	?
i	?	?	?	?	?	?	?	?
l	?	?	?	?	?	?	?	?
n	?	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	?	?	?	?	?	?	?
a	2	?	?	?	?	?	?	?
i	3	?	?	?	?	?	?	?
l	4	?	?	?	?	?	?	?
n	5	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	1	2	3	4	5	6	7
a	2	2	?	?	?	?	?	?
i	3	?	?	?	?	?	?	?
l	4	?	?	?	?	?	?	?
n	5	?	?	?	?	?	?	?

?	0	f	a	i	l	i	n	g
0	0	1	2	3	4	5	6	7
s	1	1	2	3	4	5	6	7
a	2	2	1	2	3	4	5	6
i	3	3	2	1	2	3	4	5
l	4	4	3	2	1	2	3	4
n	5	5	4	3	2	2	2	3