當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

知识图谱学习笔记－非结构化数据处理

發布時間：2025/4/5 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了知识图谱学习笔记－非结构化数据处理小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

非結構話數據到知識圖譜

非結構數據－> 信息抽取(命名實體識別、關系抽取)－> 圖譜構建(實體消歧、鏈接預測)－> 圖分析算法

一、文本分析關鍵技術

拼寫糾錯
分詞
詞干提取
詞的過濾?
文本的表示
文本相似度
詞向量
句子向量
實體命名識別

二、拼寫糾錯

?input －> correction

? 天起－> 天氣

? theris －> theirs

? 機器學系－> 機器學習

找出編輯距離最小的

input? ? ? ? ? ? ? ? ? candidates? ? ? ? ? ? ? ? ? edit distance

therr? ? ? ? ? ? ? ? ? ? ?there? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1

? ? ? ? ? ? ? ? ? ? ? ? ? ? thesis? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?3

? ? ? ? ? ? ? ? ? ? ? ? ? ? theirs? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2

? ? ? ? ? ? ? ? ? ? ? ? ? ? ?the? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?2

? ? ? ? ? ? ? ? ? ? ? ? ? ? their? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1

計算編輯距離：

Given str s, str t => editDist(s,t)

算法原理：

"""代碼實現"""def edit_dist(str1,str2):m,n = len(str1),len(str2)dp = [[0 for x in range(n+1)] for x in range(m+1)]for i in range(m+1):for j in range(n+1):if i == 0: dp[i][j] = jelif j == 0:dp[i][j] = ielif str1[i-1] == str2[j-1]:dp[i][j] = dp[i-1][j-1]else:dp[i][j] = 1 + min(dp[i][j-1], dp[i-1,j-1], dp[i-1][j])return dp[m][n]

根據之前的想法可以抽樣為流程：用戶輸入－> 找出詞典庫中編輯距離最小的Top K －> 排序

缺點：跟每一個單詞都要計算編輯距離－時間復雜度太高

改進后方案：

用戶輸入－>? 生成編輯距離1，2的所有可能的字符串－> 通過詞典過濾－>? 排序

生成的字符串遠小于詞典庫

P(c)和P(s/c)可以基于統計方法計算出歷史出現的概率值

三、分詞

在很多語言中，分詞是最為重要的第一步，如漢語，一般用jieba分詞工具

怎么寫一個分詞工具？

簡單兩步驟：語句－> 候選分割(詞典DP)－> 選擇最好的(語言模型)

缺點：時間效率低、誤差傳遞

改進：分割＋語言模型? －>Joint optimization

"""jieba分詞"""import jiebaseg_list = jieba.cut("小王專注于人工智能",cut_all=False)print(" ".join(seg_list))"""增加詞典""" jieba.add_word("小王專注")

四、詞的過濾

通常先把停用詞、出現頻率很低的詞匯過濾掉

好處：提高準確率、減少時間成本

停用詞(stop words)

英文中比如 "the" "an" "their" 都可以作為停用詞來處理。但是也需要考慮應用場景

出現頻率特別低的詞匯對分析作用不大，所以一般也會去掉。把停用詞、出現頻率低的過濾掉。

五、Stemming (單詞轉換)

意思相似，合并為同一單詞

? ? ? ?went,go,going??

? ? ? ?fly,flies

? ? ? ?fast,faster,fastest

stemming 算法合并

from nltk.stem.porter import PorterStemmerstemmer = PorterStemmer()test_strs = ['caresses','dies','flies','mules','denied', 'died','agreed','owned','humbled','sized','meeting','stating']singles = [stemmer.stem(word) for word in test_strs]print(' '.join(singles))

?六、文本表示

單詞的表示：最常用的表示方式：詞袋模型(Bag-of-words Model)

假設一個詞典有7個單詞：［我們，去，爬山，今天，你們，昨天，運動］

每個單詞的表示：特點－維度等同于詞典的大小，Sparse Vector(只有一個1，其它全是0)

我們：［1，0，0，0，0，0，0］

爬山：［0，0，1，0，0，0，0］

運動：［0，0，0，0，0，0，1］

昨天：［0，0，0，0，0，1，0］

? ? ? ? ? ? ? ? 詞袋模型（詞典維度）

句子的表示方式：

corpus = [

'He is going from Beijing to Shanghai.',

'He denied my request, but he actually lied.',

'Mike lost the phone, and phone was in the car.',

]

[[0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0]

[1 0 0 1 0 1 0 0 2 0 0 1 0 0 1 0 1 0 0 0 0]

[0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 2 0 0 2 0 1]]

tf-idf 表達方式

tf-idf(w) = tf(d,w) * idf(w)

"""文本轉化向量"""corpus = ['He is going from Beijing to Shanghai.','He denied my request, but he actually lied.','Mike lost the phone, and phone was in the car.',]# 方法1: 只考慮詞頻 from sklearn.feature_extraction.text import CountVectorizervectorizer = CountVectorizer()X = vectorizer.fit_transform(corpus)print(X) print(x.toArray())# 方法2: 既考慮詞頻，也考慮詞的重要性（tf-idf）from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer()X = vectorizer.fit_transform(corpus)print(X.toArray())

計算句子相關性？方法：d = (s1*s2)/|s1|*|s2|（余弦相似度）

我們：［0.1，0.2，0.4，0.2］

爬山：［0.2，0.3，0.7，0.1］

運動：［0.2，0.3，0.6，0.2］

昨天：［0.5，0.9，0.1，0.3］

? ? ? ? ? ? 分布式表示（優點：1、維度小? 2、每個位置都是具有一定意義的浮點非0的數）

dist(我們，爬山) ＝ sqrt(0.12)

dist(爬山，運動) ＝ sqrt(0.02)

因此，爬山和運動的相似度高于我們和爬山

分布式表示依賴深度學習模型(word2vec模型 i,e.SkipGram)

句子向量：我們| 昨天｜爬山＝（）方法：1、每個維度平均? 2、時序(LSTM，RNN)

總結

以上是生活随笔為你收集整理的知识图谱学习笔记－非结构化数据处理的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。