當前位置：首頁 > 编程语言 > python >内容正文

python

机器学习之路： python 实践 word2vec 词向量技术

發布時間：2024/9/5 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习之路： python 实践 word2vec 词向量技术小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

git:?https://github.com/linyi0604/MachineLearning

詞向量技術 Word2Vec 每個連續詞匯片段都會對后面有一定制約稱為上下文context找到句子之間語義層面的聯系

1 from sklearn.datasets import fetch_20newsgroups 2 from bs4 import BeautifulSoup 3 import nltk, re 4 from gensim.models import word2vec 5 6 # nltk.download('punkt') 7 8 9 ''' 10 詞向量技術 Word2Vec 11 每個連續詞匯片段都會對后面有一定制約稱為上下文context 12 13 找到句子之間語義層面的聯系 14 15 ''' 16 17 # 聯網下載新聞數據 18 news = fetch_20newsgroups(subset="all") 19 x, y = news.data, news.target 20 21 # 定義一個函數將每條新聞中的句子分離,并返回一個句子的列表 22 def news_to_sentences(news): 23 news_text = BeautifulSoup(news).get_text() 24 tokenizer = nltk.data.load("tokenizers/punkt/english.pickle") 25 raw_sentences = tokenizer.tokenize(news_text) 26 sentences = [] 27 for sent in raw_sentences: 28 temp = re.sub("[^a-zA-Z]", " ", sent.lower().strip()).split() 29 sentences.append(temp) 30 31 return sentences 32 33 # 將長新聞中的句子剝離出來用于訓練 34 sentences = [] 35 for i in x: 36 sentence_list = news_to_sentences(i) 37 sentences += sentence_list 38 39 40 # 配置詞向量的維度 41 num_features = 300 42 # 保證被考慮的詞匯的頻度 43 min_word_count = 20 44 # 并行計算使用cpu核心數量 45 num_workers = 2 46 # 定義訓練詞向量的上下文窗口大小 47 context = 5 48 downsapling = 1e-3 49 50 # 訓練詞向量模型 51 model = word2vec.Word2Vec(sentences, 52 workers=num_workers, 53 size=num_features, 54 min_count=min_word_count, 55 window=context, 56 sample=downsapling) 57 # 這個設定代表當前訓練好的詞向量為最終版, 也可以加速模型訓練的速度 58 model.init_sims(replace=True) 59 60 # 利用訓練好的模型尋找文本中與college相關的十個詞匯 61 print(model.most_similar("college")) 62 ''' 63 [('wisconsin', 0.7664438486099243), 64 ('osteopathic', 0.7474539279937744), 65 ('madison', 0.7433826923370361), 66 ('univ', 0.7296794652938843), 67 ('melbourne', 0.7212647199630737), 68 ('walla', 0.7068545818328857), 69 ('maryland', 0.7038443088531494), 70 ('carnegie', 0.7038302421569824), 71 ('institute', 0.7003713846206665), 72 ('informatics', 0.6968873143196106)] 73 '''

轉載于:https://www.cnblogs.com/Lin-Yi/p/9007259.html

總結

以上是生活随笔為你收集整理的机器学习之路： python 实践 word2vec 词向量技术的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： maven+testng+eclipse
下一篇： python工程结构