日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 人文社科 > 生活经验 >内容正文

生活经验

自然语言处理:网购商品评论情感判定

發布時間:2023/11/27 生活经验 37 豆豆
生活随笔 收集整理的這篇文章主要介紹了 自然语言处理:网购商品评论情感判定 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目錄

1、項目背景

2、數據集

3、數據預處理

4、基于SVM的情感分類模型

5、基于word2vec中doc2vec的無監督分類模型


自然語言處理(Natural Language Processing,簡稱NLP),是為各類企業及開發者提供的用于文本分析及挖掘的核心工具,旨在幫助用戶高效的處理文本,已經廣泛應用在電商、文娛、司法、公安、金融、醫療、電力等行業客戶的多項業務中,取得了良好的效果。

1、項目背景

任何行業領域,用戶對產品的評價都顯得尤為重要。通過用戶評論,可以對用戶情感傾向進行判定。

例如,目前最為普遍的網購行為:對于用戶來說,參考評論可以做出更優的購買決策;對于商家來說,對商品評論按照情感傾向進行分類,并通過文本聚類得到普遍提及的商品優缺點,可以進一步改良產品。

本案例主要討論如何對商品評論進行情感傾向判定。下圖為某電商平臺上針對某款手機的部分評論:

2、數據集

這份某款手機的商品評論信息數據集,包含2個屬性,共計8187個樣本。

使用Pandas中的read_excel函數讀取xls格式的數據集文件,注意文件的編碼設置為gb18030,代碼如下所示:

import pandas as pd#讀入數據集
data = pd.read_excel("data.xls", encoding='gb18030')
print(data.head())

讀取數據集效果(部分)如下所示:

查看數據集的相關信息,包括行列數,列名,以及各個類別的樣本數,實現代碼如下所示:

# 數據集的大小
print(data.shape)# 數據集的列名
print(data.columns.values)# 不同類別數據記錄的統計
print(data['Class'].value_counts())

效果如下所示

(8186, 2)array([u'Comment', u'Class'], dtype=object)1    3042
-1    26570    2487
Name: Class, dtype: int64

3、數據預處理

現在,我們要將Comment列的文本信息,轉化成數值矩陣表示,也就是將文本映射到特征空間。

首先,通過jieba,使用HMM模型,對文本進行中文分詞,實現代碼如下所示:

# 導入中文分詞庫jieba
import jieba
import numpy as np

接下來,對數據集的每個樣本的文本進行中文分詞,如遇到缺失值,使用“還行、一般吧”進行填充,實現代碼如下所示:

cutted = []
for row in data.values:try:raw_words = (" ".join(jieba.cut(row[0])))cutted.append(raw_words)except AttributeError:print row[0]cutted.append(u"還行 一般吧")cutted_array = np.array(cutted)# 生成新數據文件,Comment字段為分詞后的內容
data_cutted = pd.DataFrame({'Comment': cutted_array,'Class': data['Class']
})

讀取并查看預處理后的數據,實現代碼如下所示:

print(data_cutted.head())

數據集效果(部分)如下所示:

為了更直觀地觀察詞頻高的詞語,我們使用第三方庫wordcloud進行文本的可視化,導入庫實現代碼如下所示:

# 導入第三方庫wordcloudfrom wordcloud import WordCloud
import matplotlib.pyplot as plt

針對好評,中評和差評的文本,建立WordCloud對象,繪制詞云,好評詞云可視化實現代碼如下所示:

# 好評
wc = WordCloud(font_path='Courier.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]))
plt.axis('off')
plt.imshow(wc)
plt.show()

好評詞云效果如下所示:

中評詞云可視化實現代碼如下所示:

# 中評wc = WordCloud(font_path='Courier.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]))
plt.axis('off')
plt.imshow(wc)
plt.show()

中評詞云效果如下所示:

差評詞云可視化實現代碼如下所示:

# 差評wc = WordCloud(font_path='Courier.ttf')
wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]))
plt.axis('off')
plt.imshow(wc)
plt.show()

差評詞云效果如下所示:

從詞云展現的詞頻統計圖來看,"手機","就是","屏幕","收到"等詞對于區分毫無幫助而且會造成偏差。因此,需要把這些對區分類沒有意義的詞語篩選出來,放到停用詞文件stopwords.txt中。實現代碼如下所示:

# 讀入停用詞文件
import codecswith codecs.open('stopwords.txt', 'r', encoding='utf-8') as f:stopwords = [item.strip() for item in f]for item in stopwords[0:200]:print(item,)

輸出停用詞效果如下所示:

使用jieba庫的extract_tags函數,統計好評,中評,差評文本中的TOP20關鍵詞。

#設定停用詞文件,在統計關鍵詞的時候,過濾停用詞
import jieba.analysejieba.analyse.set_stop_words('stopwords.txt') 

好評關鍵詞分析,實現代碼如下所示:

# 好評關鍵詞
keywords_pos = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]), topK=20)
for item in keywords_pos:print(item,)

好評關鍵詞TOP20如下所示:

不錯 正品 贈品 五分 發貨 東西 滿意 機子 喜歡 收到 很漂亮 充電 好評 很快 賣家 速度 評價 流暢 快遞 物流

中評關鍵詞分析,實現代碼如下所示:

#中評關鍵詞
keywords_med = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]), topK=20)
for item in keywords_med:print(item,)

中評關鍵詞TOP20如下所示:

充電 不錯 發熱 外觀 感覺 電池 機子 問題 贈品 有點 無線 發燙 換貨 軟件 快遞 安卓 內存 退貨 知道 售后

差評關鍵詞分析,實現代碼如下所示:

#差評關鍵詞
keywords_neg = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]), topK=20)for item in keywords_neg:print(item,)

差評關鍵詞TOP20如下所示:

差評 售后 垃圾 贈品 退貨 問題 換貨 充電 降價 發票 充電器 東西 剛買 發熱 無線 機子 死機 收到 質量 15

經過以上步驟的處理,整個數據集的預處理工作“告一段落”。在中文文本分析和情感分析的工作中,數據預處理的內容主要是分詞。只有經過分詞處理后的文本數據集才可以進行下一步的向量化操作,滿足輸入模型的條件。

4、基于SVM的情感分類模型

經過分詞之后的文本數據集要先進行向量化之后才能輸入到分類模型中進行運算。

我們使用sklearn庫實現向量化方法,去掉停用詞,并將其通過tftf-idf映射到特征空間。

其中,tftf為詞頻,即分詞后每個詞項在該條評論中出現的次數;dfdf為出現該詞項評論數目;NN為評論總數,使用對數來適當抑制tftf和dfdf值的影響。

我們使用sklearn庫中的函數直接實現SVM算法。在這里,我們選取以下形式的SVM模型參與運算。

為了方便,創建文本情感分析類CommentClassifier,來實現建模過程:

  • __init__為類的初始化函數,輸入參數classifier_typevector_type,分別代表分類模型的類型和向量化方法的類型。
  • fit()函數,來實現向量化與模型建立的過程。

實現代碼如下所示:

# 實現向量化方法
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer#實現svm和貝葉斯模型
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier# 實現交叉驗證
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score# 實現評價指標
from sklearn import metrics# 文本情感分類的類:CommentClassifier
class CommentClassifier:def __init__(self, classifier_type, vector_type):self.classifier_type = classifier_type #分類器類型:支持向量機或貝葉斯分類self.vector_type = vector_type         #文本向量化模型:0\1模型,TF模型,TF-IDF模型def fit(self, train_x, train_y, max_df):list_text = list(train_x)#向量化方法:0 - 0/1,1 - TF,2 - TF-IDFif self.vector_type == 0:self.vectorizer = CountVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3)).fit(list_text)elif self.vector_type == 1:self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3), use_idf=False).fit(list_text)else:self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3)).fit(list_text)self.array_trainx = self.vectorizer.transform(list_text)self.array_trainy = train_y#分類模型選擇:1 - SVC,2 - LinearSVC,3 - SGDClassifier,三種SVM模型  if self.classifier_type == 1:self.model = SVC(kernel='linear', gamma=10 ** -5, C=1).fit(self.array_trainx, self.array_trainy)elif self.classifier_type == 2:self.model = LinearSVC().fit(self.array_trainx, self.array_trainy)else:self.model = SGDClassifier().fit(self.array_trainx, self.array_trainy)def predict_value(self, test_x):list_text = list(test_x)self.array_testx = self.vectorizer.transform(list_text)array_predict = self.model.predict(self.array_testx)return array_predictdef predict_proba(self, test_x):list_text = list(test_x)self.array_testx = self.vectorizer.transform(list_text)array_score = self.model.predict_proba(self.array_testx)return array_score 
  • 使用train_test_split()函數劃分訓練集和測試集。訓練集:80%;測試集:20%。
  • 建立classifier_typevector_type兩個參數的取值列表,來表示選擇的向量化方法以及分類模型
  • 輸出每種向量化方法和分類模型的組合所對應的分類評價結果,內容包括混淆矩陣以及含PrecisionRecallF1-score三個指標的評分矩陣

實現代碼如下所示:

#劃分訓練集,測試集
train_x, test_x, train_y, test_y = train_test_split(data_cutted['Comment'].ravel().astype('U'), data_cutted['Class'].ravel(),test_size=0.2, random_state=4)classifier_list = [1,2,3]
vector_list = [0,1,2]for classifier_type in classifier_list:for vector_type in vector_list:commentCls = CommentClassifier(classifier_type, vector_type)#max_df 設置為0.98commentCls.fit(train_x, train_y, 0.98)if classifier_type == 0:value_result = commentCls.predict_value(test_x)proba_result = commentCls.predict_proba(test_x)print(classifier_type,vector_type)print('classification report')print(metrics.classification_report(test_y, value_result, labels=[-1, 0, 1]))print('confusion matrix')print(metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1]))else:value_result = commentCls.predict_value(test_x)print(classifier_type,vector_type)print('classification report')print(metrics.classification_report(test_y, value_result, labels=[-1, 0, 1]))print('confusion matrix')print(metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1]))

輸出效果如下所示:

1 0
classification reportprecision    recall  f1-score   support-1       0.68      0.62      0.65       5190       0.55      0.49      0.52       4851       0.75      0.86      0.80       634avg / total       0.67      0.68      0.67      1638confusion matrix
[[324 130  65][131 236 118][ 24  64 546]]
1 1
classification reportprecision    recall  f1-score   support-1       0.71      0.74      0.72       5190       0.58      0.54      0.56       4851       0.84      0.85      0.85       634avg / total       0.72      0.72      0.72      1638confusion matrix
[[385 109  25][145 263  77][ 15  80 539]]
1 2
classification reportprecision    recall  f1-score   support-1       0.70      0.74      0.72       5190       0.58      0.52      0.55       4851       0.84      0.86      0.85       634avg / total       0.72      0.72      0.72      1638confusion matrix
[[386 106  27][151 254  80][ 14  76 544]]
2 0
classification reportprecision    recall  f1-score   support-1       0.70      0.62      0.66       5190       0.56      0.51      0.54       4851       0.76      0.88      0.82       634avg / total       0.68      0.69      0.68      1638confusion matrix
[[320 135  64][122 248 115][ 16  57 561]]
2 1
classification reportprecision    recall  f1-score   support-1       0.69      0.73      0.71       5190       0.61      0.48      0.54       4851       0.81      0.91      0.86       634avg / total       0.71      0.73      0.72      1638confusion matrix
[[377 108  34][154 233  98][ 12  44 578]]
2 2
classification reportprecision    recall  f1-score   support-1       0.70      0.74      0.72       5190       0.61      0.50      0.55       4851       0.83      0.91      0.87       634avg / total       0.72      0.73      0.73      1638confusion matrix
[[383 108  28][154 241  90][ 13  43 578]]
3 0
classification reportprecision    recall  f1-score   support-1       0.69      0.69      0.69       5190       0.58      0.47      0.52       4851       0.79      0.90      0.84       634avg / total       0.70      0.71      0.70      1638confusion matrix
[[359 118  42][148 228 109][ 14  47 573]]
3 1
classification reportprecision    recall  f1-score   support-1       0.70      0.74      0.72       5190       0.60      0.49      0.54       4851       0.81      0.88      0.84       634avg / total       0.71      0.72      0.71      1638confusion matrix
[[386  96  37][152 240  93][ 13  66 555]]
3 2
classification reportprecision    recall  f1-score   support-1       0.65      0.75      0.69       5190       0.63      0.49      0.55       4851       0.83      0.86      0.85       634avg / total       0.71      0.72      0.71      1638confusion matrix
[[389  98  32][169 236  80][ 45  41 548]]

從結果上來看,選擇tfidf向量化方法,使用LinearSVC模型效果比較好,f1-socre為0.73

從混淆矩陣來看,我們會發現多數的錯誤分類都出現在中評和差評上。我們可以將原始數據集的中評刪除。實現代碼如下所示:

data_bi = data_cutted[data_cutted['Class'] != 0]
data_bi['Class'].value_counts()

效果如下所示:

 1    3042
-1    2658
Name: Class, dtype: int64

再次運行分類模型,查看分類結果,如下所示:

1 0
classification reportprecision    recall  f1-score   support-1       0.90      0.79      0.84       5371       0.83      0.92      0.87       603avg / total       0.86      0.86      0.86      1140confusion matrix
[[425 112][ 48 555]]
1 1
classification reportprecision    recall  f1-score   support-1       0.87      0.92      0.90       5371       0.93      0.88      0.90       603avg / total       0.90      0.90      0.90      1140confusion matrix
[[496  41][ 71 532]]
1 2
classification reportprecision    recall  f1-score   support-1       0.88      0.93      0.90       5371       0.93      0.88      0.91       603avg / total       0.90      0.90      0.90      1140confusion matrix
[[497  40][ 70 533]]
2 0
classification reportprecision    recall  f1-score   support-1       0.90      0.80      0.85       5371       0.84      0.92      0.88       603avg / total       0.87      0.86      0.86      1140confusion matrix
[[431 106][ 48 555]]
2 1
classification reportprecision    recall  f1-score   support-1       0.92      0.91      0.91       5371       0.92      0.93      0.92       603avg / total       0.92      0.92      0.92      1140confusion matrix
[[486  51][ 43 560]]
2 2
classification reportprecision    recall  f1-score   support-1       0.93      0.91      0.92       5371       0.92      0.94      0.93       603avg / total       0.92      0.92      0.92      1140confusion matrix
[[488  49][ 39 564]]
3 0
classification reportprecision    recall  f1-score   support-1       0.92      0.82      0.87       5371       0.86      0.94      0.90       603avg / total       0.89      0.88      0.88      1140confusion matrix
[[443  94][ 38 565]]
3 1
classification reportprecision    recall  f1-score   support-1       0.92      0.91      0.91       5371       0.92      0.93      0.92       603avg / total       0.92      0.92      0.92      1140confusion matrix
[[486  51][ 41 562]]
3 2
classification reportprecision    recall  f1-score   support-1       0.88      0.93      0.90       5371       0.93      0.89      0.91       603avg / total       0.91      0.91      0.91      1140confusion matrix
[[497  40][ 67 536]]

刪除差評之后,不同組合的分類模型效果均有顯著提升。這也說明,分類模型能夠有效地將好評區分出來。

數據集中存在標注不準確的問題,主要集中在中評。由于人在評論時,除非有問題否則一般都會打好評,如果打了中評說明對產品有不滿意之處,在情感的表達上就會趨向于負向情感,同時評論具有很大主觀性,很多中評會將其歸為差評,但數據集中卻認為是中評。因此,將一條評論分類為好評、中評、差評是不夠客觀,中評與差評之間的邊界很模糊,因此識別率很難提高。

5、基于word2vec中doc2vec的無監督分類模型

開源文本向量化工具word2vec,可以為文本數據尋求更加深層次的特征表示。詞語之間可以進行運算:

w2v(woman)-w2v(man)+w2v(king)=w2v(queen)

基于word2vec的doc2vec,將每個文檔表示為一個向量,并且通過余弦距離可以計算兩個文檔的相似程度,那么就可以計算一句話和一句極好的好評的距離,以及一句話到極差的差評的距離。

在本案例的數據集中:

  • 好評:快 就是 手感 滿意 也好 喜歡 也 流暢 很 服務態度 實用 超快 挺快 用著 速度 禮品 也不錯 非常好 挺好 感覺 才來 還行 好看 也快 不錯的 送了 非常不錯 超級 贊 好多東西 很實用 各方面 挺好的 很多 漂亮 配件 還不錯 也多 特意 慢 滿分 好用 非常漂亮......
  • 差評:不多說 上當 差差 剛用 服務差 一點也不 不要 簡直 還是去 實體店 大家 保證 不肯 生氣 開發票 磨損 后悔 印記 網 什么破 爛爛 左邊 失效 太 騙 掉價 走下坡路 不說了 徹底 三星手機 自營 幾次 真心 別的 看完 簡單說 機會 這是 生氣了 觸動 縫隙 沖動了 失望......

我們使用第三方庫gensim來實現doc2vec模型。

實現代碼如下所示:

import pandas as pd
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import logginglogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)train_x = data_bi['Comment'].ravel()
train_y = data_bi['Class'].ravel()#為train_x列貼上標簽"TRAIN"
def labelizeReviews(reviews, label_type):labelized = []for i, v in enumerate(reviews):label = '%s_%s' % (label_type, i)labelized.append(TaggedDocument(v.split(" "), [label]))return labelizedtrain_x = labelizeReviews(train_x, "TRAIN")#建立Doc2Vec模型model
size = 300
all_data = []
all_data.extend(train_x)model = Doc2Vec(min_count=1, window=8, size=size, sample=1e-4, negative=5, hs=0, iter=5, workers=8)
model.build_vocab(all_data)# 設置迭代次數10
for epoch in range(10):model.train(train_x)#建立空列表pos和neg以對相似度計算結果進行存儲,計算每個評論和極好評論之間的余弦距離,并存在pos列表中
#計算每個評論和極差評論之間的余弦距離,并存在neg列表中
pos = []
neg = []for i in range(0,len(train_x)):pos.append(model.docvecs.similarity("TRAIN_0","TRAIN_{}".format(i)))neg.append(model.docvecs.similarity("TRAIN_1","TRAIN_{}".format(i)))#將pos列表和neg列表更新到原始數據文件中,分別表示為字段PosSim和字段NegSim
data_bi[u'PosSim'] = pos
data_bi[u'NegSim'] = neg

模型訓練過程如下所示:

2017-05-27 14:30:28,393 : INFO : collecting all words and their counts
2017-05-27 14:30:28,394 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-05-27 14:30:28,593 : INFO : collected 10545 word types and 5700 unique tags from a corpus of 5700 examples and 482148 words
2017-05-27 14:30:28,595 : INFO : Loading a fresh vocabulary
2017-05-27 14:30:28,649 : INFO : min_count=1 retains 10545 unique words (100% of original 10545, drops 0)
2017-05-27 14:30:28,650 : INFO : min_count=1 leaves 482148 word corpus (100% of original 482148, drops 0)
2017-05-27 14:30:28,705 : INFO : deleting the raw counts dictionary of 10545 items
2017-05-27 14:30:28,706 : INFO : sample=0.0001 downsamples 217 most-common words
2017-05-27 14:30:28,707 : INFO : downsampling leaves estimated 108356 word corpus (22.5% of prior 482148)
2017-05-27 14:30:28,709 : INFO : estimated required memory for 10545 words and 300 dimensions: 38560500 bytes
2017-05-27 14:30:28,784 : INFO : resetting layer weights
2017-05-27 14:30:29,120 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:29,121 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:30,176 : INFO : PROGRESS: at 10.24% examples, 72316 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:31,211 : INFO : PROGRESS: at 29.96% examples, 91057 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:32,218 : INFO : PROGRESS: at 66.30% examples, 126742 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:33,231 : INFO : PROGRESS: at 86.00% examples, 122698 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:33,571 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:33,573 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:33,605 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:33,647 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:33,678 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:33,696 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:33,711 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:33,722 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:33,724 : INFO : training on 2410740 raw words (570332 effective words) took 4.6s, 124032 effective words/s
2017-05-27 14:30:33,727 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:33,731 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:34,753 : INFO : PROGRESS: at 36.38% examples, 212225 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:35,762 : INFO : PROGRESS: at 75.24% examples, 216859 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:36,243 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:36,244 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:36,264 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:36,306 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:36,311 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:36,320 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:36,330 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:36,336 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:36,338 : INFO : training on 2410740 raw words (570008 effective words) took 2.6s, 219523 effective words/s
2017-05-27 14:30:36,339 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:36,341 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:37,353 : INFO : PROGRESS: at 28.23% examples, 177496 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:38,372 : INFO : PROGRESS: at 66.30% examples, 193880 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:39,061 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:39,062 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:39,074 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:39,115 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:39,122 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:39,132 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:39,147 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:39,154 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:39,155 : INFO : training on 2410740 raw words (570746 effective words) took 2.8s, 203312 effective words/s
2017-05-27 14:30:39,158 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:39,159 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:40,168 : INFO : PROGRESS: at 37.74% examples, 222816 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:41,177 : INFO : PROGRESS: at 77.55% examples, 223202 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:41,605 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:41,610 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:41,614 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:41,645 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:41,670 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:41,674 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:41,682 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:41,690 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:41,692 : INFO : training on 2410740 raw words (569889 effective words) took 2.5s, 225457 effective words/s
2017-05-27 14:30:41,694 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:41,696 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:42,712 : INFO : PROGRESS: at 29.16% examples, 183182 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:43,754 : INFO : PROGRESS: at 69.96% examples, 203560 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:44,804 : INFO : PROGRESS: at 91.97% examples, 173787 words/s, in_qsize 14, out_qsize 0
2017-05-27 14:30:44,973 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:44,989 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:45,028 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:45,061 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:45,097 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:45,101 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:45,121 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:45,125 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:45,128 : INFO : training on 2410740 raw words (569903 effective words) took 3.4s, 166370 effective words/s
2017-05-27 14:30:45,131 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:45,132 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:46,152 : INFO : PROGRESS: at 11.26% examples, 79348 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:47,153 : INFO : PROGRESS: at 27.52% examples, 85992 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:48,166 : INFO : PROGRESS: at 66.47% examples, 130273 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:49,061 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:49,076 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:49,088 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:49,123 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:49,144 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:49,147 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:49,152 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:49,159 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:49,160 : INFO : training on 2410740 raw words (570333 effective words) took 4.0s, 141860 effective words/s
2017-05-27 14:30:49,161 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:49,163 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:50,185 : INFO : PROGRESS: at 31.78% examples, 193530 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:51,244 : INFO : PROGRESS: at 48.51% examples, 141817 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:52,278 : INFO : PROGRESS: at 69.96% examples, 134399 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:30:52,918 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:52,936 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:52,945 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:52,976 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:52,979 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:52,984 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:52,995 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:52,998 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:52,999 : INFO : training on 2410740 raw words (570031 effective words) took 3.8s, 148864 effective words/s
2017-05-27 14:30:53,000 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:53,002 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:54,024 : INFO : PROGRESS: at 34.48% examples, 202424 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:55,035 : INFO : PROGRESS: at 68.58% examples, 201499 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:56,010 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:56,017 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:56,048 : INFO : PROGRESS: at 96.89% examples, 183861 words/s, in_qsize 5, out_qsize 1
2017-05-27 14:30:56,049 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:56,071 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:56,084 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:56,099 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:56,101 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:56,104 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:56,104 : INFO : training on 2410740 raw words (570328 effective words) took 3.1s, 184129 effective words/s
2017-05-27 14:30:56,105 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:56,107 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:57,134 : INFO : PROGRESS: at 33.13% examples, 197730 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:58,140 : INFO : PROGRESS: at 69.96% examples, 206423 words/s, in_qsize 15, out_qsize 0
2017-05-27 14:30:58,876 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:30:58,883 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:30:58,889 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:30:58,937 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:30:58,949 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:30:58,953 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:30:58,960 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:30:58,967 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:30:58,968 : INFO : training on 2410740 raw words (570312 effective words) took 2.9s, 199922 effective words/s
2017-05-27 14:30:58,969 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
2017-05-27 14:30:58,970 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey
2017-05-27 14:30:59,991 : INFO : PROGRESS: at 32.86% examples, 198045 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:31:00,993 : INFO : PROGRESS: at 68.23% examples, 201443 words/s, in_qsize 16, out_qsize 0
2017-05-27 14:31:01,881 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-05-27 14:31:01,888 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-05-27 14:31:01,907 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-05-27 14:31:01,922 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-05-27 14:31:01,941 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-05-27 14:31:01,948 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-05-27 14:31:01,955 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-05-27 14:31:01,961 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-05-27 14:31:01,962 : INFO : training on 2410740 raw words (570826 effective words) took 3.0s, 191072 effective words/s

最后可視化評論分類效果,實現代碼如下所示:

from matplotlib import pyplot as pltlabel= data_bi['Class'].ravel()
values = data_bi[['PosSim' , 'NegSim']].valuesplt.scatter(values[:,0], values[:,1], c=label, alpha=0.4)
plt.show()

效果如下所示:

從上圖中可以看到,好評與差評基本上可以通過一條直線區分開(藍色為差評,紅色為好評)

該方法與傳統思路完全不同,沒有使用詞頻率,情感詞等特征,其優點有:

  • 將數據集映射到了極低維度的空間,只有二維
  • 一種無監督的學習方法,不需要對原始訓練數據進行標注
  • 具有普適性,在其他領域也可以用這種方法,只需要先找出該領域極其正和極其負的方法,將其與所有待識別的數據通過doc2vec轉化為向量計算距離即可

總結

以上是生活随笔為你收集整理的自然语言处理:网购商品评论情感判定的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。