當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【NLP】文本情感分析

發(fā)布時間：2023/12/20 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了【NLP】文本情感分析小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

昨晚太晚代碼還沒有跑完，恰巧又遇到PSO-LSTM的準確率沒辦法復原，慘兮兮/(ㄒoㄒ)/，具體內(nèi)容今天來補上

文本情感分析

- 一、情感分析簡介
- 二、文本介紹及語料分析
- 三、數(shù)據(jù)集分析
- 四、LSTM模型
- 五、重點函數(shù)講解
- - plot_model
  - np_utils.to_categorical
  - model.summary()
- 特別感謝

一、情感分析簡介

??對人們對產(chǎn)品、服務、組織、個人、問題、事件、話題及其屬性的觀點、情感、情緒、評價和態(tài)度的計算研究。文本情感分析（Sentiment Analysis）是自然語言處理（NLP）方法中常見的應用，也是一個有趣的基本任務，尤其是以提煉文本情緒內(nèi)容為目的的分類。它是對帶有情感色彩的主觀性文本進行分析、處理、歸納和推理的過程。
??本文將介紹情感分析中的情感極性（傾向）分析。所謂情感極性分析，指的是對文本進行褒義、貶義、中性的判斷。在大多應用場景下，只分為兩類。例如對于“喜愛”和“厭惡”這兩個詞，就屬于不同的情感傾向。
??本文將詳細介紹如何進行文本數(shù)據(jù)預處理，并使用深度學習模型中的LSTM模型來實現(xiàn)文本的情感分析。

二、文本介紹及語料分析

??本項目以某電商網(wǎng)站中某個商品的評論作為語料（corpus.csv），點擊下載數(shù)據(jù)集，該數(shù)據(jù)集一共有4310條評論數(shù)據(jù)，文本的情感分為兩類：“正面”和“反面”，該數(shù)據(jù)集的前幾行如下：

三、數(shù)據(jù)集分析

數(shù)據(jù)集中的情感分布
數(shù)據(jù)集中的評論句子長度分布

以下代碼為統(tǒng)計數(shù)據(jù)集中的情感分布以及評論句子長度分布

import pandas as pd import matplotlib.pyplot as plt from matplotlib import font_manager from itertools import accumulate# 設置matplotlib繪圖時的字體 my_font=font_manager.FontProperties(fname="C:\Windows\Fonts\simhei.ttf")# 統(tǒng)計句子長度及長度出現(xiàn)的頻數(shù) df=pd.read_csv('data/data_single.csv') print(df.groupby('label')['label'].count())df['length']=df['evaluation'].apply(lambda x:len(x)) len_df=df.groupby('length').count() sent_length=len_df.index.tolist() sent_freq=len_df['evaluation'].tolist()# 繪制句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖 plt.bar(sent_length,sent_freq) plt.title('句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖',fontproperties=my_font) plt.xlabel('句子長度',fontproperties=my_font) plt.ylabel('句子長度出現(xiàn)的頻數(shù)',fontproperties=my_font) plt.show() plt.close() # 繪制句子長度累積分布函數(shù)(CDF) sent_pentage_list=[(count/sum(sent_freq)) for count in accumulate(sent_freq)]# 繪制CDF plt.plot(sent_length,sent_pentage_list)# 尋找分位點為quantile的句子長度 quantile=0.91 print(list(sent_pentage_list)) for length,per in zip(sent_length,sent_pentage_list):if round(per,2)==quantile:index=lengthbreak print('\n分位點維%s的句子長度：%d.'%(quantile,index))plt.show() plt.close()# 繪制句子長度累積分布函數(shù)圖 plt.plot(sent_length,sent_pentage_list) plt.hlines(quantile,0,index,colors='c',linestyles='dashed') plt.vlines(index,0,quantile,colors='c',linestyles='dashed') plt.text(0,quantile,str(quantile)) plt.text(index,0,str(index)) plt.title('句子長度累計分布函數(shù)圖',fontproperties=my_font) plt.xlabel('句子長度',fontproperties=my_font) plt.ylabel('句子長度累積頻率',fontproperties=my_font) plt.show() plt.close()

輸出結(jié)果如下：

句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖如下：

句子長度累積分布函數(shù)圖如下：

從以上的圖片可以看出，大多數(shù)樣本的句子長度集中在1-200之間，句子長度累計頻率取0.91分位點，則長度為183左右。

四、LSTM模型

實現(xiàn)的模型框架如下：

代碼如下：

import pickle import numpy as np import pandas as pd from keras.utils import np_utils from keras.utils.vis_utils import plot_model from keras.models import Sequential from keras.preprocessing.sequence import pad_sequences from keras.layers import LSTM, Dense, Embedding,Dropout from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score# load dataset # ['evaluation'] is feature, ['label'] is label def load_data(filepath,input_shape=20):df=pd.read_csv(filepath)# 標簽及詞匯表labels,vocabulary=list(df['label'].unique()),list(df['evaluation'].unique())# 構(gòu)造字符級別的特征string=''for word in vocabulary:string+=wordvocabulary=set(string)# 字典列表word_dictionary={word:i+1 for i,word in enumerate(vocabulary)}with open('word_dict.pk','wb') as f:pickle.dump(word_dictionary,f)inverse_word_dictionary={i+1:word for i,word in enumerate(vocabulary)}label_dictionary={label:i for i,label in enumerate(labels)}with open('label_dict.pk','wb') as f:pickle.dump(label_dictionary,f)output_dictionary={i:labels for i,labels in enumerate(labels)}# 詞匯表大小vocab_size=len(word_dictionary.keys())# 標簽類別數(shù)量label_size=len(label_dictionary.keys())# 序列填充，按input_shape填充，長度不足的按0補充x=[[word_dictionary[word] for word in sent] for sent in df['evaluation']]x=pad_sequences(maxlen=input_shape,sequences=x,padding='post',value=0)y=[[label_dictionary[sent]] for sent in df['label']]'''np_utils.to_categorical用于將標簽轉(zhuǎn)化為形如(nb_samples, nb_classes)的二值序列。假設num_classes = 10。如將[1, 2, 3,……4]轉(zhuǎn)化成：[[0, 1, 0, 0, 0, 0, 0, 0][0, 0, 1, 0, 0, 0, 0, 0][0, 0, 0, 1, 0, 0, 0, 0]……[0, 0, 0, 0, 1, 0, 0, 0]]'''y=[np_utils.to_categorical(label,num_classes=label_size) for label in y]y=np.array([list(_[0]) for _ in y])return x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary# 創(chuàng)建深度學習模型，Embedding + LSTM + Softmax def create_LSTM(n_units,input_shape,output_dim,filepath):x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary=load_data(filepath)model=Sequential()model.add(Embedding(input_dim=vocab_size+1,output_dim=output_dim,input_length=input_shape,mask_zero=True))model.add(LSTM(n_units,input_shape=(x.shape[0],x.shape[1])))model.add(Dropout(0.2))model.add(Dense(label_size,activation='softmax'))model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])'''error:ImportError: ('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')版本問題：from keras.utils.vis_utils import plot_model真正解決方案：https://www.pianshen.com/article/6746984081/'''plot_model(model,to_file='./model_lstm.png',show_shapes=True)# 輸出模型信息model.summary()return model# 模型訓練 def model_train(input_shape,filepath,model_save_path):# 將數(shù)據(jù)集分為訓練集和測試集，占比為9：1# input_shape=100x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary=load_data(filepath,input_shape)train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.1,random_state=42)# 模型輸入?yún)?shù)，需要根據(jù)自己需要調(diào)整n_units=100batch_size=32epochs=5output_dim=20# 模型訓練lstm_model=create_LSTM(n_units,input_shape,output_dim,filepath)lstm_model.fit(train_x,train_y,epochs=epochs,batch_size=batch_size,verbose=1)# 模型保存lstm_model.save(model_save_path)# 測試條數(shù)N= test_x.shape[0]predict=[]label=[]for start,end in zip(range(0,N,1),range(1,N+1,1)):print(f'start:{start}, end:{end}')sentence=[inverse_word_dictionary[i] for i in test_x[start] if i!=0]y_predict=lstm_model.predict(test_x[start:end])print('y_predict:',y_predict)label_predict=output_dictionary[np.argmax(y_predict[0])]label_true=output_dictionary[np.argmax(test_y[start:end])]print(f'label_predict:{label_predict}, label_true:{label_true}')# 輸出預測結(jié)果print(''.join(sentence),label_true,label_predict)predict.append(label_predict)label.append(label_true)# 預測準確率acc=accuracy_score(predict,label)print('模型在測試集上的準確率:%s'%acc)if __name__=='__main__':filepath='data/data_single.csv'input_shape=180model_save_path='data/corpus_model.h5'model_train(input_shape,filepath,model_save_path)

五、重點函數(shù)講解

plot_model

如果代碼中輸入from keras.utils import plot_model報錯的話，可以改成from keras.utils.vis_utils import plot_model。
而我改了之后仍然報錯：error:ImportError: ('You must install pydot (pip install pydot) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', ‘for plot_model/model_to_dot to work.’)
以下為解決方案：

（1）pip install pydot_ng
（2）pip install graphviz，這個建議不要直接pip install，去官網(wǎng)下載，我是下載了以下版本

解壓后放入對應的anaconda環(huán)境的site-package中，然后復制bin的目錄。
（3）修改site-packages\pydot_ng_init_.py中的代碼，在Method3 添加：path = r"D:\App\tech\Anaconda3\envs\nlp\Lib\site-packages\Graphviz\bin" //該路徑指向剛才復制的路徑，如圖所示：

np_utils.to_categorical

np_utils.to_categorical用于將標簽轉(zhuǎn)化為形如(nb_samples, nb_classes)
的二值序列。
假設num_classes = 10。
如將[1, 2, 3,……4]轉(zhuǎn)化成：
[[0, 1, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0]
……
[0, 0, 0, 0, 1, 0, 0, 0]]

model.summary()

通過model.summary()輸出模型各層的參數(shù)狀況，如圖所示：

特別感謝

此文章參考了農(nóng)夫三拳有點疼博客和錯誤解決參考鏈接

總結(jié)

以上是生活随笔為你收集整理的【NLP】文本情感分析的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： laravel5.8 pusher so
下一篇： 20150904看电影学英语