日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

【Python-ML】电影评论数据集文本挖掘

發布時間:2025/4/16 python 19 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【Python-ML】电影评论数据集文本挖掘 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
# -*- coding: utf-8 -*- ''' Created on 2018年1月22日 @author: Jason.F @summary: 文本挖掘,對電影評論進行內容抽取、特征向量化并訓練模型預測 電影評論數據:http://ai.stanford.edu/~amaas/data/sentiment/ ''' import pyprind import pandas as pd import os import numpy as np import re from nltk.stem.porter import PorterStemmer import nltk from nltk.corpus import stopwords from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.feature_extraction.text import TfidfVectorizer import timestart = time.clock()homedir = os.getcwd()#獲取當前文件的路徑 #第一步:導入數據并輸出到moive_data.csv ''' pbar=pyprind.ProgBar(50000) labels={'pos':1,'neg':0}#正面和負面評論標簽 df = pd.DataFrame() for s in ('test','train'):for l in ('pos','neg'):path=homedir+'/aclImdb/%s/%s' %(s,l)for file in os.listdir(path):with open(os.path.join(path,file),'r') as infile:txt =infile.read()df =df.append([[txt,labels[l]]],ignore_index=True)pbar.update() df.columns=['review','sentiment'] np.random.seed(0) df=df.reindex(np.random.permutation(df.index))#重排數據集,打散正負樣本數據 df.to_csv(homedir+'/movie_data.csv',index=False) ''' #第二步:文本數據清洗和特征向量化 df=pd.read_csv(homedir+'/movie_data.csv') def preprocessor(text):text=re.sub('<[^>]*>','',text)#移除HTML標記,#把<>里面的東西刪掉包括內容emotions=re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)text=re.sub('[\W]+',' ',text.lower())+''.join(emotions).replace('-','')return text #print (preprocessor(df.loc[0,'review'][-50:]))#數據集第一行review字段的最后50個字符 #print (preprocessor("</a>This :) is :( a test :-)!")) df['review']=df['review'].apply(preprocessor) def tokenizer(text):#提取詞匯return text.split() porter=PorterStemmer() def tokenizer_porter(text):#文本分詞并提取詞干return [porter.stem(word) for word in text.split()] nltk.download('stopwords')#停用詞移除(stop-word removal),停用詞是文本中常見單不能有效判別信息的詞匯 stop = stopwords.words('english')#獲得英文停用詞集 #print ([w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]) #第三步:模型訓練 X_train=df.loc[:25000,'review'].values y_train=df.loc[:25000,'sentiment'].values X_test=df.loc[25000:,'review'].values y_test=df.loc[25000:,'sentiment'].values tfidf=TfidfVectorizer(strip_accents=None,lowercase=False,preprocessor=None) param_grid = [{'vect__ngram_range':[(1,1)],'vect__stop_words':[stop,None],'vect__tokenizer':[tokenizer,tokenizer_porter],'clf__penalty':['l1','l2'],'clf__C':[1.0,10.1,100.0]},\{'vect__ngram_range':[(1,1)],'vect__stop_words':[stop,None],'vect__tokenizer':[tokenizer,tokenizer_porter],'vect__use_idf':[False],'vect__norm':[None],'clf__penalty':['l1','l2'],'clf__C':[1.0,10.1,100.0]} ] lr_tfidf =Pipeline([('vect',tfidf),('clf',LogisticRegression(random_state=0))]) gs_lr_tfidf=GridSearchCV(lr_tfidf,param_grid,scoring='accuracy',cv=5,verbose=1,n_jobs=-1) gs_lr_tfidf.fit(X_train,y_train) print ('Best parameter set :%s' % gs_lr_tfidf.best_params_) print ('CV Accuracy:%.3f'%gs_lr_tfidf.best_score_) clf=gs_lr_tfidf.best_estimator_ print ('Test Accuracy:%.3f'%clf.score(X_test,y_test))end = time.clock() print('finish all in %s' % str(end - start))

結果:


總結

以上是生活随笔為你收集整理的【Python-ML】电影评论数据集文本挖掘的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。