當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

五十八、2020美赛C题的思路以及个人Python的解法

發(fā)布時間：2024/10/8 python 41 豆豆

生活随笔收集整理的這篇文章主要介紹了五十八、2020美赛C题的思路以及个人Python的解法小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

@Author：Runsen

這是2020年美賽C題，當(dāng)時三月份朋友找我搞定，今天在清理文件中發(fā)現(xiàn)了，于是做一個記錄。這不是我的作業(yè)，我的專業(yè)可是化工。與這些沒有什么關(guān)系。

陽光公司計劃在線上市場上推出和銷售三種新產(chǎn)品：微波爐，嬰兒奶嘴和吹風(fēng)機。他們已聘請您的團隊作為顧問，通過顧客過去對其他競爭產(chǎn)品提供的相關(guān)評級和評論，識別關(guān)鍵模式、關(guān)系、措施和參數(shù)

告知其在線銷售策略；
確定潛在的重要設(shè)計功能，以增強產(chǎn)品的可取性。
陽光公司過去曾使用數(shù)據(jù)來為銷售策略提供信息，但他們以前從未使用過這種特殊的組合和數(shù)據(jù)類型。
陽光公司對這些數(shù)據(jù)中以時間為基準的模式以及它們是否以有助于該公司制作成功產(chǎn)品的互動方式特別感興趣。

為了幫助您，陽光公司的數(shù)據(jù)中心為您提供了該項目的三個數(shù)據(jù)文件：hair_dryer.tsv，microwave.tsv和pacifier.tsv。這些數(shù)據(jù)代表在數(shù)據(jù)指示的時間段內(nèi)在亞馬遜市場上購買微波爐、嬰兒奶嘴和吹風(fēng)機的客戶提供的評級和評論。他們還提供了數(shù)據(jù)標簽定義的詞匯表。提供的數(shù)據(jù)文件包含了您應(yīng)當(dāng)使用的唯一數(shù)據(jù)。

數(shù)據(jù)集定義每行代表劃分為以下幾列的數(shù)據(jù)

marketplace（string）	撰寫評論的市場的2個字母的國家/地區(qū)代碼。
customer_id（string）	隨機標識符，可用于匯總單個作者撰寫的評論。
review_id（string）	評論的唯一ID
product_id（string）	評論所屬的唯一產(chǎn)品ID
product_parent（string）	隨機標識符，可用于匯總同一產(chǎn)品的評論
product_title（string	產(chǎn)品的標題
product_category（string）	產(chǎn)品的主要消費者類別
star_rating（int）	評論的1-5星級
helpful_votes（int）	有幫助的投票數(shù)
total_votes（int）	評論獲得的總票數(shù)
vine（string）	基于客戶在Amazon社區(qū)中撰寫準確而有見地的評論所獲得的信任，邀請他們成為Amazon Vine Voices。亞馬遜為Amazon Vine成員提供了供應(yīng)商已提交給該程序的產(chǎn)品的免費副本。 Amazon不會影響Amazon Vine成員的意見，也不會修改或編輯評論。
verified_purchase（string）	“ Y”表示亞馬遜已驗證撰寫評論的人在亞馬遜上購買了該產(chǎn)品，并且沒有以大幅度折扣購買該產(chǎn)品。
review_headline（string）	評論的標題
review_body（string）	評論文本
review_date（bigint）	撰寫評論的日期

要求：

1、分析所提供的三個產(chǎn)品數(shù)據(jù)集，以識別、描述和支持數(shù)學(xué)證據(jù)、有意義的定量和/或定性模式、關(guān)系、衡量標準和星級評定、評審之間的參數(shù)，以及幫助性評級，這將有助于陽光公司在他們的三個新的在線市場產(chǎn)品提供成功。

2、使用您的分析來解決以下來自Sunshine Company市場總監(jiān)的具體問題和要求：

確定基于評級和評論的數(shù)據(jù)度量，這些數(shù)據(jù)度量是Sunshine Company在其三種產(chǎn)品在網(wǎng)上市場上銷售后最需要跟蹤的信息。
識別并討論每個數(shù)據(jù)集中基于時間的度量和模式，這些度量和模式可能表明產(chǎn)品在在線市場上的聲譽在增加或減少。
確定基于文本的度量值和基于評級的度量值的組合，這些度量值最好地指示潛在的成功或失敗產(chǎn)品。
特定的明星收視率會引發(fā)更多的評論嗎？例如，客戶在看到一系列低星級評級后，是否更有可能撰寫某種類型的評論？
基于文本的評論的特定質(zhì)量描述，如“熱情”、“失望”和其他，是否與評級水平密切相關(guān)？

3。給陽光公司的市場總監(jiān)寫一封一到兩頁的信，總結(jié)你團隊的分析和結(jié)果。包括你的團隊最自信地向市場總監(jiān)推薦的結(jié)果的具體理由

思路

數(shù)據(jù)清理：數(shù)據(jù)中包括重復(fù)信息，同時存在很多垃圾信息，比如星級評論與文字評論極度不符的（一好一差）刪除，同時刪除沒有幫助的投票數(shù)和評論獲得的總票數(shù)為0的評論，我們還要確定已驗證撰寫評論的人在亞馬遜上確定購買了該產(chǎn)品的真實用戶。（這個我沒有做處理，因為我不知道如何將一好一差進行刪除）
判斷產(chǎn)品的好壞：每個數(shù)據(jù)集中識別并討論基于時間的度量和模式，這些度量和模式表明產(chǎn)品在在線市場中的聲譽在上升還是下降，這樣我們可以推銷聲譽更高的產(chǎn)品。（這個需要知道的是keras建立時間序列，回滾的辦法）
建立產(chǎn)品服務(wù)體系：我們需要將使用LDA主題模型，提取出差評里的關(guān)鍵詞，從而建立更加完整的產(chǎn)品服務(wù)體系。
客戶在看到一系列低星級評級后，我們需要確定客戶是否更有可能撰寫某種類型的評論。因此我們需要將挖掘時間節(jié)點前的星級評價對時間節(jié)點后的評論是否存在聯(lián)系，可以巧妙的利用相關(guān)性分析解決此問題。例如僅考慮每一條評論與此前一周的評級的關(guān) 系，即可提取數(shù)據(jù)樣本，作處理后利用spss作相關(guān)性分析。（沒有采用SPSS）

本次采用了Python代碼，使用keras進行了建立時間序列和LDA主題模型提評論的關(guān)鍵詞

數(shù)據(jù)清理

1、去除重復(fù)數(shù)據(jù)

數(shù)據(jù)中包括重復(fù)信息和NA值，我們選擇去除

data.drop_duplicates(inplace=True) data.dropna(inplace=True)

2、去除沒有購買的數(shù)據(jù)

我們認為verified_purchase=N的數(shù)據(jù)，具有沒有研究的意思，因為并沒有真實的購買產(chǎn)品，因此我們需要選擇真實客戶的數(shù)據(jù)

data = data[(data['verified_purchase'] == 'y')|(data['verified_purchase'] == 'Y') ]

3、去除沒有幫助的數(shù)據(jù)

helpful_votes和total_votes如果都為0，我們認為毫無幫助，因此我們選擇去除

# 刪除沒有幫助的投票數(shù)和評論獲得的總票數(shù)為0的沒有評論 data = data[(data['helpful_votes'] != 0)&(data['total_votes'] != 0)]

4、將評論等級分開

我們需要將評論中的停用詞選擇刪除

# 讀取停用詞 stop = '' with open('stopwords.txt','r',encoding='utf-8',errors='ignore') as s:for line in s:line = line.strip()stop += line

我們用dataList存放總評論內(nèi)容，tagList存放總評論分數(shù)

import jieba from nltk.corpus import stopwordsdataList = [] tagList = [] for i in data.values:if int(i[7]) >= 4:flag = 1elif int(i[7]) >= 3:flag = 2else:flag = 3# 將review_body分詞wordList = jieba.cut(i[-2], cut_all=True)# 去停用詞termsAll = list(set([term for term in wordList if term not in stop]))filtered = [w for w in termsAll if(w not in stopwords.words('english'))] dataList.append(filtered)tagList.append(str(flag))

我們根據(jù)star_rating的分數(shù)，將將評論等級分開。根據(jù)常規(guī)：4、5屬于好評，3中評，2，1屬于差評，

data['tag'] = tagList pos_data = data[data["tag"]== '1'] sec_data = data[data["tag"]== '2'] neg_data= data[data["tag"]== '3']pos_review = [] for i in pos_data.review_body.values:# 將review_body分詞wordList = jieba.cut(i, cut_all=True)# 去停用詞termsAll = tuple(set([term for term in wordList if term not in stop]))pos_review.append(termsAll)sec_review = [] for i in sec_data.review_body.values:# 將review_body分詞wordList = jieba.cut(i, cut_all=True)# 去停用詞termsAll = tuple(set([term for term in wordList if term not in stop]))sec_review.append(termsAll)neg_review = [] for i in neg_data.review_body.values:# 將review_body分詞wordList = jieba.cut(i, cut_all=True)# 去停用詞termsAll = tuple(set([term for term in wordList if term not in stop]))neg_review.append(termsAll)

判斷產(chǎn)品的好壞

時間序列預(yù)測是一類比較困難的預(yù)測問題。與常見的回歸預(yù)測模型不同，輸入變量之間的“序列依賴性”為時間序列問題增加了復(fù)雜度。RNN（recursive neural network）遞歸神經(jīng)網(wǎng)絡(luò)專門用來處理序列依賴性，同樣LSTM在深度學(xué)習(xí)中廣泛使用的一種遞歸神經(jīng)網(wǎng)絡(luò)。

我們創(chuàng)建新的df來存放data[‘star_rating’]評分數(shù)據(jù)

data.set_index(pd.to_datetime(data.review_date)) data.sort_index(inplace=True) df = pd.DataFrame(index=data.index) df['star_rating'] = data['star_rating']

我們回滾10天，預(yù)測客戶評分

import numpy as np look_back = 10 def create_dataset(dataset):dataX = []dataY = []for i in range(len(dataset) - look_back - 1):x = dataset[i: i+look_back, 0]dataX.append(x)y = dataset[i+look_back, 0]dataY.append(y)print('X: %s, Y: %s' % (x, y))return np.array(dataX), np.array(dataY)# Series轉(zhuǎn)化ndarray dataset = df.values.astype('float32') # 訓(xùn)練集和測試集 train_size = int(len(dataset)*0.67) validation_size = len(dataset) - train_size train, validation = dataset[0: train_size, :], dataset[train_size: len(dataset), :] X_train, y_train = create_dataset(train) X_validation, y_validation = create_dataset(validation)

下面我們使用keras，將前10天的評分作為輸入變量，第10天的作為輸出變量。

import math import numpy as np import matplotlib as mpl mpl.rcParams['font.sans-serif'] = ['KaiTi'] mpl.rcParams['font.serif'] = ['KaiTi'] from pandas import read_csv from matplotlib import pyplot as plt from keras.models import Sequential from keras.layers import Dense from keras.utils.vis_utils import plot_modeldef build_model():model = Sequential()model.add(Dense(units=12, input_dim=look_back, activation='relu'))model.add(Dense(units=8, activation='relu'))model.add(Dense(units=1))model.compile(loss='mean_squared_error', optimizer='adam')return modelmodel = build_model() model.fit(X_train, y_train, epochs=30, batch_size=2, verbose=1) train_score = model.evaluate(X_train, y_train, verbose=0) print('Train Score: %.2f MSE (%.2f RMSE)' % (train_score, math.sqrt(train_score))) validation_score = model.evaluate(X_validation, y_validation, verbose=0) print('Validation Score: %.2f MSE (%.2f RMSE)' % (validation_score, math.sqrt(validation_score))) predict_train = model.predict(X_train) predict_validation = model.predict(X_validation) # 構(gòu)建通過訓(xùn)練數(shù)據(jù)集進行預(yù)測的圖表數(shù)據(jù) # 依據(jù)給定形狀和類型(shape[, dtype, order])返回一個新的空數(shù)組。 predict_train_plot = np.empty_like(dataset) predict_train_plot[:, :] = np.nan predict_train_plot[look_back:len(predict_train)+look_back, :] = predict_train predict_validation_plot = np.empty_like(dataset) predict_validation_plot[:, :] = np.nan predict_validation_plot[len(predict_train)+look_back*2+1: len(dataset)-1, :] = predict_validation

下面我們繪制散點圖查看評分的具體分布

df["year"] = df.index.year plt.figure(figsize=(20,10)) plt.scatter(df.year, dataset,color='blue',label="真實當(dāng)天評分") plt.scatter(df.year,predict_train_plot, color='green',label= "預(yù)測訓(xùn)練集當(dāng)天評分") plt.scatter(df.year,predict_validation_plot, color='red',label= "預(yù)測測試集當(dāng)天評分") plt.legend(fontsize='x-large') plt.show()

這是hair_drye產(chǎn)品的評分分布預(yù)測

這是microwave產(chǎn)品的評分分布預(yù)測

這是pacifiter產(chǎn)品的評分分布預(yù)測

我們從圖中看出其實現(xiàn)在三個產(chǎn)品都是處于3-5評分之間，并沒有很嚴重的差評數(shù)據(jù)預(yù)測出來。

建立產(chǎn)品服務(wù)體系

我們使用我們需要將使用LDA主題模型，提取出差評里的關(guān)鍵詞，從而建立更加完整的產(chǎn)品服務(wù)體系。

from gensim import corpora, models, similarities dictionary = corpora.Dictionary(neg_review) corpus = [dictionary.doc2bow(sentence) for sentence in neg_review] # num_topics類似Kmeans指定K值 lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20) # 第一類主題，最具有代表性的前10個分詞 print (lda.print_topic(1,topn=10)) print (lda.print_topic(2,topn=10)) print (lda.print_topic(3,topn=10)) print (lda.print_topic(4,topn=10)) print (lda.print_topic(5,topn=10)) print (lda.print_topic(6,topn=10)) print (lda.print_topic(7,topn=10)) print (lda.print_topic(8,topn=10)) print (lda.print_topic(9,topn=10)) print (lda.print_topic(10,topn=10)) print (lda.print_topic(11,topn=10)) print (lda.print_topic(12,topn=10)) print (lda.print_topic(13,topn=10)) print (lda.print_topic(14,topn=10)) print (lda.print_topic(15,topn=10)) print (lda.print_topic(16,topn=10)) print (lda.print_topic(17,topn=10)) print (lda.print_topic(18,topn=10)) print (lda.print_topic(19,topn=10))

在20個主題中，我們查找出現(xiàn)多次的關(guān)鍵詞

這是hair_drye產(chǎn)品的差評中的關(guān)鍵詞：

product，dry，money，problem，disappointed，months，work，hair，bad。

因此，我們認為客戶給與hair_drye產(chǎn)品差評的原因，可能是金錢，或者客戶使用出現(xiàn)了問題等原因

這是microwave產(chǎn)品的差評中的關(guān)鍵詞：

product，return，door，service，back，microwave，dangerous，power，issues

因此，我們認為客戶給與microwave產(chǎn)品差評的原因，可能是service服務(wù)不好，客戶認為使用時出現(xiàn)危險。

這是pacfier產(chǎn)品的差評中的關(guān)鍵詞：

small，product，pacifiers，disappointed，months，quality，size

因此，我們認為客戶給與pacfie產(chǎn)品差評的原因，可能是產(chǎn)品大小不行，客戶認為太小了，質(zhì)量出現(xiàn)問題

同樣的我們可以提取出好評里的關(guān)鍵詞

from gensim import corpora, models, similarities dictionary = corpora.Dictionary(pos_review) corpus = [dictionary.doc2bow(sentence) for sentence in pos_review] # num_topics類似Kmeans指定K值 lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20) # 第一類主題，最具有代表性的前10個分詞 print (lda.print_topic(1,topn=10)) print (lda.print_topic(2,topn=10)) print (lda.print_topic(3,topn=10)) print (lda.print_topic(4,topn=10)) print (lda.print_topic(5,topn=10)) print (lda.print_topic(6,topn=10)) print (lda.print_topic(7,topn=10)) print (lda.print_topic(8,topn=10)) print (lda.print_topic(9,topn=10)) print (lda.print_topic(10,topn=10)) print (lda.print_topic(11,topn=10)) print (lda.print_topic(12,topn=10)) print (lda.print_topic(13,topn=10)) print (lda.print_topic(14,topn=10)) print (lda.print_topic(15,topn=10)) print (lda.print_topic(16,topn=10)) print (lda.print_topic(17,topn=10)) print (lda.print_topic(18,topn=10)) print (lda.print_topic(19,topn=10))

這是hair_drye產(chǎn)品的好評中的關(guān)鍵詞：

price，dry，money，quickly，powerful，recommend，heat，hair，easy。

因此，我們認為客戶給與hair_drye產(chǎn)品好評的原因，可能是金錢，或者客戶使用時覺得產(chǎn)品powerful，quickly和easy，有可能recommend給別人

這是microwave產(chǎn)品的好評中的關(guān)鍵詞：

microwave，small，great，power，happy，ordered，space，size

因此，我們認為客戶給與microwave產(chǎn)品差評的原因，可能是客戶使用覺得很好，心情高興。

這是pacfier產(chǎn)品的好評中的關(guān)鍵詞：

great，safe，constantly，love，sanitize，automatic，effort，excellent

money，quickly，powerful，recommend，heat，hair，easy。

因此，我們認為客戶給與hair_drye產(chǎn)品好評的原因，可能是金錢，或者客戶使用時覺得產(chǎn)品powerful，quickly和easy，有可能recommend給別人

這是microwave產(chǎn)品的好評中的關(guān)鍵詞：

microwave，small，great，power，happy，ordered，space，size

因此，我們認為客戶給與microwave產(chǎn)品差評的原因，可能是客戶使用覺得很好，心情高興。

這是pacfier產(chǎn)品的好評中的關(guān)鍵詞：

great，safe，constantly，love，sanitize，automatic，effort，excellent

因此，我們認為客戶給與pacfie產(chǎn)品好評的原因，可能是產(chǎn)品安全，心情高興

附上代碼和數(shù)據(jù)集：鏈接：https://pan.baidu.com/s/1vRCVsFZdiQ9Fvz95yNrYcw

提取碼：rl2t

后記：其實根本不需要用Python寫代碼，直接使用Stata幾十分鐘就可以搞定。

總結(jié)

以上是生活随笔為你收集整理的五十八、2020美赛C题的思路以及个人Python的解法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：华光股份是国企吗
下一篇： 2020 年最全 Python 面试题汇