主题建模lda_使用LDA的Google Play商店应用评论的主题建模
主題建模lda
App ratings and reviews in google play store provides a decent idea about the entire customer experience. And, it is very important to understand what are the customers liking/disliking about the product/service offered through the app. Thanks to machine learning, we can analyse millions of such reviews and get to know where the product/service/app is failing to provide a good customer experience.
Google Play商店中的應用評分和評論提供了有關整個客戶體驗的不錯的主意。 而且,了解客戶喜歡/不喜歡通過該應用程序提供的產品/服務的內容非常重要。 借助機器學習,我們可以分析數百萬條此類評論,并了解產品/服務/應用無法提供良好客戶體驗的地方。
This read will help you to extract reviews of “any” app from google play store, understand top customer concerns associated to the app by extracting topic of the reviews (with bad ratings) and analysing these concerns in different ways.
這本讀物將幫助您從Google Play商店中提取“任何”應用程序的評論,通過提取評論主題(評分不佳)并以不同方式分析這些關注點,從而了解與該應用程序相關的主要客戶顧慮。
Text analytics, Python, NLP, LDA, Tableau, Data science
文本分析,Python,NLP,LDA,Tableau,數據科學
Let’s start!!!
開始吧!!!
1.圖書館安裝 (1. Libraries installation)
We have google_play_scrapper to extract reviews of app from google_play_store and sklearn to take care of the NLP segment
我們有google_play_scrapper來提取google_play_store和sklearn中應用的評論,以處理NLP細分
!pip install google_play_scraper!pip install sklearnimport pandas as pd
from google_play_scraper.features.reviews import Sort, reviews_all, reviews
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
2.評論數據提取 (2. Reviews data extraction)
Fortunately, we have the google_play_scraper library to extract all reviews and provide output in json format. We just need to provide the app id to the reviews_all function (refer the code below).
幸運的是,我們擁有google_play_scraper庫來提取所有評論并以json格式提供輸出。 我們只需要將應用程序ID提供給reviews_all函數(請參見下面的代碼)。
The app id can be found in the play store URL, once you select any particular app, then look from the string assigned to id, please refer to highlighted part of URL in the image below.
您可以在Play商店的網址中找到該應用程序ID,一旦選擇任何特定的應用程序,然后從分配給ID的字符串中查找,請參考下圖中URL的突出顯示部分。
result = reviews_all(‘com.bt.bms’,sleep_milliseconds=0,lang=’en’, country=’us’)3.創建評論的數據框 (3. Create dataframe of the reviews)
Here, I am converting the json output to pandas dataframe to get stats like total reviews, reviews by unknown users, reviews by unique users, etc. You might see the total reviews won’t match the total reviews present in google play store page. The reason for this difference is that, the total reviews considered in google play store consists of the blank reviews as well (most customers don’t write reviews, they just give the numerical rating).
在這里,我將json輸出轉換為pandas數據框,以獲取統計信息,例如總評論,未知用戶的評論,唯一用戶的評論等。您可能會看到總評論與google play商店頁面中的總評論不匹配。 造成這種差異的原因是,在Google Play商店中考慮的總評論數也包括空白評論(大多數客戶不寫評論,他們只是給出數字評分)。
One important point about the review data — reviewId is a unique key and content column contains the textual reviews. In later part of this read, you will see joins with ‘reviewsId’ and ‘content’ as the joining keys.
關于審閱數據的一個重要點-reviewId是唯一鍵,內容列包含文本審閱。 在本文的后續部分,您將看到以“ reviewsId”和“ content”作為連接鍵的連接。
df = pd.DataFrame(result)unique_users = len(df[‘userName’].unique())
unknown_users = len(df[df[‘userName’]==’A Google user’])
total_reviews = len(df)mean = df['score'].mean()print(f’Total textual reviews: {len(result)} \n’)
print(f’Total unique users : {unique_users}’)
print(f’Total unknown users: {unknown_users}’)
print(f’Total users who gave multiple reviews: {total_reviews — unique_users — unknown_users}\n’)print(f'Average rating for this app based on the textual reviews: {round(mean,2)} \n')
Output:
輸出:
Total textual reviews: 233202Total unique users : 179630
Total unknown users : 28231
Total users who gave multiple reviews: 25341
Average rating for this app based on the textual reviews: 3.99
4.提取評分低于4的所有評論 (4. Extract all reviews with rating below 4)
We are considering reviews of bad ratings only, i.e. textual reviews of ratings below 4. We are assuming here, that customers will write a bad review if they rate the app below 4.
我們僅考慮對不良評分的評論,即評分低于4的文字評論。在此假設,如果客戶將應用評分為4以下,則他們將撰寫不良評論。
df_tm = df[df['score']<=3]df_tm = df_tm[df_tm.content.str.len()>=30]
print(f'Remaining textual reviews: {len(df_tm)} \n')Remaining textual reviews: 37996
5.獲取主題建模的相關列 (5. Get the relevant columns for topic modelling)
For topic modelling we just need the content column. reviewId is the unique key.
對于主題建模,我們只需要content列。 reviewId是唯一鍵。
df_tm = df_tm[['reviewId','content']].drop_duplicates()df_tm.dropna(inplace=True)
df_tm = df_tm.reset_index().drop(columns='index')
print(f'Remaining textual reviews: {len(df_tm)} \n')Remaining textual reviews: 37996
6.創建審閱的文檔術語矩陣 (6. Create document term matrix of the reviews)
We are using CountVectorizer to create document term matrix, below are the two important paramaters.
我們使用CountVectorizer創建文檔術語矩陣,以下是兩個重要的參數。
i. max_df : discard words that occur more than 95% documentsii. min_df : include only those words that occur atleast in 2 documents
一世。 max_df:丟棄出現超過95%的文檔的單詞ii。 min_df:僅包括至少兩個文檔中出現的單詞
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')dtm = cv.fit_transform(df_tm['content'])dtm#shows 8839 terms and 37996 articles
7.使用LDA進行主題建模 (7. Using LDA for topic modelling)
In LDA, the main parameter is n_components, this parameter decides how many topics are to the created. By default n_component is 10. When we try to create more topics, then there is always a possibility of same type of words being present across multiple topics. I finalised n_components as 5 after iterating over this process and analysing the list of words across multiple topics.
在LDA中,主要參數是n_components,此參數決定要創建的主題數。 默認情況下,n_component為10。當我們嘗試創建更多主題時,總是有可能在多個主題中出現相同類型的單詞。 在遍歷此過程并分析多個主題的單詞列表之后,我將n_components最終確定為5。
LDA = LatentDirichletAllocation(n_components=5,random_state=1)LDA.fit(dtm)
Output:
輸出:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7, learning_method='batch',learning_offset=10.0, max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001, n_components=5, n_jobs=None, perp_tol=0.1, random_state=1, topic_word_prior=None, total_samples=1000000.0, verbose=0)
提取主題和相應的top20(高頻)詞 (Extract the topics and corresponding top20 (high frequency) words)
for index,topic in enumerate(LDA.components_):print(f'topic #{index} : ')
print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])
The output from the model are 5 topics each containing list of top 20 words. The function argsort() helps us to get index positions of high frequency words under a particular topic.
該模型的輸出是5個主題,每個主題都包含前20個單詞的列表。 函數argsort()幫助我們獲取特定主題下的高頻單詞的索引位置。
Naming to each topic is given after interpreting the words within each topic. LDA model doesn’t assign topic to those words.
在解釋每個主題中的單詞之后,將給每個主題命名。 LDA模型不會為這些單詞分配主題。
topic #0 : Internet Charges related['rating', 'convenience', 'charging', 'people', 'book', 'movie', 'extra', 'fees', 'fee', 'tickets', 'offers', 'good', 'charge', 'ticket', 'booking', 'high', 'app', 'handling', 'charges', 'internet']topic #1 : Payment/Offers related
['able', 'card', 'fix', 'problem', 'offers', 'unable', 'shows', 'tried', 'doesn', 'work', 'times', 'book', 'payment', 'try', 'offer', 'open', 'working', 'error', 'time', 'app']topic #2 : App issue related
['download', 'don', 'like', 'updated', 'slow', 'previous', 'bad', 'phone', 'old', 'user', 'hai', 'need', 'latest', 'good', 'better', 'worst', 'new', 'version', 'update', 'app']topic #3 : Booking-Refund/Ticket
['care', 'deducted', 'didn', 'account', 'transaction', 'time', 'movie', 'service', 'payment', 'got', 'refund', 'worst', 'customer', 'booking', 'book', 'app', 'booked', 'ticket', 'money', 'tickets']topic #4 : Booking-Location/language
['add', 'cancellation', 'cinema', 'able', 'theatre', 'good', 'seat', 'ticket', 'location', 'shows', 'movies', 'showing', 'booking', 'available', 'seats', 'option', 'movie', 'book', 'tickets', 'app']
8.將主題建模結果與基礎數據集結合 (8. Combine the topic modelling results with the base dataset)
topic_results = LDA.transform(dtm)df_topic_results = pd.DataFrame(topic_results, columns=['0_InternetCharges',
'1_Payment/Offers' ,
'2_App' ,
'3_Booking-Refund/Ticket' ,
'4_Booking-Location/language'
])df_result = pd.merge(df_tm, df_topic_results, how='inner', left_index=True, right_index=True )df_output = pd.merge(df, df_result, how='left', on=[ 'reviewId','content' ])df_output.to_csv('app_reviews_bms.csv')
Output dataset:
輸出數據集:
9.使用Tableau儀表板可視化輸出 (9. Visualising the output using Tableau dashboard)
I have used tableau to analyse and visualise the output dataset.
我已經使用tableau分析和可視化輸出數據集。
You can go this link- https://public.tableau.com/views/AppReviews_v2/GooglePlayStoreappreviewanalysis?:language=en&:display_count=y&:origin=viz_share_link
您可以轉到此鏈接-https : //public.tableau.com/views/AppReviews_v2/GooglePlayStoreappreviewanalysis ?: language = zh-CN &: display_count = y &: origin = viz_share_link
and analyse the results, also download the tableau workbook once you have done creating your output dataset.
并分析結果,創建完輸出數據集后,還請下載tableau工作簿。
For better experience of the tableau visualisation switch to desktop mode (if you are viewing on phone).
為了更好地體驗Tableau可視化效果,請切換到桌面模式(如果在手機上查看)。
Next Steps, We can optimise the content by removing non-english words, emojis and common words. That might help us to improve the accuracy of the model.
下一步,我們可以通過刪除非英語單詞,表情符號和常用單詞來優化內容。 這可能有助于我們提高模型的準確性。
I would love to get valuable comments for any sort of improvements in the entire process.
對于整個過程中的任何改進,我都希望獲得寶貴的意見。
翻譯自: https://medium.com/analytics-vidhya/play-store-app-reviews-textual-data-topic-modelling-using-lda-f24bdbd2910d
主題建模lda
總結
以上是生活随笔為你收集整理的主题建模lda_使用LDA的Google Play商店应用评论的主题建模的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【Java密码学】使用Bouncy Ca
- 下一篇: 胶囊路由_评论:胶囊之间的动态路由