动漫数据推荐系统
Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.
簡單的TfidfVectorizer和CountVectorizer推薦系統,適用于初學者。
目標 (The Goal)
Recommendation system is widely use in many industries to suggest items to customers. For example, a radio station may use a recommendation system to create the top 100 songs of the month to suggest to audiences, or they might use recommendation system to identify song of similar genre that the audience has requested. Based on how recommendation system is widely being used in the industry, we are going to create a recommendation system for the anime data. It would be nice if anime followers can see an update of top 100 anime every time they walk into an anime store or receive an email suggesting anime based on genre that they like.
推薦系統在許多行業中廣泛用于向客戶推薦項目。 例如,廣播電臺可以使用推薦系統創建當月最流行的100首歌曲以向觀眾推薦,或者他們可以使用推薦系統來標識觀眾已請求的類似流派的歌曲。 基于推薦系統在行業中的廣泛使用,我們將為動漫數據創建一個推薦系統。 如果動漫追隨者每次走進動漫商店或收到一封根據他們喜歡的流派來推薦動漫的電子郵件時,都能看到前100名動漫的更新,那就太好了。
With the anime data, we will apply two different recommendation system models: simple recommendation system and content-based recommendation system to analyse anime data and create recommendation.
對于動漫數據 ,我們將應用兩種不同的推薦系統模型:簡單的推薦系統和基于內容的推薦系統來分析動漫數據并創建推薦。
總覽 (Overview)
For simple recommendation system, we need to calculate weighted rating to make sure that the rating of the same score of different votes numbers will have unequal weight. For example, an average rating of 9.0 from 10 people will have lower weight from an average rating of 9.0 from 1,000 people. After we calculate the weighted rating, we can see a list of top chart anime.
對于簡單的推薦系統,我們需要計算加權等級,以確保不同票數的相同分數的等級具有不相等的權重。 例如,每10個人獲得9.0的平均評分將比每1,000個人獲得9.0的平均評分降低。 在計算加權評分后,我們可以看到頂級動漫列表。
For content-based recommendation system, we will need to identify which features will be used as part of the analysis. We will apply sklearn to identify the similarity in the context and create anime suggestion.
對于基于內容的推薦系統,我們將需要確定哪些功能將用作分析的一部分。 我們將應用sklearn 識別上下文中的相似性并創建動漫建議。
資料總覽 (Data Overview)
With the anime data that we have, there are a total of 12,294 anime of 7 different types of data including anime_id, name, genre, type, episodes, rating, and members.
根據我們擁有的動畫數據,總共有12294種7種不同類型的數據的動畫,包括anime_id,名稱,類型,類型,劇集,評分和成員。
實作 (Implementation)
1. Import Data
1.導入數據
We need to import pandas as this well let us put data nicely into the dataframe format.
我們需要導入大熊貓,因為這樣可以很好地將數據放入數據框格式中。
import pandas as pdanime = pd.read_csv('…/anime.csv')
anime.head(5)anime.info()anime.describe()
We can see that the minimum rating score is 1.67 and the maximum rating score is 10. The minimum members is 5 and the maximum is 1,013,917.
我們可以看到最低評級分數是1.67,最大評級分數是10。最小成員是5,最大成員是1,013,917。
anime_dup = anime[anime.duplicated()]print(anime_dup)
There is no duplicated data that need to be cleaned.
沒有重復的數據需要清除。
type_values = anime['type'].value_counts()print(type_values)
Most anime are broadcast of the TV, followed by OVA.
多數動漫在電視上播放,其次是OVA。
2. Simple Recommendation System
2.簡單的推薦系統
Firstly, we need to know the calculation of the weighted rating (WR).
首先,我們需要知道加權等級(WR)的計算。
v is the number of votes for the anime; m is the minimum votes required to be listed in the chart; R is the average rating of the anime; C is the mean vote across the whole report.
v是動畫的票數; m是圖表中需要列出的最低投票數; R是動畫的平均評分; C是整個報告中的平均票數。
We need to determine what data will be used in this calculation.
我們需要確定在此計算中將使用哪些數據。
m = anime['members'].quantile(0.75)print(m)
From the result, we are going to use those data that have more than 9,437 members to create the recommendation system.
根據結果??,我們將使用擁有超過9,437個成員的那些數據來創建推薦系統。
qualified_anime = anime.copy().loc[anime['members']>m]C = anime['rating'].mean()def WR(x,C=C, m=m):
v = x['members']
R = x['rating']
return (v/(v+m)*R)+(m/(v+m)*C)qualified_anime['score'] = WR(qualified_anime)
qualified_anime.sort_values('score', ascending =False)
qualified_anime.head(15)
This is the list of top 15 anime based on weighted rating calculation.
這是根據加權評級計算得出的前15名動漫的列表。
3. Genre Based Recommendation System
3.基于體裁的推薦系統
With genre based recommendation, we will use sklearn package to help us analyse text context. We will need to compute the similarity of the genre. Two method that we are going to use is TfidfVectorizer and CountVectorizer.
通過基于體裁的推薦,我們將使用sklearn包來幫助我們分析文本上下文。 我們將需要計算體裁的相似性。 我們將使用的兩種方法是TfidfVectorizer和CountVectorizer。
In TfidfVectorizer, it calculates the frequency of the word with the consideration on how often it occurs in all documents. While, CountVectorizer is more simpler, it only counts how many times the word has occured.
在TfidfVectorizer中,它會考慮單詞在所有文檔中出現的頻率來計算單詞的頻率。 雖然CountVectorizer更簡單,但它僅計算單詞出現的次數。
from sklearn.feature_extraction.text import TfidfVectorizertf_idf = TfidfVectorizer(lowercase=True, stop_words = 'english')anime['genre'] = anime['genre'].fillna('')
tf_idf_matrix = tf_idf.fit_transform(anime['genre'])tf_idf_matrix.shape
We can see that there are 46 different words from 12,294 anime.
我們可以看到,從12,294動漫中有46個不同的單詞。
from sklearn.metrics.pairwise import linear_kernelcosine_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)indices = pd.Series(anime.index, index=anime['name'])
indices = indices.drop_duplicates()def recommendations (name, cosine_sim = cosine_sim):
similarity_scores = list(enumerate(cosine_sim[indices[name]]))
similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
similarity_scores = similarity_scores[1:21]
anime_indices = [i[0] for i in similarity_scores]
return anime['name'].iloc[anime_indices]recommendations('Kimi no Na wa.')
Based of the TF-IDF calculation, this is the top 20 anime recommendations that are similar to Kimi no Na wa..
根據TF-IDF的計算,這是前20大動漫推薦,與《 Kimi no Na wa》相似。
Next, we are going to look at another model, CountVectorizer() and we are going to compare the result between cosine_similarity and linear_kernel.
接下來,我們將看看另一個模型CountVectorizer(),并將比較余弦相似度和linear_kernel之間的結果。
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.metrics.pairwise import cosine_similaritycount = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(anime['genre'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)recommendations('Kimi no Na wa.', cosine_sim2)cosine_sim2 = linear_kernel(count_matrix, count_matrix)
recommendations('Kimi no Na wa.', cosine_sim2)
Summary
摘要
In this article, we have look at the anime data and trying to build two types of recommendation systems. The simple recommendation system let us see the top chart anime. We have done this by using the weighted rating calculation on the voting and number of members. Then, we continue to build the recommendation system based on anime’s genre feature. With this, we apply both TfidfVectorizer and CountVectorizer to see the differences in their recommendation.
在本文中,我們研究了動畫數據,并嘗試構建兩種類型的推薦系統。 簡單的推薦系統讓我們看到了熱門動畫。 我們通過對投票和成員數進行加權評級計算來完成此任務。 然后,我們將繼續基于動漫的流派特征構建推薦系統。 這樣,我們同時應用了TfidfVectorizer和CountVectorizer來查看其建議中的差異。
Hope that you enjoy this article!
希望您喜歡這篇文章!
1. https://www.datacamp.com/community/tutorials/recommender-systems-python
1. https://www.datacamp.com/community/tutorials/recommender-systems-python
翻譯自: https://medium.com/analytics-vidhya/recommendation-system-for-anime-data-784c78952ba5
總結
- 上一篇: 女人梦到自己吃饭什么意思
- 下一篇: 云尚制片管理系统_电影制片厂的未来