使用Python构建推荐系统的机器学习
Recommender systems are widely used in product recommendations such as recommendations of music, movies, books, news, research articles, restaurants, etc. [1][5][9][10].
推薦系統廣泛用于產品推薦,例如音樂,電影,書籍,新聞,研究文章,餐廳等的推薦。[1] [5] [9] [10]。
There are two popular methods for building recommender systems:
有兩種建立推薦系統的流行方法:
collaborative filtering [3][4][5][10]
協同過濾 [3] [4] [5] [10]
Content-based filtering [6][9]
基于內容的過濾 [6] [9]
The collaborative filtering method [5][10] predicts (filters) the interests of a user on a product by collecting preferences information from many other users (collaborating). The assumption behind the collaborative filtering method is that if a person P1 has the same opinion as another person P2 on an issue, P1 is more likely to share P2’s opinion on a different issue than that of a randomly chosen person [5].
協作過濾方法[5] [10]通過從許多其他用戶收集(協作)偏好信息來預測(過濾)用戶對產品的興趣。 協作過濾方法背后的假設是,如果一個人P1在某個問題上與另一個人P2具有相同的觀點,則P1與隨機選擇的人相比,更有可能分享P2在不同問題上的觀點。
Content-based filtering method [6][9] utilizes product features/attributes to recommend other products similar to what the user likes, based on other users’ previous actions or explicit feedback such as rating on products.
基于內容的過濾方法[6] [9]根據其他用戶的先前行為或明確的反饋(例如,對產品的評分),利用產品功能/屬性來推薦與用戶喜歡的產品類似的其他產品。
A recommender system may use either or both of these two methods.
推薦系統可以使用這兩種方法中的一種或兩種。
In this article, I use the Kaggle Netflix prize data [2] to demonstrate how to use model-based collaborative filtering method to build a recommender system in Python.
在本文中,我將使用Kaggle Netflix獎品數據[2]演示如何使用基于模型的協作過濾方法在Python中構建推薦系統。
The rest of the article is arranged as follows:
本文的其余部分安排如下:
- Overview of collaborative filtering 協作過濾概述
- Build recommender system in Python 用Python構建推薦系統
- Summary 摘要
1.協同過濾概述 (1. Overview of Collaborative Filtering)
As described in [5], the main idea behind collaborative filtering is that one person often gets the best recommendations from another with similar interests. Collaborative filtering uses various techniques to match people with similar interests and make recommendations based on shared interests.
如[5]中所述,協作過濾的主要思想是一個人經常從興趣相似的另一個人那里獲得最佳建議。 協作過濾使用各種技術來匹配具有相似興趣的人,并根據共同的興趣提出建議。
The high-level workflow of a collaborative filtering system can be described as follows:
協作過濾系統的高級工作流程可以描述如下:
- A user rates items (e.g., movies, books) to express his or her preferences on the items 用戶對項目(例如電影,書籍)進行評分,以表達他或她對項目的偏好
- The system treats the ratings as an approximate representation of the user’s interest in items 系統將等級視為用戶對商品興趣的近似表示
- The system matches this user’s ratings with other users’ ratings and finds the people with the most similar ratings 系統將該用戶的評分與其他用戶的評分相匹配,并找到評分最相似的人
- The system recommends items that the similar users have rated highly but not yet being rated by this user 系統推薦相似用戶評價較高但尚未被該用戶評價的項目
Typically a collaborative filtering system recommends products to a given user in two steps [5]:
通常,協作式篩選系統通過兩個步驟[5]向給定的用戶推薦產品:
- Step 1: Look for people who share the same rating patterns with the given user 步驟1:尋找與指定使用者分享相同評分模式的使用者
- Step 2: Use the ratings from the people found in step 1 to calculate a prediction of a rating by the given user on a product 步驟2:使用步驟1中找到的人員的評分來計算給定用戶對產品的評分預測
This is called user-based collaborative filtering. One specific implementation of this method is the user-based Nearest Neighbor algorithm.
這稱為基于用戶的協作過濾。 該方法的一種特定實現是基于用戶的最近鄰算法 。
As an alternative, item-based collaborative filtering (e.g., users who are interested in x also interested in y) works in an item-centric manner:
或者,基于項目的協作過濾(例如,對x感興趣的用戶也對y感興趣)以項目為中心的方式工作:
- Step 1: Build an item-item matrix of the rating relationships between pairs of items 步驟1:建立項目對之間的評級關系的項目-項目矩陣
- Step 2: Predict the rating of the current user on a product by examining the matrix and matching that user’s rating data 步驟2:通過檢查矩陣并匹配該用戶的評分數據,預測當前用戶對產品的評分
There are two types of collaborative filtering system:
協作過濾系統有兩種類型:
- Model-based 基于模型
- Memory-based 基于內存
In a model-based system, we develop models using different machine learning algorithms to predict users’ rating of unrated items [5]. There are many model-based collaborative filtering algorithms such as Matrix factorization algorithms (e.g., singular value decomposition (SVD), Alternating Least Squares (ALS) algorithm [8]), Bayesian networks, clustering models, etc.[5].
在基于模型的系統中,我們使用不同的機器學習算法開發模型,以預測用戶對未分級項目的評分[5]。 有許多基于模型的協作過濾算法,例如矩陣分解算法(例如, 奇異值分解 (SVD),交替最小二乘(ALS)算法[8]), 貝葉斯網絡 , 聚類模型等[5] 。
A memory-based system uses users’ rating data to compute the similarity between users or items. Typical examples of this type of systems are neighbourhood-based method and item-based/user-based top-N recommendations [5].
基于內存的系統使用用戶的評分數據來計算用戶或項目之間的相似度。 這種類型系統的典型示例是基于鄰域的方法和基于項/基于用戶的前N個建議[5]。
This article describes how to build a model-based collaborative filtering system using the SVD model.
本文介紹如何使用SVD模型構建基于模型的協作篩選系統。
2.用Python構建推薦系統 (2. Build Recommender System in Python)
This section describes how to build a recommender system in Python.
本節介紹如何在Python中構建推薦系統。
2.1安裝庫 (2.1 Installing Library)
There are multiple Python libraries available (e.g., Python scikit Surprise [7], Spark RDD-based API for collaborative filtering [8]) for building recommender systems. I use the Python scikit Surprise library in this article for demonstration purpose.
有許多可用的Python庫(例如,Python scikit Surprise [7], 基于Spark RDD的用于協作過濾的API [8])用于構建推薦系統。 我將本文中的Python scikit Surprise庫用于演示目的。
The Surprise library can be installed as follows:
Surprise庫可以按以下方式安裝:
pip install scikit-surprise2.2加載數據 (2.2 Loading Data)
As described before, I use the Kaggle Netflix prize data [2] in this article. There are multiple data files for different purposes. The following data files are used in this article:
如前所述,我在本文中使用Kaggle Netflix獎勵數據[2]。 有多個數據文件可用于不同目的。 本文中使用以下數據文件:
training data:
訓練數據:
- combined_data_1.txt Combined_data_1.txt
- combined_data_2.txt Combined_data_2.txt
- combined_data_3.txt Combined_data_3.txt
- combined_data_4.txt Combined_data_4.txt
Movie titles data file:
電影標題數據文件:
- movie_titles.csv movie_titles.csv
The training dataset is too big to be handled on a Laptop. Thus I only load the first 100,000 records from each of the training data files for demonstration purpose.
訓練數據集太大,無法在筆記本電腦上處理。 因此,出于演示目的,我僅從每個訓練數據文件中加載前100,000條記錄。
Once training data files have been downloaded onto a local machine, the first 100,000 records from each of the training data files can be loaded into memory as Pandas DataFrames as follows:
將訓練數據文件下載到本地計算機上之后,可以將每個訓練數據文件中的前100,000條記錄作為Pandas DataFrames加載到內存中,如下所示:
def readFile(file_path, rows=100000):data_dict = {'Cust_Id' : [], 'Movie_Id' : [], 'Rating' : [], 'Date' : []}
f = open(file_path, "r")
count = 0
for line in f:
count += 1
if count > rows:
break
if ':' in line:
movidId = line[:-2] # remove the last character ':'
movieId = int(movidId)
else:
customerID, rating, date = line.split(',')
data_dict['Cust_Id'].append(customerID)
data_dict['Movie_Id'].append(movieId)
data_dict['Rating'].append(rating)
data_dict['Date'].append(date.rstrip("\n"))
f.close()
return pd.DataFrame(data_dict)df1 = readFile('./data/netflix/combined_data_1.txt', rows=100000)
df2 = readFile('./data/netflix/combined_data_2.txt', rows=100000)
df3 = readFile('./data/netflix/combined_data_3.txt', rows=100000)
df4 = readFile('./data/netflix/combined_data_4.txt', rows=100000)df1['Rating'] = df1['Rating'].astype(float)
df2['Rating'] = df2['Rating'].astype(float)
df3['Rating'] = df3['Rating'].astype(float)
df4['Rating'] = df4['Rating'].astype(float)
The resulting different DataFrames for different portions of training data are combined into one as follows:
針對訓練數據的不同部分所產生的不同DataFrame合并為一個,如下所示:
df = df1.copy()df = df.append(df2)
df = df.append(df3)
df = df.append(df4)df.index = np.arange(0,len(df))
df.head(10)
The movie titles file can be loaded into memory as Pandas DataFrame:
電影標題文件可以作為Pandas DataFrame加載到內存中:
df_title = pd.read_csv('./data/netflix/movie_titles.csv', encoding = "ISO-8859-1", header = None, names = ['Movie_Id', 'Year', 'Name'])df_title.head(10)
2.3培訓與評估模型 (2.3 Training and Evaluating Model)
The Dataset module in Surprise provides different methods for loading data from files, Pandas DataFrames, or built-in datasets such as ml-100k (MovieLens 100k) [4]:
Surprise中的Dataset模塊提供了從文件,Pandas DataFrames或內置數據集(例如ml-100k(MovieLens 100k)[4])中加載數據的不同方法:
- Dataset.load_builtin() 數據集.load_builtin()
- Dataset.load_from_file() 數據集.load_from_file()
- Dataset.load_from_df() 數據集.load_from_df()
I use the load_from_df() method to load data from Pandas DataFrame in this article.
我在本文中使用load_from_df ()方法從Pandas DataFrame加載數據。
The Reader class in Surprise is to parse a file containing users, items, and users’ ratings on items. The default format is that each rating is stored in a separate line in the following order separated by space: user item rating
Surprise中的Reader類用于解析包含用戶,項目以及用戶對項目的評分的文件。 缺省格式是,每個等級以以下順序存儲在單獨的行中,并以空格分隔: 用戶 項目 等級
This order and the separator are configurable using the following parameters:
可以使用以下參數配置此順序和分隔符:
line_format is a string like “item user rating” to indicate the order of the data with field names separated by a space
line_format是一個類似于“ item user rating ”的字符串,用于指示字段名稱用空格分隔的數據順序
sep is used to specify separator between fields, such as space, ‘,’, etc.
sep用于指定字段之間的分隔符,例如空格,“,”等。
rating_scale is to specify the rating scale. The default value is (1, 5)
rating_scale用于指定評分等級。 默認值為(1,5)
skip_lines is to indicate the number of lines to skip at the beginning of the file and the default is 0
skip_lines用于指示文件開頭要跳過的行數,默認值為0
I use the default settings in this article. The item, user, rating correspond to the columns of Cust_Id, Movie_Id, and Rating of the DataFrame respectively.
我在本文中使用默認設置。 item , user , rating分別對應于DataFrame的Cust_Id , Movie_Id和Rating的列。
The Surprise library [7] contains the implementation of multiple models/algorithms for building recommender systems such as SVD, Probabilistic Matrix Factorization (PMF), Non-negative Matrix Factorization (NMF), etc. The SVD model is used in this article.
Surprise庫[7]包含用于構建推薦系統的多個模型/算法,例如SVD,概率矩陣分解(PMF),非負矩陣分解(NMF)等。本文使用了SVD模型。
The following code is to load data from Pandas DataFrame and create a SVD model instance:
以下代碼用于從Pandas DataFrame加載數據并創建SVD模型實例:
from surprise import Reader, Dataset, SVDfrom surprise.model_selection.validation import cross_validatereader = Reader()data = Dataset.load_from_df(df[['Cust_Id', 'Movie_Id', 'Rating']], reader)svd = SVD()
Once the data and model for product recommendation are ready, the model can be evaluated using cross-validation as follows:
一旦準備好產品推薦的數據和模型,就可以使用交叉驗證對模型進行評估,如下所示:
# Run 5-fold cross-validation and print resultscross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
The following are the results of the cross validation of the SVD model:
以下是SVD模型的交叉驗證的結果:
Once the model has been evaluated to our satisfaction, then we can re-train the model using the entire training dataset:
一旦對模型進行了評估,我們就可以使用整個訓練數據集對模型進行重新訓練:
trainset = data.build_full_trainset()svd.fit(trainset)
2.4推薦產品 (2.4 Recommending Products)
After a recommendation model has been trained appropriately, it can be used for prediction.
推薦模型經過適當訓練后,可以用于預測。
For example, given a user (e.g., Customer Id 785314), we can use the trained model to predict the ratings given by the user on different products (i.e., Movie titles):
例如,給定用戶(例如,客戶ID 785314),我們可以使用經過訓練的模型來預測用戶對不同產品(即電影標題)給出的評分:
titles = df_title.copy()titles['Estimate_Score'] = titles['Movie_Id'].apply(lambda x: svd.predict(785314, x).est)To recommend products (i.e., movies) to the given user, we can sort the list of movies in decreasing order of predicted ratings and take the top N movies as recommendations:
為了向給定的用戶推薦產品(例如電影),我們可以按照預測收視率從高到低的順序對電影列表進行排序,并以推薦的前N部電影作為推薦:
titles = titles.sort_values(by=['Estimate_Score'], ascending=False)titles.head(10)
The following are the top 10 movies to be recommended to the user with Customer Id 785314:
以下是建議使用客戶ID 785314向用戶推薦的十大電影:
3.總結 (3. Summary)
In this article, I used the scikit Surprise library [7] and the Kaggle Netflix prize data [2] to demonstrate how to use model-based collaborative filtering method to build a recommender system in Python.
在本文中,我使用了scikit Surprise庫[7]和Kaggle Netflix獎勵數據[2]來演示如何使用基于模型的協作過濾方法在Python中構建推薦系統。
As described at the beginning of this article, the dataset is too big to be handled on a laptop or any typical single personal computer. Thus I only loaded the first 100,000 records from each of the training dataset files for demonstration purpose.
如本文開頭所述,數據集太大,無法在筆記本電腦或任何典型的單臺個人計算機上處??理。 因此,出于演示目的,我僅從每個訓練數據集文件中加載了前100,000條記錄。
In the settings of real applications, I would recommend to use Surprise with Koalas or use the ALS algorithm in Spark MLLib to implement collaborative filtering system and run it on Spark cluster [8].
在實際應用中的設置,我會建議使用與驚喜考拉或使用ALS算法星火MLLib實現協同過濾系統和星火集群[8]上運行它。
The Jupyter notebook with all of the source code used in this article is available in Github [11].
Github [11]中提供了Jupyter筆記本以及本文中使用的所有源代碼。
翻譯自: https://towardsdatascience.com/machine-learning-for-building-recommender-system-in-python-9e4922dd7e97
總結
以上是生活随笔為你收集整理的使用Python构建推荐系统的机器学习的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 纪念碑谷1通关攻略(中国十大著名纪念碑)
- 下一篇: 单词嵌入_单词嵌入与单词袋:推荐系统的奇