交替最小二乘矩阵分解_使用交替最小二乘矩阵分解与pyspark建立推荐系统
交替最小二乘矩陣分解
pyspark上的動(dòng)手推薦系統(tǒng) (Hands-on recommender system on pyspark)
Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For example, Amazon can recommend new shopping items to buy, Netflix can recommend new movies to watch, and Google can recommend news that a user might be interested in. The two widely used approaches for building a recommender system are the content-based filtering (CBF) and collaborative filtering (CF).
推薦系統(tǒng)是一種信息過濾工具,旨在預(yù)測用戶喜歡的產(chǎn)品,并在此基礎(chǔ)上向用戶推薦一些產(chǎn)品。 例如,Amazon可以推薦要購買的新購物商品,Netflix可以推薦要觀看的新電影,而Google可以推薦用戶可能感興趣的新聞。構(gòu)建推薦系統(tǒng)的兩種廣泛使用的方法是基于內(nèi)容的過濾( CBF)和協(xié)作過濾(CF)。
To understand the concept of recommender systems, let us look at an example. The below table shows the user-item utility matrix Y where the value Rui denotes how item i has been rated by user u on a scale of 1–5. The missing entries (shown by ? in Table) are the items that have not been rated by the respective user.
為了理解推薦系統(tǒng)的概念,讓我們看一個(gè)例子。 下表顯示了用戶項(xiàng)效用矩陣 Y,其中Rui值表示用戶i如何以1-5的等級對項(xiàng)i進(jìn)行評分。 缺少的條目(在表中用?顯示)是尚未由相應(yīng)用戶評分的項(xiàng)目。
utility matrix效用矩陣The objective of the recommender system is to predict the ratings for these items. Then the highest rated items can be recommended to the respective users. In real world problems, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available. The code for this project can be found here.
推薦系統(tǒng)的目的是預(yù)測這些項(xiàng)目的評級。 然后可以向各個(gè)用戶推薦評分最高的項(xiàng)目。 在現(xiàn)實(shí)世界中,效用矩陣非常稀疏,因?yàn)槊總€(gè)用戶在大量可用選項(xiàng)中僅??會遇到一小部分項(xiàng)目。 該項(xiàng)目的代碼可以在這里找到。
顯式與隱式評級 (Explicit v.s. Implicit ratings)
There are two ways to gather user preference data to recommend items, the first method is to ask for explicit ratings from a user, typically on a concrete rating scale (such as rating a movie from one to five stars) making it easier to make extrapolations from data to predict future ratings. However, the drawback with explicit data is that it puts the responsibility of data collection on the user, who may not want to take time to enter ratings. On the other hand, implicit data is easy to collect in large quantities without any extra effort on the part of the user. Unfortunately, it is much more difficult to work with.
有兩種收集用戶偏好數(shù)據(jù)以推薦項(xiàng)目的方法,第一種方法是要求用戶提供明確的評分 ,通常以具體的評分標(biāo)準(zhǔn)(例如,將電影從一星評為五星),使推斷更容易從數(shù)據(jù)中預(yù)測未來的收視率。 但是,顯式數(shù)據(jù)的缺點(diǎn)是將數(shù)據(jù)收集的責(zé)任交給了用戶,而用戶可能不想花時(shí)間輸入評分。 另一方面, 隱式數(shù)據(jù)易于大量收集,而無需用戶付出任何額外的努力。 不幸的是,要處理它要困難得多。
數(shù)據(jù)稀疏和冷啟動(dòng) (Data Sparsity and Cold Start)
In real world problems, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available. Cold-Start problem can arise during addition of a new user or a new item where both do not have history in terms of ratings. Sparsity can be calculated using the below function.
在現(xiàn)實(shí)世界中,效用矩陣非常稀疏,因?yàn)槊總€(gè)用戶在大量可用選項(xiàng)中僅??會遇到一小部分項(xiàng)目。 在添加新用戶或新項(xiàng)目時(shí),如果兩者都沒有評級歷史記錄,則會出現(xiàn)冷啟動(dòng)問題。 稀疏度可以使用以下函數(shù)計(jì)算。
def get_mat_sparsity(ratings):# Count the total number of ratings in the datasetcount_nonzero = ratings.select("rating").count()# Count the number of distinct userIds and distinct movieIdstotal_elements = ratings.select("userId").distinct().count() * ratings.select("movieId").distinct().count()# Divide the numerator by the denominatorsparsity = (1.0 - (count_nonzero *1.0)/total_elements)*100print("The ratings dataframe is ", "%.2f" % sparsity + "% sparse.")get_mat_sparsity(ratings)1.具有顯式評分的數(shù)據(jù)集(MovieLens) (1. Dataset with Explicit Ratings (MovieLens))
MovieLens is a recommender system and virtual community website that recommends movies for its users to watch, based on their film preferences using collaborative filtering. MovieLens 100M datatset is taken from the MovieLens website, which customizes user recommendation based on the ratings given by the user. To understand the concept of recommendation system better, we will work with this dataset. This dataset can be downloaded from here.
MovieLens是一個(gè)推薦器系統(tǒng)和虛擬社區(qū)網(wǎng)站,它基于用戶使用協(xié)作篩選的電影偏好來推薦電影供用戶觀看。 MovieLens 100M數(shù)據(jù)集來自MovieLens網(wǎng)站,該網(wǎng)站根據(jù)用戶給出的等級來自定義用戶推薦。 為了更好地理解推薦系統(tǒng)的概念,我們將使用此數(shù)據(jù)集。 可以從此處下載該數(shù)據(jù)集。
There are 2 tuples, movies and ratings which contains variables such as MovieID::Genre::Title and UserID::MovieID::Rating::Timestamp respectively.
有2個(gè)元組,電影和等級,其中分別包含MovieID :: Genre :: Title和UserID :: MovieID :: Rating :: Timestamp等變量。
Let’s load the data and explore the data. To load the data as a spark dataframe, import pyspark and instantiate a spark session.
讓我們加載數(shù)據(jù)并瀏覽數(shù)據(jù)。 要將數(shù)據(jù)加載為spark數(shù)據(jù)框,請導(dǎo)入pyspark并實(shí)例化spark會話。
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName('Recommendations').getOrCreate()movies = spark.read.csv("movies.csv",header=True)
ratings = spark.read.csv("ratings.csv",header=True)
ratings.show()Movie lens data with explicit ratings given by the user for movies watched.用戶為觀看的電影提供具有明確評級的電影鏡頭數(shù)據(jù)。 # Join both the data frames to add movie data into ratings
movie_ratings = ratings.join(movies, ['movieId'], 'left')
movie_ratings.show()Ratings and movie data combined for better visualisation收視率和電影數(shù)據(jù)結(jié)合在一起,可以更好地可視
Let's calculate the data sparsity to understand the sparsity of the data. Please function that we built in the beginning of this article to get the sparsity. The movie lens data is 98.36% sparse.
讓我們計(jì)算數(shù)據(jù)稀疏度以了解數(shù)據(jù)稀疏度。 請使用我們在本文開頭構(gòu)建的函數(shù)來獲得稀疏性。 電影鏡頭數(shù)據(jù)為98.36%稀疏。
get_mat_sparsity(ratings)Before moving into recommendations, split the dataset into train and test. Please set the seed to reproduce results. We will look at different recommendation techniques in detail in the below sections.
在提出建議之前,請將數(shù)據(jù)集分為訓(xùn)練和測試。 請?jiān)O(shè)置種子以重現(xiàn)結(jié)果。 我們將在以下各節(jié)中詳細(xì)介紹不同的推薦技術(shù)。
# Create test and train set(train, test) = ratings.randomSplit([0.8, 0.2], seed = 2020)
2.具有二進(jìn)制等級的數(shù)據(jù)集(MovieLens) (2. Dataset with Binary Ratings (MovieLens))
With some datasets, we don’t have the luxury to work with explicit ratings. For those datasets we must infer ratings from the given information. In MovieLens dataset, let us add implicit ratings using explicit ratings by adding 1 for watched and 0 for not watched. We aim the model to give high predictions for movies watched.
對于某些數(shù)據(jù)集,我們沒有足夠的精力來使用明確的評分。 對于這些數(shù)據(jù)集,我們必須根據(jù)給定的信息推斷等級。 在MovieLens數(shù)據(jù)集中,讓我們使用顯式評級添加隱式評級,方法是將觀看次數(shù)增加1,將未觀看次數(shù)增加0。 我們的模型旨在為觀看的電影提供較高的預(yù)測。
Please note this is not the right dataset for implict ratings since there may be movies in the not watched set, which the user has actually watched but not given a rating.
請注意,這不是隱式收視率的正確數(shù)據(jù)集,因?yàn)榭赡苡形从^看集中的電影,這些電影是用戶實(shí)際看過但未給出收視率的。
def get_binary_data(ratings):ratings = ratings.withColumn('binary', fn.lit(1))userIds = ratings.select("userId").distinct()movieIds = ratings.select("movieId").distinct()user_movie = userIds.crossJoin(movieIds).join(ratings, ['userId', 'movieId'], "left")user_movie = user_movie.select(['userId', 'movieId', 'binary']).fillna(0)return user_movieuser_movie = get_binary_data(ratings)user_movie.show()推薦方法 (Approaches to Recommendation)
The two widely used approaches for building a recommender system are the content-based filtering (CBF) and collaborative filtering (CF), of which CBF is the most widely used.
建立推薦系統(tǒng)的兩種廣泛使用的方法是基于內(nèi)容的過濾(CBF)和協(xié)作過濾(CF),其中CBF的使用最為廣泛。
The below figure illustrates the concepts of CF and CBF. The primary difference between these two approaches is that CF looks for similar users to recommend items while CBF looks for similar contents to recommend items.
下圖說明了CF和CBF的概念。 這兩種方法之間的主要區(qū)別在于,CF尋找相似的用戶來推薦商品,而CBF尋找相似的內(nèi)容來推薦商品。
協(xié)同過濾(CF) (Collaborative filtering (CF))
Collaborative filtering aggregates the past behaviour of all users. It recommends items to a user based on the items liked by another set of users whose likes (and dislikes) are similar to the user under consideration. This approach is also called the user-user based CF.
協(xié)同過濾匯總了所有用戶的過去行為。 它基于喜歡(和不喜歡)與正在考慮的用戶相似的另一組用戶喜歡的項(xiàng)目,向用戶推薦項(xiàng)目。 這種方法也稱為基于用戶-用戶的CF。
item-item based CF became popular later, where to recommend an item to a user, the similarity between items liked by the user and other items are calculated. The user-user CF and item-item CF can be achieved by two different ways, memory-based (neighbourhood approach) and model-based (latent factor model approach).
基于項(xiàng)目的CF后來變得很流行,向用戶推薦項(xiàng)目,計(jì)算用戶喜歡的項(xiàng)目與其他項(xiàng)目之間的相似度。 用戶-用戶CF和項(xiàng)目CF可以通過兩種不同的方式來實(shí)現(xiàn),分別是基于內(nèi)存的 (鄰域方法)和基于模型的 (潛在因子模型方法)。
1.基于內(nèi)存的(鄰域方法) (1. The memory-based (neighbourhood approach))
To recommend items to user u1 in the user-user based neighborhood approach first a set of users whose likes and dislikes similar to the useru1 is found using a similarity metrics which captures the intuition that sim(u1, u2) >sim(u1, u3) where user u1 and u2 are similar and user u1 and u3 are dissimilar. similar user is called the neighbourhood of user u1.
為了在基于用戶-用戶的鄰域方法中向用戶u1推薦商品,首先使用相似度度量來發(fā)現(xiàn)一組用戶,這些用戶的相似之處和不相似之處類似于useru1,該相似度捕捉了sim(u1,u2)> sim(u1,u3)的直覺。 ),其中用戶u1和u2是相似的,而用戶u1和u3是不相似的。 類似的用戶稱為用戶u1的鄰居。
Image Source: https://link.springer.com/article/10.1007/s11227-020-03266-2圖片來源: https : //link.springer.com/article/10.1007/s11227-020-03266-2Neighbourhood approaches are most effective at detecting very localized relationships (neighbours), ignoring other users. But the downsides are that, first, the data gets sparse which hinders scalability, and second, they perform poorly in terms of reducing the RMSE (root-mean-squared-error) compared to other complex methods.
鄰域方法最有效地檢測非常本地化的關(guān)系(鄰域),而忽略其他用戶。 但是缺點(diǎn)是,首先,數(shù)據(jù)稀疏,這阻礙了可伸縮性,其次,與其他復(fù)雜方法相比,它們在降低RMSE(均方根誤差)方面表現(xiàn)不佳。
2.基于模型(潛在因子模型方法) (2. The model-based (latent factor model approach))
Latent factor model based collaborative filtering learns the (latent) user and item profiles (both of dimension K) through matrix factorization by minimizing the RMSE (Root Mean Square Error) between the available ratings yand their predicted values y?. Here each item i is associated with a latent (feature) vector xi, each user u is associated with a latent (profile) vector theta(u), and the rating y?(ui) is expressed as
基于潛在因子模型的協(xié)同過濾通過最小化可用評級y和其預(yù)測值y?之間的RMSE(均方根誤差),通過矩陣分解來學(xué)習(xí)(潛在)用戶和項(xiàng)目配置文件(均為K維)。 在這里,每個(gè)項(xiàng)目i與一個(gè)潛在(特征)向量xi關(guān)聯(lián),每個(gè)用戶u與一個(gè)潛在(特征)向量theta(u)關(guān)聯(lián),并且等級y?(ui)表示為
Image Source: https://developers.google.com/machine-learning/recommendation/collaborative/matrix圖片來源: https : //developers.google.com/machine-learning/recommendation/collaborative/matrixLatent methods deliver prediction accuracy superior to other published CF techniques. It also addresses the sparsity issue faced with other neighbourhood models in CF. The memory efficiency and ease of implementation via gradient based matrix factorization model (SVD) have made this the method of choice within the Netflix Prize competition. However, the latent factor models are only effective at estimating the association between all items at once but fails to identify strong association among a small set of closely related items.
潛在方法提供的預(yù)測精度優(yōu)于其他已發(fā)布的CF技術(shù)。 它還解決了CF中其他鄰域模型面臨的稀疏性問題。 通過基于梯度的矩陣分解模型(SVD)的存儲效率和易于實(shí)現(xiàn)的性能使其成為Netflix獎(jiǎng)競賽中的首選方法。 但是,潛在因子模型僅可有效地一次估計(jì)所有項(xiàng)目之間的關(guān)聯(lián),而無法識別少量緊密相關(guān)項(xiàng)目之間的強(qiáng)關(guān)聯(lián)。
建議使用交替最小二乘(ALS) (Recommendation using Alternating Least Squares (ALS))
Alternating Least Squares (ALS) matrix factorisation attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called ‘factor’ matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.
交替最小二乘(ALS)矩陣分解嘗試將等級矩陣R估計(jì)為兩個(gè)較低等級的矩陣X和Y的乘積,即X * Yt =R。通常,這些近似值稱為“因子”矩陣。 一般方法是迭代的。 在每次迭代期間,因子矩陣之一保持恒定,而另一個(gè)因矩陣最小而求解。 然后,新求解的因子矩陣保持不變,同時(shí)求解其他因子矩陣。
In the below section we will instantiate an ALS model, run hyperparameter tuning, cross validation and fit the model.
在下面的部分中,我們將實(shí)例化ALS模型,運(yùn)行超參數(shù)調(diào)整,交叉驗(yàn)證并擬合模型。
1.建立一個(gè)ALS模型 (1. Build out an ALS model)
To build the model explicitly specify the columns. Set nonnegative as ‘True’, since we are looking at rating greater than 0. The model also gives an option to select implicit ratings. Since we are working with explicit ratings, set it to ‘False’.
要構(gòu)建模型,請明確指定列。 將非負(fù)值設(shè)置為' True ',因?yàn)槲覀冋诓榭吹脑u級大于0。該模型還提供了選擇隱式評級的選項(xiàng)。 由于我們正在使用顯式評級,因此請將其設(shè)置為“ False ”。
When using simple random splits as in Spark’s CrossValidator or TrainValidationSplit, it is actually very common to encounter users and/or items in the evaluation set that are not in the training set. By default, Spark assigns NaN predictions during ALSModel.transform when a user and/or item factor is not present in the model.We set cold start strategy to ‘drop’ to ensure we don’t get NaN evaluation metrics
當(dāng)在Spark的CrossValidator或TrainValidationSplit使用簡單的隨機(jī)拆分時(shí),遇到評估集中未包含的用戶和/或項(xiàng)目實(shí)際上很常見。 默認(rèn)情況下,當(dāng)模型中不存在用戶和/或項(xiàng)目因子時(shí),Spark在ALSModel.transform期間分配NaN預(yù)測。 我們將冷啟動(dòng)策略設(shè)置為“下降”,以確保沒有獲得NaN評估指標(biāo)
# Import the required functionsfrom pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator# Create ALS model
als = ALS(
userCol="userId",
itemCol="movieId",
ratingCol="rating",
nonnegative = True,
implicitPrefs = False,
coldStartStrategy="drop"
)
2.超參數(shù)調(diào)整和交叉驗(yàn)證 (2. Hyperparameter tuning and cross validation)
# Import the requisite packagesfrom pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
ParamGridBuilder: We will first define the tuning parameter using param_grid function, please feel free experiment with parameters for the grid. I have only chosen 4 parameters for each grid. This will result in 16 models for training.
ParamGridBuilder :我們將首先使用param_grid函數(shù)定義調(diào)整參數(shù),請隨意嘗試使用網(wǎng)格參數(shù)。 我只為每個(gè)網(wǎng)格選擇了4個(gè)參數(shù)。 這將產(chǎn)生16個(gè)訓(xùn)練模型。
# Add hyperparameters and their respective values to param_gridparam_grid = ParamGridBuilder() \
.addGrid(als.rank, [10, 50, 100, 150]) \
.addGrid(als.regParam, [.01, .05, .1, .15]) \
.build()
RegressionEvaluator: Then define the evaluator, select rmse as metricName in evaluator.
RegressionEvaluator :然后定義評估器,在評估器中選擇rmse作為metricName。
# Define evaluator as RMSE and print length of evaluatorevaluator = RegressionEvaluator(
metricName="rmse",
labelCol="rating",
predictionCol="prediction")
print ("Num models to be tested: ", len(param_grid))
CrossValidator: Now feed both param_grid and evaluator into the crossvalidator including the ALS model. I have chosen number of folds as 5. Feel free to experiement with parameters.
CrossValidator :現(xiàn)在將param_grid和Evaluator都輸入到包含ALS模型的crossvalidator中。 我選擇的折數(shù)為5。可以隨意使用參數(shù)進(jìn)行實(shí)驗(yàn)。
# Build cross validation using CrossValidatorcv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)
4.檢查最佳模型參數(shù) (4. Check the best model parameters)
Let us check, which parameters out of the 16 parameters fed into the crossvalidator, resulted in the best model.
讓我們檢查一下輸入交叉驗(yàn)證器的16個(gè)參數(shù)中的哪個(gè)參數(shù)產(chǎn)生了最佳模型。
print("**Best Model**")# Print "Rank"print(" Rank:", best_model._java_obj.parent().getRank())# Print "MaxIter"
print(" MaxIter:", best_model._java_obj.parent().getMaxIter())# Print "RegParam"
print(" RegParam:", best_model._java_obj.parent().getRegParam())
3.擬合最佳模型并評估預(yù)測 (3. Fit the best model and evaluate predictions)
Now fit the model and make predictions on test dataset. As discussed earlier, based on the range of parameters chosen we are testing 16 models, so this might take while.
現(xiàn)在擬合模型并對測試數(shù)據(jù)集進(jìn)行預(yù)測。 如前所述,基于選擇的參數(shù)范圍,我們正在測試16個(gè)模型,因此這可能需要一段時(shí)間。
#Fit cross validator to the 'train' datasetmodel = cv.fit(train)#Extract best model from the cv model above
best_model = model.bestModel# View the predictions
test_predictions = best_model.transform(test)RMSE = evaluator.evaluate(test_predictions)
print(RMSE)
The RMSE for the best model is 0.866 which means that on average the model predicts 0.866 above or below values of the original ratings matrix. Please note, matrix factorisation unravels patterns that humans cannot, therefore you can find ratings for a few users are a bit off in comparison to others.
最佳模型的RMSE為0.866,這意味著該模型平均預(yù)測比原始評級矩陣的值高或低0.866。 請注意,矩陣分解分解了人類無法做到的模式,因此您可以發(fā)現(xiàn)一些用戶的評分與其他用戶相比有些偏離。
4.提出建議 (4. Make Recommendations)
Lets go ahead and make recommendations based on our best model. recommendForAllUsers(n) function in als takes n recommedations. Lets go with 5 recommendations for all users.
讓我們繼續(xù)前進(jìn),并根據(jù)我們的最佳模型提出建議。 在als中的recommendedForAllUsers(n)函數(shù)需要n個(gè)建議。 讓我們?yōu)樗杏脩籼峁?條建議。
# Generate n Recommendations for all usersrecommendations = best_model.recommendForAllUsers(5)
recommendations.show()
5.將建議轉(zhuǎn)換為可解釋的格式 (5. Convert recommendations into interpretable format)
The recommendations are generated in a format that easy to use in pyspark. As seen in the above the output, the recommendations are saved in an array format with movie id and ratings. To make these recommendations easy to read and compare t check if recommendations make sense, we will want to add more information like movie name and genre, then explode array to get rows with single recommendations.
推薦以易于在pyspark中使用的格式生成。 從上面的輸出中可以看到,推薦以帶有電影ID和等級的數(shù)組格式保存。 為了使這些建議易于閱讀和比較,并檢查建議是否有意義,我們將要添加更多信息,例如電影名稱和流派,然后爆炸數(shù)組以獲取包含單個(gè)建議的行。
nrecommendations = nrecommendations\.withColumn("rec_exp", explode("recommendations"))\
.select('userId', col("rec_exp.movieId"), col("rec_exp.rating"))nrecommendations.limit(10).show()
這些建議有意義嗎? (Do the recommendations make sense?)
To check if the recommendations make sense, join movie name and genre to the above table. Lets randomly pick 100th user to check if the recommendations make sense.
要檢查推薦的建議是否有意義,請將電影名稱和類型加入上表。 讓我們隨機(jī)選擇第100個(gè)用戶來檢查推薦是否有意義。
第100個(gè)用戶的ALS建議: (100th User’s ALS Recommendations:)
nrecommendations.join(movies, on='movieId').filter('userId = 100').show()第100位使用者的實(shí)際偏好: (100th User’s Actual Preference:)
ratings.join(movies, on='movieId').filter('userId = 100').sort('rating', ascending=False).limit(10).show()The movie recommended to the 100th user primarily belongs to comedy, drama, war and romance genres, and the movies preferred by the user as seen in the above table, match very closely with these genres.
推薦給第100位用戶的電影主要屬于喜劇,戲劇,戰(zhàn)爭和浪漫類,而上表中用戶偏愛的電影與這些流派非常匹配。
I hope you enjoyed reading. Please find the detailed codes in the Github Repository.
希望您喜歡閱讀。 請?jiān)贕ithub存儲庫中找到詳細(xì)的代碼。
翻譯自: https://medium.com/@snehal.1409/build-recommendation-system-with-pyspark-using-alternating-least-squares-als-matrix-factorisation-ebe1ad2e7679
交替最小二乘矩陣分解
總結(jié)
以上是生活随笔為你收集整理的交替最小二乘矩阵分解_使用交替最小二乘矩阵分解与pyspark建立推荐系统的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: plsql中导入csvs_在命令行中使用
- 下一篇: 普通话测试系统_普通话