python surprise
surprise 庫
目錄
- surprise 庫
- 目錄
- 參考文檔
- 安裝
- 預(yù)測算法
- prediction_algorithms包
- The algorithm base class 算法基類
- Baselines estimates configuration 基線法估計(jì)配置
- 使用GridSearchCV調(diào)整算法參數(shù)
- Similarity measure configuration 相似度度量配置
- Trainset class
- 源碼解讀
- reader 類
- dataset 類
- trainset 類
- KNNBaseline 類
- KNNBasic 類
- example
- split 數(shù)據(jù)分割
- ShuffleSplit 類
參考文檔
Surprise’ documentation
GitHub — Surprise
安裝
With pip (you’ll need numpy, and a C compiler. Windows users might prefer using conda):
$ pip install numpy $ pip install scikit-surpriseWith conda:
$ conda install -c conda-forge scikit-surpriseFor the latest version, you can also clone the repo and build the source (you’ll first need Cython and numpy):
$ pip install numpy cython $ git clone https://github.com/NicolasHug/surprise.git $ cd surprise $ python setup.py install預(yù)測算法
任何算法都是在surprise庫的全局命名空間中,可以直接調(diào)用。
prediction_algorithms包
| random_pred.NormalPredictor | Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. | 根據(jù)訓(xùn)練集的分布特征隨機(jī)給出一個(gè)預(yù)測值 |
| baseline_only.BaselineOnly | Algorithm predicting the baseline estimate for given user and item. | 給定用戶和Item,給出基于baseline的估計(jì)值 |
| knns.KNNBasic | A basic collaborative filtering algorithm. | 最基礎(chǔ)的協(xié)作過濾 |
| knns.KNNWithMeans | A basic collaborative filtering algorithm, taking into account the mean ratings of each user. | 將每個(gè)用戶評分的均值考慮在內(nèi)的協(xié)作過濾實(shí)現(xiàn) |
| knns.KNNBaseline | A basic collaborative filtering algorithm taking into account a baseline rating. | 考慮基線評級的協(xié)作過濾 |
| matrix_factorization.SVD | The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. | SVD實(shí)現(xiàn) |
| matrix_factorization.SVDpp | The SVD++ algorithm, an extension of SVD taking into account implicit ratings. | SVD++,即LFM+SVD |
| matrix_factorization.NMF | A collaborative filtering algorithm based on Non-negative Matrix Factorization. | 基于矩陣分解的協(xié)作過濾 |
| slope_one.SlopeOne | A simple yet accurate collaborative filtering algorithm. | 一個(gè)簡單但精確的協(xié)作過濾算法 |
| co_clustering.CoClustering | A collaborative filtering algorithm based on co-clustering. | 基于協(xié)同聚類的協(xié)同過濾算法 |
The algorithm base class 算法基類
The surprise.prediction_algorithms.algo_base module defines the base class AlgoBase from which every single prediction algorithm has to inherit.
fit(trainset)
Train an algorithm on a given training set.
This method is called by every derived class as the first basic step for training an algorithm. It basically just initializes some internal structures and set the self.trainset attribute.
Parameters: trainset (Trainset) – A training set, as returned by the folds method.
Returns: self
在一個(gè)給定的數(shù)據(jù)集上訓(xùn)練一個(gè)算法
Baselines estimates configuration 基線法估計(jì)配置
中用到最小化平方誤差函數(shù)的算法(包括Baseline 方法 和相似度計(jì)算)都需要配置參數(shù),不同的參數(shù)會導(dǎo)致算法有不同的性能,且不同的 baseline 用于不同的算法,參數(shù)配置不一樣。
使用默認(rèn)的baseline參數(shù)已經(jīng)可以取得一定的性能的。
需要注意的是,一些相似度度量用到Baseline,無論實(shí)際的預(yù)測算法是否用到baseline,都需要配置相關(guān)參數(shù)
具體的參數(shù)配置,可以參考這篇論文
Factor in the Neighbors: Scalable and Accurate Collaborative Filtering
使用GridSearchCV調(diào)整算法參數(shù)
cross_validate()函數(shù)針對給定的一組參數(shù)報(bào)告交叉驗(yàn)證過程的準(zhǔn)確性度量。 如果你想知道哪個(gè)參數(shù)組合可以產(chǎn)生最好的結(jié)果,那么GridSearchCV類就可以解決問題。 給定一個(gè)參數(shù)的字典,這個(gè)類徹底地嘗試所有參數(shù)組合,并報(bào)告任何準(zhǔn)確性度量(在不同分割上的平均值)的最佳參數(shù)。 它受到scikit-learn的GridSearchCV的啟發(fā)。
Similarity measure configuration 相似度度量配置
Many algorithms use a similarity measure to estimate a rating. The way they can be configured is done in a similar fashion as for baseline ratings: you just need to pass a sim_options argument at the creation of an algorithm. This argument is a dictionary with the following (all optional) keys:
通過指定sim_options這個(gè)字典變量來配置相似度指標(biāo)
1. name: 指定相似度指標(biāo)的名字,similarities module 中給出了MSD
2. user_based: 指定是使用計(jì)算用戶之間的相似度還是Item之間的相似度,這個(gè)地方的選擇對預(yù)測算法的性能有巨大影響,默認(rèn)值:True
3. min_support:當(dāng)相似度不為0時(shí),最小公共用戶數(shù)或公共項(xiàng)目數(shù)
4. shrinkage: 收縮參數(shù),僅用于 pearson_baseline 相似度
Trainset class
It is used by the fit() method of every prediction algorithm. You should not try to built such an object on your own but rather use the Dataset.folds() method or the DatasetAutoFolds.build_full_trainset() method.
訓(xùn)練集不應(yīng)該由個(gè)人創(chuàng)建,可以通過Dataset.folds()或者DatasetAutoFolds.build_full_trainset()方法創(chuàng)建。
源碼解讀
reader 類
def __init__(self, name=None, line_format='user item rating',sep=None,rating_scale=(1, 5), skip_lines=0):建立閱讀器的格式,自動將評分定在rating_scale區(qū)間
self.offset : self.offset = -lower_bound + 1 if lower_bound <= 0 else 0
解析一行,返回需要的格式數(shù)據(jù)
dataset 類
def build_full_trainset(self):"""Do not split the dataset into folds and just return a trainset asis, built from the whole dataset.User can then query for predictions, as shown in the :ref:`User Guide<train_on_whole_trainset>`.Returns:The :class:`Trainset <surprise.Trainset>`."""將所有數(shù)據(jù)用于生成訓(xùn)練集
def construct_trainset(self, raw_trainset):建立 `raw_id` 到 `inner_id`的映射 得到 ur 字典 --- `用戶-評分`字典ir 字典 --- `Item-評分`字典n_users --- 用戶數(shù)n_items --- Item數(shù)n_ratings --- 評分記錄數(shù)建立訓(xùn)練集trainset 類
KNNBaseline 類
考慮基礎(chǔ)評分的協(xié)同過濾
最好用 pearson_baseline 相似度
關(guān)于Baseline方法的作用,可以參考這篇文章
推薦系統(tǒng)的協(xié)同過濾算法實(shí)現(xiàn)和淺析 pdf
KNNBasic 類
def fit(self, trainset):"""計(jì)算相似度計(jì)算公式的由來可以參考相關(guān)書籍""" def estimate(self, u, i):"""估計(jì)用戶u對物品i的打分找出在給物品i打過分的k個(gè)近鄰用戶根據(jù)相應(yīng)的預(yù)測評分計(jì)算公式計(jì)算預(yù)測評分返回值: est, details({'actual_k': actual_k})"""example
庫中,是預(yù)測所有用戶沒有評論過的物品的評分
不適合做Top-N推薦
split 數(shù)據(jù)分割
數(shù)據(jù)分割部分的分割操作都是針對所有數(shù)據(jù)集的操作,適合評分預(yù)測的應(yīng)用,不適合Top-N推薦的應(yīng)用(有待驗(yàn)證)
ShuffleSplit 類
將所有的訓(xùn)練數(shù)據(jù)隨機(jī)地分成k份。
將數(shù)據(jù)分割為訓(xùn)練集和測試集割為訓(xùn)練集和測試集
總結(jié)
以上是生活随笔為你收集整理的python surprise的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Surprise安装
- 下一篇: python surprise库_Pyt