當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python surprise库_surprise库文档翻译

發(fā)布時(shí)間：2023/12/8 python 51 豆豆

生活随笔收集整理的這篇文章主要介紹了 python surprise库_surprise库文档翻译小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

這里的格式并沒(méi)有做過(guò)多的處理，可參考于OneNote筆記鏈接

由于OneNote取消了單頁(yè)分享，如果需要請(qǐng)留下郵箱，我會(huì)郵件發(fā)送pdf版本，后續(xù)再解決這個(gè)問(wèn)題

推薦算法庫(kù)surprise安裝

pip install surprise

基本用法

? 自動(dòng)交叉驗(yàn)證

# Load the movielens-100k dataset (download it if needed),

data = Dataset.load_builtin('ml-100k')

# We'll use the famous SVD algorithm.

algo = SVD()

# Run 5-fold cross-validation and print results

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

load_builtin方法會(huì)自動(dòng)下載“movielens-100k”數(shù)據(jù)集，放在.surprise_data目錄下面

? 使用自定義的數(shù)據(jù)集

# path to dataset file

file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the

# movielens-100k dataset, each line has the following format:

# 'user item rating timestamp', separated by '\t' characters.

reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)

# We can now use this dataset as we please, e.g. calling cross_validate

cross_validate(BaselineOnly(), data, verbose=True)

交叉驗(yàn)證

○ cross_validate(算法，數(shù)據(jù)集，評(píng)估模塊measures=[]，交叉驗(yàn)證折數(shù)cv)

○ 通過(guò)test方法和KFold也可以對(duì)數(shù)據(jù)集進(jìn)行更詳細(xì)的操作，也可以使用LeaveOneOut或是ShuffleSplit

from surprise import SVD

from surprise import Dataset

from surprise import accuracy

from surprise.model_selection import Kfold

# Load the movielens-100k dataset

data = Dataset.load_builtin('ml-100k')

# define a cross-validation iterator

kf = KFold(n_splits=3)

algo = SVD()

for trainset, testset in kf.split(data):

# train and test algorithm.

algo.fit(trainset)

predictions = algo.test(testset)

# Compute and print Root Mean Squared Error

accuracy.rmse(predictions, verbose=True)

使用GridSearchCV來(lái)調(diào)節(jié)算法參數(shù)

如果需要對(duì)算法參數(shù)來(lái)進(jìn)行比較測(cè)試，GridSearchCV類(lèi)可以提供解決方案

例如對(duì)SVD的參數(shù)嘗試不同的值

from surprise import SVD

from surprise import Dataset

from surprise.model_selection import GridSearchCV

# Use movielens-100K

data = Dataset.load_builtin('ml-100k')

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],

'reg_all': [0.4, 0.6]}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score

print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score

print(gs.best_params['rmse'])

# We can now use the algorithm that yields the best rmse:

algo = gs.best_estimator['rmse']

algo.fit(data.build_full_trainset())

使用預(yù)測(cè)算法

○ 基線估算配置

§ 在使用最小二乘法（ALS）時(shí)傳入?yún)?shù)：

1) reg_i：項(xiàng)目正則化參數(shù)，默認(rèn)值為10

2) reg_u：用戶(hù)正則化參數(shù)，默認(rèn)值為15

3) n_epochs：als過(guò)程中的迭代次數(shù)，默認(rèn)值為10

print('Using ALS')

bsl_options = {'method': 'als',

'n_epochs': 5,

'reg_u': 12,

'reg_i': 5

}

algo = BaselineOnly(bsl_options=bsl_options)

§ 在使用隨機(jī)梯度下降（SGD）時(shí)傳入?yún)?shù)：

1) reg：優(yōu)化成本函數(shù)的正則化參數(shù)，默認(rèn)值為0.02

2) learning_rate：SGD的學(xué)習(xí)率，默認(rèn)值為0.005

3) n_epochs：SGD過(guò)程中的迭代次數(shù)，默認(rèn)值為20

print('Using SGD')

bsl_options = {'method': 'sgd',

'learning_rate': .00005,

}

algo = BaselineOnly(bsl_options=bsl_options)

§ 在創(chuàng)建KNN算法時(shí)候來(lái)傳遞參數(shù)

bsl_options = {'method': 'als',

'n_epochs': 20,

}

sim_options = {'name': 'pearson_baseline'}

algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)

○ 相似度配置

§ name：要使用的相似度名稱(chēng)，默認(rèn)是MSD

§ user_based：是否時(shí)基于用戶(hù)計(jì)算相似度，默認(rèn)為T(mén)rue

§ min_support：最小的公共數(shù)目，當(dāng)最小的公共用戶(hù)或者公共項(xiàng)目小于min_support時(shí)候，相似度為0

§ shrinkage：收縮參數(shù)，默認(rèn)值為100

i. sim_options = {'name': 'cosine',

'user_based': False # compute similarities between items

}

algo = KNNBasic(sim_options=sim_options)

ii. sim_options = {'name': 'pearson_baseline',

'shrinkage': 0 # no shrinkage

}

algo = KNNBasic(sim_options=sim_options)

? 其他一些問(wèn)題

○ 如何獲取top-N的推薦

from collections import defaultdict

from surprise import SVD

from surprise import Dataset

def get_top_n(predictions, n=10):

'''Return the top-N recommendation for each user from a set of predictions.

Args:

predictions(list of Prediction objects): The list of predictions, as

returned by the test method of an algorithm.

n(int): The number of recommendation to output for each user. Default

is 10.

Returns:

A dict where keys are user (raw) ids and values are lists of tuples:

[(raw item id, rating estimation), ...] of size n.

'''

# First map the predictions to each user.

top_n = defaultdict(list)

for uid, iid, true_r, est, _ in predictions:

top_n[uid].append((iid, est))

# Then sort the predictions for each user and retrieve the k highest ones.

for uid, user_ratings in top_n.items():

user_ratings.sort(key=lambda x: x[1], reverse=True)

top_n[uid] = user_ratings[:n]

return top_n

# First train an SVD algorithm on the movielens dataset.

data = Dataset.load_builtin('ml-100k')

trainset = data.build_full_trainset()

algo = SVD()

algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.

testset = trainset.build_anti_testset()

predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user

for uid, user_ratings in top_n.items():

print(uid, [iid for (iid, _) in user_ratings])

○ 如何計(jì)算精度

from collections import defaultdict

from surprise import Dataset

from surprise import SVD

from surprise.model_selection import KFold

def precision_recall_at_k(predictions, k=10, threshold=3.5):

'''Return precision and recall at k metrics for each user.'''

# First map the predictions to each user.

user_est_true = defaultdict(list)

for uid, _, true_r, est, _ in predictions:

user_est_true[uid].append((est, true_r))

precisions = dict()

recalls = dict()

for uid, user_ratings in user_est_true.items():

# Sort user ratings by estimated value

user_ratings.sort(key=lambda x: x[0], reverse=True)

# Number of relevant items

n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

# Number of recommended items in top k

n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

# Number of relevant and recommended items in top k

n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))

for (est, true_r) in user_ratings[:k])

# Precision@K: Proportion of recommended items that are relevant

precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

# Recall@K: Proportion of relevant items that are recommended

recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

return precisions, recalls

data = Dataset.load_builtin('ml-100k')

kf = KFold(n_splits=5)

algo = SVD()

for trainset, testset in kf.split(data):

algo.fit(trainset)

predictions = algo.test(testset)

precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

# Precision and recall can then be averaged over all users

print(sum(prec for prec in precisions.values()) / len(precisions))

print(sum(rec for rec in recalls.values()) / len(recalls))

○ 如何獲得用戶(hù)（或項(xiàng)目）的k個(gè)最近鄰居

import io # needed because of weird encoding of u.item file

from surprise import KNNBaseline

from surprise import Dataset

from surprise import get_dataset_dir

def read_item_names():

"""Read the u.item file from MovieLens 100-k dataset and return two

mappings to convert raw ids into movie names and movie names into raw ids.

"""

file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'

rid_to_name = {}

name_to_rid = {}

with io.open(file_name, 'r', encoding='ISO-8859-1') as f:

for line in f:

line = line.split('|')

rid_to_name[line[0]] = line[1]

name_to_rid[line[1]] = line[0]

return rid_to_name, name_to_rid

# First, train the algortihm to compute the similarities between items

data = Dataset.load_builtin('ml-100k')

trainset = data.build_full_trainset()

sim_options = {'name': 'pearson_baseline', 'user_based': False}

algo = KNNBaseline(sim_options=sim_options)

algo.fit(trainset)

# Read the mappings raw id <-> movie name

rid_to_name, name_to_rid = read_item_names()

# Retrieve inner id of the movie Toy Story

toy_story_raw_id = name_to_rid['Toy Story (1995)']

toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)

# Retrieve inner ids of the nearest neighbors of Toy Story.

toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)

# Convert inner ids of the neighbors into names.

toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)

for inner_id in toy_story_neighbors)

toy_story_neighbors = (rid_to_name[rid]

for rid in toy_story_neighbors)

print()

print('The 10 nearest neighbors of Toy Story are:')

for movie in toy_story_neighbors:

print(movie)

○ 解釋一下什么是raw_id和inner_id？

i. 用戶(hù)和項(xiàng)目有自己的raw_id和inner_id，原生id是評(píng)分文件或者pandas數(shù)據(jù)集中定義的id，重點(diǎn)在于要知道你使用predict()或者其他方法時(shí)候接收原生的id

ii. 在訓(xùn)練集創(chuàng)建時(shí)，每一個(gè)原生的id映射到inner id（這是一個(gè)唯一的整數(shù)，方便surprise操作），原生id和內(nèi)部id之間的轉(zhuǎn)換可以用訓(xùn)練集中的to_inner_uid(), to_inner_iid(), to_raw_uid(), 以及to_raw_iid()方法

○ 默認(rèn)數(shù)據(jù)集下載到了哪里？怎么修改這個(gè)位置

i. 默認(rèn)數(shù)據(jù)集下載到了——“~/.surprise_data”中

ii. 如果需要修改，可以通過(guò)設(shè)置“SURPRISE_DATA_FOLDER”環(huán)境變量來(lái)修改位置

? API合集

○ 推薦算法包

random_pred.NormalPredictor Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

baseline_only. BaselineOnly Algorithm predicting the baseline estimate for given user and item.

knns.KNNBasic A basic collaborative filtering algorithm.

knns.KNNWithMeans A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

knns.KNNWithZScore A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

knns.KNNBaseline A basic collaborative filtering algorithm taking into account a baseline rating.

matrix_factorization.SVD The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

matrix_factorization.SVDpp The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

matrix_factorization.NMF A collaborative filtering algorithm based on Non-negative Matrix Factorization.

slope_one.SlopeOne A simple yet accurate collaborative filtering algorithm.

co_clustering.CoClustering A collaborative filtering algorithm based on co-clustering.

○ 推薦算法基類(lèi)

§ class surprise.prediction_algorithms.algo_base.AlgoBase(**kwargs)

§ 如果算法需要計(jì)算相似度，那么baseline_options參數(shù)可以用來(lái)配置

§ 方法介紹：

1) compute_baselines() 計(jì)算用戶(hù)和項(xiàng)目的基線，這個(gè)方法只能適用于Pearson相似度或者BaselineOnly算法，返回一個(gè)包含用戶(hù)相似度和用戶(hù)相似度的元組

2) compute_similarities() 相似度矩陣，計(jì)算相似度矩陣的方式取決于sim_options算法創(chuàng)建時(shí)候所傳遞的參數(shù)，返回相似度矩陣

3) default_preditction() 默認(rèn)的預(yù)測(cè)值，如果計(jì)算期間發(fā)生了異常，那么預(yù)測(cè)值則使用這個(gè)值。默認(rèn)情況下時(shí)所有評(píng)分的均值（可以在子類(lèi)中重寫(xiě)，以改變這個(gè)值），返回一個(gè)浮點(diǎn)類(lèi)型

4) fit(trainset) 在給定的訓(xùn)練集上訓(xùn)練算法，每個(gè)派生類(lèi)都會(huì)調(diào)用這個(gè)方法作為訓(xùn)練算法的第一個(gè)基本步驟，它負(fù)責(zé)初始化一些內(nèi)部結(jié)構(gòu)和設(shè)置self.trainset屬性，返回self指針

5) get_neighbors(iid, k) 返回inner id所對(duì)應(yīng)的k個(gè)最近鄰居的，取決于這個(gè)iid所對(duì)應(yīng)的是用戶(hù)還是項(xiàng)目（由sim_options里面的user_based是True還是False決定），返回K個(gè)最近鄰居的內(nèi)部id列表

6) predict(uid, iid, r_ui=None, clip=True, verbose=False) 計(jì)算給定的用戶(hù)和項(xiàng)目的評(píng)分預(yù)測(cè)，該方法將原生id轉(zhuǎn)換為內(nèi)部id，然后調(diào)用estimate每個(gè)派生類(lèi)中定義的方法。如果結(jié)果是一個(gè)不可能的預(yù)測(cè)結(jié)果，那么會(huì)根據(jù)default_prediction()來(lái)計(jì)算預(yù)測(cè)值

另外解釋一下clip，這個(gè)參數(shù)決定是否對(duì)預(yù)測(cè)結(jié)果進(jìn)行近似。舉個(gè)例子來(lái)說(shuō)，如果預(yù)測(cè)結(jié)果是5.5，而評(píng)分的區(qū)間是[1,5]，那么將預(yù)測(cè)結(jié)果修改為5；如果預(yù)測(cè)結(jié)果小于1，那么修改為1。默認(rèn)為T(mén)rue

verbose參數(shù)決定了是否打印每個(gè)預(yù)測(cè)的詳細(xì)信息。默認(rèn)值為False

返回值，一個(gè)rediction對(duì)象，包含了：

a) 原生用戶(hù)id

b) 原生項(xiàng)目id

c) 真實(shí)評(píng)分

d) 預(yù)測(cè)評(píng)分

e) 可能對(duì)后面預(yù)測(cè)有用的一些其他的詳細(xì)信息

7) test(testset, verbose=False) 在給定的測(cè)試集上測(cè)試算法，即估計(jì)給定測(cè)試集中的所有評(píng)分。返回值是prediction對(duì)象的列表

○ 預(yù)測(cè)模塊

§ surprise.prediction_algorithms.predictions模塊定義了Prediction命名元組和PredictionImpossible異常

§ Prediction

□ 用于儲(chǔ)存預(yù)測(cè)結(jié)果的命名元組

□ 僅用于文檔和打印等目的

□ 參數(shù)：

uid 原生用戶(hù)id

iid 原生項(xiàng)目id

r_ui 浮點(diǎn)型的真實(shí)評(píng)分

est 浮點(diǎn)型的預(yù)測(cè)評(píng)分

details 預(yù)測(cè)相關(guān)的其他詳細(xì)信息

§ surprise.prediction_algorithms.predictions.PredictionImpossible

□ 當(dāng)預(yù)測(cè)不可能時(shí)候，出現(xiàn)這個(gè)異常

□ 這個(gè)異常會(huì)設(shè)置當(dāng)前的預(yù)測(cè)評(píng)分變?yōu)槟J(rèn)值（全局平均值）

○ model_selection包

§ 交叉驗(yàn)證迭代器

□ 該模塊中包含各種交叉驗(yàn)證迭代器：

KFold 基礎(chǔ)交叉驗(yàn)證迭代器

RepeatedKFold 重復(fù)KFold交叉驗(yàn)證迭代器

ShuffleSplit 具有隨機(jī)訓(xùn)練集和測(cè)試集的基本交叉驗(yàn)證迭代器

LeaveOneOut 交叉驗(yàn)證迭代器，其中每個(gè)用戶(hù)再測(cè)試集中只有一個(gè)評(píng)級(jí)

PredefinedKFold 使用load_from_folds方法加載數(shù)據(jù)集時(shí)的交叉驗(yàn)證迭代器

□ 該模塊中還包含了將數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集的功能

train_test_split(data, test_size=0,2, train_size=None, random_state=None, shuffle=True)

data，要拆分的數(shù)據(jù)集

test_size，如果是浮點(diǎn)數(shù)，表示要包含在測(cè)試集中的評(píng)分比例；如果是整數(shù)，則表示測(cè)試集中固定的評(píng)分?jǐn)?shù)；如果是None，則設(shè)置為訓(xùn)練集大小的補(bǔ)碼；默認(rèn)為0.2

train_size，如果是浮點(diǎn)數(shù)，表示要包含在訓(xùn)練集中的評(píng)分比例；如果是整數(shù)，則表示訓(xùn)練集中固定的評(píng)分?jǐn)?shù)；如果是None，則設(shè)置為訓(xùn)練集大小的補(bǔ)碼；默認(rèn)為None

random_state，整形，一個(gè)隨機(jī)種子，如果多次拆分后獲得的訓(xùn)練集和測(cè)試集沒(méi)有多大分別，可以用這個(gè)參數(shù)來(lái)定義隨機(jī)種子

shuffle，布爾值，是否在數(shù)據(jù)集中改變?cè)u(píng)分，默認(rèn)為T(mén)rue

§ 交叉驗(yàn)證

surprise.model_selection.validation.cross_validate(algo, data, measures=[u'rmse'，u'mae'], cv=None, return_train_measures=False, n_jobs=1, pre_dispatch=u'2 * n_jobs', verbose=False)

? algo，算法

? data，數(shù)據(jù)集

? measures，字符串列表，指定評(píng)估方案

? cv，交叉迭代器或者整形或者None，如果是迭代器那么按照指定的參數(shù)；如果是int，則使用KFold交叉驗(yàn)證迭代器，以參數(shù)為折疊次數(shù)；如果是None，那么使用默認(rèn)的KFold，默認(rèn)折疊次數(shù)5

? return_train_measures，是否計(jì)算訓(xùn)練集的性能指標(biāo)，默認(rèn)為False

? n_jobs，整形，并行進(jìn)行評(píng)估的最大折疊數(shù)。如果為-1，那么使用所有的CPU；如果為1，那么沒(méi)有并行計(jì)算（有利于調(diào)試）；如果小于-1，那么使用（CPU數(shù)目 + n_jobs + 1）個(gè)CPU計(jì)算；默認(rèn)值為1

? pre_dispatch，整形或者字符串，控制在并行執(zhí)行期間調(diào)度的作業(yè)數(shù)。（減少這個(gè)數(shù)量可有助于避免在分配過(guò)多的作業(yè)多于CPU可處理內(nèi)容時(shí)候的內(nèi)存消耗）這個(gè)參數(shù)可以是：

None，所有作業(yè)會(huì)立即創(chuàng)建并生成

int，給出生成的總作業(yè)數(shù)確切數(shù)量

string，給出一個(gè)表達(dá)式作為函數(shù)n_jobs，例如“2*n_jobs”

默認(rèn)為2*n_jobs

返回值是一個(gè)字典：

? test_*，*對(duì)應(yīng)評(píng)估方案，例如“test_rmse”

? train_*，*對(duì)應(yīng)評(píng)估方案，例如“train_rmse”。當(dāng)return_train_measures為T(mén)rue時(shí)候生效

? fit_time，數(shù)組，每個(gè)分割出來(lái)的訓(xùn)練數(shù)據(jù)評(píng)估時(shí)間，以秒為單位

? test_time，數(shù)組，每個(gè)分割出來(lái)的測(cè)試數(shù)據(jù)評(píng)估時(shí)間，以秒為單位

§ 參數(shù)搜索

□ class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch=u'2 * n_jobs', joblib_verbose=0)

? 參數(shù)類(lèi)似于上文中交叉驗(yàn)證

? refit，布爾或者整形。如果為T(mén)rue，使用第一個(gè)評(píng)估方案中最佳平均性能的參數(shù)，在整個(gè)數(shù)據(jù)集上重新構(gòu)造算法measures；通過(guò)傳遞字符串可以指定其他的評(píng)估方案；默認(rèn)為False

? joblib_verbose，控制joblib的詳細(xì)程度，整形數(shù)字越高，消息越多

□ 內(nèi)部方法：

a) best_estimator，字典，使用measures方案的最佳評(píng)估值，對(duì)所有的分片計(jì)算平均

b) best_score，浮點(diǎn)數(shù)，計(jì)算平均得分

c) best_params，字典，獲得measure中最佳的參數(shù)組合

d) best_index，整數(shù)，獲取用于該指標(biāo)cv_results的最高精度（平均下來(lái)的）的指數(shù)

e) cv_results，數(shù)組字典，measures中所有的參數(shù)組合的訓(xùn)練和測(cè)試的時(shí)間

f) fit，通過(guò)cv參數(shù)給出不同的分割方案，對(duì)所有的參數(shù)組合計(jì)算

g) predit，當(dāng)refit為False時(shí)候生效，傳入數(shù)組，見(jiàn)上文

h) test，當(dāng)refit為False時(shí)候生效，傳入數(shù)組，見(jiàn)上文

□ class surprise.model_selection.search.RandomizedSearchCV（algo_class，param_distributions，n_iter = 10，measures = [u'rmse'，u'mae']，cv = None，refit = False，return_train_measures = False，n_jobs = 1，pre_dispatch = u'2 * n_jobs'，random_state =無(wú)，joblib_verbose = 0 ）

隨機(jī)抽樣進(jìn)行計(jì)算而非像上面的進(jìn)行瓊劇

○ 相似度模塊

§ similarities模塊中包含了用于計(jì)算用戶(hù)或者項(xiàng)目之間相似度的工具：

1) cosine

2) msd

3) pearson

4) pearson_baseline

○ 精度模塊

§ surprise.accuracy模塊提供了用于計(jì)算一組預(yù)測(cè)的精度指標(biāo)的工具：

1) rmse（均方根誤差）

2) mae（平均絕對(duì)誤差）

3) fcp

○ 數(shù)據(jù)集模塊

§ dataset模塊定義了用于管理數(shù)據(jù)集的Dataset類(lèi)和其他子類(lèi)

§ class surprise.dataset.Dataset（reader）

§ 內(nèi)部方法：

1) load_builtin(name=u'ml-100k')，加載內(nèi)置數(shù)據(jù)集，返回一個(gè)Dataset對(duì)象

2) load_from_df(df, reader)，df（dataframe），數(shù)據(jù)框架，要求必須具有三列（要求順序），用戶(hù)原生id，項(xiàng)目原生id，評(píng)分；reader，指定字段內(nèi)容

3) load_from_file(file_path, reader)，從文件中加載數(shù)據(jù)，參數(shù)為路徑和讀取器

4) load_from_folds(folds_files, reader)，處理一種特殊情況，movielens-100k數(shù)據(jù)集中已經(jīng)定義好了訓(xùn)練集和測(cè)試集，可以通過(guò)這個(gè)方法導(dǎo)入

○ 訓(xùn)練集類(lèi)

§ class surprise.Trainset(ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items)

§ 屬性分析：

1) ur，用戶(hù)評(píng)分列表（item_inner_id，rating）的字典，鍵是用戶(hù)的inner_id

2) ir，項(xiàng)目評(píng)分列表（user_inner_id，rating）的字典，鍵是項(xiàng)目的inner_id

3) n_users，用戶(hù)數(shù)量

4) n_items，項(xiàng)目數(shù)量

5) n_ratings，總評(píng)分?jǐn)?shù)

6) rating_scale，評(píng)分的最高以及最低的元組

7) global_mean，所有評(píng)級(jí)的平均值

§ 方法分析：

1) all_items()，生成函數(shù)，迭代所有項(xiàng)目，返回所有項(xiàng)目的內(nèi)部id

2) all_ratings(),生成函數(shù)，迭代所有評(píng)分，返回一個(gè)(uid, iid, rating)的元組

3) all_users()，生成函數(shù)，迭代所有的用戶(hù)，然會(huì)用戶(hù)的內(nèi)部id

4) build_anti_testset(fill=None)，返回可以在test()方法中用作測(cè)試集的評(píng)分列表，參數(shù)決定填充未知評(píng)級(jí)的值，如果使用None則使用global_mean

5) knows_item(iid)，標(biāo)志物品是否屬于訓(xùn)練集

6) knows_user(uid)，標(biāo)志用戶(hù)是否屬于訓(xùn)練集

7) to_inner_iid(riid)，將項(xiàng)目原始id轉(zhuǎn)換為內(nèi)部id

8) to_innser_uid(ruid)，將用戶(hù)原始id轉(zhuǎn)換為內(nèi)部id

9) to_raw_iid(iiid)，將項(xiàng)目的內(nèi)部id轉(zhuǎn)換為原始id

10) to_raw_uid(iuid)，將用戶(hù)的內(nèi)部id轉(zhuǎn)換為原始id

○ 讀取器類(lèi)

§ class surprise.reader.Reader(name=None, line_format=u'user item rating', sep=None, rating_scale=(1, 5), skip_lines=0)

Reader類(lèi)用于解析包含評(píng)分的文件，要求這樣的文件每行只指定一個(gè)評(píng)分，并且需要每行遵守這個(gè)接口：用戶(hù)；項(xiàng)目；評(píng)分；[時(shí)間戳]，不要求順序，但是需要指定

§ 參數(shù)分析：

1) name，如果指定，則返回一個(gè)內(nèi)置的數(shù)據(jù)集Reader，并忽略其他參數(shù)，可接受的值是"ml-100k"，“m1l-1m”和“jester”。默認(rèn)為None

2) line_format，string類(lèi)型，字段名稱(chēng)，指定時(shí)需要用空格分割，默認(rèn)是“user item rating”

3) sep，char類(lèi)型，指定字段之間的分隔符

4) rating_scale，元組類(lèi)型，評(píng)分區(qū)間，默認(rèn)為(1,5)

5) skip_lines，int類(lèi)型，要在文件開(kāi)頭跳過(guò)的行數(shù)，默認(rèn)為0

○ 轉(zhuǎn)儲(chǔ)模塊

§ surprise.dump.dump(file_name, predictions=None, algo=None, verbose=0)

□ 一個(gè)pickle的基本包裝器，用來(lái)序列化預(yù)測(cè)或者算法的列表

□ 參數(shù)分析：

a) file_name，str，指定轉(zhuǎn)儲(chǔ)的位置

b) predictions，Prediction列表，用來(lái)轉(zhuǎn)儲(chǔ)的預(yù)測(cè)

c) algo，Algorithm，用來(lái)轉(zhuǎn)儲(chǔ)的算法

d) verbose，詳細(xì)程度，0或者1

§ surprise.dump.load(file_name)

□ 用于讀取轉(zhuǎn)儲(chǔ)文件

□ 返回一個(gè)元組（predictions, algo），其中可能為None

文章來(lái)源：segmentfault，作者：Wildcard。如果您發(fā)現(xiàn)本社區(qū)中有涉嫌抄襲的內(nèi)容，歡迎發(fā)送郵件至：william.shi#ucloud.cn（郵箱中#請(qǐng)改為@）進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，本社區(qū)將立刻刪除涉嫌侵權(quán)內(nèi)容。

后臺(tái)-系統(tǒng)設(shè)置-擴(kuò)展變量-手機(jī)廣告位-內(nèi)容正文底部

總結(jié)

以上是生活随笔為你收集整理的python surprise库_surprise库文档翻译的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： python surprise库_Pyt
下一篇： python安装surprise库总是失

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python surprise库_surprise库文档翻译

總結(jié)