當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

模型融合(stackingblending)

發(fā)布時間：2024/1/23 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了模型融合(stackingblending) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1. blending

需要得到各個模型結(jié)果集的權(quán)重，然后再線性組合。

"""Kaggle competition: Predicting a Biological Response. Blending {RandomForests, ExtraTrees, GradientBoosting} + stretching to [0,1]. The blending scheme is related to the idea Jose H. Solorzano presented here: http://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning/10950#post10950 '''You can try this: In one of the 5 folds, train the models, then use the results of the models as 'variables' in logistic regression over the validation data of that fold'''. Or at least this is the implementation of my understanding of that idea :-) The predictions are saved in test.csv. The code below created my best submission to the competition: - public score (25%): 0.43464 - private score (75%): 0.37751 - final rank on the private leaderboard: 17th over 711 teams :-) Note: if you increase the number of estimators of the classifiers, e.g. n_estimators=1000, you get a better score/rank on the private test set. Copyright 2012, Emanuele Olivetti. BSD license, 3 clauses. """from __future__ import division import numpy as np import load_data from sklearn.cross_validation import StratifiedKFold from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegressiondef logloss(attempt, actual, epsilon=1.0e-15):"""Logloss, i.e. the score of the bioresponse competition."""attempt = np.clip(attempt, epsilon, 1.0-epsilon)return - np.mean(actual * np.log(attempt) +(1.0 - actual) * np.log(1.0 - attempt))if __name__ == '__main__':np.random.seed(0) # seed to shuffle the train setn_folds = 10verbose = Trueshuffle = FalseX, y, X_submission = load_data.load()if shuffle:idx = np.random.permutation(y.size)X = X[idx]y = y[idx]skf = list(StratifiedKFold(y, n_folds))clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]print "Creating train and test sets for blending."dataset_blend_train = np.zeros((X.shape[0], len(clfs)))dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)))for j, clf in enumerate(clfs):print j, clfdataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf)))for i, (train, test) in enumerate(skf):print "Fold", iX_train = X[train]y_train = y[train]X_test = X[test]y_test = y[test]clf.fit(X_train, y_train)y_submission = clf.predict_proba(X_test)[:, 1]dataset_blend_train[test, j] = y_submissiondataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:, 1]dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)printprint "Blending."clf = LogisticRegression()clf.fit(dataset_blend_train, y)y_submission = clf.predict_proba(dataset_blend_test)[:, 1]print "Linear stretch of predictions to [0,1]"y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min())print "Saving Results."tmp = np.vstack([range(1, len(y_submission)+1), y_submission]).Tnp.savetxt(fname='submission.csv', X=tmp, fmt='%d,%0.9f', header='MoleculeId,PredictedProbability', comments='') 2.stacking

stacking的核心：在訓(xùn)練集上進(jìn)行預(yù)測，從而構(gòu)建更高層的學(xué)習(xí)器。

stacking訓(xùn)練過程:

1）拆解訓(xùn)練集。將訓(xùn)練數(shù)據(jù)隨機且大致均勻的拆為m份

2）在拆解后的訓(xùn)練集上訓(xùn)練模型，同時在測試集上預(yù)測。利用m-1份訓(xùn)練數(shù)據(jù)進(jìn)行訓(xùn)練，預(yù)測剩余一份；在此過程進(jìn)行的同時，利用相同的m-1份數(shù)據(jù)訓(xùn)練，在真正的測試集上預(yù)測；如此重復(fù)m次，將訓(xùn)練集上m次結(jié)果疊加為1列，將測試集上m次結(jié)果取均值融合為1列。

3）使用k個分類器重復(fù)2過程。將分別得到k列訓(xùn)練集的預(yù)測結(jié)果，k列測試集預(yù)測結(jié)果。

4）訓(xùn)練3過程得到的數(shù)據(jù)。將k列訓(xùn)練集預(yù)測結(jié)果和訓(xùn)練集真實label進(jìn)行訓(xùn)練，將k列測試集預(yù)測結(jié)果作為測試集。

# -*- coding: utf-8 -*- import numpy as np from sklearn.model_selection import StratifiedKFold from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier import xgboost as xgb from sklearn.ensemble import ExtraTreesClassifier from sklearn.linear_model import LogisticRegressiondef load_data():passdef stacking(train_x, train_y, test):""" stackinginput: train_x, train_y, testoutput: test的預(yù)測值clfs: 5個一級分類器dataset_blend_train: 一級分類器的prediction, 二級分類器的train_xdataset_blend_test: 二級分類器的test"""# 5個一級分類器clfs = [SVC(C = 3, kernel="rbf"),RandomForestClassifier(n_estimators=100, max_features="log2", max_depth=10, min_samples_leaf=1, bootstrap=True, n_jobs=-1, random_state=1),KNeighborsClassifier(n_neighbors=15, n_jobs=-1),xgb.XGBClassifier(n_estimators=100, objective="binary:logistic", gamma=1, max_depth=10, subsample=0.8, nthread=-1, seed=1),ExtraTreesClassifier(n_estimators=100, criterion="gini", max_features="log2", max_depth=10, min_samples_split=2, min_samples_leaf=1,bootstrap=True, n_jobs=-1, random_state=1)]# 二級分類器的train_x, testdataset_blend_train = np.zeros((train_x.shape[0], len(clfs)), dtype=np.int)dataset_blend_test = np.zeros((test.shape[0], len(clfs)), dtype=np.int)# 5個分類器進(jìn)行8_folds預(yù)測n_folds = 8skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=1)for i,clf in enumerate(clfs):dataset_blend_test_j = np.zeros((test.shape[0], n_folds)) # 每個分類器的單次fold預(yù)測結(jié)果for j,(train_index,test_index) in enumerate(skf.split(train_x, train_y)):tr_x = train_x[train_index]tr_y = train_y[train_index]clf.fit(tr_x, tr_y)dataset_blend_train[test_index, i] = clf.predict(train_x[test_index])dataset_blend_test_j[:, j] = clf.predict(test)dataset_blend_test[:, i] = dataset_blend_test_j.sum(axis=1) // (n_folds//2 + 1)# 二級分類器進(jìn)行預(yù)測clf = LogisticRegression(penalty="l1", tol=1e-6, C=1.0, random_state=1, n_jobs=-1)clf.fit(dataset_blend_train, train_y)prediction = clf.predict(dataset_blend_test)return predictiondef main():(train_x, train_y, test) = load_data()prediction = stacking(train_x, train_y, test)return predictionif __name__ == "__main__":prediction = main()

總結(jié)

以上是生活随笔為你收集整理的模型融合(stackingblending)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： xgboost相比传统gbdt有何不同？
下一篇： xgboost使用自定义的loss fu