當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task5模型融合

發布時間：2023/12/15 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task5模型融合小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

智慧海洋建設-Task5 模型融合

5.1 學習目標

學習融合策略

完成相應學習打卡任務

5.2 內容介紹

https://mlwave.com/kaggle-ensembling-guide/
https://github.com/MLWave/Kaggle-Ensemble-Guide

模型融合是比賽后期一個重要的環節，大體來說有如下的類型方式。

簡單加權融合:

回歸（分類概率）：算術平均融合（Arithmetic mean），幾何平均融合（Geometric mean）；
分類：投票（Voting)

boosting/bagging（在xgboost，Adaboost,GBDT中已經用到）:

多樹的提升方法

stacking/blending:

構建多層模型，并利用預測結果再擬合預測。

5.3 相關理論介紹

5.3.1 簡單加權融合

平均法-Averaging

對于回歸問題，一個簡單直接的思路是取平均。將多個模型的回歸結果取平均值作為最終預測結果，進而把多個弱分類器榮和城強分類器。

稍稍改進的方法是進行加權平均，權值可以用排序的方法確定，舉個例子，比如A、B、C三種基本模型，模型效果進行排名，假設排名分別是1，2，3，那么給這三個模型賦予的權值分別是3/6、2/6、1/6。

平均法或加權平均法看似簡單，其實后面的高級算法也可以說是基于此而產生的，Bagging或者Boosting都是一種把許多弱分類器這樣融合成強分類器的思想。

Averaging也可以用于對分類問題的概率進行平均。

投票法-voting

對于一個二分類問題，有3個基礎模型，現在我們可以在這些基學習器的基礎上得到一個投票的分類器，把票數最多的類作為我們要預測的類別。

投票法有硬投票（hard voting）和軟投票（soft voting）

硬投票: 對多個模型直接進行投票，不區分模型結果的相對重要度，最終投票數最多的類為最終被預測的類。

軟投票：增加了設置權重的功能，可以為不同模型設置不同權重，進而區別模型不同的重要度。

5.3.2 stacking/blending

堆疊法-stacking

基本思想：用初始訓練數據學習出若干個基學習器后，將這幾個學習器的預測結果作為新的訓練集(第一層)，來學習一個新的學習器(第二層)。

背景: 為了幫助大家理解模型的原理，我們先假定一下數據背景。

訓練集數據大小為10000*100，測試集大小為3000*100。即訓練集有10000條數據、100個特征；測試集有3000條數據、100個特征。該數據對應回歸問題。

第一層使用三種算法-XGB、LGB、NN。第二層使用GBDT。

算法解讀

stacking 第一層

XGB算法 - 對應圖中model 1部分

- 輸入：使用訓練集進行5-fold處理 - 處理：具體處理細節如下- 使用1、2、3、4折作為訓練集，訓練一個XGB模型并預測第5折和測試集，將預測結果分別稱為**XGB-pred-tran5**(shape `2000*1`)和**XGB-pred-test1**(shape `3000*1`).- 使用1、2、3、5折作為訓練集，訓練一個XGB模型并預測第4折和測試集，將預測結果分別稱為**XGB-pred-tran4**(shape `2000*1`)和**XGB-pred-test2**(shape `3000*1`).- 使用1、2、4、5折作為訓練集，訓練一個XGB模型并預測第3折和測試集，將預測結果分別稱為**XGB-pred-tran3**(shape `2000*1`)和**XGB-pred-test3**(shape `3000*1`).- 使用1、3、4、5折作為訓練集，訓練一個XGB模型并預測第2折和測試集，將預測結果分別稱為**XGB-pred-tran2**(shape `2000*1`)和**XGB-pred-test4**(shape `3000*1`).- 使用2、3、4、5折作為訓練集，訓練一個XGB模型并預測第1折和測試集，將預測結果分別稱為**XGB-pred-tran1**(shape `2000*1`)和**XGB-pred-test5**(shape `3000*1`). - 輸出：- 將XGB分別對1、2、3、4、5折進行預測的結果合并，得到**XGB-pred-tran**(shape `10000*1`)。并且根據5-fold的原理可以知道，與原數據可以形成對應關系。因此在圖中稱為NEW FEATURE。- 將XGB-pred-test1 - 5 的結果使用Averaging的方法求平均值，最終得到**XGB-pred-test**(shape `3000*1`)。

LGB算法 - 同樣對應圖中model 1部分

- 輸入：與XGB算法一致 - 處理：與XGB算法一致。只需更改預測結果的命名即可，如**LGB-pred-tran5**和**LGB-pred-test1** - 輸出：- 將LGB分別對1、2、3、4、5折進行預測的結果合并，得到**LGB-pred-tran**(shape `10000*1`)。- 將LGB-pred-test1 - 5 的結果使用Averaging的方法求平均值，最終得到**LGB-pred-test**(shape `3000*1`)。

NN算法 - 同樣對應圖中model 1部分

- 輸入：與XGB算法一致 - 處理：與XGB算法一致。只需更改預測結果的命名即可，如**NN-pred-tran5**和**NN-pred-test1** - 輸出：- 將NN分別對1、2、3、4、5折進行預測的結果合并，得到**NN-pred-tran**(shape `10000*1`)。- 將NN-pred-test1 - 5 的結果使用Averaging的方法求平均值，最終得到**NN-pred-test**(shape `3000*1`)。

stacking 第二層

訓練集：將三個新特征 XGB-pred-tran、LGB-pred-tran、NN-pred-tran合并得到新的訓練集(shape 10000*3)
測試集：將三個新測試集XGB-pred-test、LGB-pred-test、NN-pred-test合并得到新的測試集(shape 30000*3)
用新訓練集和測試集構造第二層的預測器，即GBDT模型

混合法 - blending

Blending與Stacking大致相同，只是Blending的主要區別在于訓練集不是通過K-Fold的CV策略來獲得預測值從而生成第二階段模型的特征，而是建立一個Holdout集。簡單來說，Blending直接用不相交的數據集用于不同層的訓練。

同樣以上述數據集為例，構造一個兩層的Blending模型。

首先將訓練集劃分為兩部分(d1，d2)，例如d1為4000條數據用于blending的第一層，d2是6000條數據用于blending的第二層。

第一層：用d1訓練多個模型，將其對d2和test的預測結果作為第二層的New Features。例如同樣適用上述三個模型，對d2生成6000*3的新特征數據；對test生成3000*3的新特征矩陣。

第二層：用d2的New Features和標簽訓練新的分類器，然后把test的New Features輸入作為最終的測試集，對test預測出的結果就是最終的模型融合的值。

優缺點對比

Blending的優點在于：

比stacking簡單（因為不用進行k次的交叉驗證來獲得stacker feature）

避開了一個信息泄露問題：generlizers和stacker使用了不一樣的數據集

在團隊建模過程中，不需要給隊友分享自己的隨機種子

而缺點在于：

使用了很少的數據（是劃分hold-out作為測試集，并非cv）

blender可能會過擬合（其實大概率是第一點導致的）

stacking使用多次的CV會比較穩健

5.4 代碼實現

import pandas as pd import numpy as np import warnings import matplotlib import matplotlib.pyplot as plt import seaborn as snswarnings.filterwarnings('ignore') %matplotlib inlineimport itertools import matplotlib.gridspec as gridspec from sklearn import datasets from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor from sklearn.linear_model import LogisticRegression # from mlxtend.classifier import StackingClassifier from sklearn.model_selection import cross_val_score, train_test_split # from mlxtend.plotting import plot_learning_curves # from mlxtend.plotting import plot_decision_regionsfrom sklearn.model_selection import StratifiedKFold from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import train_test_split from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import VotingClassifier import lightgbm as lgb from sklearn.neural_network import MLPClassifier,MLPRegressor from sklearn.metrics import mean_squared_error, mean_absolute_error

5.4.1 load data

import pandas as pd import numpy as np from sklearn.metrics import classification_report, f1_score from sklearn.model_selection import StratifiedKFold, KFold,train_test_split def reduce_mem_usage(df):start_mem = df.memory_usage().sum() / 1024**2 print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_type = df[col].dtypeif col_type != object:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64) else:if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:df[col] = df[col].astype(np.float16)elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)else:df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2 print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))return df all_df = pd.read_csv('data/group_df.csv',index_col=0) all_df = reduce_mem_usage(all_df) all_df = all_df.fillna(99) Memory usage of dataframe is 30.28 MB Memory usage after optimization is: 7.59 MB Decreased by 74.9% all_df.shape (9000, 440) all_df['label'].value_counts() 2 4361 -1 20000 16211 1018 Name: label, dtype: int64

all_df中label為0/1/2的為訓練集，一共有7000條；label為-1的為測試集，一共有2000條。

label為-1的測試集沒有label，這部分數據用于模擬真實比賽提交數據。

train數據均有標簽，我們將從中分出30%作為驗證集，其余作為訓練集。在驗證集上比較模型性能優劣，模型性能均使用f1作為評分。

train = all_df[all_df['label'] != -1] test = all_df[all_df['label'] == -1] feats = [c for c in train.columns if c not in ['ID', 'label']]# 根據7：3劃分訓練集和測試集 X_train,X_val,y_train,y_val= train_test_split(train[feats],train['label'],test_size=0.3,random_state=0)

5.4.2 單模及加權融合

這里訓練三個單模，分別是用了一個三種不同的RF/LGB/LGB模型。事實上模型融合需要基礎分類器之間存在差異，一般不會選用相同的分類器模型。這里只是作為展示。

# 單模函數 def build_model_rf(X_train,y_train):model = RandomForestClassifier(n_estimators = 100)model.fit(X_train, y_train)return modeldef build_model_lgb(X_train,y_train):model = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)model.fit(X_train, y_train)return modeldef build_model_lgb2(X_train,y_train):model = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)model.fit(X_train, y_train)return model # 這里針對三個單模進行訓練，其中subA_rf/lgb/nn都是可以提交的模型 # 單模沒有進行調參，因此是弱分類器，效果可能不是很好。print('predict rf ...') model_rf = build_model_rf(X_train,y_train) val_rf = model_rf.predict(X_val) subA_rf = model_rf.predict(test[feats]) rf_f1_score = f1_score(y_val,val_rf,average='macro') print(rf_f1_score)print('predict lgb...') model_lgb = build_model_lgb(X_train,y_train) val_lgb = model_lgb.predict(X_val) subA_lgb = model_lgb.predict(test[feats]) lgb_f1_score = f1_score(y_val,val_lgb,average='macro') print(lgb_f1_score)print('predict lgb 2...') model_lgb2 = build_model_lgb2(X_train,y_train) val_lgb2 = model_lgb2.predict(X_val) subA_lgb2 = model_lgb2.predict(test[feats]) lgb2_f1_score = f1_score(y_val,val_lgb2,average='macro') print(lgb2_f1_score) predict rf ... 0.8987051046527208 predict lgb... 0.9144414270113281 predict lgb 2... 0.9183965870229657 voting_clf = VotingClassifier(estimators=[('rf',model_rf ),('lgb',model_lgb),('lgb2',model_lgb2 )],voting='hard')voting_clf.fit(X_train,y_train) val_voting = voting_clf.predict(X_val) subA_voting = voting_clf.predict(test[feats]) voting_f1_score = f1_score(y_val,val_voting,average='macro') print(voting_f1_score) 0.9142736444973326

5.4.3 Stacking融合

_N_FOLDS = 5 # 采用5折交叉驗證 kf = KFold(n_splits=_N_FOLDS, random_state=42) # sklearn的交叉驗證模塊，用于劃分數據def get_oof(clf, X_train, y_train, X_test):oof_train = np.zeros((X_train.shape[0], 1)) oof_test_skf = np.empty((_N_FOLDS, X_test.shape[0], 1)) for i, (train_index, test_index) in enumerate(kf.split(X_train)): # 交叉驗證劃分此時的訓練集和驗證集kf_X_train = X_train.iloc[train_index,]kf_y_train = y_train.iloc[train_index,]kf_X_val = X_train.iloc[test_index,]clf.fit(kf_X_train, kf_y_train)oof_train[test_index] = clf.predict(kf_X_val).reshape(-1, 1) oof_test_skf[i, :] = clf.predict(X_test).reshape(-1, 1) oof_test = oof_test_skf.mean(axis=0) # 對每一則交叉驗證的結果取平均return oof_train, oof_test # 返回當前分類器對訓練集和測試集的預測結果 # 將你的每個分類器都調用get_oof函數，并把它們的結果合并，就得到了新的訓練和測試數據new_train,new_test new_train, new_test = [], []model1 = RandomForestClassifier(n_estimators = 100) model2 = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200) model3 = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)for clf in [model1, model2, model3]:oof_train, oof_test = get_oof(clf, X_train, y_train, X_val)new_train.append(oof_train)new_test.append(oof_test)new_train = np.concatenate(new_train, axis=1) new_test = np.concatenate(new_test, axis=1) # 用新的訓練數據new_train作為新的模型的輸入，stacking第二層 # 使用LogisticRegression作為第二層是為了防止模型過擬合 # 這里使用的模型還有待優化，因此模型融合效果并不是很好 clf = LogisticRegression() clf.fit(new_train, y_train) result = clf.predict(new_test)stacking_f1_score = f1_score(y_val,result,average='macro') print(stacking_f1_score) 0.8816601744239989

5.5 思考題

如何基于stacking改進出blending - stacking使用了foldCV，blending使用了holdout.

stacking還可以進行哪些優化提升F1-score - 從第一層模型數量？模型差異性？角度出發

參考內容

https://blog.csdn.net/weixin_44585839/article/details/110148396

https://blog.csdn.net/weixin_39962758/article/details/111101263

總結

以上是生活随笔為你收集整理的【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task5模型融合的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： photoshop巧用系列第三篇——切图
下一篇：【算法竞赛学习】气象海洋预测-Task1