當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Kaggle】Intermediate Machine Learning（管道+交叉验证）

發布時間：2024/7/5 编程问答 55 豆豆

生活随笔收集整理的這篇文章主要介紹了【Kaggle】Intermediate Machine Learning（管道+交叉验证）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 4. Pipelines 管道
- 5. Cross-Validation 交叉驗證

上一篇：【Kaggle】Intermediate Machine Learning（缺失值+文字特征處理）
下一篇：【Kaggle】Intermediate Machine Learning（XGBoost + Data Leakage）

4. Pipelines 管道

該模塊可以把數據前處理+建模整合起來

好處：

更清晰的代碼：在預處理的每個步驟中對數據的核算都可能變得混亂。使用管道，您無需在每個步驟中手動跟蹤訓練和驗證數據。

錯誤更少：錯誤地使用步驟或忘記預處理步驟的機會更少。

易于生產部署

對模型驗證也有好處

步驟1： 定義前處理步驟

對缺失的數字數據，進行插值
對文字特征進行one-hot編碼

from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder# Preprocessing for numerical data 數字數據插值 numerical_transformer = SimpleImputer(strategy='constant')# Preprocessing for categorical data 文字特征處理，插值+編碼轉換 categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore')) ])# Bundle preprocessing for numerical and categorical data # 上面兩者合并起來，形成完整的數據處理流程 preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_cols),('cat', categorical_transformer, categorical_cols)])

步驟2： 定義模型

from sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor(n_estimators=100, random_state=0)

步驟3： 創建和評估管道

我們使用Pipeline類來定義將預處理和建模步驟捆綁在一起的管道。

管道會在生成預測之前自動對數據進行預處理（如果沒有管道，我們必須在進行預測之前先對數據進行預處理）。

# Bundle preprocessing and modeling code in a pipeline # 將前處理管道 + 模型管道，再次疊加形成新管道 my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),('model', model)])# Preprocessing of training data, fit model my_pipeline.fit(X_train, y_train)# Preprocessing of validation data, get predictions preds = my_pipeline.predict(X_valid) # 用定義好的pipeline 對test進行預測，提交，代碼很簡潔，不易出錯 preds_test = my_pipeline.predict(X_test) # Save test predictions to file output = pd.DataFrame({'Id': X_test.index,'SalePrice': preds_test}) output.to_csv('submission.csv', index=False)

You advanced 5,020 places on the leaderboard!
Your submission scored 16459.13640, which is an improvement of your previous score of 16619.07644. Great job!
誤差有點提升，哈哈，加油！🚀

5. Cross-Validation 交叉驗證

交叉驗證可以更好的驗證模型，把數據分成幾份（Folds），依次選取一份作為驗證集，其余的用來訓練，顯然交叉驗證會花費更多的時間

如何選擇是否使用：

對于較小的數據集，不需要太多的計算負擔，則應運行交叉驗證
對于較大的數據集，單個驗證集就足夠了，因為數據足夠多了，交叉驗證花費的時間成本變大
沒有簡單的準則，如果模型花費幾分鐘或更短的時間來運行，那就使用交叉驗證吧
可以運行交叉驗證，看看每個實驗的分數是否接近。如果每個實驗產生相同的結果，則單個驗證集可能就足夠了

from sklearn.ensemble import RandomForestRegressor from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputermy_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),('model', RandomForestRegressor(n_estimators=50,random_state=0)) ]) from sklearn.model_selection import cross_val_score # Multiply by -1 since sklearn calculates *negative* MAE scores = -1 * cross_val_score(my_pipeline, X, y,cv=5,scoring='neg_mean_absolute_error') print("MAE scores:\n", scores) print("Average MAE score (across experiments):") print(scores.mean()) # 樹的棵數不同情況下，交叉驗證的得分均值 def get_score(n_estimators):"""Return the average MAE over 3 CV folds of random forest model.Keyword argument:n_estimators -- the number of trees in the forest"""my_pipeline = Pipeline(steps=[('preprocessing',SimpleImputer()),('model',RandomForestRegressor(n_estimators=n_estimators,random_state=0))])scores = -1*cross_val_score(my_pipeline,X,y,cv=3,scoring='neg_mean_absolute_error')return scores.mean()results = {} for i in range(1,9):# 獲取樹的棵樹是50，100，。。。，400時，模型的效果results[50*i] = get_score(50*i) # 可視化不同參數下的模型效果 import matplotlib.pyplot as plt %matplotlib inlineplt.plot(list(results.keys()), list(results.values())) plt.show() n_estimators_best = min(results, key=results.get) #最合適的參數

還可以通過 sklearn.model_selection.GridSearchCV 網格式搜索最佳的參數