sklearn 机器学习 Pipeline 模板
生活随笔
收集整理的這篇文章主要介紹了
sklearn 机器学习 Pipeline 模板
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄
- 1. 導入工具包
- 2. 讀取數據
- 3. 數字特征、文字特征分離
- 4. 數據處理Pipeline
- 5. 嘗試不同的模型
- 6. 參數搜索
- 7. 特征重要性篩選
- 8. 最終完整Pipeline
使用 sklearn 的 pipeline 搭建機器學習的流程
本文例子為 [Kesci] 新人賽 · 員工滿意度預測
參考 [Hands On ML] 2. 一個完整的機器學習項目(加州房價預測)
1. 導入工具包
import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn.impute import SimpleImputer from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelBinarizer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.pipeline import FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_score2. 讀取數據
data = pd.read_csv("../competition/Employee_Satisfaction/train.csv") test = pd.read_csv("../competition/Employee_Satisfaction/test.csv") data.columns Index(['id', 'last_evaluation', 'number_project', 'average_monthly_hours','time_spend_company', 'Work_accident', 'package','promotion_last_5years', 'division', 'salary', 'satisfaction_level'],dtype='object')- 訓練數據,標簽分離
3. 數字特征、文字特征分離
def num_cat_splitor(X):s = (X.dtypes == 'object')object_cols = list(s[s].index)# object_cols # ['package', 'division', 'salary']num_cols = list(set(X.columns) - set(object_cols))# num_cols# ['Work_accident', 'time_spend_company', 'promotion_last_5years', 'id',# 'average_monthly_hours', 'last_evaluation', 'number_project']return num_cols, object_cols num_cols, object_cols = num_cat_splitor(X) # print(num_cols) # print(object_cols) # X[object_cols].values- 特征數值篩選器
4. 數據處理Pipeline
- 數字特征
- 文字特征
- 組合數字和文字特征
5. 嘗試不同的模型
from sklearn.ensemble import RandomForestRegressor forest_reg = RandomForestRegressor() forest_scores = cross_val_score(forest_reg,X_prepared,y,scoring='neg_mean_squared_error',cv=3) forest_rmse_scores = np.sqrt(-forest_scores) print(forest_rmse_scores) print(forest_rmse_scores.mean()) print(forest_rmse_scores.std())還可以嘗試別的模型
6. 參數搜索
param_grid = [{'n_estimators' : [3,10,30,50,80],'max_features':[2,4,6,8]},{'bootstrap':[False], 'n_estimators' : [3,10],'max_features':[2,3,4]}, ] forest_reg = RandomForestRegressor() grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error') grid_search.fit(X_prepared,y)- 最佳參數
- 最優模型
- 搜索結果
7. 特征重要性篩選
feature_importances = grid_search.best_estimator_.feature_importances_- 選擇前 k 個最重要的特征
8. 最終完整Pipeline
prepare_select_and_predict_pipeline = Pipeline([('preparation', full_pipeline),('feature_selection', TopFeatureSelector(feature_importances, k)),('forst_reg', RandomForestRegressor()) ])- 參數搜索
- 訓練
- 預測
以上只是粗略的大體框架,還有很多細節,大家多指教!
我的CSDN博客地址 https://michael.blog.csdn.net/
長按或掃碼關注我的公眾號(Michael阿明),一起加油、一起學習進步!
總結
以上是生活随笔為你收集整理的sklearn 机器学习 Pipeline 模板的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 898. 子数组按位或
- 下一篇: LeetCode 330. 按要求补齐数