當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

金融贷款逾期的模型构建4——模型调优

發布時間：2025/3/19 编程问答 41 豆豆

生活随笔收集整理的這篇文章主要介紹了金融贷款逾期的模型构建4——模型调优小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 一、任務
- 二、概述
- - 1、參數說明
  - 2、常用方法
- 二、實現
- - 1、模塊引入
  - 2、模型評估函數
  - 3、數據讀取
  - 4、Logistic Regression
  - - （1）調參部分
    - （2）模型評估
  - 5、SVM
  - - （1）調參部分
    - （2）模型評估
  - 6、Decision Tree
  - - （1）調參部分
    - （2）模型評估
  - 7、Random Forest
  - 8、GBDT
  - 9、XGBoost
  - 10、LightGBM
- 三、遇到的問題
- - 1、UnboundLocalError： local variable 'xxx' referenced before assignment
  - 2、ImportError: [joblib] Attempting to do parallel computing without protecting
  - 3、recall

一、任務

使用網格搜索法對7個模型進行調優（調參時采用五折交叉驗證的方式），并進行模型評估，展示代碼的運行結果。

二、概述

機器學習模型基本都會涉及調參不同的參數組合會產生不同的效果：

如果模型數據量不是很大（運行時間不是很長）——GridSearchCV來自動選擇輸入參數中的最優組合。
若很大數據量，模型運行特別費計算資源和時間——GridSearchCV可能會成本太高，需要對模型了解深入一點或者積累更多的實戰經驗，最后進行手動調參。

1、參數說明

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

（1）estimator
所使用的分類器，如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10), 并且傳入除需要確定最佳的參數之外的其他參數。每一個分類器都需要一個scoring參數，或者score方法。
（2）param_grid
param_grid 值為字典或者列表，即需要最優化的參數的取值，param_grid =param_test1，param_test1 = {‘n_estimators’:range(10,71,10)}。
（3）scoring
準確度評價標準，默認None,這時需要使用score函數；或者如scoring=‘roc_auc’，根據所選模型不同，評價準則不同。字符串（函數名），或是可調用對象，需要其函數簽名形如：scorer(estimator, X, y)；如果是None，則使用estimator的誤差估計函數。scoring參數選擇如下：
傳送門：http://scikit-learn.org/stable/modules/model_evaluation.html
（4）cv
交叉驗證參數，默認None，使用三折交叉驗證。指定fold數量，默認為3，也可以是yield訓練/測試數據的生成器。
（5）refit
默認為True,程序將會以交叉驗證訓練集得到的最佳參數，重新對所有可用的訓練集與開發集進行，作為最終用于性能評估的最佳模型參數。即在搜索參數結束后，用最佳參數結果再次fit一遍全部數據集。
（6）iid
默認True,為True時，默認為各個樣本fold概率分布一致，誤差估計為所有樣本之和，而非各個fold的平均。
（7）verbose
日志冗長度，int：冗長度，0：不輸出訓練過程，1：偶爾輸出，>1：對每個子模型都輸出。
（8）n_jobs
并行數，int：個數,-1：跟CPU核數一致, 1:默認值。
（9）pre_dispatch
指定總共分發的并行任務數。當n_jobs大于1時，數據將在每個運行點進行復制，這可能導致OOM，而設置pre_dispatch參數，則可以預先劃分總共的job數量，使數據最多被復制pre_dispatch次

2、常用方法

grid.fit()：運行網格搜索；
grid_scores_：給出不同參數情況下的評價結果；
best_params_：描述了已取得最佳結果的參數的組合；
best_score_：成員提供優化過程期間觀察到的最好的評分。

二、實現

1、模塊引入

import pandas as pd from sklearn.model_selection import train_test_split, GridSearchCV import xgboost as xgb import numpy as np import lightgbm as lgb from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier import warnings from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score warnings.filterwarnings(action ='ignore', category = DeprecationWarning)

2、模型評估函數

## 模型評估 def model_metrics(clf, y_target, y_predict):accuracy = accuracy_score(y_target, y_predict)print('The accuracy is ', accuracy)precision = precision_score(y_target, y_predict)print('The precision is ', precision)recall = recall_score(y_target, y_predict)print('The recall is ', recall)

3、數據讀取

## 讀取數據data = pd.read_csv("data_all.csv")x = data.drop(labels='status', axis=1)y = data['status']x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)## 數據標準化scaler = StandardScaler()scaler.fit(x_train)x_train_stand = scaler.transform(x_train)x_test_stand = scaler.transform(x_test)

4、Logistic Regression

（1）調參部分

lr = LogisticRegression() # 要調參數 param = {'C':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']} grid = GridSearchCV(estimator=lr, param_grid=param, scoring='roc_auc', cv=5) grid.fit(x_train_stand, y_train) print('最佳參數：',grid.best_params_) print('訓練集的最佳分數：', grid.best_score_) print('測試集的最佳分數：', grid.score(x_test_stand, y_test))

==》最佳參數： {‘C’: 0.1, ‘penalty’: ‘l1’}

（2）模型評估

lr = LogisticRegression(C = 0.1, penalty = 'l1') lr.fit(x_train_stand, y_train) y_pre_lr = lr.predict(x_test_stand) model_metrics(lr, y_test, y_pre_lr)

結果輸出

The accuracy is 0.7890679747722494 The precision is 0.6746987951807228 The recall is 0.31197771587743733

5、SVM

（1）調參部分

svm = SVC(random_state=2018, probability=True) param = {'C':[0.01, 0.1, 1]} grid = GridSearchCV(estimator = svm, param_grid = param, scoring='roc_auc',cv=5) grid.fit(x_train_stand, y_train) print('最佳參數：',grid.best_params_) print('訓練集的最佳分數：', grid.best_score_) print('測試集的最佳分數：', grid.score(x_test_stand, y_test))

==》最佳參數： {‘C’: 0.1}

（2）模型評估

svm = SVC(C = 0.1, random_state=2018, probability=True) svm.fit(x_train_stand, y_train) y_pre_svm = svm.predict(x_test_stand) model_metrics(svm, y_test, y_pre_svm)

結果輸出

The accuracy is 0.7575332866152769 The precision is 0.8823529411764706 The recall is 0.04178272980501393

6、Decision Tree

（1）調參部分

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features='sqrt',random_state =2018) param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)} # 最佳參數： {'max_depth': 9, 'min_samples_split': 300} param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)} # 最佳參數： {'min_samples_leaf': 90, 'min_samples_split': 50} param = {'max_features':range(7,20,2)} # 最佳參數： {'max_features': 9} grid = GridSearchCV(estimator = dt, param_grid = param,scoring = 'roc_auc', cv = 5) grid.fit(x_train_stand, y_train) print('最佳參數：',grid.best_params_) print('訓練集的最佳分數：', grid.best_score_) print('測試集的最佳分數：', grid.score(x_test_stand, y_test))

（2）模型評估

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features=9,random_state =2018) dt.fit(x_train_stand, y_train) y_pre_dt = dt.predict(x_test_stand) model_metrics(dt, y_test, y_pre_dt)

結果輸出

The accuracy is 0.7561317449194114 The precision is 0.5578947368421052 The recall is 0.14763231197771587

7、Random Forest

## Random Forest # param = {'n_estimators': range(1, 200, 5), 'max_features': ['log2', 'sqrt', 'auto']} # 最佳參數： {'max_features': 'sqrt', 'n_estimators': 171} rf = RandomForestClassifier(n_estimators=171, max_features='sqrt', random_state=2018) rf.fit(x_train_stand, y_train) y_pre_rf = rf.predict(x_test_stand) model_metrics(rf, y_test, y_pre_rf)

輸出結果

The accuracy is 0.7848633496846531 The precision is 0.6857142857142857 The recall is 0.26740947075208915

8、GBDT

# gbdt = GradientBoostingClassifier(random_state=2018) # param = {'n_estimators': range(1, 100, 10), 'learning_rate': np.arange(0.1, 1, 0.1)} # grid = GridSearchCV(estimator = gbdt, param_grid = param,scoring = 'roc_auc', cv = 5) # grid.fit(x_train_stand, y_train) # print('最佳參數：',grid.best_params_) # print('訓練集的最佳分數：', grid.best_score_) # print('測試集的最佳分數：', grid.score(x_test_stand, y_test)) # 最佳參數： {'learning_rate': 0.1, 'n_estimators': 41} gbdt = GradientBoostingClassifier(learning_rate=0.1, n_estimators=41, random_state=2018) gbdt.fit(x_train_stand, y_train) y_pre_gbdt = gbdt.predict(x_test_stand) model_metrics(gbdt, y_test, y_pre_gbdt)

9、XGBoost

## 調參部分 param = {'n_estimators':range(20,200,20)} # param = {'max_depth': range(3, 10, 2), 'min_child_weight': range(1, 12, 2)} # param = {'gamma': [i / 10 for i in range(1, 6)]} # param = {'subsample': [i / 10 for i in range(5, 10)], 'colsample_bytree': [i / 10 for i in range(5, 10)]} # param = {'reg_alpha': [1e-5, 1e-2, 0.1, 0, 1, 100]} # param = {'n_estimators': range(20, 200, 20)} xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01, gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=2018) grid = GridSearchCV(estimator=xgb, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5) grid.fit(x_train_stand, y_train) print('最佳參數：', grid.best_params_) print('訓練集的最佳分數：', grid.best_score_) print('測試集的最佳分數：', grid.score(x_test_stand, y_test)) # # 最佳參數： {'n_estimators': 40} # 訓練集的最佳分數： 0.8028110571725202 # 測試集的最佳分數： 0.7770857458817146## 模型評估 xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01,gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic',nthread=4, scale_pos_weight=1, seed=2018) xgboost.fit(x_train_stand, y_train) y_pre_xgb = xgboost.predict(x_test_stand) model_metrics(xgboost, y_test, y_pre_xgb)

輸出結果

The accuracy is 0.7876664330763841 The precision is 0.6521739130434783 The recall is 0.3342618384401114

10、LightGBM

## 調參部分 gbm = lgb.LGBMClassifier(seed = 2018) param = {'learning_rate': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1, 6, 1),'n_estimators': range(30, 50, 5)} grid = GridSearchCV(estimator=gbm, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5) grid.fit(x_train_stand, y_train) print('最佳參數：', grid.best_params_) print('訓練集的最佳分數：', grid.best_score_) print('測試集的最佳分數：', grid.score(x_test_stand, y_test)) # 最佳參數： {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 40} # 訓練集的最佳分數： 0.8007228827289531 # 測試集的最佳分數： 0.7729296422647178## 模型評估 gbm = lgb.LGBMClassifier(learning_rate = 0.1, max_depth = 3, n_estimators = 40, seed=2018) gbm.fit(x_train_stand, y_train) y_pre_gbm = gbm.predict(x_test_stand) model_metrics(gbm, y_test, y_pre_gbm)

輸出結果

The accuracy is 0.7932725998598459 The precision is 0.6839080459770115 The recall is 0.33147632311977715

三、遇到的問題

1、UnboundLocalError： local variable ‘xxx’ referenced before assignment

錯誤：
UnboundLocalError： local variable ‘xxx’ referenced before assignment

在函數外部已經定義了變量n，在函數內部對該變量進行運算，運行時會遇到了這樣的錯誤：

主要是因為沒有讓解釋器清楚變量是全局變量還是局部變量。

解決方案：修改變量的命名，使之不發生沖突

2、ImportError: [joblib] Attempting to do parallel computing without protecting

錯誤：
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using “if name == ‘main‘”. Please see the joblib documentation on Parallel for more information

解決方案：添加if __name__=='__main__':即可

3、recall

為什么召回率普遍偏低？

總結

以上是生活随笔為你收集整理的金融贷款逾期的模型构建4——模型调优的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：金融贷款逾期的模型构建3——模型评估
下一篇：金融贷款逾期的模型构建5——数据预处理

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

金融贷款逾期的模型构建4——模型调优

文章目錄

一、任務

二、概述

1、參數說明

2、常用方法

二、實現

1、模塊引入

2、模型評估函數

3、數據讀取

4、Logistic Regression

（1）調參部分

（2）模型評估

5、SVM

（1）調參部分

（2）模型評估

6、Decision Tree

（1）調參部分

（2）模型評估

7、Random Forest

8、GBDT

9、XGBoost

10、LightGBM

三、遇到的問題

1、UnboundLocalError： local variable ‘xxx’ referenced before assignment

2、ImportError: [joblib] Attempting to do parallel computing without protecting

3、recall

總結