日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

金融贷款逾期的模型构建4——模型调优

發布時間:2025/3/19 编程问答 31 豆豆
生活随笔 收集整理的這篇文章主要介紹了 金融贷款逾期的模型构建4——模型调优 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

    • 一、任務
    • 二、概述
      • 1、參數說明
      • 2、常用方法
    • 二、實現
      • 1、模塊引入
      • 2、模型評估函數
      • 3、數據讀取
      • 4、Logistic Regression
        • (1)調參部分
        • (2)模型評估
      • 5、SVM
        • (1)調參部分
        • (2)模型評估
      • 6、Decision Tree
        • (1)調參部分
        • (2)模型評估
      • 7、Random Forest
      • 8、GBDT
      • 9、XGBoost
      • 10、LightGBM
    • 三、遇到的問題
      • 1、UnboundLocalError: local variable 'xxx' referenced before assignment
      • 2、ImportError: [joblib] Attempting to do parallel computing without protecting
      • 3、recall

一、任務

使用網格搜索法對7個模型進行調優(調參時采用五折交叉驗證的方式),并進行模型評估,展示代碼的運行結果。

二、概述

機器學習模型基本都會涉及調參不同的參數組合會產生不同的效果 :

  • 如果模型數據量不是很大(運行時間不是很長)——GridSearchCV來自動選擇輸入參數中的最優組合。
  • 若很大數據量,模型運行特別費計算資源和時間——GridSearchCV可能會成本太高,需要對模型了解深入一點或者積累更多的實戰經驗,最后進行手動調參。

1、參數說明

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

(1)estimator
所使用的分類器,如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10), 并且傳入除需要確定最佳的參數之外的其他參數。每一個分類器都需要一個scoring參數,或者score方法。
(2)param_grid
param_grid 值為字典或者列表,即需要最優化的參數的取值,param_grid =param_test1,param_test1 = {‘n_estimators’:range(10,71,10)}。
(3)scoring
準確度評價標準,默認None,這時需要使用score函數;或者如scoring=‘roc_auc’,根據所選模型不同,評價準則不同。字符串(函數名),或是可調用對象,需要其函數簽名形如:scorer(estimator, X, y);如果是None,則使用estimator的誤差估計函數。scoring參數選擇如下:
傳送門:http://scikit-learn.org/stable/modules/model_evaluation.html
(4)cv
交叉驗證參數,默認None,使用三折交叉驗證。指定fold數量,默認為3,也可以是yield訓練/測試數據的生成器。
(5)refit
默認為True,程序將會以交叉驗證訓練集得到的最佳參數,重新對所有可用的訓練集與開發集進行,作為最終用于性能評估的最佳模型參數。即在搜索參數結束后,用最佳參數結果再次fit一遍全部數據集。
(6)iid
默認True,為True時,默認為各個樣本fold概率分布一致,誤差估計為所有樣本之和,而非各個fold的平均。
(7)verbose
日志冗長度,int:冗長度,0:不輸出訓練過程,1:偶爾輸出,>1:對每個子模型都輸出。
(8)n_jobs
并行數,int:個數,-1:跟CPU核數一致, 1:默認值。
(9)pre_dispatch
指定總共分發的并行任務數。當n_jobs大于1時,數據將在每個運行點進行復制,這可能導致OOM,而設置pre_dispatch參數,則可以預先劃分總共的job數量,使數據最多被復制pre_dispatch次

2、常用方法

grid.fit():運行網格搜索;
grid_scores_:給出不同參數情況下的評價結果;
best_params_:描述了已取得最佳結果的參數的組合;
best_score_:成員提供優化過程期間觀察到的最好的評分。

二、實現

1、模塊引入

import pandas as pd from sklearn.model_selection import train_test_split, GridSearchCV import xgboost as xgb import numpy as np import lightgbm as lgb from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier import warnings from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score warnings.filterwarnings(action ='ignore', category = DeprecationWarning)

2、模型評估函數

## 模型評估 def model_metrics(clf, y_target, y_predict):accuracy = accuracy_score(y_target, y_predict)print('The accuracy is ', accuracy)precision = precision_score(y_target, y_predict)print('The precision is ', precision)recall = recall_score(y_target, y_predict)print('The recall is ', recall)

3、數據讀取

## 讀取數據data = pd.read_csv("data_all.csv")x = data.drop(labels='status', axis=1)y = data['status']x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)## 數據標準化scaler = StandardScaler()scaler.fit(x_train)x_train_stand = scaler.transform(x_train)x_test_stand = scaler.transform(x_test)

4、Logistic Regression

(1)調參部分

lr = LogisticRegression() # 要調參數 param = {'C':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']} grid = GridSearchCV(estimator=lr, param_grid=param, scoring='roc_auc', cv=5) grid.fit(x_train_stand, y_train) print('最佳參數:',grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test))

==》最佳參數: {‘C’: 0.1, ‘penalty’: ‘l1’}

(2)模型評估

lr = LogisticRegression(C = 0.1, penalty = 'l1') lr.fit(x_train_stand, y_train) y_pre_lr = lr.predict(x_test_stand) model_metrics(lr, y_test, y_pre_lr)

結果輸出

The accuracy is 0.7890679747722494 The precision is 0.6746987951807228 The recall is 0.31197771587743733

5、SVM

(1)調參部分

svm = SVC(random_state=2018, probability=True) param = {'C':[0.01, 0.1, 1]} grid = GridSearchCV(estimator = svm, param_grid = param, scoring='roc_auc',cv=5) grid.fit(x_train_stand, y_train) print('最佳參數:',grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test))

==》最佳參數: {‘C’: 0.1}

(2)模型評估

svm = SVC(C = 0.1, random_state=2018, probability=True) svm.fit(x_train_stand, y_train) y_pre_svm = svm.predict(x_test_stand) model_metrics(svm, y_test, y_pre_svm)

結果輸出

The accuracy is 0.7575332866152769 The precision is 0.8823529411764706 The recall is 0.04178272980501393

6、Decision Tree

(1)調參部分

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features='sqrt',random_state =2018) param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)} # 最佳參數: {'max_depth': 9, 'min_samples_split': 300} param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)} # 最佳參數: {'min_samples_leaf': 90, 'min_samples_split': 50} param = {'max_features':range(7,20,2)} # 最佳參數: {'max_features': 9} grid = GridSearchCV(estimator = dt, param_grid = param,scoring = 'roc_auc', cv = 5) grid.fit(x_train_stand, y_train) print('最佳參數:',grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test))

(2)模型評估

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features=9,random_state =2018) dt.fit(x_train_stand, y_train) y_pre_dt = dt.predict(x_test_stand) model_metrics(dt, y_test, y_pre_dt)

結果輸出

The accuracy is 0.7561317449194114 The precision is 0.5578947368421052 The recall is 0.14763231197771587

7、Random Forest

## Random Forest # param = {'n_estimators': range(1, 200, 5), 'max_features': ['log2', 'sqrt', 'auto']} # 最佳參數: {'max_features': 'sqrt', 'n_estimators': 171} rf = RandomForestClassifier(n_estimators=171, max_features='sqrt', random_state=2018) rf.fit(x_train_stand, y_train) y_pre_rf = rf.predict(x_test_stand) model_metrics(rf, y_test, y_pre_rf)

輸出結果

The accuracy is 0.7848633496846531 The precision is 0.6857142857142857 The recall is 0.26740947075208915

8、GBDT

# gbdt = GradientBoostingClassifier(random_state=2018) # param = {'n_estimators': range(1, 100, 10), 'learning_rate': np.arange(0.1, 1, 0.1)} # grid = GridSearchCV(estimator = gbdt, param_grid = param,scoring = 'roc_auc', cv = 5) # grid.fit(x_train_stand, y_train) # print('最佳參數:',grid.best_params_) # print('訓練集的最佳分數:', grid.best_score_) # print('測試集的最佳分數:', grid.score(x_test_stand, y_test)) # 最佳參數: {'learning_rate': 0.1, 'n_estimators': 41} gbdt = GradientBoostingClassifier(learning_rate=0.1, n_estimators=41, random_state=2018) gbdt.fit(x_train_stand, y_train) y_pre_gbdt = gbdt.predict(x_test_stand) model_metrics(gbdt, y_test, y_pre_gbdt)

9、XGBoost

## 調參部分 param = {'n_estimators':range(20,200,20)} # param = {'max_depth': range(3, 10, 2), 'min_child_weight': range(1, 12, 2)} # param = {'gamma': [i / 10 for i in range(1, 6)]} # param = {'subsample': [i / 10 for i in range(5, 10)], 'colsample_bytree': [i / 10 for i in range(5, 10)]} # param = {'reg_alpha': [1e-5, 1e-2, 0.1, 0, 1, 100]} # param = {'n_estimators': range(20, 200, 20)} xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01, gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=2018) grid = GridSearchCV(estimator=xgb, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5) grid.fit(x_train_stand, y_train) print('最佳參數:', grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test)) # # 最佳參數: {'n_estimators': 40} # 訓練集的最佳分數: 0.8028110571725202 # 測試集的最佳分數: 0.7770857458817146## 模型評估 xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01,gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic',nthread=4, scale_pos_weight=1, seed=2018) xgboost.fit(x_train_stand, y_train) y_pre_xgb = xgboost.predict(x_test_stand) model_metrics(xgboost, y_test, y_pre_xgb)

輸出結果

The accuracy is 0.7876664330763841 The precision is 0.6521739130434783 The recall is 0.3342618384401114

10、LightGBM

## 調參部分 gbm = lgb.LGBMClassifier(seed = 2018) param = {'learning_rate': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1, 6, 1),'n_estimators': range(30, 50, 5)} grid = GridSearchCV(estimator=gbm, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5) grid.fit(x_train_stand, y_train) print('最佳參數:', grid.best_params_) print('訓練集的最佳分數:', grid.best_score_) print('測試集的最佳分數:', grid.score(x_test_stand, y_test)) # 最佳參數: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 40} # 訓練集的最佳分數: 0.8007228827289531 # 測試集的最佳分數: 0.7729296422647178## 模型評估 gbm = lgb.LGBMClassifier(learning_rate = 0.1, max_depth = 3, n_estimators = 40, seed=2018) gbm.fit(x_train_stand, y_train) y_pre_gbm = gbm.predict(x_test_stand) model_metrics(gbm, y_test, y_pre_gbm)

輸出結果

The accuracy is 0.7932725998598459 The precision is 0.6839080459770115 The recall is 0.33147632311977715

三、遇到的問題

1、UnboundLocalError: local variable ‘xxx’ referenced before assignment

錯誤
UnboundLocalError: local variable ‘xxx’ referenced before assignment

在函數外部已經定義了變量n,在函數內部對該變量進行運算,運行時會遇到了這樣的錯誤:

主要是因為沒有讓解釋器清楚變量是全局變量還是局部變量。

解決方案:修改變量的命名,使之不發生沖突

2、ImportError: [joblib] Attempting to do parallel computing without protecting

錯誤
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using “if name == ‘main‘”. Please see the joblib documentation on Parallel for more information

解決方案:添加if __name__=='__main__':即可

3、recall

為什么召回率普遍偏低?

總結

以上是生活随笔為你收集整理的金融贷款逾期的模型构建4——模型调优的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。