日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

金融贷款逾期的模型构建6——特征选择

發(fā)布時(shí)間:2025/3/19 编程问答 62 豆豆
生活随笔 收集整理的這篇文章主要介紹了 金融贷款逾期的模型构建6——特征选择 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

    • 一、IV值
      • 1、概述
      • 2、IV計(jì)算
        • (1)WOE
        • (2)IV 計(jì)算
    • 二、實(shí)現(xiàn)
      • 0、相關(guān)模塊
      • 1、IV值
      • 2、Random Forest
      • 3、特征合并
      • 4、模型構(gòu)建
      • 5、模型評(píng)估

數(shù)據(jù)傳送門(data.csv):https://pan.baidu.com/s/1G1b2QJjYkkk7LDfGorbj5Q
目標(biāo):數(shù)據(jù)集是金融數(shù)據(jù)(非脫敏),要預(yù)測(cè)貸款用戶是否會(huì)逾期。表格中 “status” 是結(jié)果標(biāo)簽:0表示未逾期,1表示逾期。

任務(wù):分別用IV值和隨機(jī)森林進(jìn)行特征選擇。然后分別構(gòu)建模型(邏輯回歸、SVM、決策樹、隨機(jī)森林、GBDT、XGBoost和LightGBM),進(jìn)行模型評(píng)估。

一、IV值

1、概述

IV:Information Value,即信息價(jià)值,或者信息量。用于衡量變量的預(yù)測(cè)能力,也就是說(shuō),若某特征的IV值越大,該特征對(duì)預(yù)測(cè)的結(jié)果影響越大。

適用條件:有監(jiān)督模型且必須是二分類。

常見的IV取值范圍代表意思如下:

  • 若IV在(-∞,0.02]區(qū)間,視為無(wú)預(yù)測(cè)力變量
  • 若IV在(0.02,0.1]區(qū)間,視為較弱預(yù)測(cè)力變量
  • 若IV在(0.1,+∞)區(qū)間,視為預(yù)測(cè)力可以,而實(shí)際應(yīng)用中,也是保留IV值大于0.1的變量進(jìn)行篩選。

IV值計(jì)算

2、IV計(jì)算

WOE 是 IV 的計(jì)算基礎(chǔ)。

(1)WOE

WOE(Weight of Evidence,證據(jù)權(quán)重)。WOE是對(duì)原始自變量的一種編碼形式。

  • 首先,對(duì)該特征進(jìn)行分組處理(也稱離散化、分箱等)。
  • 然后,對(duì)第 iii 組,計(jì)算WOEWOEWOE,公式如下所示:
    WOEi=ln(pyipni)=ln(#yi/#yT#ni/#nT)WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})WOEi?=ln(pni??pyi???)=ln(#ni?/#nT?#yi?/#yT??)
    其中,pyip_{y_i}pyi??表示該組中響應(yīng)客戶(在風(fēng)險(xiǎn)模型中,即違約客戶)占所有樣本中所有響應(yīng)客戶的比例,pnip_{n_i}pni??表示該組中未響應(yīng)客戶占樣本中所有未響應(yīng)客戶的比例。#yi\#y_i#yi?表示這個(gè)組中響應(yīng)客戶的數(shù)量,#ni\#n_i#ni?表示這個(gè)組中未響應(yīng)客戶的數(shù)量,#yT\#y_T#yT?表示樣本中所有響應(yīng)客戶的數(shù)量,#nT\#n_T#nT?表示樣本中所有未響應(yīng)客戶的數(shù)量。
    ==》WOEWOEWOE:“當(dāng)前分組中響應(yīng)客戶占所有響應(yīng)客戶的比例”和“當(dāng)前分組中沒(méi)有響應(yīng)的客戶占所有沒(méi)有響應(yīng)的客戶的比例”的差異。
  • 公式變形:
    WOEi=ln(pyipni)=ln(#yi/#yT#ni/#nT)=ln(#yi/#ni#yT/#nT)WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})=ln(\frac{\#y_i/\#n_i}{\#y_T/\#n_T})WOEi?=ln(pni??pyi???)=ln(#ni?/#nT?#yi?/#yT??)=ln(#yT?/#nT?#yi?/#ni??)
    ==》WOEWOEWOE:當(dāng)前這個(gè)組中響應(yīng)的客戶和未響應(yīng)客戶的比值,和所有樣本中這個(gè)比值的差異。
    ==》WOE越大,這種差異越大,這個(gè)分組里的樣本響應(yīng)的可能性就越大,WOE越小,差異越小,這個(gè)分組里的樣本響應(yīng)的可能性就越小。

(2)IV 計(jì)算

IVi=(pyi?pni)?WOEi=(pyi?pni)?ln(pyipni)=(#yi/#yT?#ni/#nT)ln(#yi/#yT#ni/#nT)IV=∑i=1nIViIV_i =(p_{y_i}-p_{n_i})* WOE_i = (p_{y_i}-p_{n_i})*ln(\frac{p_{y_i}}{p_{n_i}})=(\#y_i/\#y_T-\#n_i/\#n_T)ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})\\ IV = \sum_{i=1}^{n}IV_i IVi?=(pyi???pni??)?WOEi?=(pyi???pni??)?ln(pni??pyi???)=(#yi?/#yT??#ni?/#nT?)ln(#ni?/#nT?#yi?/#yT??)IV=i=1n?IVi?
其中,n為特征的分組個(gè)數(shù)。

二、實(shí)現(xiàn)

0、相關(guān)模塊

import pandas as pd from pandas import DataFrame as df from numpy import log import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier import xgboost as xgb import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, f1_score from sklearn.metrics import roc_auc_score, recall_score, roc_curve, auc import matplotlib.pyplot as plt

1、IV值

def calcWOE(dataset, col, target):## 對(duì)特征進(jìn)行統(tǒng)計(jì)分組subdata = df(dataset.groupby(col)[col].count())## 每個(gè)分組中響應(yīng)客戶的數(shù)量suby = df(dataset.groupby(col)[target].sum())## subdata 與 suby 的拼接data = df(pd.merge(subdata, suby, how='left', left_index=True, right_index=True))## 相關(guān)統(tǒng)計(jì),總共的樣本數(shù)量total,響應(yīng)客戶總數(shù)b_total,未響應(yīng)客戶數(shù)量g_totalb_total = data[target].sum()total = data[col].sum()g_total = total - b_total## WOE公式data["bad"] = data.apply(lambda x:round(x[target]/b_total, 100), axis=1)data["good"] = data.apply(lambda x:round((x[col] - x[target])/g_total, 100), axis=1)data["WOE"] = data.apply(lambda x:log(x.bad / x.good), axis=1)return data.loc[:, ["bad", "good", "WOE"]]def calcIV(dataset):print()dataset["IV"] = dataset.apply(lambda x:(x["bad"] - x["good"]) * x["WOE"], axis=1)IV = sum(dataset["IV"])return IVfile_name = '1.csv' data = pd.read_csv(file_name, encoding='gbk') X = data.drop(labels="status", axis=1) print(X.shape) y = data["status"] col_list = [col for col in data.drop(labels=['Unnamed: 0','status'], axis=1)] data_IV = df() fea_iv = []for col in col_list:col_WOE = calcWOE(data, col, "status")## 刪除nan、inf、-infcol_WOE = col_WOE[~col_WOE.isin([np.nan, np.inf, -np.inf]).any(1)]col_IV = calcIV(col_WOE)if col_IV > 0.1:data_IV[col] = [col_IV]fea_iv.append(col)data_IV.to_csv('data_IV.csv', index=0) print(fea_iv)

輸出結(jié)果

['trans_amount_increase_rate_lately', 'trans_activity_day', 'repayment_capability', 'first_transaction_time', 'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month', 'abs', 'avg_price_last_12_month', 'trans_fail_top_count_enum_last_1_month', 'trans_fail_top_count_enum_last_6_month', 'trans_fail_top_count_enum_last_12_month', 'max_cumulative_consume_later_1_month', 'pawns_auctions_trusts_consume_last_1_month', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_day', 'trans_day_last_12_month', 'apply_score', 'loans_score', 'loans_count', 'loans_overdue_count', 'history_suc_fee', 'history_fail_fee', 'latest_one_month_suc', 'latest_one_month_fail', 'loans_avg_limit', 'consfin_credit_limit', 'consfin_max_limit', 'consfin_avg_limit', 'loans_latest_day']

2、Random Forest

rfc = RandomForestClassifier() rfc.fit(X, y) rfc_impc = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False) fea_gini = rfc_impc[:20].index.tolist() print(fea_gini)

輸出結(jié)果

['trans_fail_top_count_enum_last_1_month', 'history_fail_fee', 'loans_score', 'apply_score', 'latest_one_month_fail', 'trans_fail_top_count_enum_last_12_month', 'Unnamed: 0', 'trans_amount_3_month', 'trans_activity_day', 'max_cumulative_consume_later_1_month', 'repayment_capability', 'historical_trans_amount', 'consfin_credit_limit', 'latest_query_day', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_time', 'loans_overdue_count', 'history_suc_fee', 'trans_days_interval', 'number_of_trans_from_2011']

3、特征合并

features = list(set(fea_gini)|set(fea_iv)) X_final = X[features] print(X_final.shape)

(4754, 35)
分析:從原來(lái)的(4754, 92)經(jīng)過(guò)篩選得到 (4754, 35) 特征的數(shù)據(jù),去掉了大量的冗余。

4、模型構(gòu)建

## 劃分?jǐn)?shù)據(jù)集 X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.3, random_state=2019)## 模型1:Logistic Regression lr = LogisticRegression() lr.fit(X_train, y_train)# ## 模型2:SVM svm = SVC(kernel='linear',probability=True) svm.fit(X_train,y_train)## 模型3:Decision Tree dtc = DecisionTreeClassifier(max_depth=8) dtc.fit(X_train,y_train)## 模型4:Random Forest rfc = RandomForestClassifier() rfc.fit(X_train,y_train)## 模型5:GBDT gbdt = GradientBoostingClassifier() gbdt.fit(X_train,y_train)## 模型6:XGBoost xgbc = xgb.XGBClassifier() xgbc.fit(X_train,y_train)## 模型7:LightGBM lgbc = lgb.LGBMClassifier() lgbc.fit(X_train,y_train)

5、模型評(píng)估

## 模型評(píng)估 def model_metrics(clf, X_train, X_test, y_train, y_test):y_train_pred = clf.predict(X_train)y_test_pred = clf.predict(X_test)y_train_prob = clf.predict_proba(X_train)[:, 1]y_test_prob = clf.predict_proba(X_test)[:, 1]# 準(zhǔn)確率print('準(zhǔn)確率: ',end=' ')print('訓(xùn)練集: ', '%.4f' % accuracy_score(y_train, y_train_pred), end=' ')print('測(cè)試集: ', '%.4f' % accuracy_score(y_test, y_test_pred))# 精準(zhǔn)率print('精準(zhǔn)率:',end=' ')print('訓(xùn)練集: ', '%.4f' % precision_score(y_train, y_train_pred), end=' ')print('測(cè)試集: ', '%.4f' % precision_score(y_test, y_test_pred))# 召回率print('召回率:',end=' ')print('訓(xùn)練集: ', '%.4f' % recall_score(y_train, y_train_pred), end=' ')print('測(cè)試集: ', '%.4f' % recall_score(y_test, y_test_pred))# f1_scoreprint('f1-score:',end=' ')print('訓(xùn)練集: ', '%.4f' % f1_score(y_train, y_train_pred), end=' ')print('測(cè)試集: ', '%.4f' % f1_score(y_test, y_test_pred))# aucprint('auc:',end=' ')print('訓(xùn)練集: ', '%.4f' % roc_auc_score(y_train, y_train_prob), end=' ')print('測(cè)試集: ', '%.4f' % roc_auc_score(y_test, y_test_prob))# roc曲線fpr_train, tpr_train, thred_train = roc_curve(y_train, y_train_prob, pos_label=1)fpr_test, tpr_test, thred_test = roc_curve(y_test, y_test_prob, pos_label=1)label = ['Train - AUC:{:.4f}'.format(auc(fpr_train, tpr_train)),'Test - AUC:{:.4f}'.format(auc(fpr_test, tpr_test))]plt.plot(fpr_train, tpr_train)plt.plot(fpr_test, tpr_test)plt.plot([0, 1], [0, 1], 'd--')plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.legend(label, loc=4)plt.title('ROC Curve')model_metrics(lr, X_train, X_test, y_train, y_test) model_metrics(svm, X_train, X_test, y_train, y_test) model_metrics(dtc, X_train, X_test, y_train, y_test) model_metrics(rfc, X_train, X_test, y_train, y_test) model_metrics(gbdt, X_train, X_test, y_train, y_test) model_metrics(xgbc, X_train, X_test, y_train, y_test) model_metrics(lgbc, X_train, X_test, y_train, y_test)

出現(xiàn)的問(wèn)題:

  • TypeError: 'list' object is not callable set
    原因:上面重復(fù)定義list所以該處不可使用,提示:定義任何對(duì)象不要和關(guān)鍵字或者import里面的函數(shù)等等同名。

  • UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for)在預(yù)測(cè)的時(shí)候出現(xiàn)該警告,同一模型有的評(píng)價(jià)指標(biāo)結(jié)果為0,目前沒(méi)有解決。

  • 參考:
    https://blog.csdn.net/kevin7658/article/details/50780391/

    總結(jié)

    以上是生活随笔為你收集整理的金融贷款逾期的模型构建6——特征选择的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。