【博客地址】:https://blog.csdn.net/sunyaowu315
【博客大綱地址】:https://blog.csdn.net/sunyaowu315/article/details/82905347
需要項(xiàng)目數(shù)據(jù)、代碼資料,請(qǐng)?zhí)砑觪q群:102755159,或留言聯(lián)系筆者郵件發(fā)送!!!
如果對(duì)金融風(fēng)控、機(jī)器學(xué)習(xí)、數(shù)據(jù)科學(xué)、大數(shù)據(jù)分析等感興趣的小伙伴,可添加微信交流(郵件中備注,我會(huì)附上微信號(hào))
話不多說,直接上代碼。
Python:汽車金融評(píng)分卡
# coding: utf-8
# <h1>Table of Contents<span class="tocSkip"></span></h1>
# <div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#拒絕推斷" data-toc-modified-id="拒絕推斷-0.1"><span class="toc-item-num">0.1 </span>拒絕推斷</a></span><ul class="toc-item"><li><span><a href="#第一步準(zhǔn)備數(shù)據(jù)集:把解釋變量和被解釋變量分開,這是KNN這個(gè)函數(shù)的要求" data-toc-modified-id="第一步準(zhǔn)備數(shù)據(jù)集:把解釋變量和被解釋變量分開,這是KNN這個(gè)函數(shù)的要求-0.1.1"><span class="toc-item-num">0.1.1 </span>第一步準(zhǔn)備數(shù)據(jù)集:把解釋變量和被解釋變量分開,這是KNN這個(gè)函數(shù)的要求</a></span></li><li><span><a href="#第二步:進(jìn)行缺失值填補(bǔ)和標(biāo)準(zhǔn)化,這也是knn這個(gè)函數(shù)的要求" data-toc-modified-id="第二步:進(jìn)行缺失值填補(bǔ)和標(biāo)準(zhǔn)化,這也是knn這個(gè)函數(shù)的要求-0.1.2"><span class="toc-item-num">0.1.2 </span>第二步:進(jìn)行缺失值填補(bǔ)和標(biāo)準(zhǔn)化,這也是knn這個(gè)函數(shù)的要求</a></span></li><li><span><a href="#第三步:建模并預(yù)測(cè)" data-toc-modified-id="第三步:建模并預(yù)測(cè)-0.1.3"><span class="toc-item-num">0.1.3 </span>第三步:建模并預(yù)測(cè)</a></span></li><li><span><a href="#第四步:將審核通過的申請(qǐng)者和未通過的申請(qǐng)者進(jìn)行合并" data-toc-modified-id="第四步:將審核通過的申請(qǐng)者和未通過的申請(qǐng)者進(jìn)行合并-0.1.4"><span class="toc-item-num">0.1.4 </span>第四步:將審核通過的申請(qǐng)者和未通過的申請(qǐng)者進(jìn)行合并</a></span></li></ul></li><li><span><a href="#建立違約預(yù)測(cè)模型" data-toc-modified-id="建立違約預(yù)測(cè)模型-0.2"><span class="toc-item-num">0.2 </span>建立違約預(yù)測(cè)模型</a></span><ul class="toc-item"><li><span><a href="#粗篩變量" data-toc-modified-id="粗篩變量-0.2.1"><span class="toc-item-num">0.2.1 </span>粗篩變量</a></span></li><li><span><a href="#變量細(xì)篩與數(shù)據(jù)清洗" data-toc-modified-id="變量細(xì)篩與數(shù)據(jù)清洗-0.2.2"><span class="toc-item-num">0.2.2 </span>變量細(xì)篩與數(shù)據(jù)清洗</a></span></li><li><span><a href="#變量分箱WOE轉(zhuǎn)換" data-toc-modified-id="變量分箱WOE轉(zhuǎn)換-0.2.3"><span class="toc-item-num">0.2.3 </span>變量分箱WOE轉(zhuǎn)換</a></span></li><li><span><a href="#構(gòu)造分類模型" data-toc-modified-id="構(gòu)造分類模型-0.2.4"><span class="toc-item-num">0.2.4 </span>構(gòu)造分類模型</a></span></li><li><span><a href="#檢驗(yàn)?zāi)P?#34; data-toc-modified-id="檢驗(yàn)?zāi)P?0.2.5"><span class="toc-item-num">0.2.5 </span>檢驗(yàn)?zāi)P?lt;/a></span></li><li><span><a href="#評(píng)分卡開發(fā)" data-toc-modified-id="評(píng)分卡開發(fā)-0.2.6"><span class="toc-item-num">0.2.6 </span>評(píng)分卡開發(fā)</a></span></li></ul></li></ul></li></ul></div># In[1]:import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os#get_ipython().magic('matplotlib inline')# In[2]:os.chdir(r'F:\script\0\script_credit')# In[3]:accepts = pd.read_csv('accepts.csv')
rejects = pd.read_csv('rejects.csv')# In[ ]:'''
#信用風(fēng)險(xiǎn)建模案例
##數(shù)據(jù)說明:本數(shù)據(jù)是一份汽車貸款違約 數(shù)據(jù)
##名稱---中文含義
##application_id---申請(qǐng)者ID
##account_number---帳戶號(hào)
##bad_ind---是否違約
##vehicle_year---汽車購(gòu)買時(shí)間
##vehicle_make---汽車制造商
##bankruptcy_ind---曾經(jīng)破產(chǎn)標(biāo)識(shí)
##tot_derog---五年內(nèi)信用不良事件數(shù)量(比如手機(jī)欠費(fèi)消號(hào))
##tot_tr---全部帳戶數(shù)量
##age_oldest_tr---最久賬號(hào)存續(xù)時(shí)間(月)
##tot_open_tr---在使用帳戶數(shù)量
##tot_rev_tr---在使用可循環(huán)貸款帳戶數(shù)量(比如信用卡)
##tot_rev_debt---在使用可循環(huán)貸款帳戶余額(比如信用卡欠款)
##tot_rev_line---可循環(huán)貸款帳戶限額(信用卡授權(quán)額度)
##rev_util---可循環(huán)貸款帳戶使用比例(余額/限額)
##fico_score---FICO打分
##purch_price---汽車購(gòu)買金額(元)
##msrp---建議售價(jià)
##down_pyt---分期付款的首次交款
##loan_term---貸款期限(月)
##loan_amt---貸款金額
##ltv---貸款金額/建議售價(jià)*100
##tot_income---月均收入(元)
##veh_mileage---行使歷程(Mile)
##used_ind---是否使用
##weight---樣本權(quán)重
'''##################################################################################################################
# ## 一、拒絕推斷# ### 第一步準(zhǔn)備數(shù)據(jù)集:把解釋變量和被解釋變量分開,這是KNN這個(gè)函數(shù)的要求# In[4]:
#取出部分變量用于做KNN:由于KNN算法要求使用連續(xù)變量,因此僅選了部分重要的連續(xù)變量用于做KNN模型
accepts_x = accepts[["tot_derog","age_oldest_tr","rev_util","fico_score","ltv"]]
# In[5]:accepts_y = accepts['bad_ind']
# In[6]:rejects_x = rejects[["tot_derog","age_oldest_tr","rev_util","fico_score","ltv"]]# In[ ]:
# ### 第二步:進(jìn)行缺失值填補(bǔ)和標(biāo)準(zhǔn)化,這也是knn這個(gè)函數(shù)的要求# In[ ]:
#查看一下數(shù)據(jù)集的信息
rejects_x.info()# In[ ]:accepts_x.info()# In[ ]:# 利用fancyimpute包中的knn方法進(jìn)行缺失值填補(bǔ)
# 本人平時(shí)建模用均值或眾數(shù)填補(bǔ)。只是有人問到多重插補(bǔ)的方法,所以給大家演示一下。使用時(shí)注意一下兩點(diǎn):
# 1、多重插補(bǔ)算法切記不要把Y放到待填補(bǔ)的數(shù)據(jù)集中
# 2、缺失值大于30%的變量建議不要使用多重插補(bǔ)的方法,因?yàn)楣簿€性問題會(huì)比較嚴(yán)重
import fancyimpute as fimp
accepts_x_filled = pd.DataFrame(fimp.KNN(3).complete(accepts_x.as_matrix()))
accepts_x_filled.columns = accepts_x.columnsrejects_x_filled = pd.DataFrame(fimp.KNN(3).complete(rejects_x.as_matrix()))
rejects_x_filled.columns = rejects_x.columns# In[8]:# 標(biāo)準(zhǔn)化數(shù)據(jù)
from sklearn.preprocessing import Normalizer
accepts_x_norm = pd.DataFrame(Normalizer().fit_transform(accepts_x_filled))
accepts_x_norm.columns = accepts_x_filled.columnsrejects_x_norm = pd.DataFrame(Normalizer().fit_transform(rejects_x_filled))
rejects_x_norm.columns = rejects_x_filled.columns# ### 第三步:建模并預(yù)測(cè)# In[9]:# 利用knn模型進(jìn)行預(yù)測(cè),做拒絕推斷
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5, weights='distance')
neigh.fit(accepts_x_norm, accepts_y) # In[10]:rejects['bad_ind'] = neigh.predict(rejects_x_norm)# ### 第四步:將審核通過的申請(qǐng)者和未通過的申請(qǐng)者進(jìn)行合并# In[ ]:# accepts的數(shù)據(jù)是針對(duì)于違約用戶的過度抽樣
#因此,rejects也要進(jìn)行同樣比例的抽樣# In[11]:rejects_res = rejects[rejects['bad_ind'] == 0].sample(1340)
rejects_res = pd.concat([rejects_res, rejects[rejects['bad_ind'] == 1]], axis = 0)# In[12]:data = pd.concat([accepts.iloc[:, 2:-1], rejects_res.iloc[:,1:]], axis = 0)##################################################################################################################
# ## 二、建立違約預(yù)測(cè)模型# ### 粗篩變量# In[13]:# 分類變量轉(zhuǎn)換
bankruptcy_dict = {'N':0, 'Y':1}
data.bankruptcy_ind = data.bankruptcy_ind.map(bankruptcy_dict)# In[14]:# 蓋帽法處理年份變量中的異常值,并將年份其轉(zhuǎn)化為距現(xiàn)在多長(zhǎng)時(shí)間
# 此處只是一個(gè)示例,所有連續(xù)變量都要按此方法進(jìn)行處理
year_min = data.vehicle_year.quantile(0.1)
year_max = data.vehicle_year.quantile(0.99)
data.vehicle_year = data.vehicle_year.map(lambda x: year_min if x <= year_min else x)
data.vehicle_year = data.vehicle_year.map(lambda x: year_max if x >= year_max else x)data.vehicle_year = data.vehicle_year.map(lambda x: 2018 - x)# In[15]:data.drop(['vehicle_make'], axis = 1, inplace = True)# In[ ]:data_filled = pd.DataFrame(fimp.KNN(3).complete(data.as_matrix()))
data_filled.columns = data.columns# In[17]:X = data_filled[['age_oldest_tr', 'bankruptcy_ind', 'down_pyt', 'fico_score','loan_amt', 'loan_term', 'ltv', 'msrp', 'purch_price', 'rev_util','tot_derog', 'tot_income', 'tot_open_tr', 'tot_rev_debt','tot_rev_line', 'tot_rev_tr', 'tot_tr', 'used_ind', 'veh_mileage','vehicle_year']]
y = data_filled['bad_ind']# In[18]:# 利用隨機(jī)森林填補(bǔ)變量
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=5, random_state=0)
clf.fit(X,y)# In[19]:importances = list(clf.feature_importances_)
importances_order = importances.copy()
importances_order.sort(reverse=True)cols = list(X.columns)
col_top = []
for i in importances_order[:9]:col_top.append((i,cols[importances.index(i)]))
col_top# In[20]:col = [i[1] for i in col_top]# ### 變量細(xì)篩與數(shù)據(jù)清洗# In[21]:from PyWoE import WoE
import warnings
warnings.filterwarnings("ignore")# In[22]:data_filled.head()# In[23]:iv_c = {}
for i in col:try:iv_c[i] = WoE(v_type='c').fit(data_filled[i],data_filled['bad_ind']).optimize().iv except:print(i)pd.Series(iv_c).sort_values(ascending=False)# ### 變量分箱WOE轉(zhuǎn)換# In[24]:WOE_c = data_filled[col].apply(lambda col:WoE(v_type='c').fit(col,data_filled['bad_ind']).optimize().fit_transform(col,data_filled['bad_ind']))# In[25]:WOE_c.head()# ### 構(gòu)造分類模型# In[26]:# 劃分?jǐn)?shù)據(jù)集
from sklearn.cross_validation import train_test_split
X = WOE_c
y = data_filled['bad_ind']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)# In[27]:def plot_confusion_matrix(cm, classes,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix."""plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=0)plt.yticks(tick_marks, classes)thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, cm[i, j],horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel('True label')plt.xlabel('Predicted label')# In[28]:# 構(gòu)建邏輯回歸模型,進(jìn)行違約概率預(yù)測(cè)
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,recall_score,classification_report
lr = LogisticRegression(C = 1, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred = lr.predict(X_test.values)# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()# In[46]:## 加入代價(jià)敏感參數(shù),重新計(jì)算
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,recall_score,classification_report
lr = LogisticRegression(C = 1, penalty = 'l1', class_weight='balanced')
lr.fit(X_train,y_train.values.ravel())
y_pred = lr.predict(X_test.values)# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()# ### 檢驗(yàn)?zāi)P? In[47]:from sklearn.metrics import roc_curve, auc
fpr,tpr,threshold = roc_curve(y_test,y_pred, drop_intermediate=False) ###計(jì)算真正率和假正率
roc_auc = auc(fpr,tpr) ###計(jì)算auc的值 plt.figure()
lw = 2
plt.figure(figsize=(10,10))
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假正率為橫坐標(biāo),真正率為縱坐標(biāo)做曲線
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()# In[31]:# 利用sklearn.metrics中的roc_curve算出tpr,fpr作圖fig, ax = plt.subplots()
ax.plot(1 - threshold, tpr, label='tpr') # ks曲線要按照預(yù)測(cè)概率降序排列,所以需要1-threshold鏡像
ax.plot(1 - threshold, fpr, label='fpr')
ax.plot(1 - threshold, tpr-fpr,label='KS')
plt.xlabel('score')
plt.title('KS Curve')
#plt.xticks(np.arange(0,1,0.2), np.arange(1,0,-0.2))
#plt.xticks(np.arange(0,1,0.2), np.arange(score.max(),score.min(),-0.2*(data['反欺詐評(píng)分卡總分'].max() - data['反欺詐評(píng)分卡總分'].min())))
plt.figure(figsize=(20,20))
legend = ax.legend(loc='upper left', shadow=True, fontsize='x-large')plt.show()# ### 評(píng)分卡開發(fā)# In[149]:# 求各變量各水平得分
n = 0
for i in X.columns:if n == 0:temp = WoE(v_type='c').fit(data_filled[i],data_filled['bad_ind']).optimize().binstemp['name'] = [i]*len(temp)scorecard = temp.copy()n += 1else:temp = WoE(v_type='c').fit(data_filled[i],data_filled['bad_ind']).optimize().binstemp['name'] = [i]*len(temp)scorecard = pd.concat([scorecard, temp], axis = 0)n += 1scorecard['score'] = scorecard['woe'].map(lambda x: -int(np.ceil(28.8539*x)))# In[151]:# 基準(zhǔn)分
print('base score is {}'.format(int(np.ceil(28.8539*lr.intercept_[0]+513.561))))# In[153]:scorecard# In[154]:# 求原始數(shù)據(jù)表中每個(gè)樣本的得分
def fico_score_cnvnt(x):if x < 6.657176e+02:return -21else:return 16def age_oldest_tr_cnvnt(x):if x < 1.618624e+02:return -9else:return 20def rev_util_cnvnt(x):if x < 7.050000e+01:return 7else:return -19def ltv_cnvnt(x):if x < 9.450000e+01:return 16else:return -8def tot_tr_cnvnt(x):if x < 1.085218e+01:return -13elif x < 1.330865e+01:return -4elif x < 1.798767e+01:return 3else:return 11def tot_rev_line_cnvnt(x):if x < 1.201000e+04:return -12else:return 19def tot_derog_cnvnt(x):if x < 1.072596e+00:return 8else:return -13def purch_price_cnvnt(x):if x < 1.569685e+04:return -5else:return 3def tot_rev_debt_cnvnt(x):if x < 1.024000e+04:return -2else:return 8# In[155]:func = [fico_score_cnvnt,age_oldest_tr_cnvnt,rev_util_cnvnt,ltv_cnvnt,tot_tr_cnvnt,tot_rev_line_cnvnt,tot_derog_cnvnt,purch_price_cnvnt,tot_rev_debt_cnvnt]# In[156]:X_score_dict = {i:j for i,j in zip(X.columns,func)}# In[157]:X_score = data_filled[X.columns].copy()
for i in X_score.columns:X_score[i] = X_score[i].map(X_score_dict[i])# In[158]:X_score['SCORE'] = X_score[X.columns].apply(lambda x: sum(x) + 513, axis = 1)# In[159]:X_score_label = pd.concat([X_score, data_filled['bad_ind']], axis = 1)# In[160]:X_score_label.head()# In[161]:# 查看逾期未逾期評(píng)分分布
fig, ax = plt.subplots()
ax1 = sns.kdeplot(X_score_label[X_score_label['bad_ind'] == 1]['SCORE'],label='1')
ax2 = sns.kdeplot(X_score_label[X_score_label['bad_ind'] == 0]['SCORE'],label='0')plt.show()
總結(jié)
以上是生活随笔為你收集整理的【项目实战】汽车金融评分卡的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。