當(dāng)前位置：首頁(yè) > 编程资源 > 综合教程 >内容正文

综合教程

Lending Club 数据做数据分析&评分卡

發(fā)布時(shí)間：2023/12/15 综合教程 25 生活家

生活随笔收集整理的這篇文章主要介紹了 Lending Club 数据做数据分析&评分卡小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

一：項(xiàng)目目的

研究Lending Club 貸款的風(fēng)險(xiǎn)特征，并提出建模方案。

二：數(shù)據(jù)獲取

數(shù)據(jù)集來(lái)自L(fǎng)ending Club平臺(tái)發(fā)生借貸的業(yè)務(wù)數(shù)據(jù)，2017年第一季度，具體數(shù)據(jù)集可以從Lending Club官網(wǎng)下載,需要先用郵箱注冊(cè)一個(gè)賬號(hào)。

三：數(shù)據(jù)探索

1.導(dǎo)入需要用到的工具

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')  #風(fēng)格設(shè)置
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore') 
plt.rcParams['font.sans-serif'] = ['SimHei']  # 指定默認(rèn)字體
plt.rcParams['axes.unicode_minus'] = False  # 解決保存圖像是負(fù)號(hào)'-'顯示為方塊的問(wèn)題

2.導(dǎo)入數(shù)據(jù)

data=pd.read_csv("LoanStats3d_securev1.csv",encoding='latin-1',skiprows = 1)
data.head()

各個(gè)變量的解釋也可以在Lending Club 官網(wǎng)找到，直接下載是EXCEL格式的

#看一下目標(biāo)特征
data.loan_status.value_counts()

Fully Paid            307831
Charged Off            77884
Current                33584
Late (31-120 days)      1006
In Grace Period          436
Late (16-30 days)        287
Default                   67
Name: loan_status, dtype: int64

Fully Paid:已結(jié)清 ,Charged Off：壞賬 ,Current：當(dāng)前已還款 , Late (31-120 days)：預(yù)期30-120天

#In Grace Period ：已逾期但在寬限期類(lèi) , Default：逾期超過(guò)90天

#參考：https://help.bitbond.com/article/20-the-10-loan-status-variants-explained·

從結(jié)果看出我們的正反案列存在嚴(yán)重的正反案列不均衡問(wèn)題，后續(xù)建模需要處理以下

3.先把標(biāo)簽處理一下

#封裝一個(gè)替換函數(shù)
def coding(col, codeDict):
　　colCoded = pd.Series(col, copy=True)#創(chuàng)建一個(gè)和loan_status一樣的 Series
   for key, value in codeDict.items():#返回可遍歷的(鍵, 值) 元組數(shù)組
       colCoded.replace(key, value, inplace=True)#替換原有數(shù)據(jù)
   return colCoded

##把貸款狀態(tài)LoanStatus編碼為違約=1, 正常=0:
dict1={'Current':0,'Fully Paid':0,'Charged Off':1,'Late (31-120 days)':1,'Late (16-30 days)':1,'In Grace Period':1,"Default":1}data["loan_status_class"]=coding(data["loan_status"],dict1)

data.loan_status_class.value_counts()

0    341415
1     79680
Name: loan_status_class, dtype: int64
3.處理缺失值

#查看缺失值
for i in data.columns:
    miss=data[i].isnull().sum()
    print(i,"	",miss)

截圖不完整

發(fā)現(xiàn)很多變量全為空，這種數(shù)據(jù)對(duì)我們沒(méi)有任何價(jià)值，先處理掉這些無(wú)用數(shù)據(jù)。

#刪除缺失別列0.8以上的列
half_count = len(data)*0.8 # 設(shè)定閥值
data = data.dropna(thresh = half_count, axis = 1 ) #若某一列數(shù)據(jù)缺失的數(shù)量超過(guò)閥值就會(huì)被刪除
data.shape

#data = data.drop(['desc', 'url'，'id'], axis = 1) #刪除了一些無(wú)用列

（421095, 93）  
#剩下93 列
4.數(shù)據(jù)描述
  由于數(shù)據(jù)特征較多，這里篩選一些變量做描述

col=["loan_amnt","term","int_rate","grade","emp_length","annual_inc","verification_status","loan_status","purpose","dti","delinq_2yrs","inq_last_6mths",'open_acc',"pub_rec","revol_bal","total_acc","total_rev_hi_lim","addr_state","home_ownership","emp_title","loan_status_class"]
df=data[col]

df.columns=["申請(qǐng)額度","借款期限","利率","評(píng)級(jí)","工作年限","年收入","收入來(lái)源是否核實(shí)","借款狀態(tài)","借款目的","負(fù)債率","近兩年逾期30天以上的次數(shù)","近6個(gè)月征信查詢(xún)次數(shù)","未結(jié)清借款數(shù)","負(fù)面記錄","未結(jié)清借款總額","剩余信用額度","總授信額度","所在地","住房狀態(tài)","職位","分類(lèi)"]

#描述分類(lèi)屬性依據(jù)好壞樣本的分布情況
cla=["借款期限","評(píng)級(jí)","工作年限","收入來(lái)源是否核實(shí)","借款目的","住房狀態(tài)"]
for i in cla:
    pvt=pd.pivot_table(df[["分類(lèi)",i]],index=i,columns="分類(lèi)",aggfunc=len) 
    pvt.plot(kind="bar")

由圖可知：
1.（左上）大部分人選擇36期貸款，少部分選擇60期，但是60期逾期百分比明顯高于36期，并且高達(dá)37%,借款時(shí)間越長(zhǎng)，風(fēng)險(xiǎn)越大。
2.（右上）LC自評(píng)等級(jí)，這個(gè)評(píng)級(jí)與利息相關(guān)的，隨著評(píng)級(jí)下降風(fēng)險(xiǎn)越來(lái)越高，利息越來(lái)越高，我們可以認(rèn)為相應(yīng)的逾期率較大，本圖也反饋了LC 評(píng)級(jí)的優(yōu)異性能
3.（左下）值得注意的是，工作年限10年以上的借款人相對(duì)較多，這與我們的一般認(rèn)知不符合，除此以外工作1-9年的人群隨著工作年限加長(zhǎng)，借款需求相對(duì)減少，可能是收入相對(duì)穩(wěn)定了，
  （題外話(huà)：如果這個(gè)假設(shè)成立，為什么我自己工作年限越長(zhǎng)，就越窮呢，是因?yàn)槟芰γ矗姨y了=。=）
4.（右下）收入來(lái)源是否經(jīng)過(guò)核實(shí)，大部分是經(jīng)過(guò)核實(shí)的，并且經(jīng)過(guò)核實(shí)的違約概率相對(duì)較低

5.借款目的債務(wù)整合，家具裝修，還信用卡，三類(lèi)最多（還是比較誠(chéng)實(shí)，這也側(cè)面說(shuō)明這個(gè)特征可能用處不大）

6.住房狀態(tài)上按揭與租房最多，相對(duì)的租房違約率較高（符合一般家庭情況）

對(duì)數(shù)值型數(shù)據(jù)進(jìn)行描述

cel=[i for i in df1.columns if df1[i].dtypes =="float"]
for i ,j in enumerate(cel):
    plt.figure(figsize=(8,5*len(cel)))
    plt.subplot(len(cel),1,i+1)
    sns.distplot(df1[j][df1.分類(lèi)==0],color="b")
    sns.distplot(df1[j][df1.分類(lèi)==1],color="r")

由圖可知：

1.（左上）借款額度呈正太分布，稍有一點(diǎn)左偏，表明業(yè)務(wù)多集中在中小額度上面，且額度越高逾期率相對(duì)有所增加。

2.（右上）年收入集中在15萬(wàn)以?xún)?nèi)，個(gè)別極高收入達(dá)到400萬(wàn)。

3.（左下）負(fù)債率呈現(xiàn)正太分布，多集中在40%以?xún)?nèi)

4.（右下）近兩年逾期30天以上的次數(shù)，說(shuō)明即使一次逾期記錄也沒(méi)有，客戶(hù)也是可能逾期的

由圖可知：

1.征信查詢(xún)，查的越多越容易逾期

2.沒(méi)有負(fù)面記錄也是會(huì)逾期的，但是有負(fù)面記錄的人逾期率要高得多。

用詞云圖看看，申請(qǐng)地與職位的頻率

text=["職位","所在地"]
str_list=["",""]
for i,j in enumerate(text):
    for k in df1[j].values:
        str_list[i]+=str(k) + " "
print(str_list)

#分別設(shè)置了背景顏色，寬度，與高度
from wordcloud import WordCloud
wordcloud=WordCloud(background_color="white",width=1000, height=860, margin=2).generate(str_list[0])
plt.imshow(wordcloud)
plt.axis("off")

wordcloud=WordCloud(background_color="white",width=1000, height=860, margin=2).generate(str_list[1])
plt.imshow(wordcloud)
plt.axis("off")
#在制作詞云圖時(shí)，文本之間提前預(yù)留空格，作圖時(shí)間會(huì)非常快，相當(dāng)于自己提前分詞

1.借款人職務(wù) 大多是公司職員，

2.借款人主要集中在，加利福利亞，紐約德克薩斯州（該公司中部在加州）

探索借款用途與利率之間的關(guān)系

df['int_rate_num'] = df['int_rate'].str.rstrip("%").astype("float")# 刪除 利率后面的百分號(hào)，并且轉(zhuǎn)換成 浮點(diǎn)型數(shù)據(jù)
sns.boxplot(y="purpose",x="int_rate_num",data=df)

借款用途為 small_business 的借款利率最高

探索探索利率收入工作年限和借款狀態(tài)之間的關(guān)系

#替換數(shù)據(jù)的第二種方法
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

df = df.replace(mapping_dict) #轉(zhuǎn)換
df["annual_inc"]=df["annual_inc"].astype("float") #把收入中odjest 轉(zhuǎn)換成float
sns.pairplot(df, vars=["int_rate_num","annual_inc", "emp_length"],hue="loan_status_class", diag_kind="kde" ,kind="reg", size = 3)

可理解為工作年限越長(zhǎng)，收入越高違約情況相對(duì)較低，相應(yīng)的享受更低的利息

簡(jiǎn)單看看相關(guān)性

sns.heatmap(df1.corr())

除去對(duì)角線(xiàn)以外顏色越淺相關(guān)信息越高

5.建模準(zhǔn)備工作

1.查看缺失值具體情況，并決定填充策略

#查看缺失值情況并決定哪些需要?jiǎng)h除
data_defect=[i for i in data.columns if (data[i].isnull().sum())/data.shape[0] != 0]
for i in data_defect:
defect=data[i].isnull().sum()/data.shape[0]
print( i , defect)

data=data.drop(["mths_since_recent_bc","mths_since_recent_inq"],axis=1)

#眾數(shù)填充
fil=["emp_title","emp_length","title","dti","num_rev_accts","num_tl_120dpd_2m","percent_bc_gt_75"]
from scipy.stats import mode # 計(jì)算眾數(shù)模塊
for i in fil:
    data[i][data[i].isnull()]=mode(data[i][data[i].notnull()])[0][0]

#再看看缺失值情況

objectcolumns=[i for i in data.columns if data[i].dtype=="object"]
data[objectcolumns].isnull().sum().sort_values(ascending=False)

data[objectcolumns].head()

#發(fā)現(xiàn) int_rate    與revol_util  實(shí)際數(shù)數(shù)值，但是含有% 被識(shí)別為字符，借款周期需要處理，工齡需要處理
data.int_rate= data.int_rate.str.rstrip('%').astype('float')
data.revol_util= data.revol_util.str.rstrip('%').astype('float')#刪除末尾指定字符，并轉(zhuǎn)化成數(shù)值
data["term"]=data["term"].str.rstrip("months").astype("float")
objectcolumns=[i for i in data.columns if data[i].dtypes=="object"]
data[objectcolumns].isnull().sum().sort_values()

#數(shù)據(jù)過(guò)濾
var = data[objectcolumns].columns
for v in var:
    print('
Frequency count for variable {0}'.format(v))
    print(data[v].value_counts())
data[objectcolumns].shape

drop_list=["sub_grade","title","zip_code","last_pymnt_d","last_credit_pull_d"]
data.drop(drop_list,axis=1,inplace=True)

#創(chuàng)建一個(gè)vacancy 類(lèi)型，填充缺失值
objectcolumns=[i for i in data.columns if data[i].dtype=="object"]
data[objectcolumns]=data[objectcolumns].fillna("vacanct")

import missingno as msno   # 缺失值可視化
msno.matrix(data[objectcolumns])

#查看float數(shù)據(jù)類(lèi)型缺失情況
floatcolumns=[i for i in data.columns if data[i].dtype=="float"]
data[floatcolumns].isnull().sum().sort_values(ascending=False)

對(duì)于數(shù)值型數(shù)據(jù)我們先采用均值填補(bǔ)

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=np.nan , strategy='mean',copy=False, axis=0)
imp=imp.fit(data[floatcolumns])
data[floatcolumns]=imp.transform(data[floatcolumns])

極端值這里暫時(shí)不做處理，因?yàn)槭亲鲈u(píng)分卡，后續(xù)會(huì)做分箱操作

對(duì)object數(shù)據(jù)再次進(jìn)行數(shù)據(jù)過(guò)濾，看看是否需要篩選

objectColumns = [i for i in data.columns if data[i].dtype=="object"]
var = data[objectColumns].columns
for v in var:
    print('
Frequency count for variable {0}'.format(v))
    print(data[v].value_counts())

data_drop=data[["sub_grade","pymnt_plan","title","last_pymnt_d","last_pymnt_d","last_credit_pull_d","application_type","hardship_flag",
              "debt_settlement_flag"]]
data=data.drop(data_drop,axis=1)

2.特征抽象

這里我們優(yōu)先使用類(lèi)別標(biāo)簽，暫時(shí)不用啞變量，后續(xù)看模型效果也可以嘗試啞變量

data_list={
    "grade":{"A":1,"B":2,"C":3,"D":4,"E":5,"F":6,"G":7},
    "emp_length":{"10+ years":11,"2 years":2,"< 1 year":0,"3 years":3,"1 year":1,"5 years":5,"4 years":6,"vacanct":0,
                  "8 years":8,"7 years":7,"6 years":6,"9 years":9 },
    "home_ownership":{"MORTGAGE":1,"RENT":2,"OWN":3,"ANY":4 },
    "verification_status":{"Source Verified":1,"Verified":2,"Not Verified":3},
    "loan_status":{'Current':0,'Fully Paid':0,'Charged Off':1,'Late (31-120 days)':1,'Late (16-30 days)':1,'In Grace Period':1,"Default":1},
    "purpose":{"debt_consolidation":1,"credit_card":2,"home_improvement":3,"other":4,"major_purchase":5,"medical":6,"car":7,
               "small_business":8,"moving":9,"vacation":10,"house":11,"renewable_energy":12,"wedding":13,"educational":14},
    "initial_list_status":{"w":1,"f":2},
    "term":{36.0:1,60.0:2}
}
data=data.replace(data_list)#映射

n_columns = ["home_ownership","verification_status","purpose","application_type"] 
dummy_df = pd.get_dummies(data[n_columns])# 用get_dummies進(jìn)行one hot編碼
loans = pd.concat([data, dummy_df], axis=1) #當(dāng)axis = 1的時(shí)候，concat就是行對(duì)齊，然后將不同列名稱(chēng)的兩張表合并
data = data.drop(n_columns, axis=1)  #清除原來(lái)的分類(lèi)變量

啞變量編碼

同值信息處理

from scipy.stats import mode
equ_fea=[]
for i in data1.columns:
        mode_value=mode(data1[i])[0][0]
        mode_rate=mode(data1[i])[1][0]/data1.shape[0]
        if mode_rate >0.9:
            equ_fea.append([i,mode_value,mode_rate])
dt=pd.DataFrame(equ_fea,columns=["name","value","equi"])
dt.sort_values(by="equi")

再剔除信息泄露屬性

drop_data_leakage=data[["recoveries","last_pymnt_amnt","funded_amnt","funded_amnt_inv","total_pymnt","total_pymnt_inv","total_rec_prncp",
                  "total_rec_int"]]
a=data.drop(drop_data_leakage,axis=1,inplace=True)

特征衍生

#我們呢把'annual_inc'年收入/12 得到客戶(hù)月收入，然后在用"installment" 除以月收入得到得到每月還款與月收入的比，值越大客戶(hù)還款壓力越大
data["installment_feat"]=data["installment"] / (data["annual_inc"]/12)

#把時(shí)序變量變成月份值，用借款發(fā)放時(shí)間 - 首次使用信用卡時(shí)間，作為一個(gè)新變量，表示信用歷史
a=(data["issue_d"]-data["earliest_cr_line"])/30
data["cre_hist"]=a
data.drop(["issue_d","earliest_cr_line"],axis=1,inplace=True)

data.to_csv("2017q1_2.csv",index=False)

連續(xù)變量分箱：

分箱方法包括有監(jiān)督的卡方分箱 KS分箱和決策樹(shù)分箱，無(wú)監(jiān)督的等寬等頻等分箱

一開(kāi)始打算采用卡方分箱，但是有的數(shù)據(jù)莫名其妙出錯(cuò)，要么就跑一晚上沒(méi)有反應(yīng)，以為是正太分布的問(wèn)題，半天也沒(méi)有解決，最后改用決策樹(shù)分箱。

def Chi2(df, total_col, bad_col,overallRate):
    '''
     #此函數(shù)計(jì)算卡方值
     :df dataFrame
     :total_col 每個(gè)值得總數(shù)量
     :bad_col 每個(gè)值的壞數(shù)據(jù)數(shù)量
     :overallRate 壞數(shù)據(jù)的占比
     : return 卡方值
    '''
    df2=df.copy()
    df2['expected']=df[total_col].apply(lambda x: x*overallRate)
    combined=zip(df2['expected'], df2[bad_col])
    chi=[(i[0]-i[1])**2/i[0] for i in combined]
    chi2=sum(chi)
    return chi2
#最大分箱數(shù)分箱
def ChiMerge_MaxInterval_Original(df, col, target,max_interval=5):
    '''
    : df dataframe
    : col 要被分項(xiàng)的特征
    ： target 目標(biāo)值 0,1 值 1 為反  0 為正
    : max_interval 最大箱數(shù)
    ：return 箱體
    '''
    colLevels=set(df[col])
    colLevels=sorted(list(colLevels))
    N_distinct=len(colLevels)
    if N_distinct <= max_interval:
        print ("the row is cann't be less than interval numbers")
        return colLevels[:-1]
    else:
        total=df.groupby([col])[target].count()
        total=pd.DataFrame({'total':total})
        bad=df.groupby([col])[target].sum()
        bad=pd.DataFrame({'bad':bad})
        regroup=total.merge(bad, left_index=True, right_index=True, how='left')
        regroup.reset_index(level=0, inplace=True)
        N=sum(regroup['total'])
        B=sum(regroup['bad'])
        overallRate=B*1.0/N
        groupIntervals=[[i] for i in colLevels]
        groupNum=len(groupIntervals)
        while(len(groupIntervals)>max_interval):
            chisqList=[]
            for interval in groupIntervals:
                df2=regroup.loc[regroup[col].isin(interval)]
                chisq=Chi2(df2,'total','bad',overallRate)
                chisqList.append(chisq)
            min_position=chisqList.index(min(chisqList))
            if min_position==0:
                combinedPosition=1
            elif min_position==groupNum-1:
                combinedPosition=min_position-1
            else:
                if chisqList[min_position-1]<=chisqList[min_position + 1]:
                    combinedPosition=min_position-1
                else:
                    combinedPosition=min_position+1
            #合并箱體
            groupIntervals[min_position]=groupIntervals[min_position]+groupIntervals[combinedPosition]
            groupIntervals.remove(groupIntervals[combinedPosition])
            groupNum=len(groupIntervals)
        groupIntervals=[sorted(i) for i in groupIntervals]
        print (groupIntervals)
        cutOffPoints=[i[-1] for i in groupIntervals[:-1]]
        return cutOffPoints
#返回最佳切分點(diǎn)array

卡方分箱代碼

import numpy as np
from scipy.stats import kstest
kstest(b, 'norm')  #正太分布檢驗(yàn)  p值大于0.05 表示符合正太分布

正態(tài)性檢驗(yàn)代碼

#先切分下需要分箱的數(shù)據(jù)
x_data=data["open_acc"]
x1_data=x_data[:,np.newaxis] #sklearn要求x，至少是二維數(shù)據(jù)，所以需要增加一維，np.newaxis 的位置決定了增加維度的位置
x1_data

#做單變量決策樹(shù)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier( max_depth=3,min_samples_leaf=21054).fit(x1_data,data["loan_status"])

#顯示圖形
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(model, out_file=None)
graphviz.Source(dot_data)

略過(guò)圖形展示

#通過(guò)決策樹(shù)得到所有切分點(diǎn)，并轉(zhuǎn)換成字典
num_box=["loan_amnt","int_rate","dti","fico_range_low","installment","annual_inc","fico_range_high","open_acc"]
cut_list=[
    [4012,5987,7012,9012,10012,19987,20012,23987,28112],
    [6.905,8.045,10.565,11.76,12.49,13.665,15.88,17.915,19.94],
    [7.445,10.075,12.625,14.855,20.195,21.795,25.135,30.115,34.275],
    [667.5,677,683,687,692,697,707,727,747],
    [161,197,251,503,602,880],
    [42800,55101,65732,85085,104499,120287,150486],
    [671.5,686.5,691.5,701.5,711.5,731.5,751.5],
    [5.5,7.5,8.5,10.5,17.5,22.5]]
cut_dict={}
for i in range(len(num_box)):
    cut_dict[num_box[i]]=cut_list[i]

#采用pd.cut()劃分?jǐn)?shù)據(jù)
def box_col_to_df(to_box,col,num_b):#數(shù)據(jù)集    需要轉(zhuǎn)換的數(shù)據(jù)列   切割點(diǎn)LISI
    bins=[-100.0]+num_b+[1000000000.0] #因?yàn)閜d.cun()是封閉的，這里把bins的上下區(qū)間擴(kuò)大
    to_box[col]=pd.cut(to_box[col],bins=bins,include_lowest=True,labels=range(len(bins)-1))

box_col_to_df(data,"open_acc",cut_dict["open_acc"])

完成后的數(shù)據(jù)如下

用隨機(jī)森林對(duì)變量重要程度排序

已經(jīng)提前切分了 X與 y

from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier().fit(x,y)

一開(kāi)始這里出現(xiàn)錯(cuò)誤，顯示 x存在空值或者無(wú)窮大

 #找到無(wú)窮值
inf_list= np.isinf(data).sum().tolist()#把每一列的無(wú)窮值個(gè)數(shù)加起來(lái)
sum(inf_list)#如果sum(nan_inf) 為0，則不存在無(wú)窮值；如果不為0，則存在。

#定位無(wú)窮值
abnormal_index = [ [inf_list.index(i)] for i in inf_list if i != 0 ]#遍歷列表，找到所有非0值的索引。
print(data.columns[abnormal_index])

只有2個(gè)，所以刪除相應(yīng)行就可以了

再跑一次隨機(jī)森林

然后輸出變量重要程度

#輸出變量重要程度排序
importance = clf.feature_importances_
indices = np.argsort(importance)[::-1]
features = x.columns
name=[]
degree=[]
for f in range(x.shape[1]):
    name.append(features[f])
    degree.append(importance[indices[f]])
zy=pd.DataFrame({"name":name,"degree":degree})
print(zy)

先截取前15個(gè)變量看效果

degree_list=df.loc[:15,"name"]
df=data[degree_list]

計(jì)算woe值與IV值

#封裝woe與IV值計(jì)算函數(shù)
def Calcwoe(data,col,target):
    total=data.groupby([col])[target].count()
    total=pd.DataFrame({"total":total})
    bad=data.groupby([col])[target].sum()
    bad=pd.DataFrame({"bad":bad})
    regroup=total.merge(bad,left_index=True,right_index=True,how="left")
    regroup.reset_index(level=0,inplace=True)
    n=sum(regroup["total"])
    b=sum(regroup["bad"])
    regroup["good"]=regroup["total"]-regroup["bad"]
    g=n-b
    regroup["bad_pcnt"]=regroup["bad"].map(lambda x: x*1.0/b)
    regroup["good_pcnt"]=regroup["good"].map(lambda x : x*1.0/g)
    regroup["woe"]=regroup.apply(lambda x: np.log(x.good_pcnt*1.0/x.bad_pcnt),axis=1)
    woe_dict=regroup[[col,"woe"]].set_index(col).to_dict()
    IV=regroup.apply(lambda x:(x.good_pcnt-x.bad_pcnt)*np.log(x.good_pcnt*1.0/x.bad_pcnt),axis=1)
    IV_SUM=sum(IV)
    return {"woe":woe_dict,"IV_SUM":IV_SUM,"IV":IV}

計(jì)算IV值
df=data.copy()
woe_dist={}
IV_list=[]
for i in df.columns:
    iv_dict=Calcwoe(df,i,"loan_status")
    IV_list.append(iv_dict["IV_SUM"])
    woe_dist[i]=iv_dist["woe"]
DF_IV=pd.DataFrame({"iv_name":df.columns.values,"IV":IV_list})
DF_IV.sort_values(by="IV",ascending=False)

iv值出現(xiàn)（無(wú)窮大）表明特征中的某些屬性缺失某一類(lèi)樣本，這種情況下需要從新分箱，合并屬性

再次查看IV值

計(jì)算IV值
df=data.copy()
woe_dist={}
IV_list=[]
for i in df.columns:
    iv_dict=Calcwoe(df,i,"loan_status")
    IV_list.append(iv_dict["IV_SUM"])
    woe_dist[i]=iv_dist["woe"]
DF_IV=pd.DataFrame({"iv_name":df.columns.values,"IV":IV_list})
DF_IV.sort_values(by="IV",ascending=False)

之后可以保留IV值大于0.015變量，也可以保留大于0.02的變量，看實(shí)際情況

下面利VIF（方差膨脹系數(shù)）檢驗(yàn)多重共線(xiàn)性，既用其他特征擬合這一特征，如果解釋性很強(qiáng)，說(shuō)明他們存在共線(xiàn)性

#利用VIF（方差膨脹系數(shù)）檢驗(yàn)多重共線(xiàn)性
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
VIF_ls=[]
n=df.columns
for i in range(len(n)):
    VIF_ls.append([n[i],int(VIF(df.values,i))])
df_vif=pd.DataFrame(VIF_ls,columns=["name","vif"])
print(df_vif)

#利用協(xié)方差計(jì)算線(xiàn)性相關(guān)性
cor=data[num_box].corr()
cor.iloc[:,:]=np.tril(cor.values,k=-1)
cor=cor.stack()
cor[np.abs(cor)>0.7]

# VIF 大于 10 cor 大于0.7 變量之間存在相關(guān)性    這里我們逐一刪除，如當(dāng)刪除 installment 之后，vif小于10，那么installment和 loan_amnt
#選擇iv值大的哪一個(gè)
df.drop("fico_range_high",axis=1,inplace=True )
df.drop("installment",axis=1,inplace=True)
df.drop("grade",axis=1,inplace=True)

valid_feas=DF_IV[DF_IV.IV > 0.015].iv_name.tolist()
valid_feas

df=df[valid_feas]
df.head()

#用熱力圖看看相關(guān)性
colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(df.corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

解決樣本不均衡問(wèn)題，方法有，過(guò)采樣，和欠采樣，以及有放回隨機(jī)抽樣等方法，本次采用過(guò)采樣平衡正反樣本。

#利用過(guò)采樣方法，解決樣本不均衡問(wèn)題
#劃分x和y
x_list=list(df.columns)
x_list.remove("loan_status") #再x_list中剔除 loan_status 變量
x=df[x_list]
y=df["loan_status"]

n_sample=y.shape[0]
n_pos_sample=y[y==0].shape[0]
n_neg_sample=y[y==1].shape[0]
print("樣本個(gè)數(shù):{},正樣本占比:{:.2%},負(fù)樣本占比:{:.2%}".format(n_sample,
                                         n_pos_sample/n_sample,
                                        n_neg_sample/n_sample))

from imblearn.over_sampling import SMOTE # 導(dǎo)入SMOTE算法模塊
# 處理不平衡數(shù)據(jù)
sm = SMOTE(random_state=42)    # 處理過(guò)采樣的方法
x, y = sm.fit_sample(x, y)
print('通過(guò)SMOTE方法平衡正負(fù)樣本后')
n_sample = y.shape[0]
n_pos_sample = y[y == 0].shape[0]
n_neg_sample = y[y == 1].shape[0]
print('樣本個(gè)數(shù)：{}; 正樣本占{:.2%}; 負(fù)樣本占{:.2%}'.format(n_sample,
                                                   n_pos_sample / n_sample,
                                                   n_neg_sample / n_sample))

df.to_csv("2017q1_df.csv",index=False)

# 用woe編碼替換原屬性值，這樣可以讓系數(shù)正則化
for i in range(len(x.columns)):
    x[x.columns[i]].replace(woe_dict[x.columns[i]],inplace=True)

開(kāi)始訓(xùn)練模型

#x增加一列全為1，得到方程截距
import statsmodels.api as sm
x1=sm.add_constant(x_train)

x_train,x_text,y_train,y_test=train_test_split(x1,y,test_size=0.2,random_state=1991)# 切分比列為2-8，切分，并設(shè)置隨機(jī)數(shù)種子

#利用交叉驗(yàn)證和網(wǎng)格搜索
from sklearn.model_selection import GridSearchCV  #網(wǎng)格搜索
from sklearn.linear_model import LogisticRegression # 邏輯回歸
from sklearn.model_selection import train_test_split # 測(cè)試集與訓(xùn)練集劃分

#構(gòu)建網(wǎng)格參數(shù)組合
param_test1={"C":[0.01,0.1,1.0,10.0,20.0,30.0,100.0,200.0,300.0,1000.0], #正則化系數(shù)
            "penalty":["l1","l2"] #正則化參數(shù)
            "max_iter":[100,200,300,400,500]} #算法收斂的最大迭代次數(shù)
gsearch1=GridSearchCV(LogisticRegression(),param_grid=param_test1,cv=10)
gsearch1.fit(x_train,y_train)  #訓(xùn)練模型

gsearch1.best_params_, gsearch1.best_score_   #查看評(píng)分最高的參數(shù)組合與最佳評(píng)分

gsearch1.best_estimator_  # 最佳參數(shù)分類(lèi)器

利用網(wǎng)格搜索得到的最佳參數(shù)訓(xùn)練模型

from sklearn.linear_model import LogisticRegression
flt= LogisticRegression(penalty='l2',C=0.01)
flt.fit(x_train,y_train)

用驗(yàn)證集查看模型效果

auc=roc_auc_score(flt.predict(x_text),y_test)
fpr,tpr,thre=roc_curve(flt.predict(x_text),y_test)
ks=max(tpr-fpr)
print("auc:{}       ks:{}".format(auc,ks))

#查看準(zhǔn)確率
from sklearn.metrics import accuracy_score
print("準(zhǔn)確率:{:.4%}".format(accuracy_score(flt.predict(x_text),y_test)))

flt.coef_ #查看系數(shù)

ks值大于0.3說(shuō)明是一個(gè)基本能用的模型

輸出評(píng)分卡：

#輸出評(píng)分卡
#假設(shè)比率為1/20 時(shí) 分值是500，比率每翻倍一次的20分
B=20/np.log(2) 
A=500+B*np.log(1/20)
basescore=round(A-B*flt.coef_[0][0],0) #基準(zhǔn)分四舍五入取整
scorecard={}
for i,j in enumerate(x.columns):
    woe=woe_dict[j]["woe"]
    interval=[]
    scores=[]
    for key,value in woe.items():
        score=round(-(value*flt.coef_[0][i+1]*B))
        scores.append(score)
        interval.append(key)
    data=(pd.DataFrame({"interval":interval,"scores":scores})).set_index("interval").to_dict()
    scorecard[j]=data
print(scorecard)

整理之后得到評(píng)分卡。

得到評(píng)分卡之后我們通常需要計(jì)算出最佳的分?jǐn)?shù)切割點(diǎn)，可以用ROC曲線(xiàn)，找到拐點(diǎn)的值，帶入評(píng)分卡方程就是我們的最佳切割分?jǐn)?shù)

也可以利用，卡方分箱，或者決策樹(shù)，將評(píng)分分箱，計(jì)算每一箱的逾期率，根據(jù)業(yè)務(wù)情況選擇切割分?jǐn)?shù)。

總結(jié)

1.每一種方法沒(méi)有好壞的區(qū)分，只有適合與不適合，更多時(shí)候我們需要都用一邊，才知道某一種算法適合什么數(shù)據(jù)。

2.制作模型本身是一個(gè)不斷迭代尋找最優(yōu)的過(guò)程，當(dāng)我們構(gòu)建出一個(gè)模型之后如果效果不理想，那么需要我們從數(shù)據(jù)清洗開(kāi)始從新來(lái)做，比如缺失值填充是用均值還是眾數(shù)？分類(lèi)變量使用標(biāo)簽法，還是做啞變量呢，這些我們都要一一嘗試不斷迭代，得到我們的最終模型。

3.變量選擇很重要，人們常說(shuō)數(shù)據(jù)決定的模型的頂點(diǎn)，而算法用于逼近頂點(diǎn)，可見(jiàn)再特征選擇上我們要盡量的貼合業(yè)務(wù)實(shí)際情況，要想得到好的模型最終還是要在數(shù)據(jù)上下功夫，這說(shuō)明數(shù)據(jù)清洗與準(zhǔn)備過(guò)程，再整個(gè)建模流程中是比較重要的。

參考：

https://zhuanlan.zhihu.com/p/39780207

https://blog.csdn.net/zs15321583801/article/details/89485951

總結(jié)

以上是生活随笔為你收集整理的Lending Club 数据做数据分析&评分卡的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： C语言基础知识总结
下一篇： JavaScript实现随机点名器的方法