當前位置：首頁 >

金融风控--申请评分卡模型--特征工程（特征分箱，WOE编码）标签：金融特征分箱-WOE编码 2017-07-16 21:26 4086人阅读评论(2) 收藏举报分类：金融风

發布時間：2025/3/21 56 豆豆

生活随笔收集整理的這篇文章主要介紹了金融风控--申请评分卡模型--特征工程（特征分箱，WOE编码）标签：金融特征分箱-WOE编码 2017-07-16 21:26 4086人阅读评论(2) 收藏举报分类：金融风小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

金融風控-->申請評分卡模型-->特征工程（特征分箱，WOE編碼）標簽：金融特征分箱-WOE編碼 2017-07-16 21:26 4086人閱讀評論(2) 收藏舉報分類：金融風控（6）作者同類文章X

這篇博文主要講在申請評分卡模型中常用的一些特征工程方法，申請評分卡模型最多的還是logsitic模型。

先看數據，我們現在有三張表：

已加工成型的信息：

Master表
idx:每一筆貸款的unique key,可以與另外2個文件里的idx相匹配。
UserInfo_*:借款人特征字段
WeblogInfo_*:Info網絡行為字段
Education_Info*:學歷學籍字段
ThirdParty_Info_PeriodN_*:第三方數據時間段N字段
SocialNetwork_*:社交網絡字段
ListingInfo:借款成交時間
Target:違約標簽(1 = 貸款違約,0 = 正常還款)

需要衍生的信息

借款人的登陸信息表
ListingInfo:借款成交時間
LogInfo1:操作代碼
LogInfo2:操作類別
LogInfo3:登陸時間
idx:每一筆貸款的unique key

客戶在不同的時間段內有著不同的操作，故我們最好做個時間切片，在每個時間切片內統計一些特征。從而衍生出一些特征。

時間切片:

兩個時刻間的跨度

例: 申請日期之前30天內的登錄次數申請日期之前第30天至第59天內的登錄次數

基于時間切片的衍生

申請日期之前180天內,平均每月(30天)的登錄次數

常用的時間切片

(1、2個)月,(1、2個)季度,半年,1年,1年半,2年

時間切片的選擇

不能太長:保證大多數樣本都能覆蓋到不能太短:丟失信息

我們希望最大時間切片不能太長，都是最好又能包含大部分信息。那么最大切片應該多大呢？

#coding:utf-8 import pandas as pd import datetime import collections import numpy as np import randomimport matplotlib.pyplot as pltdef TimeWindowSelection(df, daysCol, time_windows):''':param df: the dataset containg variabel of days:param daysCol: the column of days:param time_windows: the list of time window，可分別取30,60,90,,,360:return:'''freq_tw = {}for tw in time_windows:freq = sum(df[daysCol].apply(lambda x: int(x<=tw))) ##統計在tw時間切片內客戶操作的總次數freq_tw[tw] = freq/float(len(df))　##tw時間切片內客戶總操作數占總的操作數比例return freq_twdata1 = pd.read_csv('PPD_LogInfo_3_1_Training_Set.csv', header = 0) ### Extract the applying date of each applicant data1['logInfo'] = data1['LogInfo3'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d')) data1['Listinginfo'] = data1['Listinginfo1'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d')) data1['ListingGap'] = data1[['logInfo','Listinginfo']].apply(lambda x: (x[1]-x[0]).days,axis = 1) timeWindows = TimeWindowSelection(data1, 'ListingGap', range(30,361,30)) fig=plt.figure() ax=fig.add_subplot(1,1,1) ax.plot(list(timeWindows.keys()),list(timeWindows.values()),marker='o') ax.set_xticks([0,30,60,90,120,150,180,210,240,270,300,330,360]) ax.grid() plt.show()

由上圖可以看出，在0-180天的時間切片內的操作數占總的操作數的95%，180天以后的覆蓋度增長很慢。所以我們選擇180天為最大的時間切片。凡是不超過180天的時間切片，都可以用來做個特征衍生。

選取[7,30,60,90,120,150,180]做為不同的切片,衍生變量。

那么我們來選擇提取哪些有用的特征：

統計下LogInfo1和LogInfo2在每個時間切片內被操作的次數m1。
統計下LogInfo1和LogInfo2在每個時間切片內不同的操作次數m2。
統計下LogInfo1和LogInfo2在每個時間切片內m1/m2的值。

time_window = [7, 30, 60, 90, 120, 150, 180] var_list = ['LogInfo1','LogInfo2'] data1GroupbyIdx = pd.DataFrame({'Idx':data1['Idx'].drop_duplicates()}) for tw in time_window:data1['TruncatedLogInfo'] = data1['Listinginfo'].map(lambda x: x + datetime.timedelta(-tw))temp = data1.loc[data1['logInfo'] >= data1['TruncatedLogInfo']]for var in var_list:#count the frequences of LogInfo1 and LogInfo2count_stats = temp.groupby(['Idx'])[var].count().to_dict()data1GroupbyIdx[str(var)+'_'+str(tw)+'_count'] = data1GroupbyIdx['Idx'].map(lambda x: count_stats.get(x,0))# count the distinct value of LogInfo1 and LogInfo2Idx_UserupdateInfo1 = temp[['Idx', var]].drop_duplicates()uniq_stats = Idx_UserupdateInfo1.groupby(['Idx'])[var].count().to_dict()data1GroupbyIdx[str(var) + '_' + str(tw) + '_unique'] = data1GroupbyIdx['Idx'].map(lambda x: uniq_stats.get(x,0))# calculate the average count of each value in LogInfo1 and LogInfo2data1GroupbyIdx[str(var) + '_' + str(tw) + '_avg_count'] = data1GroupbyIdx[[str(var)+'_'+str(tw)+'_count',str(var) + '_' + str(tw) + '_unique']].\apply(lambda x: x[0]*1.0/x[1], axis=1)

數據清洗

對于類別型變量

刪除缺失率超過50%的變量剩余變量中的缺失做為一種狀態

對于連續型變量

刪除缺失率超過30%的變量利用隨機抽樣法對剩余變量中的缺失進行補缺

注:連續變量中的缺失也可以當成一種狀態

特征分箱（連續變量離散化或類別型變量使其更少類別）
分箱的定義

將連續變量離散化
將多狀態的離散變量合并成少狀態

分箱的重要性及其優勢

離散特征的增加和減少都很容易，易于模型的快速迭代；

稀疏向量內積乘法運算速度快，計算結果方便存儲，容易擴展；

離散化后的特征對異常數據有很強的魯棒性：比如一個特征是年齡>30是1，否則0。如果特征沒有離散化，一個異常數據“年齡300歲”會給模型造成很大的干擾；

邏輯回歸屬于廣義線性模型，表達能力受限；單變量離散化為N個后，每個變量有單獨的權重，相當于為模型引入了非線性，能夠提升模型表達能力，加大擬合；

離散化后可以進行特征交叉，由M+N個變量變為M*N個變量，進一步引入非線性，提升表達能力；

特征離散化后，模型會更穩定，比如如果對用戶年齡離散化，20-30作為一個區間，不會因為一個用戶年齡長了一歲就變成一個完全不同的人。當然處于區間相鄰處的樣本會剛好相反，所以怎么劃分區間是門學問；

特征離散化以后，起到了簡化了邏輯回歸模型的作用，降低了模型過擬合的風險。

可以將缺失作為獨立的一類帶入模型。

將所有變量變換到相似的尺度上。

特征分箱的方法
　

這里我們主要講有監督的卡方分箱法(ChiMerge)。

　　自底向上的(即基于合并的)數據離散化方法。它依賴于卡方檢驗:具有最小卡方值的相鄰區間合并在一起,直到滿足確定的停止準則。
　　基本思想:對于精確的離散化，相對類頻率在一個區間內應當完全一致。因此,如果兩個相鄰的區間具有非常類似的類分布，則這兩個區間可以合并；否則，它們應當保持分開。而低卡方值表明它們具有相似的類分布。

分箱步驟：

這里需要注意初始化時需要對實例進行排序，在排序的基礎上進行合并。

卡方閾值的確定：

　　根據顯著性水平和自由度得到卡方值
　　自由度比類別數量小1。例如：有3類,自由度為2，則90%置信度(10%顯著性水平)下，卡方的值為4.6。

閾值的意義

　　類別和屬性獨立時,有90%的可能性,計算得到的卡方值會小于4.6。
　　大于閾值4.6的卡方值就說明屬性和類不是相互獨立的，不能合并。如果閾值選的大,區間合并就會進行很多次,離散后的區間數量少、區間大。
　　
注:
1,ChiMerge算法推薦使用0.90、0.95、0.99置信度,最大區間數取10到15之間.
2,也可以不考慮卡方閾值,此時可以考慮最小區間數或者最大區間數。指定區間數量的上限和下限,最多幾個區間,最少幾個區間。
3,對于類別型變量,需要分箱時需要按照某種方式進行排序。

按照最大區間數進行分箱代碼：

def Chi2(df, total_col, bad_col, overallRate):''':param df: the dataset containing the total count and bad count:param total_col: total count of each value in the variable:param bad_col: bad count of each value in the variable:param overallRate: the overall bad rate of the training set:return: the chi-square value'''df2 = df.copy()df2['expected'] = df[total_col].apply(lambda x: x*overallRate)combined = zip(df2['expected'], df2[bad_col])chi = [(i[0]-i[1])**2/i[0] for i in combined]chi2 = sum(chi)return chi2### ChiMerge_MaxInterval: split the continuous variable using Chi-square value by specifying the max number of intervals def ChiMerge_MaxInterval_Original(df, col, target, max_interval = 5):''':param df: the dataframe containing splitted column, and target column with 1-0:param col: splitted column:param target: target column with 1-0:param max_interval: the maximum number of intervals. If the raw column has attributes less than this parameter, the function will not work:return: the combined bins'''colLevels = set(df[col])# since we always combined the neighbours of intervals, we need to sort the attributescolLevels = sorted(list(colLevels))　## 先對這列數據進行排序，然后在計算分箱N_distinct = len(colLevels)if N_distinct <= max_interval: #If the raw column has attributes less than this parameter, the function will not workprint "The number of original levels for {} is less than or equal to max intervals".format(col)return colLevels[:-1]else:#Step 1: group the dataset by col and work out the total count & bad count in each level of the raw columntotal = df.groupby([col])[target].count()total = pd.DataFrame({'total':total})bad = df.groupby([col])[target].sum()bad = pd.DataFrame({'bad':bad})regroup = total.merge(bad,left_index=True,right_index=True, how='left')##將左側，右側的索引用作其連接鍵。regroup.reset_index(level=0, inplace=True)N = sum(regroup['total'])B = sum(regroup['bad'])#the overall bad rate will be used in calculating expected bad countoverallRate = B*1.0/N　##　統計壞樣本率# initially, each single attribute forms a single intervalgroupIntervals = [[i] for i in colLevels]## 類似于[[1],[2],[3,4]]其中每個[.]為一箱groupNum = len(groupIntervals)while(len(groupIntervals)>max_interval): #the termination condition: the number of intervals is equal to the pre-specified threshold# in each step of iteration, we calcualte the chi-square value of each atttributechisqList = []for interval in groupIntervals:df2 = regroup.loc[regroup[col].isin(interval)]chisq = Chi2(df2, 'total','bad',overallRate)chisqList.append(chisq)#find the interval corresponding to minimum chi-square, and combine with the neighbore with smaller chi-squaremin_position = chisqList.index(min(chisqList))if min_position == 0:## 如果最小位置為0,則要與其結合的位置為１combinedPosition = 1elif min_position == groupNum - 1:combinedPosition = min_position -1else:## 如果在中間，則選擇左右兩邊卡方值較小的與其結合if chisqList[min_position - 1]<=chisqList[min_position + 1]:combinedPosition = min_position - 1else:combinedPosition = min_position + 1groupIntervals[min_position] = groupIntervals[min_position]+groupIntervals[combinedPosition]# after combining two intervals, we need to remove one of themgroupIntervals.remove(groupIntervals[combinedPosition])groupNum = len(groupIntervals)groupIntervals = [sorted(i) for i in groupIntervals]　## 對每組的數據安從小到大排序cutOffPoints = [i[-1] for i in groupIntervals[:-1]]　## 提取出每組的最大值，也就是分割點return cutOffPoints

以卡方閾值作為終止分箱條件：

def ChiMerge_MinChisq(df, col, target, confidenceVal = 3.841):''':param df: the dataframe containing splitted column, and target column with 1-0:param col: splitted column:param target: target column with 1-0:param confidenceVal: the specified chi-square thresold, by default the degree of freedom is 1 and using confidence level as 0.95:return: the splitted bins'''colLevels = set(df[col])total = df.groupby([col])[target].count()total = pd.DataFrame({'total':total})bad = df.groupby([col])[target].sum()bad = pd.DataFrame({'bad':bad})regroup = total.merge(bad,left_index=True,right_index=True, how='left')regroup.reset_index(level=0, inplace=True)N = sum(regroup['total'])B = sum(regroup['bad'])overallRate = B*1.0/NcolLevels =sorted(list(colLevels))groupIntervals = [[i] for i in colLevels]groupNum = len(groupIntervals)while(1): #the termination condition: all the attributes form a single interval; or all the chi-square is above the threshouldif len(groupIntervals) == 1:breakchisqList = []for interval in groupIntervals:df2 = regroup.loc[regroup[col].isin(interval)]chisq = Chi2(df2, 'total','bad',overallRate)chisqList.append(chisq)min_position = chisqList.index(min(chisqList))if min(chisqList) >=confidenceVal:breakif min_position == 0:combinedPosition = 1elif min_position == groupNum - 1:combinedPosition = min_position -1else:if chisqList[min_position - 1]<=chisqList[min_position + 1]:combinedPosition = min_position - 1else:combinedPosition = min_position + 1groupIntervals[min_position] = groupIntervals[min_position]+groupIntervals[combinedPosition]groupIntervals.remove(groupIntervals[combinedPosition])groupNum = len(groupIntervals)return groupIntervals

無監督分箱法:

等距劃分、等頻劃分

等距分箱
　　從最小值到最大值之間,均分為 N 等份, 這樣, 如果 A,B 為最小最大值, 則每個區間的長度為 W=(B?A)/N , 則區間邊界值為A+W,A+2W,….A+(N?1)W 。這里只考慮邊界，每個等份里面的實例數量可能不等。
　　
等頻分箱
　　區間的邊界值要經過選擇,使得每個區間包含大致相等的實例數量。比如說 N=10 ,每個區間應該包含大約10%的實例。
　　
以上兩種算法的弊端
　　比如,等寬區間劃分,劃分為5區間,最高工資為50000,則所有工資低于10000的人都被劃分到同一區間。等頻區間可能正好相反,所有工資高于50000的人都會被劃分到50000這一區間中。這兩種算法都忽略了實例所屬的類型,落在正確區間里的偶然性很大。

我們對特征進行分箱后，需要對分箱后的每組（箱）進行woe編碼，然后才能放進模型訓練。

WOE編碼

WOE(weight of evidence, 證據權重)

一種有監督的編碼方式,將預測類別的集中度的屬性作為編碼的數值

優勢
　　將特征的值規范到相近的尺度上。
　　(經驗上講,WOE的絕對值波動范圍在0.1~3之間)。
　　具有業務含義。
　　
缺點
　　需要每箱中同時包含好、壞兩個類別。

特征信息度

IV(Information Value), 衡量特征包含預測變量濃度的一種指標

　特征信息度解構：
　
　其中Gi,Bi表示箱i中好壞樣本占全體好壞樣本的比例。
　WOE表示兩類樣本分布的差異性。
　(Gi-Bi)：衡量差異的重要性。

　特征信息度的作用
　選擇變量：

非負指標
高IV表示該特征和目標變量的關聯度高
目標變量只能是二分類
過高的IV,可能有潛在的風險
特征分箱越細,IV越高
常用的閾值有:
< =0.02: 沒有預測性,不可用
0.02 to 0.1: 弱預測性
0.1 to 0.2: 有一定的預測性
0.2 +: 高預測性

注意上面說的IV是指一個變量里面所有箱的IV之和。

計算WOE和IV代碼：

def CalcWOE(df, col, target):''':param df: dataframe containing feature and target:param col: 注意col這列已經經過分箱了，現在計算每箱的WOE和總的IV。:param target: good/bad indicator:return: 返回每箱的WOE(字典類型）和總的IV之和。'''total = df.groupby([col])[target].count()total = pd.DataFrame({'total': total})bad = df.groupby([col])[target].sum()bad = pd.DataFrame({'bad': bad})regroup = total.merge(bad, left_index=True, right_index=True, how='left')regroup.reset_index(level=0, inplace=True)N = sum(regroup['total'])B = sum(regroup['bad'])regroup['good'] = regroup['total'] - regroup['bad']G = N - Bregroup['bad_pcnt'] = regroup['bad'].map(lambda x: x*1.0/B)regroup['good_pcnt'] = regroup['good'].map(lambda x: x * 1.0 / G)regroup['WOE'] = regroup.apply(lambda x: np.log(x.good_pcnt*1.0/x.bad_pcnt),axis = 1)WOE_dict = regroup[[col,'WOE']].set_index(col).to_dict(orient='index')IV = regroup.apply(lambda x: (x.good_pcnt-x.bad_pcnt)*np.log(x.good_pcnt*1.0/x.bad_pcnt),axis = 1)IV = sum(IV)return {"WOE": WOE_dict, 'IV':IV}

那么可能有人會問，以上都是有監督的分箱，有監督的WOE編碼，如何能將這些有監督的方法應用到預測集上呢？
　　
　　我們觀察下有監督的卡方分箱法和有監督的woe編碼的計算公式不難發現，其計算結果都是以一個比值結果呈現（卡方分箱法：(壞樣本數量-期望壞樣本數量)/期望壞樣本數量的比值形式；有監督的woe類似），比如我們發現預測集里面好壞樣本不平衡，需要對壞樣本進行一個欠采樣或者是好樣本進行過采樣，只要是一個均勻采樣，理論上這個有監督的卡方分箱的比值結果是不變的，其woe的比值結果也是不變的。即預測集上的卡方分組和woe編碼和訓練集上一樣。
　　
　　那么，在訓練集中我們對一個連續型變量進行分箱以后，對照這這個連續型變量每個值，如果這個值在某個箱中，那么就用這個箱子的woe編碼代替他放進模型進行訓練。

　　在預測集中類似，但是預測集中的這個連續型變量的某個值可能不在任一個箱中，比如在訓練集中我對[x1,x2]分為一箱，[x3,x4]分為一箱，預測集中這個連續變量某個值可能為(x2+x3)/2即不在任意一箱中，如果把[x1,x2]分為一箱，那么這一箱的變量應該是x1<=x< x2；第二箱應該是x2<=x< x4等等。即預測集中連續變量某一個值大于等于第i-1個箱的最大值，小于第ｉ個箱子的最大值，那么這個變量就應該對應第ｉ個箱子。這樣分箱就覆蓋所有訓練樣本外可能存在的值。預測集中任意的一個值都可以找到對應的箱，和對應的woe編碼。
　　

def AssignBin(x, cutOffPoints):''':param x: the value of variable:param cutOffPoints: 每組的最大值，也就是分割點:return: bin number, indexing from 0for example, if cutOffPoints = [10,20,30], if x = 7, return Bin 0. If x = 35, return Bin 3'''numBin = len(cutOffPoints) + 1if x<=cutOffPoints[0]:return 'Bin 0'elif x > cutOffPoints[-1]:return 'Bin {}'.format(numBin-1)else:for i in range(0,numBin-1):if cutOffPoints[i] < x <= cutOffPoints[i+1]:return 'Bin {}'.format(i+1)

　　
　　如果我們發現分箱以后能完全能區分出好壞樣本，那么得注意了這個連續變量會不會是個事后變量。

分箱的注意點

對于連續型變量做法:

使用ChiMerge進行分箱
如果有特殊值，把特殊值單獨分為一組，例如把-1單獨分為一箱。
計算這個連續型變量的每個值屬于那個箱子，得出箱子編號。以所屬箱子編號代替原始值。

def AssignBin(x, cutOffPoints):''':param x: the value of variable:param cutOffPoints: the ChiMerge result for continous variable:return: bin number, indexing from 0for example, if cutOffPoints = [10,20,30], if x = 7, return Bin 0. If x = 35, return Bin 3'''numBin = len(cutOffPoints) + 1if x<=cutOffPoints[0]:return 'Bin 0'elif x > cutOffPoints[-1]:return 'Bin {}'.format(numBin-1)else:for i in range(0,numBin-1):if cutOffPoints[i] < x <= cutOffPoints[i+1]:return 'Bin {}'.format(i+1)

檢查分箱以后每箱的bad_rate的單調性，如果不滿足，那么繼續進行相鄰的兩箱合并，知道bad_rate單調為止。(可以放寬到U型)

## determine whether the bad rate is monotone along the sortByVar def BadRateMonotone(df, sortByVar, target):# df[sortByVar]這列數據已經經過分箱df2 = df.sort([sortByVar])total = df2.groupby([sortByVar])[target].count()total = pd.DataFrame({'total': total})bad = df2.groupby([sortByVar])[target].sum()bad = pd.DataFrame({'bad': bad})regroup = total.merge(bad, left_index=True, right_index=True, how='left')regroup.reset_index(level=0, inplace=True)combined = zip(regroup['total'],regroup['bad'])badRate = [x[1]*1.0/x[0] for x in combined]badRateMonotone = [badRate[i]<badRate[i+1] for i in range(len(badRate)-1)]Monotone = len(set(badRateMonotone))if Monotone == 1:return Trueelse:return False

　　上述過程是收斂的,因為當箱數為2時,bad rate自然單調

檢查最大箱，如果最大箱里面數據數量占總數據的90%以上，那么棄用這個變量

def MaximumBinPcnt(df,col):N = df.shape[0]total = df.groupby([col])[col].count()pcnt = total*1.0/Nreturn max(pcnt)

對于類別型變量：

當類別數較少時,原則上不需要分箱
否則，當類別較多時，以bad rate代替原有值，轉成連續型變量再進行分箱計算。

def BadRateEncoding(df, col, target):''':param df: dataframe containing feature and target:param col: the feature that needs to be encoded with bad rate, usually categorical type:param target: good/bad indicator:return: the assigned bad rate to encode the categorical fature'''total = df.groupby([col])[target].count()total = pd.DataFrame({'total': total})bad = df.groupby([col])[target].sum()bad = pd.DataFrame({'bad': bad})regroup = total.merge(bad, left_index=True, right_index=True, how='left')regroup.reset_index(level=0, inplace=True)regroup['bad_rate'] = regroup.apply(lambda x: x.bad*1.0/x.total,axis = 1)br_dict = regroup[[col,'bad_rate']].set_index([col]).to_dict(orient='index')badRateEnconding = df[col].map(lambda x: br_dict[x]['bad_rate'])return {'encoding':badRateEnconding, 'br_rate':br_dict}

否則，檢查最大箱，如果最大箱里面數據數量占總數據的90%以上，那么棄用這個變量
當某個或者幾個類別的bad rate為0時,需要和最小的非0bad rate的箱進行合并。

### If we find any categories with 0 bad, then we combine these categories with that having smallest non-zero bad rate def MergeBad0(df,col,target):''':param df: dataframe containing feature and target:param col: the feature that needs to be calculated the WOE and iv, usually categorical type:param target: good/bad indicator:return: WOE and IV in a dictionary'''total = df.groupby([col])[target].count()total = pd.DataFrame({'total': total})bad = df.groupby([col])[target].sum()bad = pd.DataFrame({'bad': bad})regroup = total.merge(bad, left_index=True, right_index=True, how='left')regroup.reset_index(level=0, inplace=True)regroup['bad_rate'] = regroup.apply(lambda x: x.bad*1.0/x.total,axis = 1)regroup = regroup.sort_values(by = 'bad_rate')col_regroup = [[i] for i in regroup[col]]for i in range(regroup.shape[0]):col_regroup[1] = col_regroup[0] + col_regroup[1]col_regroup.pop(0)if regroup['bad_rate'][i+1] > 0:breaknewGroup = {}for i in range(len(col_regroup)):for g2 in col_regroup[i]:newGroup[g2] = 'Bin '+str(i)return newGroup

當該變量可以完全區分目標變量時,需要認真檢查該變量的合理性。（可能是事后變量）

單變量分析

用IV檢驗該變量有效性（一般閾值區間在(0.0.2，0.8)）

iv_threshould = 0.02 ## k,v分別表示col,col對應的這列的IV值。 varByIV = [k for k, v in var_IV.items() if v > iv_threshould] ## WOE_dict字典中包含字典。 WOE_encoding = [] for k in varByIV:if k in trainData.columns:trainData[str(k)+'_WOE'] = trainData[k].map(lambda x: WOE_dict[k][x]['WOE'])WOE_encoding.append(str(k)+'_WOE')elif k+str('_Bin') in trainData.columns:k2 = k+str('_Bin')trainData[str(k) + '_WOE'] = trainData[k2].map(lambda x: WOE_dict[k][x]['WOE'])WOE_encoding.append(str(k) + '_WOE')else:print "{} cannot be found in trainData"

連續變量bad rate的單調性(可以放寬到U型)
單一區間的占比不宜過高（一般不能超過90%，如果超過則棄用這個變量）

多變量分析

變量的兩兩相關性，當相關性高時,只能保留一個:

可以選擇IV高的留下
或者選擇分箱均衡的留下（后期評分得分會均勻）

#### we can check the correlation matrix plot col_to_index = {WOE_encoding[i]:'var'+str(i) for i in range(len(WOE_encoding))} #sample from the list of columns, since too many columns cannot be displayed in the single plot corrCols = random.sample(WOE_encoding,15) sampleDf = trainData[corrCols] for col in corrCols:sampleDf.rename(columns = {col:col_to_index[col]}, inplace = True) scatter_matrix(sampleDf, alpha=0.2, figsize=(6, 6), diagonal='kde')#alternatively, we check each pair of independent variables, and selected the variabale with higher IV if they are highly correlated compare = list(combinations(varByIV, 2))## 從varByIV隨機的進行兩兩組合 removed_var = [] roh_thresould = 0.8 for pair in compare:(x1, x2) = pairroh = np.corrcoef([trainData[str(x1)+"_WOE"],trainData[str(x2)+"_WOE"]])[0,1]if abs(roh) >= roh_thresould:if var_IV[x1]>var_IV[x2]:## 選IV大的留下removed_var.append(x2)else:removed_var.append(x1)

多變量分析：變量的多重共線性
　通常用VIF來衡量，要求VIF<10:
　

import numpy as np from sklearn.linear_model import LinearRegressionselected_by_corr=[i for i in varByIv if i not in removed_var] for i in range(len(selected_by_corr)):x0=trainData[selected_by_corr[i]+'_WOE']x0=np.array(x0)X_Col=[k+'_WOE' for k in selected_by_corr if k!=selected_by_corr[i]]X=trainData[X_Col]X=np.array(X)regr=LinearRegression()clr=regr.fit(X,x0)x_pred=clr.predit(X)R2=1-((x_pred-x0)**2).sum()/((x0-x0.mean())**2).sum()vif=1/(1-R2)print "The vif for {0} is {1}".format(selected_by_corr[i],vif)

當發現vif>10時，需要逐一剔除變量，當剔除變量Xk時，發現vif<10時，此時剔除{Xi,Xk}中IV小的那個變量。
通常情況下，計算vif這一步不是必須的，在進行單變量處理以后，放進邏輯回歸模型進行訓練預測，如果效果非常不好時，才需要做多變量分析，消除多重共線性。

本篇博文總結：
　

總結

以上是生活随笔為你收集整理的金融风控--申请评分卡模型--特征工程（特征分箱，WOE编码）标签：金融特征分箱-WOE编码 2017-07-16 21:26 4086人阅读评论(2) 收藏举报分类：金融风的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：神经网络贷款风险评估（base on k
下一篇： 2017年度盘点：15个最流行的GitH

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

金融风控--申请评分卡模型--特征工程（特征分箱，WOE编码） 标签： 金融特征分箱-WOE编码 2017-07-16 21:26 4086人阅读 评论(2) 收藏 举报 分类： 金融风

總結

金融风控--申请评分卡模型--特征工程（特征分箱，WOE编码）标签：金融特征分箱-WOE编码 2017-07-16 21:26 4086人阅读评论(2) 收藏举报分类：金融风