从0到1建立一张评分卡之变量分箱
? 變量分箱是評(píng)分卡建模流程中的關(guān)鍵環(huán)節(jié),可以說(shuō)是評(píng)分卡的核心環(huán)節(jié)。合理的分箱可以消除變量的量綱影響,而且能減少異常值等噪聲數(shù)據(jù)的影響,有效避免模型過(guò)擬合。此外,分箱可以給模型實(shí)現(xiàn)業(yè)務(wù)上的可解釋性,可以說(shuō)是評(píng)分卡的核心了。
? 下面開(kāi)始實(shí)現(xiàn)評(píng)分卡建立中的分箱操作。
? 首先,變量需要分為數(shù)值型變量和類(lèi)別型變量。對(duì)于這兩種類(lèi)型的變量分箱過(guò)程中需要注意的點(diǎn)如下:
- 如果不超過(guò)5個(gè),無(wú)需進(jìn)行分箱;
- 超過(guò)5個(gè),有兩種方法。一,如果類(lèi)別很多,可以對(duì)其進(jìn)行bad_rate編碼之后劃入數(shù)值型變量;二,類(lèi)別不是很多,對(duì)其進(jìn)行降基處理,縮小至5個(gè)以?xún)?nèi)。
? 有無(wú)監(jiān)督和有監(jiān)督分箱兩種方法。無(wú)監(jiān)督分箱有等比分箱、等寬分箱、聚類(lèi)分箱等。有監(jiān)督分箱有卡方分箱、最優(yōu)分箱等等。
? 一共有14個(gè)數(shù)值型變量和6個(gè)類(lèi)別型變量。‘zip_code’、'addr_state’兩個(gè)變量的類(lèi)別很多,進(jìn)行bad_rate編碼后劃入數(shù)值型變量。另外4個(gè)變量單獨(dú)進(jìn)行分箱。
def binning_cate(df,col_list,target):"""df:數(shù)據(jù)集col_list:變量list集合target:目標(biāo)變量的字段名return: bin_df :list形式,里面存儲(chǔ)每個(gè)變量的分箱結(jié)果iv_value:list形式,里面存儲(chǔ)每個(gè)變量的IV值"""total = df[target].count()bad = df[target].sum()good = total-badall_odds = good*1.0/badbin_df =[]iv_value=[]for col in col_list:d1 = df.groupby([col],as_index=True)d2 = pd.DataFrame()d2['min_bin'] = d1[col].min()d2['max_bin'] = d1[col].max()d2['total'] = d1[target].count()d2['totalrate'] = d2['total']/totald2['bad'] = d1[target].sum()d2['badrate'] = d2['bad']/d2['total']d2['good'] = d2['total'] - d2['bad']d2['goodrate'] = d2['good']/d2['total']d2['badattr'] = d2['bad']/badd2['goodattr'] = (d2['total']-d2['bad'])/goodd2['odds'] = d2['good']/d2['bad']GB_list=[]for i in d2.odds:if i>=all_odds:GB_index = str(round((i/all_odds)*100,0))+str('G')else:GB_index = str(round((all_odds/i)*100,0))+str('B')GB_list.append(GB_index)d2['GB_index'] = GB_listd2['woe'] = np.log(d2['badattr']/d2['goodattr'])d2['bin_iv'] = (d2['badattr']-d2['goodattr'])*d2['woe']d2['IV'] = d2['bin_iv'].sum()iv = d2['bin_iv'].sum().round(3)print('變量名:{}'.format(col))print('IV:{}'.format(iv))print('\t')bin_df.append(d2)iv_value.append(iv)return bin_df,iv_value? 注意,如果類(lèi)別型變量的某一箱只有好樣本/壞樣本,將造成變量的IV值為inf/-inf,此時(shí)就需要對(duì)變量進(jìn)行降基處理或者重新分箱。
接著看一下每一箱的明細(xì)情況。
? IV值一般大于0.01,就可以入模使用。IV值不宜過(guò)高,如果過(guò)高說(shuō)明變量的預(yù)測(cè)能力過(guò)強(qiáng),其實(shí)可以單獨(dú)拿出來(lái)作為一條策略。評(píng)分卡的變量最好還是弱變量。此外,每一箱的WOE值也不宜大于1,因?yàn)榇笥?說(shuō)明這一箱至少有65%以上的好壞樣本,其實(shí)可以單獨(dú)作為一條規(guī)則了。
? 下面利用條形圖將分箱結(jié)果可視化展示。
? 下面對(duì)zip_code、addr_state這兩個(gè)變量進(jìn)行bad_rate編碼,就是將變量的每個(gè)類(lèi)別映射成這個(gè)類(lèi)別的壞樣本率,這樣就可以將類(lèi)別型變量轉(zhuǎn)化為數(shù)值型變量了。
def BadRateEncoding(df, col, target):''':param df: dataframe containing feature and target:param col: the feature that needs to be encoded with bad rate, usually categorical type:param target: good/bad indicator:return: the assigned bad rate to encode the categorical feature'''regroup = BinBadRate(df, col, target, grantRateIndicator=0)[1]br_dict = regroup[[col,'bad_rate']].set_index([col]).to_dict(orient='index')for k, v in br_dict.items():br_dict[k] = v['bad_rate']badRateEnconding = df[col].map(lambda x: br_dict[x])return {'encoding':badRateEnconding, 'bad_rate':br_dict}def BinBadRate(df, col, target, grantRateIndicator=0):''':param df: 需要計(jì)算好壞比率的數(shù)據(jù)集:param col: 需要計(jì)算好壞比率的特征:param target: 好壞標(biāo)簽:param grantRateIndicator: 1返回總體的壞樣本率,0不返回:return: 每箱的壞樣本率,以及總體的壞樣本率(當(dāng)grantRateIndicator==1時(shí))'''total = df.groupby([col])[target].count()total = pd.DataFrame({'total': total})bad = df.groupby([col])[target].sum()bad = pd.DataFrame({'bad': bad})regroup = total.merge(bad, left_index=True, right_index=True, how='left') # 每箱的壞樣本數(shù),總樣本數(shù)regroup.reset_index(level=0, inplace=True)regroup['bad_rate'] = regroup.apply(lambda x: x.bad * 1.0 / x.total, axis=1) # 加上一列壞樣本率dicts = dict(zip(regroup[col],regroup['bad_rate'])) # 每箱對(duì)應(yīng)的壞樣本率組成的字典if grantRateIndicator==0:return (dicts, regroup)N = sum(regroup['total'])B = sum(regroup['bad'])overallRate = B * 1.0 / Nreturn (dicts, regroup, overallRate) # 對(duì)zip_code,addr_state進(jìn)行bad_rate編碼 br_encoding_dict = {} more_value_features=['zip_code','addr_state'] for col in more_value_features:br_encoding = BadRateEncoding(trainData, col, 'y')trainData[col + '_br_encoding'] = br_encoding['encoding']br_encoding_dict[col] = br_encoding['bad_rate']num_features.append(col + '_br_encoding')? bad_rate編碼之后產(chǎn)生兩個(gè)新的列,將這兩列劃入數(shù)值型變量中一起進(jìn)行卡方分箱。
# 數(shù)值型變量的分箱 # 先用卡方分箱輸出變量的分割點(diǎn) def split_data(df,col,split_num):"""df: 原始數(shù)據(jù)集col:需要分箱的變量split_num:分割點(diǎn)的數(shù)量"""df2 = df.copy()count = df2.shape[0] # 總樣本數(shù)n = math.floor(count/split_num) # 按照分割點(diǎn)數(shù)目等分后每組的樣本數(shù)split_index = [i*n for i in range(1,split_num)] # 分割點(diǎn)的索引values = sorted(list(df2[col])) # 對(duì)變量的值從小到大進(jìn)行排序split_value = [values[i] for i in split_index] # 分割點(diǎn)對(duì)應(yīng)的valuesplit_value = sorted(list(set(split_value))) # 分割點(diǎn)的value去重排序return split_valuedef assign_group(x,split_bin):"""x:變量的valuesplit_bin:split_data得出的分割點(diǎn)list"""n = len(split_bin)if x<=min(split_bin): return min(split_bin) # 如果x小于分割點(diǎn)的最小值,則x映射為分割點(diǎn)的最小值elif x>max(split_bin): # 如果x大于分割點(diǎn)的最大值,則x映射為分割點(diǎn)的最大值return 10e10else:for i in range(n-1):if split_bin[i]<x<=split_bin[i+1]:# 如果x在兩個(gè)分割點(diǎn)之間,則x映射為分割點(diǎn)較大的值return split_bin[i+1]def bin_bad_rate(df,col,target,grantRateIndicator=0):"""df:原始數(shù)據(jù)集col:原始變量/變量映射后的字段target:目標(biāo)變量的字段grantRateIndicator:是否輸出總體的違約率"""total = df.groupby([col])[target].count()bad = df.groupby([col])[target].sum()total_df = pd.DataFrame({'total':total})bad_df = pd.DataFrame({'bad':bad})regroup = pd.merge(total_df,bad_df,left_index=True,right_index=True,how='left')regroup = regroup.reset_index()regroup['bad_rate'] = regroup['bad']/regroup['total'] # 計(jì)算根據(jù)col分組后每組的違約率dict_bad = dict(zip(regroup[col],regroup['bad_rate'])) # 轉(zhuǎn)為字典形式if grantRateIndicator==0:return (dict_bad,regroup)total_all= df.shape[0]bad_all = df[target].sum()all_bad_rate = bad_all/total_all # 計(jì)算總體的違約率return (dict_bad,regroup,all_bad_rate)def cal_chi2(df,all_bad_rate):"""df:bin_bad_rate得出的regroupall_bad_rate:bin_bad_rate得出的總體違約率"""df2 = df.copy()df2['expected'] = df2['total']*all_bad_rate # 計(jì)算每組的壞用戶(hù)期望數(shù)量combined = zip(df2['expected'],df2['bad']) # 遍歷每組的壞用戶(hù)期望數(shù)量和實(shí)際數(shù)量chi = [(i[0]-i[1])**2/i[0] for i in combined] # 計(jì)算每組的卡方值chi2 = sum(chi) # 計(jì)算總的卡方值return chi2def assign_bin(x,cutoffpoints):"""x:變量的valuecutoffpoints:分箱的切割點(diǎn)"""bin_num = len(cutoffpoints)+1 # 箱體個(gè)數(shù)if x<=cutoffpoints[0]: # 如果x小于最小的cutoff點(diǎn),則映射為Bin 0return 'Bin 0'elif x>cutoffpoints[-1]: # 如果x大于最大的cutoff點(diǎn),則映射為Bin(bin_num-1)return 'Bin {}'.format(bin_num-1)else:for i in range(0,bin_num-1):if cutoffpoints[i]<x<=cutoffpoints[i+1]: # 如果x在兩個(gè)cutoff點(diǎn)之間,則x映射為Bin(i+1)return 'Bin {}'.format(i+1)def ChiMerge(df,col,target,max_bin=5,min_binpct=0):col_unique = sorted(list(set(df[col]))) # 變量的唯一值并排序n = len(col_unique) # 變量唯一值得個(gè)數(shù)df2 = df.copy()if n>100: # 如果變量的唯一值數(shù)目超過(guò)100,則將通過(guò)split_data和assign_group將x映射為split對(duì)應(yīng)的valuesplit_col = split_data(df2,col,100) # 通過(guò)這個(gè)目的將變量的唯一值數(shù)目人為設(shè)定為100df2['col_map'] = df2[col].map(lambda x:assign_group(x,split_col))else:df2['col_map'] = df2[col] # 變量的唯一值數(shù)目沒(méi)有超過(guò)100,則不用做映射# 生成dict_bad,regroup,all_bad_rate的元組(dict_bad,regroup,all_bad_rate) = bin_bad_rate(df2,'col_map',target,grantRateIndicator=1)col_map_unique = sorted(list(set(df2['col_map']))) # 對(duì)變量映射后的value進(jìn)行去重排序group_interval = [[i] for i in col_map_unique] # 對(duì)col_map_unique中每個(gè)值創(chuàng)建list并存儲(chǔ)在group_interval中while (len(group_interval)>max_bin): # 當(dāng)group_interval的長(zhǎng)度大于max_bin時(shí),執(zhí)行while循環(huán)chi_list=[]for i in range(len(group_interval)-1):temp_group = group_interval[i]+group_interval[i+1] # temp_group 為生成的區(qū)間,list形式,例如[1,3]chi_df = regroup[regroup['col_map'].isin(temp_group)]chi_value = cal_chi2(chi_df,all_bad_rate) # 計(jì)算每一對(duì)相鄰區(qū)間的卡方值chi_list.append(chi_value)best_combined = chi_list.index(min(chi_list)) # 最小的卡方值的索引# 將卡方值最小的一對(duì)區(qū)間進(jìn)行合并group_interval[best_combined] = group_interval[best_combined]+group_interval[best_combined+1]# 刪除合并前的右區(qū)間group_interval.remove(group_interval[best_combined+1])# 對(duì)合并后每個(gè)區(qū)間進(jìn)行排序group_interval = [sorted(i) for i in group_interval]# cutoff點(diǎn)為每個(gè)區(qū)間的最大值cutoffpoints = [max(i) for i in group_interval[:-1]]# 檢查是否有箱只有好樣本或者只有壞樣本df2['col_map_bin'] = df2['col_map'].apply(lambda x:assign_bin(x,cutoffpoints)) # 將col_map映射為對(duì)應(yīng)的區(qū)間Bin# 計(jì)算每個(gè)區(qū)間的違約率(dict_bad,regroup) = bin_bad_rate(df2,'col_map_bin',target)# 計(jì)算最小和最大的違約率[min_bad_rate,max_bad_rate] = [min(dict_bad.values()),max(dict_bad.values())]# 當(dāng)最小的違約率等于0,說(shuō)明區(qū)間內(nèi)只有好樣本,當(dāng)最大的違約率等于1,說(shuō)明區(qū)間內(nèi)只有壞樣本while min_bad_rate==0 or max_bad_rate==1:bad01_index = regroup[regroup['bad_rate'].isin([0,1])].col_map_bin.tolist()# 違約率為1或0的區(qū)間bad01_bin = bad01_index[0]if bad01_bin==max(regroup.col_map_bin):cutoffpoints = cutoffpoints[:-1] # 當(dāng)bad01_bin是最大的區(qū)間時(shí),刪除最大的cutoff點(diǎn)elif bad01_bin==min(regroup.col_map_bin):cutoffpoints = cutoffpoints[1:] # 當(dāng)bad01_bin是最小的區(qū)間時(shí),刪除最小的cutoff點(diǎn)else:bad01_bin_index = list(regroup.col_map_bin).index(bad01_bin) # 找出bad01_bin的索引prev_bin = list(regroup.col_map_bin)[bad01_bin_index-1] # bad01_bin前一個(gè)區(qū)間df3 = df2[df2.col_map_bin.isin([prev_bin,bad01_bin])] (dict_bad,regroup1) = bin_bad_rate(df3,'col_map_bin',target)chi1 = cal_chi2(regroup1,all_bad_rate) # 計(jì)算前一個(gè)區(qū)間和bad01_bin的卡方值later_bin = list(regroup.col_map_bin)[bad01_bin_index+1] # bin01_bin的后一個(gè)區(qū)間df4 = df2[df2.col_map_bin.isin([later_bin,bad01_bin])] (dict_bad,regroup2) = bin_bad_rate(df4,'col_map_bin',target)chi2 = cal_chi2(regroup2,all_bad_rate) # 計(jì)算后一個(gè)區(qū)間和bad01_bin的卡方值if chi1<chi2: # 當(dāng)chi1<chi2時(shí),刪除前一個(gè)區(qū)間對(duì)應(yīng)的cutoff點(diǎn)cutoffpoints.remove(cutoffpoints[bad01_bin_index-1])else: # 當(dāng)chi1>=chi2時(shí),刪除bin01對(duì)應(yīng)的cutoff點(diǎn)cutoffpoints.remove(cutoffpoints[bad01_bin_index])df2['col_map_bin'] = df2['col_map'].apply(lambda x:assign_bin(x,cutoffpoints))(dict_bad,regroup) = bin_bad_rate(df2,'col_map_bin',target)# 重新將col_map映射至區(qū)間,并計(jì)算最小和最大的違約率,直達(dá)不再出現(xiàn)違約率為0或1的情況,循環(huán)停止[min_bad_rate,max_bad_rate] = [min(dict_bad.values()),max(dict_bad.values())]# 檢查分箱后的最小占比if min_binpct>0:group_values = df2['col_map'].apply(lambda x:assign_bin(x,cutoffpoints))df2['col_map_bin'] = group_values # 將col_map映射為對(duì)應(yīng)的區(qū)間Bingroup_df = group_values.value_counts().to_frame() group_df['bin_pct'] = group_df['col_map']/n # 計(jì)算每個(gè)區(qū)間的占比min_pct = group_df.bin_pct.min() # 得出最小的區(qū)間占比while min_pct<min_binpct and len(cutoffpoints)>2: # 當(dāng)最小的區(qū)間占比小于min_pct且cutoff點(diǎn)的個(gè)數(shù)大于2,執(zhí)行循環(huán)# 下面的邏輯基本與“檢驗(yàn)是否有箱體只有好/壞樣本”的一致min_pct_index = group_df[group_df.bin_pct==min_pct].index.tolist()min_pct_bin = min_pct_index[0]if min_pct_bin == max(group_df.index):cutoffpoints=cutoffpoints[:-1]elif min_pct_bin == min(group_df.index):cutoffpoints=cutoffpoints[1:]else:minpct_bin_index = list(group_df.index).index(min_pct_bin)prev_pct_bin = list(group_df.index)[minpct_bin_index-1]df5 = df2[df2['col_map_bin'].isin([min_pct_bin,prev_pct_bin])](dict_bad,regroup3) = bin_bad_rate(df5,'col_map_bin',target)chi3 = cal_chi2(regroup3,all_bad_rate)later_pct_bin = list(group_df.index)[minpct_bin_index+1]df6 = df2[df2['col_map_bin'].isin([min_pct_bin,later_pct_bin])](dict_bad,regroup4) = bin_bad_rate(df6,'col_map_bin',target)chi4 = cal_chi2(regroup4,all_bad_rate)if chi3<chi4:cutoffpoints.remove(cutoffpoints[minpct_bin_index-1])else:cutoffpoints.remove(cutoffpoints[minpct_bin_index])return cutoffpoints# 數(shù)值型變量的分箱(卡方分箱) def binning_num(df,target,col_list,max_bin=None,min_binpct=None):"""df:數(shù)據(jù)集target:目標(biāo)變量的字段名col_list:變量list集合max_bin:最大的分箱個(gè)數(shù)min_binpct:區(qū)間內(nèi)樣本所占總體的最小比return:bin_df :list形式,里面存儲(chǔ)每個(gè)變量的分箱結(jié)果iv_value:list形式,里面存儲(chǔ)每個(gè)變量的IV值"""total = df[target].count()bad = df[target].sum()good = total-badall_odds = good/badinf = float('inf')ninf = float('-inf')bin_df=[]iv_value=[]for col in col_list:cut = ChiMerge(df,col,target,max_bin=max_bin,min_binpct=min_binpct)cut.insert(0,ninf)cut.append(inf)bucket = pd.cut(df[col],cut)d1 = df.groupby(bucket)d2 = pd.DataFrame()d2['min_bin'] = d1[col].min()d2['max_bin'] = d1[col].max()d2['total'] = d1[target].count()d2['totalrate'] = d2['total']/totald2['bad'] = d1[target].sum()d2['badrate'] = d2['bad']/d2['total']d2['good'] = d2['total'] - d2['bad']d2['goodrate'] = d2['good']/d2['total']d2['badattr'] = d2['bad']/badd2['goodattr'] = (d2['total']-d2['bad'])/goodd2['odds'] = d2['good']/d2['bad']GB_list=[]for i in d2.odds:if i>=all_odds:GB_index = str(round((i/all_odds)*100,0))+str('G')else:GB_index = str(round((all_odds/i)*100,0))+str('B')GB_list.append(GB_index)d2['GB_index'] = GB_listd2['woe'] = np.log(d2['badattr']/d2['goodattr'])d2['bin_iv'] = (d2['badattr']-d2['goodattr'])*d2['woe']d2['IV'] = d2['bin_iv'].sum()iv = d2['bin_iv'].sum().round(3)print('變量名:{}'.format(col))print('IV:{}'.format(iv))print('\t')bin_df.append(d2)iv_value.append(iv)return bin_df,iv_value? 下面看一下woe可視化之后的圖。
# woe的可視化 def plot_woe(bin_df,hspace=0.4,wspace=0.4,plt_size=None,plt_num=None,x=None,y=None):"""bin_df:list形式,里面存儲(chǔ)每個(gè)變量的分箱結(jié)果hspace :子圖之間的間隔(y軸方向)wspace :子圖之間的間隔(x軸方向)plt_size :圖紙的尺寸plt_num :子圖的數(shù)量x :子圖矩陣中一行子圖的數(shù)量y :子圖矩陣中一列子圖的數(shù)量return :每個(gè)變量的woe變化趨勢(shì)圖"""plt.figure(figsize=plt_size)plt.subplots_adjust(hspace=hspace,wspace=wspace)for i,df in zip(range(1,plt_num+1,1),bin_df):col_name = df.index.namedf = df.reset_index()plt.subplot(x,y,i)plt.title(col_name)sns.pointplot(data=df,x=col_name,y='woe')plt.xlabel('')plt.xticks(rotation=30)return plt.show() plot_woe(bin_df_num,hspace=0.6,wspace=0.4,plt_size=(15,15),plt_num=16,x=4,y=4)? 評(píng)分卡要求模型的可解釋性,所以最好每一箱的woe要單調(diào)。比如int_rate_clean這個(gè)變量分為4箱,woe值呈現(xiàn)單調(diào)上升,映射成評(píng)分之后也是單調(diào)上升的。這樣評(píng)分卡的業(yè)務(wù)邏輯就比較容易解釋。當(dāng)然,如果一些變量的woe不單調(diào),但是業(yè)務(wù)邏輯上能夠解釋,也允許出現(xiàn)U型的圖,但是一波三折的圖是不能接受的。
總結(jié):變量分箱其實(shí)就是觀察每一個(gè)特征值和壞樣本率之間的對(duì)應(yīng)關(guān)系。變量分箱的方法多種多樣,需要結(jié)合業(yè)務(wù)邏輯選擇合適的分箱方法。
【作者】:Labryant
【原創(chuàng)公眾號(hào)】:風(fēng)控獵人
【簡(jiǎn)介】:某創(chuàng)業(yè)公司策略分析師,積極上進(jìn),努力提升。乾坤未定,你我都是黑馬。
【轉(zhuǎn)載說(shuō)明】:轉(zhuǎn)載請(qǐng)說(shuō)明出處,謝謝合作!~
總結(jié)
以上是生活随笔為你收集整理的从0到1建立一张评分卡之变量分箱的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 从0到1建立一张评分卡之可视化分析
- 下一篇: 从0到1建立一张评分卡之模型建立