當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【风控模型】神经网络DNN算法构建信用评分卡模型

發(fā)布時間：2025/3/21 编程问答 48 豆豆

生活随笔收集整理的這篇文章主要介紹了【风控模型】神经网络DNN算法构建信用评分卡模型小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

【博客地址】：https://blog.csdn.net/sunyaowu315
【博客大綱地址】：https://blog.csdn.net/sunyaowu315/article/details/82905347

數(shù)據(jù)集介紹：

本次案例分析所用的數(shù)據(jù)，是拍拍貸發(fā)起的一次與信貸申請審核工作相關(guān)的競賽數(shù)據(jù)集。其中共有3份文件：

PPD_Training_Master_GBK_3_1_Training_Set.csv ：信貸用戶在拍拍貸上的申報信息和部分三方數(shù)據(jù)信息，以及需要預(yù)測的目標變量。
PPD_LogInfo_3_1_Training_Set ：信貸客戶的登錄信息
PPD_Userupdate_Info_3_1_Training_Set ：部分客戶的信息修改行為

建模工作就是從上述三個文件中對數(shù)據(jù)進行加工，提取特征并且建立合適的模型，對貸后表現(xiàn)做預(yù)測。

關(guān)鍵字段

??對數(shù)據(jù)分析、機器學(xué)習(xí)、數(shù)據(jù)科學(xué)、金融風(fēng)控等感興趣的小伙伴，需要數(shù)據(jù)集、代碼、行業(yè)報告等各類學(xué)習(xí)資料，可關(guān)注微信公眾號：風(fēng)控圏子（別打錯字，是圏子，不是圈子，算了直接復(fù)制吧！）

??關(guān)注公眾號后，可聯(lián)系圈子助手加入我們的機器學(xué)習(xí)風(fēng)控討論群和反欺詐討論群。（記得要備注喔！）

??相互學(xué)習(xí)，共同成長。

主程序

import pandas as pd import datetime import collections import numpy as np import numbers import random import sys _path = r'C:\Users\A3\Desktop\DNN_scorecard' sys.path.append(_path) import pickle from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from importlib import reload from matplotlib import pyplot as plt import operator reload(sys) #sys.setdefaultencoding( "utf-8") # -*- coding: utf-8 -*-### 對時間窗口，計算累計產(chǎn)比 ### def TimeWindowSelection(df, daysCol, time_windows):''':param df: the dataset containg variabel of days:param daysCol: the column of days:param time_windows: the list of time window:return:'''freq_tw = {}for tw in time_windows:freq = sum(df[daysCol].apply(lambda x: int(x<=tw)))freq_tw[tw] = freqreturn freq_twdef DeivdedByZero(nominator, denominator):'''當分母為0時，返回0；否則返回正常值'''if denominator == 0:return 0else:return nominator*1.0/denominator#對某些統(tǒng)一的字段進行統(tǒng)一 def ChangeContent(x):y = x.upper()if y == '_MOBILEPHONE':y = '_PHONE'return ydef MissingCategorial(df,x):missing_vals = df[x].map(lambda x: int(x!=x))return sum(missing_vals)*1.0/df.shape[0]def MissingContinuous(df,x):missing_vals = df[x].map(lambda x: int(np.isnan(x)))return sum(missing_vals) * 1.0 / df.shape[0]def MakeupRandom(x, sampledList):if x==x:return xelse:randIndex = random.randint(0, len(sampledList)-1)return sampledList[randIndex]def Outlier_Dectection(df,x):''':param df::param x::return:'''p25, p75 = np.percentile(df[x], 25),np.percentile(df[x], 75)d = p75 - p25upper, lower = p75 + 1.5*d, p25-1.5*dtruncation = df[x].map(lambda x: max(min(upper, x), lower))return truncation############################################################ #Step 0: 數(shù)據(jù)分析的初始工作, 包括讀取數(shù)據(jù)文件、檢查用戶Id的一致性等# ############################################################folderOfData = 'C:/Users/A3/Desktop/DNN_scorecard/' data1 = pd.read_csv(folderOfData+'PPD_LogInfo_3_1_Training_Set.csv', header = 0) data2 = pd.read_csv(folderOfData+'PPD_Training_Master_GBK_3_1_Training_Set.csv', header = 0,encoding = 'gbk') data3 = pd.read_csv(folderOfData+'PPD_Userupdate_Info_3_1_Training_Set.csv', header = 0)#將數(shù)據(jù)集分為訓(xùn)練集與測試集 all_ids = data2['Idx'] train_ids, test_ids = train_test_split(all_ids, test_size=0.3) train_ids = pd.DataFrame(train_ids) test_ids = pd.DataFrame(test_ids)data1_train = pd.merge(left=train_ids,right = data1, on='Idx', how='inner') data2_train = pd.merge(left=train_ids,right = data2, on='Idx', how='inner') data3_train = pd.merge(left=train_ids,right = data3, on='Idx', how='inner')data1_test = pd.merge(left=test_ids,right = data1, on='Idx', how='inner') data2_test = pd.merge(left=test_ids,right = data2, on='Idx', how='inner') data3_test = pd.merge(left=test_ids,right = data3, on='Idx', how='inner')############################################################################################# # Step 1: 從PPD_LogInfo_3_1_Training_Set & PPD_Userupdate_Info_3_1_Training_Set數(shù)據(jù)中衍生特征# ############################################################################################# # compare whether the four city variables match data2_train['city_match'] = data2_train.apply(lambda x: int(x.UserInfo_2 == x.UserInfo_4 == x.UserInfo_8 == x.UserInfo_20),axis = 1) del data2_train['UserInfo_2'] del data2_train['UserInfo_4'] del data2_train['UserInfo_8'] del data2_train['UserInfo_20']### 提取申請日期，計算日期差，查看日期差的分布 data1_train['logInfo'] = data1_train['LogInfo3'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d')) data1_train['Listinginfo'] = data1_train['Listinginfo1'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d')) data1_train['ListingGap'] = data1_train[['logInfo','Listinginfo']].apply(lambda x: (x[1]-x[0]).days,axis = 1)### 提取申請日期，計算日期差，查看日期差的分布 ''' 使用180天作為最大的時間窗口計算新特征所有可以使用的時間窗口可以有7 days, 30 days, 60 days, 90 days, 120 days, 150 days and 180 days. 在每個時間窗口內(nèi)，計算總的登錄次數(shù)，不同的登錄方式，以及每種登錄方式的平均次數(shù) ''' time_window = [7, 30, 60, 90, 120, 150, 180] var_list = ['LogInfo1','LogInfo2'] data1GroupbyIdx = pd.DataFrame({'Idx':data1_train['Idx'].drop_duplicates()})for tw in time_window:data1_train['TruncatedLogInfo'] = data1_train['Listinginfo'].map(lambda x: x + datetime.timedelta(-tw))temp = data1_train.loc[data1_train['logInfo'] >= data1_train['TruncatedLogInfo']]for var in var_list:#count the frequences of LogInfo1 and LogInfo2count_stats = temp.groupby(['Idx'])[var].count().to_dict()data1GroupbyIdx[str(var)+'_'+str(tw)+'_count'] = data1GroupbyIdx['Idx'].map(lambda x: count_stats.get(x,0))# count the distinct value of LogInfo1 and LogInfo2Idx_UserupdateInfo1 = temp[['Idx', var]].drop_duplicates()uniq_stats = Idx_UserupdateInfo1.groupby(['Idx'])[var].count().to_dict()data1GroupbyIdx[str(var) + '_' + str(tw) + '_unique'] = data1GroupbyIdx['Idx'].map(lambda x: uniq_stats.get(x,0))# calculate the average count of each value in LogInfo1 and LogInfo2data1GroupbyIdx[str(var) + '_' + str(tw) + '_avg_count'] = data1GroupbyIdx[[str(var)+'_'+str(tw)+'_count',str(var) + '_' + str(tw) + '_unique']].\apply(lambda x: DeivdedByZero(x[0],x[1]), axis=1)data3_train['ListingInfo'] = data3_train['ListingInfo1'].map(lambda x: datetime.datetime.strptime(x,'%Y/%m/%d')) data3_train['UserupdateInfo'] = data3_train['UserupdateInfo2'].map(lambda x: datetime.datetime.strptime(x,'%Y/%m/%d')) data3_train['ListingGap'] = data3_train[['UserupdateInfo','ListingInfo']].apply(lambda x: (x[1]-x[0]).days,axis = 1) collections.Counter(data3_train['ListingGap']) hist_ListingGap = np.histogram(data3_train['ListingGap']) hist_ListingGap = pd.DataFrame({'Freq':hist_ListingGap[0],'gap':hist_ListingGap[1][1:]}) hist_ListingGap['CumFreq'] = hist_ListingGap['Freq'].cumsum() hist_ListingGap['CumPercent'] = hist_ListingGap['CumFreq'].map(lambda x: x*1.0/hist_ListingGap.iloc[-1]['CumFreq'])''' 對 QQ和qQ, Idnumber和idNumber,MOBILEPHONE和PHONE 進行統(tǒng)一在時間切片內(nèi)，計算(1) 更新的頻率(2) 每種更新對象的種類個數(shù)(3) 對重要信息如IDNUMBER,HASBUYCAR, MARRIAGESTATUSID, PHONE的更新 ''' data3_train['UserupdateInfo1'] = data3_train['UserupdateInfo1'].map(ChangeContent) data3GroupbyIdx = pd.DataFrame({'Idx':data3_train['Idx'].drop_duplicates()})time_window = [7, 30, 60, 90, 120, 150, 180] for tw in time_window:data3_train['TruncatedLogInfo'] = data3_train['ListingInfo'].map(lambda x: x + datetime.timedelta(-tw))temp = data3_train.loc[data3_train['UserupdateInfo'] >= data3_train['TruncatedLogInfo']]#frequency of updatingfreq_stats = temp.groupby(['Idx'])['UserupdateInfo1'].count().to_dict()data3GroupbyIdx['UserupdateInfo_'+str(tw)+'_freq'] = data3GroupbyIdx['Idx'].map(lambda x: freq_stats.get(x,0))# number of updated typesIdx_UserupdateInfo1 = temp[['Idx','UserupdateInfo1']].drop_duplicates()uniq_stats = Idx_UserupdateInfo1.groupby(['Idx'])['UserupdateInfo1'].count().to_dict()data3GroupbyIdx['UserupdateInfo_' + str(tw) + '_unique'] = data3GroupbyIdx['Idx'].map(lambda x: uniq_stats.get(x, x))#average count of each typedata3GroupbyIdx['UserupdateInfo_' + str(tw) + '_avg_count'] = data3GroupbyIdx[['UserupdateInfo_'+str(tw)+'_freq', 'UserupdateInfo_' + str(tw) + '_unique']]. \apply(lambda x: x[0] * 1.0 / x[1], axis=1)#whether the applicant changed items like IDNUMBER,HASBUYCAR, MARRIAGESTATUSID, PHONEIdx_UserupdateInfo1['UserupdateInfo1'] = Idx_UserupdateInfo1['UserupdateInfo1'].map(lambda x: [x])Idx_UserupdateInfo1_V2 = Idx_UserupdateInfo1.groupby(['Idx'])['UserupdateInfo1'].sum()for item in ['_IDNUMBER','_HASBUYCAR','_MARRIAGESTATUSID','_PHONE']:item_dict = Idx_UserupdateInfo1_V2.map(lambda x: int(item in x)).to_dict()data3GroupbyIdx['UserupdateInfo_' + str(tw) + str(item)] = data3GroupbyIdx['Idx'].map(lambda x: item_dict.get(x, x))# Combine the above features with raw features in PPD_Training_Master_GBK_3_1_Training_Set allData = pd.concat([data2_train.set_index('Idx'), data3GroupbyIdx.set_index('Idx'), data1GroupbyIdx.set_index('Idx')],axis= 1) allData.to_csv(folderOfData+'allData_0.csv',encoding = 'gbk')######################################## # Step 2: 對類別型變量和數(shù)值型變量進行預(yù)處理# ######################################## allData = pd.read_csv(folderOfData+'allData_0.csv',header = 0,encoding = 'gbk') allFeatures = list(allData.columns) allFeatures.remove('target') if 'Idx' in allFeatures:allFeatures.remove('Idx') allFeatures.remove('ListingInfo')#檢查是否有常數(shù)型變量，并且檢查是類別型還是數(shù)值型變量 numerical_var = [] for col in allFeatures:if len(set(allData[col])) == 1:print('delete {} from the dataset because it is a constant'.format(col))del allData[col]allFeatures.remove(col)else:uniq_valid_vals = [i for i in allData[col] if i == i]uniq_valid_vals = list(set(uniq_valid_vals))if len(uniq_valid_vals) >= 10 and isinstance(uniq_valid_vals[0], numbers.Real):numerical_var.append(col)categorical_var = [i for i in allFeatures if i not in numerical_var]#檢查變量的最多值的占比情況,以及每個變量中占比最大的值 records_count = allData.shape[0] col_most_values,col_large_value = {},{} for col in allFeatures:value_count = allData[col].groupby(allData[col]).count()col_most_values[col] = max(value_count)/records_countlarge_value = value_count[value_count== max(value_count)].index[0]col_large_value[col] = large_value col_most_values_df = pd.DataFrame.from_dict(col_most_values, orient = 'index') col_most_values_df.columns = ['max percent'] col_most_values_df = col_most_values_df.sort_values(by = 'max percent', ascending = False) pcnt = list(col_most_values_df[:500]['max percent']) vars = list(col_most_values_df[:500].index) plt.bar(range(len(pcnt)), height = pcnt) plt.title('Largest Percentage of Single Value in Each Variable')#計算多數(shù)值占比超過90%的字段中，少數(shù)值的壞樣本率是否會顯著高于多數(shù)值 large_percent_cols = list(col_most_values_df[col_most_values_df['max percent']>=0.9].index) bad_rate_diff = {} for col in large_percent_cols:large_value = col_large_value[col]temp = allData[[col,'target']]temp[col] = temp.apply(lambda x: int(x[col]==large_value),axis=1)bad_rate = temp.groupby(col).mean()if bad_rate.iloc[0]['target'] == 0:bad_rate_diff[col] = 0continuebad_rate_diff[col] = np.log(bad_rate.iloc[0]['target']/bad_rate.iloc[1]['target']) bad_rate_diff_sorted = sorted(bad_rate_diff.items(),key=lambda x: x[1], reverse=True) bad_rate_diff_sorted_values = [x[1] for x in bad_rate_diff_sorted] plt.bar(x = range(len(bad_rate_diff_sorted_values)), height = bad_rate_diff_sorted_values)#由于所有的少數(shù)值的壞樣本率并沒有顯著高于多數(shù)值，意味著這些變量可以直接剔除 for col in large_percent_cols:if col in numerical_var:numerical_var.remove(col)else:categorical_var.remove(col)del allData[col]''' 對類別型變量，如果缺失超過80%, 就刪除，否則保留。 ''' missing_pcnt_threshould_1 = 0.8 for col in categorical_var:missingRate = MissingCategorial(allData,col)print('{0} has missing rate as {1}'.format(col,missingRate))if missingRate > missing_pcnt_threshould_1:categorical_var.remove(col)del allData[col] allData_bk = allData.copy()''' 用one-hot對類別型變量進行編碼 ''' dummy_map = {} dummy_columns = [] for raw_col in categorical_var:dummies = pd.get_dummies(allData.loc[:, raw_col], prefix=raw_col)col_onehot = pd.concat([allData[raw_col], dummies], axis=1)col_onehot = col_onehot.drop_duplicates()allData = pd.concat([allData, dummies], axis=1)del allData[raw_col]dummy_map[raw_col] = col_onehotdummy_columns = dummy_columns + list(dummies)with open(folderOfData+'dummy_map.pkl',"wb") as f:f.write(pickle.dumps(dummy_map))with open(folderOfData+'dummy_columns.pkl',"wb") as f:f.write(pickle.dumps(dummy_columns))''' 檢查數(shù)值型變量 ''' missing_pcnt_threshould_2 = 0.8 deleted_var = [] for col in numerical_var:missingRate = MissingContinuous(allData, col)print('{0} has missing rate as {1}'.format(col, missingRate))if missingRate > missing_pcnt_threshould_2:deleted_var.append(col)print('we delete variable {} because of its high missing rate'.format(col))else:if missingRate > 0:not_missing = allData.loc[allData[col] == allData[col]][col]#makeuped = allData[col].map(lambda x: MakeupRandom(x, list(not_missing)))missing_position = allData.loc[allData[col] != allData[col]][col].indexnot_missing_sample = random.sample(list(not_missing), len(missing_position))allData.loc[missing_position,col] = not_missing_sample#del allData[col]#allData[col] = makeupedmissingRate2 = MissingContinuous(allData, col)print('missing rate after making up is:{}'.format(str(missingRate2)))if deleted_var != []:for col in deleted_var:numerical_var.remove(col)del allData[col]''' 對極端值變量做處理。 ''' max_min_standardized = {} for col in numerical_var:truncation = Outlier_Dectection(allData, col)upper, lower = max(truncation), min(truncation)d = upper - lowerif d == 0:print("{} is almost a constant".format(col))numerical_var.remove(col)continueallData[col] = truncation.map(lambda x: (upper - x)/d)max_min_standardized[col] = [lower, upper]with open(folderOfData+'max_min_standardized.pkl',"wb") as f:f.write(pickle.dumps(max_min_standardized))allData.to_csv(folderOfData+'allData_1_DNN.csv', header=True,encoding='gbk', columns = allData.columns, index=False)allData = pd.read_csv(folderOfData+'allData_1_DNN.csv', header=0,encoding='gbk')######################################## # Step 3: 構(gòu)建基于TensorFlow的神經(jīng)網(wǎng)絡(luò)模型 # ########################################allFeatures = list(allData.columns) allFeatures.remove('target')with open(folderOfData+'allFeatures.pkl',"wb") as f:f.write(pickle.dumps(allFeatures))x_train = np.matrix(allData[allFeatures]) y_train = np.matrix(allData['target']).T#進一步將訓(xùn)練集拆分成訓(xùn)練集和驗證集。在新訓(xùn)練集上進行參數(shù)估計，在驗證集上決定最優(yōu)的參數(shù)x_train, x_validation, y_train, y_validation = train_test_split(x_train, y_train,test_size=0.4,random_state=9)#Example: select the best number of units in the 1-layer hidden layer import tensorflow as tf from tensorflow.contrib.learn.python.learn.estimators import SKCompatno_hidden_units_selection = {} feature_columns = [tf.contrib.layers.real_valued_column("", dimension = x_train.shape[1])] for no_hidden_units in range(50,101,10):print("the current choise of hidden units number is {}".format(no_hidden_units))clf0 = tf.contrib.learn.DNNClassifier(feature_columns = feature_columns,hidden_units=[no_hidden_units, no_hidden_units-10,no_hidden_units-20],n_classes=2,dropout = 0.5)clf = SKCompat(clf0)clf.fit(x_train, y_train, batch_size=256,steps = 100000)#monitor the performance of the model using AUC scoreclf_pred_proba = clf._estimator.predict_proba(x_validation)pred_proba = [i[1] for i in clf_pred_proba]auc_score = roc_auc_score(y_validation.getA(),pred_proba)no_hidden_units_selection[no_hidden_units] = auc_score best_hidden_units = max(no_hidden_units_selection.items(), key=operator.itemgetter(1))[0] #80#Example: check the dropout effect dropout_selection = {} feature_columns = [tf.contrib.layers.real_valued_column("", dimension = x_train.shape[1])] for dropout_prob in np.linspace(0,0.99,20):print("the current choise of drop out rate is {}".format(dropout_prob))clf0 = tf.contrib.learn.DNNClassifier(feature_columns = feature_columns,hidden_units = [best_hidden_units, best_hidden_units-10,best_hidden_units-20],n_classes=2,dropout = dropout_prob)clf = SKCompat(clf0)clf.fit(x_train, y_train, batch_size=256,steps = 100000)#monitor the performance of the model using AUC scoreclf_pred_proba = clf._estimator.predict_proba(x_validation)pred_proba = [i[1] for i in clf_pred_proba]auc_score = roc_auc_score(y_validation.getA(),pred_proba)dropout_selection[dropout_prob] = auc_score best_dropout_prob = max(dropout_selection.items(), key=operator.itemgetter(1))[0] #0.0#the best model is clf1 = tf.contrib.learn.DNNClassifier(feature_columns = feature_columns,hidden_units = [best_hidden_units, best_hidden_units-10,best_hidden_units-20],n_classes=2,dropout = best_dropout_prob) clf1.fit(x_train, y_train, batch_size=256,steps = 100000) clf_pred_proba = clf1.predict_proba(x_train) pred_proba = [i[1] for i in clf_pred_proba] auc_score = roc_auc_score(y_train.getA(),pred_proba) #0.773

功能模塊

import numpy as np import pandas as pddef SplitData(df, col, numOfSplit, special_attribute=[]):''':param df: 按照col排序后的數(shù)據(jù)集:param col: 待分箱的變量:param numOfSplit: 切分的組別數(shù):param special_attribute: 在切分數(shù)據(jù)集的時候，某些特殊值需要排除在外:return: 在原數(shù)據(jù)集上增加一列，把原始細粒度的col重新劃分成粗粒度的值，便于分箱中的合并處理'''df2 = df.copy()if special_attribute != []:df2 = df.loc[~df[col].isin(special_attribute)]N = df2.shape[0]n = int(N/numOfSplit)splitPointIndex = [i*n for i in range(1,numOfSplit)]rawValues = sorted(list(df2[col]))splitPoint = [rawValues[i] for i in splitPointIndex]splitPoint = sorted(list(set(splitPoint)))return splitPointdef MaximumBinPcnt(df,col):''':return: 數(shù)據(jù)集df中，變量col的分布占比'''N = df.shape[0]total = df.groupby([col])[col].count()pcnt = total*1.0/Nreturn max(pcnt)def Chi2(df, total_col, bad_col):''':param df: 包含全部樣本總計與壞樣本總計的數(shù)據(jù)框:param total_col: 全部樣本的個數(shù):param bad_col: 壞樣本的個數(shù):return: 卡方值'''df2 = df.copy()# 求出df中，總體的壞樣本率和好樣本率badRate = sum(df2[bad_col])*1.0/sum(df2[total_col])# 當全部樣本只有好或者壞樣本時，卡方值為0if badRate in [0,1]:return 0df2['good'] = df2.apply(lambda x: x[total_col] - x[bad_col], axis = 1)goodRate = sum(df2['good']) * 1.0 / sum(df2[total_col])# 期望壞（好）樣本個數(shù)＝全部樣本個數(shù)*平均壞（好）樣本占比df2['badExpected'] = df[total_col].apply(lambda x: x*badRate)df2['goodExpected'] = df[total_col].apply(lambda x: x * goodRate)badCombined = zip(df2['badExpected'], df2[bad_col])goodCombined = zip(df2['goodExpected'], df2['good'])badChi = [(i[0]-i[1])**2/i[0] for i in badCombined]goodChi = [(i[0] - i[1]) ** 2 / i[0] for i in goodCombined]chi2 = sum(badChi) + sum(goodChi)return chi2def BinBadRate(df, col, target, grantRateIndicator=0):''':param df: 需要計算好壞比率的數(shù)據(jù)集:param col: 需要計算好壞比率的特征:param target: 好壞標簽:param grantRateIndicator: 1返回總體的壞樣本率，0不返回:return: 每箱的壞樣本率，以及總體的壞樣本率（當grantRateIndicator＝＝1時）'''total = df.groupby([col])[target].count()total = pd.DataFrame({'total': total})bad = df.groupby([col])[target].sum()bad = pd.DataFrame({'bad': bad})regroup = total.merge(bad, left_index=True, right_index=True, how='left')regroup.reset_index(level=0, inplace=True)regroup['bad_rate'] = regroup.apply(lambda x: x.bad * 1.0 / x.total, axis=1)dicts = dict(zip(regroup[col],regroup['bad_rate']))if grantRateIndicator==0:return (dicts, regroup)N = sum(regroup['total'])B = sum(regroup['bad'])overallRate = B * 1.0 / Nreturn (dicts, regroup, overallRate)def AssignGroup(x, bin):''':return: 數(shù)值x在區(qū)間映射下的結(jié)果。例如，x=2，bin=[0,3,5], 由于0<x<3,x映射成3'''N = len(bin)if x<=min(bin):return min(bin)elif x>max(bin):return 10e10else:for i in range(N-1):if bin[i] < x <= bin[i+1]:return bin[i+1]def ChiMerge(df, col, target, max_interval=5,special_attribute=[],minBinPcnt=0):''':param df: 包含目標變量與分箱屬性的數(shù)據(jù)框:param col: 需要分箱的屬性:param target: 目標變量，取值0或1:param max_interval: 最大分箱數(shù)。如果原始屬性的取值個數(shù)低于該參數(shù)，不執(zhí)行這段函數(shù):param special_attribute: 不參與分箱的屬性取值:param minBinPcnt：最小箱的占比，默認為0:return: 分箱結(jié)果'''colLevels = sorted(list(set(df[col])))N_distinct = len(colLevels)if N_distinct <= max_interval: #如果原始屬性的取值個數(shù)低于max_interval，不執(zhí)行這段函數(shù)print("The number of original levels for {} is less than or equal to max intervals".format(col))return colLevels[:-1]else:if len(special_attribute)>=1:df1 = df.loc[df[col].isin(special_attribute)]df2 = df.loc[~df[col].isin(special_attribute)]else:df2 = df.copy()N_distinct = len(list(set(df2[col])))# 步驟一: 通過col對數(shù)據(jù)集進行分組，求出每組的總樣本數(shù)與壞樣本數(shù)if N_distinct > 100:split_x = SplitData(df2, col, 100)df2['temp'] = df2[col].map(lambda x: AssignGroup(x, split_x))else:df2['temp'] = df2[col]# 總體bad rate將被用來計算expected bad count(binBadRate, regroup, overallRate) = BinBadRate(df2, 'temp', target, grantRateIndicator=1)# 首先，每個單獨的屬性值將被分為單獨的一組# 對屬性值進行排序，然后兩兩組別進行合并colLevels = sorted(list(set(df2['temp'])))groupIntervals = [[i] for i in colLevels]# 步驟二：建立循環(huán)，不斷合并最優(yōu)的相鄰兩個組別，直到：# 1，最終分裂出來的分箱數(shù)<＝預(yù)設(shè)的最大分箱數(shù)# 2，每箱的占比不低于預(yù)設(shè)值（可選）# 3，每箱同時包含好壞樣本# 如果有特殊屬性，那么最終分裂出來的分箱數(shù)＝預(yù)設(shè)的最大分箱數(shù)－特殊屬性的個數(shù)split_intervals = max_interval - len(special_attribute)while (len(groupIntervals) > split_intervals): # 終止條件: 當前分箱數(shù)＝預(yù)設(shè)的分箱數(shù)# 每次循環(huán)時, 計算合并相鄰組別后的卡方值。具有最小卡方值的合并方案，是最優(yōu)方案chisqList = []for k in range(len(groupIntervals)-1):temp_group = groupIntervals[k] + groupIntervals[k+1]df2b = regroup.loc[regroup['temp'].isin(temp_group)]chisq = Chi2(df2b, 'total', 'bad')chisqList.append(chisq)best_comnbined = chisqList.index(min(chisqList))groupIntervals[best_comnbined] = groupIntervals[best_comnbined] + groupIntervals[best_comnbined+1]# 當將最優(yōu)的相鄰的兩個變量合并在一起后，需要從原來的列表中將其移除。例如，將[3,4,5] 與[6,7]合并成[3,4,5,6,7]后，需要將[3,4,5] 與[6,7]移除，保留[3,4,5,6,7]groupIntervals.remove(groupIntervals[best_comnbined+1])groupIntervals = [sorted(i) for i in groupIntervals]cutOffPoints = [max(i) for i in groupIntervals[:-1]]# 檢查是否有箱沒有好或者壞樣本。如果有，需要跟相鄰的箱進行合并，直到每箱同時包含好壞樣本groupedvalues = df2['temp'].apply(lambda x: AssignBin(x, cutOffPoints))df2['temp_Bin'] = groupedvalues(binBadRate,regroup) = BinBadRate(df2, 'temp_Bin', target)[minBadRate, maxBadRate] = [min(binBadRate.values()),max(binBadRate.values())]while minBadRate ==0 or maxBadRate == 1:# 找出全部為好／壞樣本的箱indexForBad01 = regroup[regroup['bad_rate'].isin([0,1])].temp_Bin.tolist()bin=indexForBad01[0]# 如果是最后一箱，則需要和上一個箱進行合并，也就意味著分裂點cutOffPoints中的最后一個需要移除if bin == max(regroup.temp_Bin):cutOffPoints = cutOffPoints[:-1]# 如果是第一箱，則需要和下一個箱進行合并，也就意味著分裂點cutOffPoints中的第一個需要移除elif bin == min(regroup.temp_Bin):cutOffPoints = cutOffPoints[1:]# 如果是中間的某一箱，則需要和前后中的一個箱進行合并，依據(jù)是較小的卡方值else:# 和前一箱進行合并，并且計算卡方值currentIndex = list(regroup.temp_Bin).index(bin)prevIndex = list(regroup.temp_Bin)[currentIndex - 1]df3 = df2.loc[df2['temp_Bin'].isin([prevIndex, bin])](binBadRate, df2b) = BinBadRate(df3, 'temp_Bin', target)chisq1 = Chi2(df2b, 'total', 'bad')# 和后一箱進行合并，并且計算卡方值laterIndex = list(regroup.temp_Bin)[currentIndex + 1]df3b = df2.loc[df2['temp_Bin'].isin([laterIndex, bin])](binBadRate, df2b) = BinBadRate(df3b, 'temp_Bin', target)chisq2 = Chi2(df2b, 'total', 'bad')if chisq1 < chisq2:cutOffPoints.remove(cutOffPoints[currentIndex - 1])else:cutOffPoints.remove(cutOffPoints[currentIndex])# 完成合并之后，需要再次計算新的分箱準則下，每箱是否同時包含好壞樣本groupedvalues = df2['temp'].apply(lambda x: AssignBin(x, cutOffPoints))df2['temp_Bin'] = groupedvalues(binBadRate, regroup) = BinBadRate(df2, 'temp_Bin', target)[minBadRate, maxBadRate] = [min(binBadRate.values()), max(binBadRate.values())]# 需要檢查分箱后的最小占比if minBinPcnt > 0:groupedvalues = df2['temp'].apply(lambda x: AssignBin(x, cutOffPoints))df2['temp_Bin'] = groupedvaluesvalueCounts = groupedvalues.value_counts().to_frame()N = sum(valueCounts['temp'])valueCounts['pcnt'] = valueCounts['temp'].apply(lambda x: x * 1.0 / N)valueCounts = valueCounts.sort_index()minPcnt = min(valueCounts['pcnt'])while minPcnt < minBinPcnt and len(cutOffPoints) > 2:# 找出占比最小的箱indexForMinPcnt = valueCounts[valueCounts['pcnt'] == minPcnt].index.tolist()[0]# 如果占比最小的箱是最后一箱，則需要和上一個箱進行合并，也就意味著分裂點cutOffPoints中的最后一個需要移除if indexForMinPcnt == max(valueCounts.index):cutOffPoints = cutOffPoints[:-1]# 如果占比最小的箱是第一箱，則需要和下一個箱進行合并，也就意味著分裂點cutOffPoints中的第一個需要移除elif indexForMinPcnt == min(valueCounts.index):cutOffPoints = cutOffPoints[1:]# 如果占比最小的箱是中間的某一箱，則需要和前后中的一個箱進行合并，依據(jù)是較小的卡方值else:# 和前一箱進行合并，并且計算卡方值currentIndex = list(valueCounts.index).index(indexForMinPcnt)prevIndex = list(valueCounts.index)[currentIndex - 1]df3 = df2.loc[df2['temp_Bin'].isin([prevIndex, indexForMinPcnt])](binBadRate, df2b) = BinBadRate(df3, 'temp_Bin', target)chisq1 = Chi2(df2b, 'total', 'bad')# 和后一箱進行合并，并且計算卡方值laterIndex = list(valueCounts.index)[currentIndex + 1]df3b = df2.loc[df2['temp_Bin'].isin([laterIndex, indexForMinPcnt])](binBadRate, df2b) = BinBadRate(df3b, 'temp_Bin', target)chisq2 = Chi2(df2b, 'total', 'bad')if chisq1 < chisq2:cutOffPoints.remove(cutOffPoints[currentIndex - 1])else:cutOffPoints.remove(cutOffPoints[currentIndex])groupedvalues = df2['temp'].apply(lambda x: AssignBin(x, cutOffPoints))df2['temp_Bin'] = groupedvaluesvalueCounts = groupedvalues.value_counts().to_frame()valueCounts['pcnt'] = valueCounts['temp'].apply(lambda x: x * 1.0 / N)valueCounts = valueCounts.sort_index()minPcnt = min(valueCounts['pcnt'])cutOffPoints = special_attribute + cutOffPointsreturn cutOffPointsdef BadRateEncoding(df, col, target):''':return: 在數(shù)據(jù)集df中，用壞樣本率給col進行編碼。target表示壞樣本標簽'''regroup = BinBadRate(df, col, target, grantRateIndicator=0)[1]br_dict = regroup[[col,'bad_rate']].set_index([col]).to_dict(orient='index')for k, v in br_dict.items():br_dict[k] = v['bad_rate']badRateEnconding = df[col].map(lambda x: br_dict[x])return {'encoding':badRateEnconding, 'bad_rate':br_dict}def AssignBin(x, cutOffPoints,special_attribute=[]):''':param x: 某個變量的某個取值:param cutOffPoints: 上述變量的分箱結(jié)果，用切分點表示:param special_attribute: 不參與分箱的特殊取值:return: 分箱后的對應(yīng)的第幾個箱，從0開始例如, cutOffPoints = [10,20,30], 對于 x = 7, 返回 Bin 0；對于x=23，返回Bin 2；對于x = 35, return Bin 3。對于特殊值，返回的序列數(shù)前加"-"'''cutOffPoints2 = [i for i in cutOffPoints if i not in special_attribute]numBin = len(cutOffPoints2)if x in special_attribute:i = special_attribute.index(x)+1return 'Bin {}'.format(0-i)if x<=cutOffPoints2[0]:return 'Bin 0'elif x > cutOffPoints2[-1]:return 'Bin {}'.format(numBin)else:for i in range(0,numBin):if cutOffPoints2[i] < x <= cutOffPoints2[i+1]:return 'Bin {}'.format(i+1)def CalcWOE(df, col, target):''':param df: 包含需要計算WOE的變量和目標變量:param col: 需要計算WOE、IV的變量，必須是分箱后的變量，或者不需要分箱的類別型變量:param target: 目標變量，0、1表示好、壞:return: 返回WOE和IV'''total = df.groupby([col])[target].count()total = pd.DataFrame({'total': total})bad = df.groupby([col])[target].sum()bad = pd.DataFrame({'bad': bad})regroup = total.merge(bad, left_index=True, right_index=True, how='left')regroup.reset_index(level=0, inplace=True)N = sum(regroup['total'])B = sum(regroup['bad'])regroup['good'] = regroup['total'] - regroup['bad']G = N - Bregroup['bad_pcnt'] = regroup['bad'].map(lambda x: x*1.0/B)regroup['good_pcnt'] = regroup['good'].map(lambda x: x * 1.0 / G)regroup['WOE'] = regroup.apply(lambda x: np.log(x.good_pcnt*1.0/x.bad_pcnt),axis = 1)WOE_dict = regroup[[col,'WOE']].set_index(col).to_dict(orient='index')for k, v in WOE_dict.items():WOE_dict[k] = v['WOE']IV = regroup.apply(lambda x: (x.good_pcnt-x.bad_pcnt)*np.log(x.good_pcnt*1.0/x.bad_pcnt),axis = 1)IV = sum(IV)return {"WOE": WOE_dict, 'IV':IV}def FeatureMonotone(x):''':return: 返回序列x中有幾個元素不滿足單調(diào)性，以及這些元素的位置。例如，x=[1,3,2,5], 元素3比前后兩個元素都大，不滿足單調(diào)性；元素2比前后兩個元素都小，也不滿足單調(diào)性。故返回的不滿足單調(diào)性的元素個數(shù)為2，位置為1和2.'''monotone = [x[i]<x[i+1] and x[i] < x[i-1] or x[i]>x[i+1] and x[i] > x[i-1] for i in range(1,len(x)-1)]index_of_nonmonotone = [i+1 for i in range(len(monotone)) if monotone[i]]return {'count_of_nonmonotone':monotone.count(True), 'index_of_nonmonotone':index_of_nonmonotone}## 判斷某變量的壞樣本率是否單調(diào) def BadRateMonotone(df, sortByVar, target,special_attribute = []):''':param df: 包含檢驗壞樣本率的變量，和目標變量:param sortByVar: 需要檢驗壞樣本率的變量:param target: 目標變量，0、1表示好、壞:param special_attribute: 不參與檢驗的特殊值:return: 壞樣本率單調(diào)與否'''df2 = df.loc[~df[sortByVar].isin(special_attribute)]if len(set(df2[sortByVar])) <= 2:return Trueregroup = BinBadRate(df2, sortByVar, target)[1]combined = zip(regroup['total'],regroup['bad'])badRate = [x[1]*1.0/x[0] for x in combined]badRateNotMonotone = FeatureMonotone(badRate)['count_of_nonmonotone']if badRateNotMonotone > 0:return Falseelse:return Truedef MergeBad0(df,col,target, direction='bad'):''':param df: 包含檢驗0％或者100%壞樣本率:param col: 分箱后的變量或者類別型變量。檢驗其中是否有一組或者多組沒有壞樣本或者沒有好樣本。如果是，則需要進行合并:param target: 目標變量，0、1表示好、壞:return: 合并方案，使得每個組里同時包含好壞樣本'''regroup = BinBadRate(df, col, target)[1]if direction == 'bad':# 如果是合并0壞樣本率的組，則跟最小的非0壞樣本率的組進行合并regroup = regroup.sort_values(by = 'bad_rate')else:# 如果是合并0好樣本率的組，則跟最小的非0好樣本率的組進行合并regroup = regroup.sort_values(by='bad_rate',ascending=False)regroup.index = range(regroup.shape[0])col_regroup = [[i] for i in regroup[col]]del_index = []for i in range(regroup.shape[0]-1):col_regroup[i+1] = col_regroup[i] + col_regroup[i+1]del_index.append(i)if direction == 'bad':if regroup['bad_rate'][i+1] > 0:breakelse:if regroup['bad_rate'][i+1] < 1:breakcol_regroup2 = [col_regroup[i] for i in range(len(col_regroup)) if i not in del_index]newGroup = {}for i in range(len(col_regroup2)):for g2 in col_regroup2[i]:newGroup[g2] = 'Bin '+str(i)return newGroupdef Monotone_Merge(df, target, col):''':return:將數(shù)據(jù)集df中，不滿足壞樣本率單調(diào)性的變量col進行合并，使得合并后的新的變量中，壞樣本率單調(diào)，輸出合并方案。例如，col=[Bin 0, Bin 1, Bin 2, Bin 3, Bin 4]是不滿足壞樣本率單調(diào)性的。合并后的col是：[Bin 0&Bin 1, Bin 2, Bin 3, Bin 4].合并只能在相鄰的箱中進行。迭代地尋找最優(yōu)合并方案。每一步迭代時，都嘗試將所有非單調(diào)的箱進行合并，每一次嘗試的合并都是跟前后箱進行合并再做比較'''def MergeMatrix(m, i,j,k):''':param m: 需要合并行的矩陣:param i,j: 合并第i和j行:param k: 刪除第k行:return: 合并后的矩陣'''m[i, :] = m[i, :] + m[j, :]m = np.delete(m, k, axis=0)return mdef Merge_adjacent_Rows(i, bad_by_bin_current, bins_list_current, not_monotone_count_current):''':param i: 需要將第i行與前、后的行分別進行合并，比較哪種合并方案最佳。判斷準則是，合并后非單調(diào)性程度減輕，且更加均勻:param bad_by_bin_current:合并前的分箱矩陣，包括每一箱的樣本個數(shù)、壞樣本個數(shù)和壞樣本率:param bins_list_current: 合并前的分箱方案:param not_monotone_count_current:合并前的非單調(diào)性元素個數(shù):return:分箱后的分箱矩陣、分箱方案、非單調(diào)性元素個數(shù)和衡量均勻性的指標balance'''i_prev = i - 1i_next = i + 1bins_list = bins_list_current.copy()bad_by_bin = bad_by_bin_current.copy()not_monotone_count = not_monotone_count_current#合并方案a：將第i箱與前一箱進行合并bad_by_bin2a = MergeMatrix(bad_by_bin.copy(), i_prev, i, i)bad_by_bin2a[i_prev, -1] = bad_by_bin2a[i_prev, -2] / bad_by_bin2a[i_prev, -3]not_monotone_count2a = FeatureMonotone(bad_by_bin2a[:, -1])['count_of_nonmonotone']# 合并方案b：將第i行與后一行進行合并bad_by_bin2b = MergeMatrix(bad_by_bin.copy(), i, i_next, i_next)bad_by_bin2b[i, -1] = bad_by_bin2b[i, -2] / bad_by_bin2b[i, -3]not_monotone_count2b = FeatureMonotone(bad_by_bin2b[:, -1])['count_of_nonmonotone']balance = ((bad_by_bin[:, 1] / N).T * (bad_by_bin[:, 1] / N))[0, 0]balance_a = ((bad_by_bin2a[:, 1] / N).T * (bad_by_bin2a[:, 1] / N))[0, 0]balance_b = ((bad_by_bin2b[:, 1] / N).T * (bad_by_bin2b[:, 1] / N))[0, 0]#滿足下述2種情況時返回方案a：（1）方案a能減輕非單調(diào)性而方案b不能；（2）方案a和b都能減輕非單調(diào)性，但是方案a的樣本均勻性優(yōu)于方案bif not_monotone_count2a < not_monotone_count_current and not_monotone_count2b >= not_monotone_count_current or \not_monotone_count2a < not_monotone_count_current and not_monotone_count2b < not_monotone_count_current and balance_a < balance_b:bins_list[i_prev] = bins_list[i_prev] + bins_list[i]bins_list.remove(bins_list[i])bad_by_bin = bad_by_bin2anot_monotone_count = not_monotone_count2abalance = balance_a# 同樣地，滿足下述2種情況時返回方案b：（1）方案b能減輕非單調(diào)性而方案a不能；（2）方案a和b都能減輕非單調(diào)性，但是方案b的樣本均勻性優(yōu)于方案aelif not_monotone_count2a >= not_monotone_count_current and not_monotone_count2b < not_monotone_count_current or \not_monotone_count2a < not_monotone_count_current and not_monotone_count2b < not_monotone_count_current and balance_a > balance_b:bins_list[i] = bins_list[i] + bins_list[i_next]bins_list.remove(bins_list[i_next])bad_by_bin = bad_by_bin2bnot_monotone_count = not_monotone_count2bbalance = balance_b#如果方案a和b都不能減輕非單調(diào)性，返回均勻性更優(yōu)的合并方案else:if balance_a< balance_b:bins_list[i] = bins_list[i] + bins_list[i_next]bins_list.remove(bins_list[i_next])bad_by_bin = bad_by_bin2bnot_monotone_count = not_monotone_count2bbalance = balance_belse:bins_list[i] = bins_list[i] + bins_list[i_next]bins_list.remove(bins_list[i_next])bad_by_bin = bad_by_bin2bnot_monotone_count = not_monotone_count2bbalance = balance_breturn {'bins_list': bins_list, 'bad_by_bin': bad_by_bin, 'not_monotone_count': not_monotone_count,'balance': balance}N = df.shape[0][badrate_bin, bad_by_bin] = BinBadRate(df, col, target)bins = list(bad_by_bin[col])bins_list = [[i] for i in bins]badRate = sorted(badrate_bin.items(), key=lambda x: x[0])badRate = [i[1] for i in badRate]not_monotone_count, not_monotone_position = FeatureMonotone(badRate)['count_of_nonmonotone'], FeatureMonotone(badRate)['index_of_nonmonotone']#迭代地尋找最優(yōu)合并方案，終止條件是:當前的壞樣本率已經(jīng)單調(diào)，或者當前只有2箱while (not_monotone_count > 0 and len(bins_list)>2):#當非單調(diào)的箱的個數(shù)超過1個時，每一次迭代中都嘗試每一個箱的最優(yōu)合并方案all_possible_merging = []for i in not_monotone_position:merge_adjacent_rows = Merge_adjacent_Rows(i, np.mat(bad_by_bin), bins_list, not_monotone_count)all_possible_merging.append(merge_adjacent_rows)balance_list = [i['balance'] for i in all_possible_merging]not_monotone_count_new = [i['not_monotone_count'] for i in all_possible_merging]#如果所有的合并方案都不能減輕當前的非單調(diào)性，就選擇更加均勻的合并方案if min(not_monotone_count_new) >= not_monotone_count:best_merging_position = balance_list.index(min(balance_list))#如果有多個合并方案都能減輕當前的非單調(diào)性，也選擇更加均勻的合并方案else:better_merging_index = [i for i in range(len(not_monotone_count_new)) if not_monotone_count_new[i] < not_monotone_count]better_balance = [balance_list[i] for i in better_merging_index]best_balance_index = better_balance.index(min(better_balance))best_merging_position = better_merging_index[best_balance_index]bins_list = all_possible_merging[best_merging_position]['bins_list']bad_by_bin = all_possible_merging[best_merging_position]['bad_by_bin']not_monotone_count = all_possible_merging[best_merging_position]['not_monotone_count']not_monotone_position = FeatureMonotone(bad_by_bin[:, 3])['index_of_nonmonotone']return bins_listdef Prob2Score(prob, basePoint, PDO):#將概率轉(zhuǎn)化成分數(shù)且為正整數(shù)y = np.log(prob/(1-prob))return (basePoint+PDO/np.log(2)*(-y)).map(lambda x: int(x))### 計算KS值 def KS(df, score, target):''':param df: 包含目標變量與預(yù)測值的數(shù)據(jù)集:param score: 得分或者概率:param target: 目標變量:return: KS值'''total = df.groupby([score])[target].count()bad = df.groupby([score])[target].sum()all = pd.DataFrame({'total':total, 'bad':bad})all['good'] = all['total'] - all['bad']all[score] = all.indexall = all.sort_values(by=score,ascending=False)all.index = range(len(all))all['badCumRate'] = all['bad'].cumsum() / all['bad'].sum()all['goodCumRate'] = all['good'].cumsum() / all['good'].sum()KS = all.apply(lambda x: x.badCumRate - x.goodCumRate, axis=1)return max(KS)

總結(jié)

以上是生活随笔為你收集整理的【风控模型】神经网络DNN算法构建信用评分卡模型的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【风控模型】Logistic算法构建标准
下一篇：【风控模型】融合模型Bagging构建信