日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

风控-数据分析

發布時間:2024/1/18 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 风控-数据分析 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

數據總體了解:

讀取數據集并了解數據集大小,原始特征維度;

通過info熟悉數據類型;

粗略查看數據集中各特征基本統計量;

缺失值和唯一值:

查看數據缺失值情況

查看唯一值特征情況

深入數據-查看數據類型

類別型數據

數值型數據

離散數值型數據

連續數值型數據

數據間相關關系

特征和特征之間關系

特征和目標變量之間關系

用pandas_profiling生成數據報告

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import datetime import warnings warnings.filterwarnings('ignore')

1.讀取文件

#讀取文件 data_train = pd.read_csv("./train.csv") data_test_a = pd.read_csv('./testA.csv')

2.整體信息

#查看數據集樣本個數和原始特征維度 data_test_a.shape (200000, 48) data_train.shape (800000, 47) data_train.columns Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'issueDate', 'isDefault','purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14'],dtype='object') data_train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 800000 entries, 0 to 799999 Data columns (total 47 columns): id 800000 non-null int64 loanAmnt 800000 non-null float64 term 800000 non-null int64 interestRate 800000 non-null float64 installment 800000 non-null float64 grade 800000 non-null object subGrade 800000 non-null object employmentTitle 799999 non-null float64 employmentLength 753201 non-null object homeOwnership 800000 non-null int64 annualIncome 800000 non-null float64 verificationStatus 800000 non-null int64 issueDate 800000 non-null object isDefault 800000 non-null int64 purpose 800000 non-null int64 postCode 799999 non-null float64 regionCode 800000 non-null int64 dti 799761 non-null float64 delinquency_2years 800000 non-null float64 ficoRangeLow 800000 non-null float64 ficoRangeHigh 800000 non-null float64 openAcc 800000 non-null float64 pubRec 800000 non-null float64 pubRecBankruptcies 799595 non-null float64 revolBal 800000 non-null float64 revolUtil 799469 non-null float64 totalAcc 800000 non-null float64 initialListStatus 800000 non-null int64 applicationType 800000 non-null int64 earliesCreditLine 800000 non-null object title 799999 non-null float64 policyCode 800000 non-null float64 n0 759730 non-null float64 n1 759730 non-null float64 n2 759730 non-null float64 n2.1 759730 non-null float64 n4 766761 non-null float64 n5 759730 non-null float64 n6 759730 non-null float64 n7 759730 non-null float64 n8 759729 non-null float64 n9 759730 non-null float64 n10 766761 non-null float64 n11 730248 non-null float64 n12 759730 non-null float64 n13 759730 non-null float64 n14 759730 non-null float64 dtypes: float64(33), int64(9), object(5) memory usage: 286.9+ MB data_train.describe() idloanAmntterminterestRateinstallmentemploymentTitlehomeOwnershipannualIncomeverificationStatusisDefault...n5n6n7n8n9n10n11n12n13n14countmeanstdmin25%50%75%max
800000.000000800000.000000800000.000000800000.000000800000.000000799999.000000800000.0000008.000000e+05800000.000000800000.000000...759730.000000759730.000000759730.000000759729.000000759730.000000766761.000000730248.000000759730.000000759730.000000759730.000000
399999.50000014416.8188753.48274513.238391437.94772372005.3517140.6142137.613391e+041.0096830.199513...8.1079378.5759948.28295314.6224885.59234511.6438960.0008150.0033840.0893662.178606
230940.2520138716.0861780.8558324.765757261.460393106585.6402040.6757496.894751e+040.7827160.399634...4.7992107.4005364.5616898.1246103.2161845.4841040.0300750.0620410.5090691.844377
0.000000500.0000003.0000005.31000015.6900000.0000000.0000000.000000e+000.0000000.000000...0.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.000000
199999.7500008000.0000003.0000009.750000248.450000427.0000000.0000004.560000e+040.0000000.000000...5.0000004.0000005.0000009.0000003.0000008.0000000.0000000.0000000.0000001.000000
399999.50000012000.0000003.00000012.740000375.1350007755.0000001.0000006.500000e+041.0000000.000000...7.0000007.0000007.00000013.0000005.00000011.0000000.0000000.0000000.0000002.000000
599999.25000020000.0000003.00000015.990000580.710000117663.5000001.0000009.000000e+042.0000000.000000...11.00000011.00000010.00000019.0000007.00000014.0000000.0000000.0000000.0000003.000000
799999.00000040000.0000005.00000030.9900001715.420000378351.0000005.0000001.099920e+072.0000001.000000...70.000000132.00000079.000000128.00000045.00000082.0000004.0000004.00000039.00000030.000000

8 rows × 42 columns

data_train.head(3).append(data_train.tail(3)) idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14012799997799998799999
035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.0
118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.0
7999976000.0313.33203.12CC32582.010+ years1...4.026.04.010.04.05.00.00.01.04.0
79999819200.036.92592.14AA4151.010+ years0...10.06.012.022.08.016.00.00.00.05.0
7999999000.0311.06294.91BB313.05 years0...3.04.04.08.03.07.00.00.00.02.0

6 rows × 47 columns

#查看數據集中特征缺失值,唯一值等 print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.') #上面得到訓練集有22列特征有缺失值,進一步查看缺失特征中缺失率大于50%的特征 There are 22 columns in train dataset with missing values. have_null_feature_dict = (data_train.isnull().sum()/len(data_train)).to_dict() fea_null_moreThanHalf = {} for key,value in have_null_feature_dict.items():if value>0.5:fea_null_moreThanHalf[key] = value fea_null_moreThanHalf {} have_null_feature_dict {'id': 0.0,'loanAmnt': 0.0,'term': 0.0,'interestRate': 0.0,'installment': 0.0,'grade': 0.0,'subGrade': 0.0,'employmentTitle': 1.25e-06,'employmentLength': 0.05849875,'homeOwnership': 0.0,'annualIncome': 0.0,'verificationStatus': 0.0,'issueDate': 0.0,'isDefault': 0.0,'purpose': 0.0,'postCode': 1.25e-06,'regionCode': 0.0,'dti': 0.00029875,'delinquency_2years': 0.0,'ficoRangeLow': 0.0,'ficoRangeHigh': 0.0,'openAcc': 0.0,'pubRec': 0.0,'pubRecBankruptcies': 0.00050625,'revolBal': 0.0,'revolUtil': 0.00066375,'totalAcc': 0.0,'initialListStatus': 0.0,'applicationType': 0.0,'earliesCreditLine': 0.0,'title': 1.25e-06,'policyCode': 0.0,'n0': 0.0503375,'n1': 0.0503375,'n2': 0.0503375,'n2.1': 0.0503375,'n4': 0.04154875,'n5': 0.0503375,'n6': 0.0503375,'n7': 0.0503375,'n8': 0.05033875,'n9': 0.0503375,'n10': 0.04154875,'n11': 0.08719,'n12': 0.0503375,'n13': 0.0503375,'n14': 0.0503375} #具體的查看缺失特征及缺失率 # nan可視化 missing = data_train.isnull().sum()/len(data_train) missing = missing[missing>0] missing.sort_values(inplace=True) missing.plot.bar() <AxesSubplot:>

  • 縱向了解哪些列存在 “nan”, 并可以把nan的個數打印,主要的目的在于查看某一列nan存在的個數是否真的很大,如果nan存在的過多,說明這一列對label的影響幾乎不起作用了,可以考慮刪掉。如果缺失值很小一般可以選擇填充。
  • 另外可以橫向比較,如果在數據集中,某些樣本數據的大部分列都是缺失的且樣本足夠的情況下可以考慮刪除。

Tips: 比賽大殺器lgb模型可以自動處理缺失值,Task4模型會具體學習模型了解模型哦!

#查看訓練集測試集中特征屬性只有一值的特征 one_value_fea = [col for col in data_train.columns if data_train[col].nunique()<=1] one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1] one_value_fea ['policyCode'] one_value_fea_test ['policyCode']

總結:47列數據中有22列都缺少數據,這在現實世界中很正常。‘policyCode’具有一個唯一值(或全部缺失)。有很多連續變量和一些分類變量。

3.數值類型特征,對象類型特征

  • 特征一般都是由類別型特征和數值型特征組成,而數值型特征又分為連續型和離散型。
  • 類別型特征有時具有非數值關系,有時也具有數值關系。比如‘grade’中的等級A,B,C等,是否只是單純的分類,還是A優于其他要結合業務判斷
  • 數值型特征本是可以直接入模的,但往往風控人員要對其做分箱,轉化為WOE編碼進而做標準評分卡等操作。從模型效果上來看,特征分箱主要是為了降低變量的復雜性,減少變量噪音對模型的影響,提高自變量和因變量的相關度。從而使模型更加穩定。
num_fea = list(data_train.select_dtypes(exclude=['object']).columns) category_fea = list(filter(lambda x : x not in num_fea,list(data_train.columns))) len(num_fea) 42 category_fea ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine'] data_train.grade 0 E 1 D 2 D 3 A 4 C 5 A 6 A 7 C 8 C 9 B 10 B 11 E 12 D 13 B 14 A 15 B 16 D 17 B 18 E 19 E 20 C 21 C 22 D 23 C 24 A 25 A 26 B 27 C 28 A 29 C.. 799970 B 799971 A 799972 G 799973 D 799974 B 799975 E 799976 C 799977 C 799978 B 799979 C 799980 C 799981 C 799982 B 799983 B 799984 C 799985 D 799986 C 799987 B 799988 C 799989 D 799990 C 799991 B 799992 C 799993 A 799994 E 799995 C 799996 A 799997 C 799998 A 799999 B Name: grade, Length: 800000, dtype: object

3.1數值型變量分析,數值型肯定是包括連續型變量和離散型變量的,找出來

  • 劃分數值型變量中的連續變量和離散型變量
#過濾數值型類別特征 def get_num_serialFea(data,feas):num_seralFea = []num_noseralFea =[]for fea in feas:temp = data[fea].nunique()if temp <= 10:num_noseralFea.append(fea)continuenum_seralFea.append(fea)return num_seralFea,num_noseralFea num_seralFea,num_noseralFea = get_num_serialFea(data_train,num_fea) num_seralFea ['id','loanAmnt','interestRate','installment','employmentTitle','annualIncome','purpose','postCode','regionCode','dti','delinquency_2years','ficoRangeLow','ficoRangeHigh','openAcc','pubRec','pubRecBankruptcies','revolBal','revolUtil','totalAcc','title','n0','n1','n2','n2.1','n4','n5','n6','n7','n8','n9','n10','n13','n14'] num_noseralFea ['term','homeOwnership','verificationStatus','isDefault','initialListStatus','applicationType','policyCode','n11','n12']

數值離散型變量分析

data_train['term'].value_counts()#離散型變量 3 606902 5 193098 Name: term, dtype: int64 data_train['homeOwnership'].value_counts()#離散型變量 0 395732 1 317660 2 86309 3 185 5 81 4 33 Name: homeOwnership, dtype: int64 data_train['verificationStatus'].value_counts()#離散型變量 1 309810 2 248968 0 241222 Name: verificationStatus, dtype: int64 data_train['initialListStatus'].value_counts()#離散型變量 0 466438 1 333562 Name: initialListStatus, dtype: int64 data_train['applicationType'].value_counts()#離散型變量 0 784586 1 15414 Name: applicationType, dtype: int64 data_train['policyCode'].value_counts()#離散型變量,無用,全部一個值 1.0 800000 Name: policyCode, dtype: int64 data_train['n11'].value_counts()#離散型變量,相差懸殊,用不用再分析 0.0 729682 1.0 540 2.0 24 4.0 1 3.0 1 Name: n11, dtype: int64 data_train['n12'].value_counts()#離散型變量,相差懸殊,用不用再分析 0.0 757315 1.0 2281 2.0 115 3.0 16 4.0 3 Name: n12, dtype: int64

數值連續型變量分析

#每個數字特征得分布可視化 f = pd.melt(data_train,value_vars=num_seralFea) g = sns.FacetGrid(f,col='variable',col_wrap=4,sharex=False,sharey=False) g = g.map(sns.distplot,"value")

  • 查看某一個數值型變量的分布,查看變量是否符合正態分布,如果不符合正太分布的變量可以log化后再觀察下是否符合正態分布。
  • 如果想統一處理一批數據變標準化 必須把這些之前已經正態化的數據提出
  • 正態化的原因:一些情況下正態非正態可以讓模型更快的收斂,一些模型要求數據正態(eg. GMM、KNN),保證數據不要過偏態即可,過于偏態可能會影響模型預測結果。
#Ploting Transaction Amount Values Distribution plt.figure(figsize=(16,12)) plt.suptitle('Transaction Values Distribution', fontsize=22) plt.subplot(221) sub_plot_1 = sns.distplot(data_train['loanAmnt']) sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18) sub_plot_1.set_xlabel("") sub_plot_1.set_ylabel("Probability", fontsize=15)plt.subplot(222) sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt'])) sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18) sub_plot_2.set_xlabel("") sub_plot_2.set_ylabel("Probability", fontsize=15) Text(0, 0.5, 'Probability')

  • 非數值類別型變量分析
data_train['grade'].value_counts() B 233690 C 227118 A 139661 D 119453 E 55661 F 19053 G 5364 Name: grade, dtype: int64 data_train['subGrade'].value_counts() C1 50763 B4 49516 B5 48965 B3 48600 C2 47068 C3 44751 C4 44272 B2 44227 B1 42382 C5 40264 A5 38045 A4 30928 D1 30538 D2 26528 A1 25909 D3 23410 A3 22655 A2 22124 D4 21139 D5 17838 E1 14064 E2 12746 E3 10925 E4 9273 E5 8653 F1 5925 F2 4340 F3 3577 F4 2859 F5 2352 G1 1759 G2 1231 G3 978 G4 751 G5 645 Name: subGrade, dtype: int64 data_train['employmentLength'].value_counts() 10+ years 262753 2 years 72358 < 1 year 64237 3 years 64152 1 year 52489 5 years 50102 4 years 47985 6 years 37254 8 years 36192 7 years 35407 9 years 30272 Name: employmentLength, dtype: int64 data_train['issueDate'].value_counts() 2016-03-01 29066 2015-10-01 25525 2015-07-01 24496 2015-12-01 23245 2014-10-01 21461 2016-02-01 20571 2015-11-01 19453 2015-01-01 19254 2015-04-01 18929 2015-08-01 18750 2015-05-01 17119 2016-01-01 16792 2014-07-01 16355 2015-06-01 15236 2015-09-01 14950 2016-04-01 14248 2014-11-01 13793 2015-03-01 13549 2016-08-01 13301 2015-02-01 12881 2016-07-01 12835 2016-06-01 12270 2016-12-01 11562 2016-10-01 11245 2016-11-01 11172 2014-05-01 10886 2014-04-01 10830 2016-05-01 10680 2014-08-01 10648 2016-09-01 10165... 2010-01-01 355 2009-10-01 305 2009-09-01 270 2009-08-01 231 2009-07-01 223 2009-06-01 191 2009-05-01 190 2009-04-01 166 2009-03-01 162 2009-02-01 160 2009-01-01 145 2008-12-01 134 2008-03-01 130 2008-11-01 113 2008-02-01 105 2008-04-01 92 2008-01-01 91 2008-10-01 62 2007-12-01 55 2008-07-01 52 2008-08-01 38 2008-05-01 38 2008-06-01 33 2007-10-01 26 2007-11-01 24 2007-08-01 23 2007-07-01 21 2008-09-01 19 2007-09-01 7 2007-06-01 1 Name: issueDate, Length: 139, dtype: int64 data_train['earliesCreditLine'].value_counts() Aug-2001 5567 Sep-2003 5403 Aug-2002 5403 Oct-2001 5258 Aug-2000 5246 Sep-2004 5219 Sep-2002 5170 Aug-2003 5116 Oct-2000 5034 Oct-2002 5034 Oct-2003 4969 Aug-2004 4904 Nov-2000 4798 Sep-2001 4787 Sep-2000 4780 Nov-1999 4773 Oct-1999 4678 Oct-2004 4647 Sep-2005 4608 Jul-2003 4586 Nov-2001 4514 Aug-2005 4494 Jul-2001 4480 Aug-1999 4446 Sep-1999 4441 Dec-2001 4379 Jul-2002 4342 Aug-2006 4283 Mar-2001 4268 May-2001 4223... Nov-1958 2 Aug-1950 2 May-1955 2 Jun-1952 2 Feb-1962 2 Jul-1951 2 Jul-1959 2 Mar-1962 1 Nov-1954 1 Oct-2015 1 Aug-1946 1 May-1957 1 Oct-1954 1 Mar-1957 1 May-1960 1 Mar-1958 1 Dec-1951 1 Jan-1946 1 Sep-1957 1 Dec-1960 1 Apr-1958 1 Jun-1958 1 Aug-1955 1 Feb-1960 1 Aug-1958 1 Jan-1944 1 Jul-1955 1 Nov-1953 1 Sep-1953 1 Oct-1957 1 Name: earliesCreditLine, Length: 720, dtype: int64 data_train['isDefault'].value_counts() 0 640390 1 159610 Name: isDefault, dtype: int64

總結:

  • 上面我們用value_counts()等函數看了特征屬性的分布,但是圖表是概括原始信息最便捷的方式。
  • 數無形時少直覺
  • 同一份數據集,在不同的尺度刻畫上顯示出來的圖形反映的規律是不一樣的。python將數據轉化成圖表,但結論是否正確需要由你保證。

3.2變量分布可視化

單一變量分布可視化

plt.figure(figsize = (8,8)) sns.barplot(data_train['employmentLength'].value_counts(dropna = False)[:20],data_train['employmentLength'].value_counts(dropna = False).keys()[:20]) plt.show()

根絕y值不同可視化x某個特征的分布

  • 首先查看類別型變量在不同y值上的分布
train_loan_fr = data_train.loc[data_train['isDefault'] == 1] train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 8))train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of grade fraud') train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of grade non-fraud') train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax3, title='Count of employmentLength fraud') train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax4, title='Count of employmentLength non-fraud') plt.show()

  • 其次查看連續型變量在不同y值上的分布
fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6)) data_train.loc[data_train['isDefault'] == 1]['loanAmnt'].apply(np.log).plot(kind='hist',bins=100,title='Log Loan Amt - Fraud',color='r',xlim=(-3, 10),ax= ax1) data_train.loc[data_train['isDefault'] == 0]['loanAmnt'].apply(np.log).plot(kind='hist',bins=100,title='Log Loan Amt - Not Fraud',color='b',xlim=(-3, 10),ax=ax2) <AxesSubplot:title={'center':'Log Loan Amt - Not Fraud'}, ylabel='Frequency'>

total = len(data_train) total_amt = data_train.groupby(['isDefault'])['loanAmnt'].sum().sum() plt.figure(figsize=(12,5)) plt.subplot(121)##1代表行,2代表列,所以一共有2個圖,1代表此時繪制第一個圖。 plot_tr = sns.countplot(x='isDefault',data=data_train)#data_train‘isDefault’這個特征每種類別的數量** plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14) plot_tr.set_xlabel("Is fraud by count", fontsize=16) plot_tr.set_ylabel('Count', fontsize=16) for p in plot_tr.patches:height = p.get_height()plot_tr.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(height/total*100),ha="center", fontsize=15) percent_amt = (data_train.groupby(['isDefault'])['loanAmnt'].sum()) percent_amt = percent_amt.reset_index() plt.subplot(122) plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt', dodge=True, data=percent_amt) plot_tr_2.set_title("Total Amount in loanAmnt \n 0: good user | 1: bad user", fontsize=14) plot_tr_2.set_xlabel("Is fraud by percent", fontsize=16) plot_tr_2.set_ylabel('Total Loan Amount Scalar', fontsize=16) for p in plot_tr_2.patches:height = p.get_height()plot_tr_2.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(height/total_amt * 100),ha="center", fontsize=15)

3.3時間格式數據處理及查看

#轉化成時間格式 issueDateDT特征表示數據日期離數據集中日期最早的日期(2007-06-01)的天數 data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d') startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d') data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days #轉化成時間格式 data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d') startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d') data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days plt.hist(data_train['issueDateDT'], label='train'); plt.hist(data_test_a['issueDateDT'], label='test'); plt.legend(); plt.title('Distribution of issueDateDT dates'); #train 和 test issueDateDT 日期有重疊 所以使用基于時間的分割進行驗證是不明智的

3.4掌握透視圖可以讓我們更好的了解數據

#透視圖 索引可以有多個,“columns(列)”是可選的,聚合函數aggfunc最后是被應用到了變量“values”中你所列舉的項目上 pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum) pivot loanAmntissueDateDT0306192122153183214245274...3926395739874018404840794110414041714201gradeABCDEFG
NaN53650.042000.019500.034425.063950.043500.0168825.085600.0101825.0...13093850.011757325.011945975.09144000.07977650.06888900.05109800.03919275.02694025.02245625.0
NaN13000.024000.032125.07025.095750.0164300.0303175.0434425.0538450.0...16863100.017275175.016217500.011431350.08967750.07572725.04884600.04329400.03922575.03257100.0
NaN68750.08175.010000.061800.052550.0175375.0151100.0243725.0393150.0...17502375.017471500.016111225.011973675.010184450.07765000.05354450.04552600.02870050.02246250.0
NaNNaN5500.02850.028625.0NaN167975.0171325.0192900.0269325.0...11403075.010964150.010747675.07082050.07189625.05195700.03455175.03038500.02452375.01771750.0
7500.0NaN10000.0NaN17975.01500.094375.0116450.042000.0139775.0...3983050.03410125.03107150.02341825.02225675.01643675.01091025.01131625.0883950.0802425.0
NaNNaN31250.02125.0NaNNaNNaN49000.027000.043000.0...1074175.0868925.0761675.0685325.0665750.0685200.0316700.0315075.072300.0NaN
NaNNaNNaNNaNNaNNaNNaN24625.0NaNNaN...56100.0243275.0224825.064050.0198575.0245825.053125.023750.025100.01000.0

7 rows × 139 columns

3.5用pandas_profiling生成數據報告

import pandas_profiling pfr = pandas_profiling.ProfileReport(data_train) pfr.to_file("./example.html")

總結

以上是生活随笔為你收集整理的风控-数据分析的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。