日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 >

Everyone Do this at the Beginning!!-Kaggle 数据预处理方案

發布時間:2025/3/15 20 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Everyone Do this at the Beginning!!-Kaggle 数据预处理方案 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

英文文檔鏈接?


?

對于數據分析來說,對原生數據的預處理的方式以及處理結果會對數據分析的結果產生非常重大的影響,而且,當下的機器學習算法都要求我們需要用品質足夠高的數據去對其進行訓練以便得到一個高質量的模型。

在這里我為大家介紹一個來自Kaggle的惡意代碼特征數據集的預處理方法。


我們可以移除此數據集中的17種特征!

  • (M)mostly-missing feaures :缺失值占比達到99%以上的特征

  • (S)too-skewed features : 出現次數最多的值占該特征所有值出現次數的99%以上。

  • (C)hightly-correlated features : 對特征之間的相關性進行計算,挑選出相關性>=0.99的特征對,比較特征對的特征值多樣性大小,淘汰特征值多樣性較小的特征。

如下,我們可以以上述4種標準移除共17種特征:

  • (M) PuaMode
  • (M) Census_ProcessorClass
  • (S) Census_IsWIMBootEnabled
  • (S) IsBeta
  • (S) Census_IsFlightsDisabled
  • (S) Census_IsFlightingInternal
  • (S) AutoSampleOptIn
  • (S) Census_ThresholdOptIn
  • (S) SMode
  • (S) Census_IsPortableOperatingSystem
  • (S) Census_DeviceFamily
  • (S) UacLuaenable
  • (S) Census_IsVirtualDevice
  • (C) Platform
  • (C) Census_OSSkuName
  • (C) Census_OSInstallLanguageIdentifier
  • (C) Processor
  • ?

    說明:在此文章中只展示對train的處理,這樣得到的結果與同時處理train/test是一樣的。

    ?

    一 、 加載數據 Load Data

    import pandas as pd import numpy as np import matplotlib.pyplot as plt

    ?自定義類型,有助于加快文件加載速度(此文件為3g多),此自定義類型方法引用自另一個Kaggle?

    # referred https://www.kaggle.com/theoviel/load-the-totality-of-the-datadtypes = {'MachineIdentifier': 'category','ProductName': 'category','EngineVersion': 'category','AppVersion': 'category','AvSigVersion': 'category','IsBeta': 'int8','RtpStateBitfield': 'float16','IsSxsPassiveMode': 'int8','DefaultBrowsersIdentifier': 'float32','AVProductStatesIdentifier': 'float32','AVProductsInstalled': 'float16','AVProductsEnabled': 'float16','HasTpm': 'int8','CountryIdentifier': 'int16','CityIdentifier': 'float32','OrganizationIdentifier': 'float16','GeoNameIdentifier': 'float16','LocaleEnglishNameIdentifier': 'int16','Platform': 'category','Processor': 'category','OsVer': 'category','OsBuild': 'int16','OsSuite': 'int16','OsPlatformSubRelease': 'category','OsBuildLab': 'category','SkuEdition': 'category','IsProtected': 'float16','AutoSampleOptIn': 'int8','PuaMode': 'category','SMode': 'float16','IeVerIdentifier': 'float16','SmartScreen': 'category','Firewall': 'float16','UacLuaenable': 'float32','UacLuaenable': 'float64', # was 'float32''Census_MDC2FormFactor': 'category','Census_DeviceFamily': 'category','Census_OEMNameIdentifier': 'float32', # was 'float16''Census_OEMModelIdentifier': 'float32','Census_ProcessorCoreCount': 'float16','Census_ProcessorManufacturerIdentifier': 'float16','Census_ProcessorModelIdentifier': 'float32', # was 'float16''Census_ProcessorClass': 'category','Census_PrimaryDiskTotalCapacity': 'float64', # was 'float32''Census_PrimaryDiskTypeName': 'category','Census_SystemVolumeTotalCapacity': 'float64', # was 'float32''Census_HasOpticalDiskDrive': 'int8','Census_TotalPhysicalRAM': 'float32','Census_ChassisTypeName': 'category','Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float32', # was 'float16''Census_InternalPrimaryDisplayResolutionHorizontal': 'float32', # was 'float16''Census_InternalPrimaryDisplayResolutionVertical': 'float32', # was 'float16''Census_PowerPlatformRoleName': 'category','Census_InternalBatteryType': 'category','Census_InternalBatteryNumberOfCharges': 'float64', # was 'float32''Census_OSVersion': 'category','Census_OSArchitecture': 'category','Census_OSBranch': 'category','Census_OSBuildNumber': 'int16','Census_OSBuildRevision': 'int32','Census_OSEdition': 'category','Census_OSSkuName': 'category','Census_OSInstallTypeName': 'category','Census_OSInstallLanguageIdentifier': 'float16','Census_OSUILocaleIdentifier': 'int16','Census_OSWUAutoUpdateOptionsName': 'category','Census_IsPortableOperatingSystem': 'int8','Census_GenuineStateName': 'category','Census_ActivationChannel': 'category','Census_IsFlightingInternal': 'float16','Census_IsFlightsDisabled': 'float16','Census_FlightRing': 'category','Census_ThresholdOptIn': 'float16','Census_FirmwareManufacturerIdentifier': 'float16','Census_FirmwareVersionIdentifier': 'float32','Census_IsSecureBootEnabled': 'int8','Census_IsWIMBootEnabled': 'float16','Census_IsVirtualDevice': 'float16','Census_IsTouchEnabled': 'int8','Census_IsPenCapable': 'int8','Census_IsAlwaysOnAlwaysConnectedCapable': 'float16','Wdft_IsGamer': 'float16','Wdft_RegionIdentifier': 'float16','HasDetections': 'int8'}train = pd.read_csv('../input/train.csv', dtype=dtypes)train.shape (8921483, 83)

    二、特征工程? Feature Engineering

    定義一個空列表用于防止需要移除的特征名稱

    droppable_features = []

    (1)

    • mostly-missing Columns

      #求每種特征的缺失值的計數占比 (train.isnull().sum()/train.shape[0]).sort_values(ascending=False) PuaMode 0.999741 Census_ProcessorClass 0.995894 DefaultBrowsersIdentifier 0.951416 Census_IsFlightingInternal 0.830440 Census_InternalBatteryType 0.710468 Census_ThresholdOptIn 0.635245 Census_IsWIMBootEnabled 0.634390 SmartScreen 0.356108 OrganizationIdentifier 0.308415 SMode 0.060277 CityIdentifier 0.036475 Wdft_IsGamer 0.034014 Wdft_RegionIdentifier 0.034014 Census_InternalBatteryNumberOfCharges 0.030124 Census_FirmwareManufacturerIdentifier 0.020541 Census_IsFlightsDisabled 0.017993 Census_FirmwareVersionIdentifier 0.017949 Census_OEMModelIdentifier 0.011459 Census_OEMNameIdentifier 0.010702 Firewall 0.010239 Census_TotalPhysicalRAM 0.009027 Census_IsAlwaysOnAlwaysConnectedCapable 0.007997 Census_OSInstallLanguageIdentifier 0.006735 IeVerIdentifier 0.006601 Census_PrimaryDiskTotalCapacity 0.005943 Census_SystemVolumeTotalCapacity 0.005941 Census_InternalPrimaryDiagonalDisplaySizeInInches 0.005283 Census_InternalPrimaryDisplayResolutionHorizontal 0.005267 Census_InternalPrimaryDisplayResolutionVertical 0.005267 Census_ProcessorModelIdentifier 0.004634... ProductName 0.000000 HasTpm 0.000000 OsBuild 0.000000 IsBeta 0.000000 OsSuite 0.000000 IsSxsPassiveMode 0.000000 HasDetections 0.000000 SkuEdition 0.000000 Census_OSInstallTypeName 0.000000 Census_IsPenCapable 0.000000 Census_IsTouchEnabled 0.000000 Census_IsSecureBootEnabled 0.000000 Census_FlightRing 0.000000 Census_ActivationChannel 0.000000 Census_GenuineStateName 0.000000 Census_IsPortableOperatingSystem 0.000000 Census_OSWUAutoUpdateOptionsName 0.000000 Census_OSUILocaleIdentifier 0.000000 Census_OSSkuName 0.000000 AutoSampleOptIn 0.000000 Census_OSEdition 0.000000 Census_OSBuildRevision 0.000000 Census_OSBuildNumber 0.000000 Census_OSBranch 0.000000 Census_OSArchitecture 0.000000 Census_OSVersion 0.000000 Census_HasOpticalDiskDrive 0.000000 Census_DeviceFamily 0.000000 Census_MDC2FormFactor 0.000000 MachineIdentifier 0.000000 Length: 83, dtype: float64

    可以看到,有2種特征的缺失值的計數占比大于99%,故移除:

    #將其放入先前定義的空列表中 droppable_features.append('PuaMode') droppable_features.append('Census_ProcessorClass')
    • Too skewed columns

      #pd.options.display : 為編碼者提供自定i一的格式 ''''{:,.4f}' : 保留4位小數 '{:,100.4f}' : 也是保留4位小數所以我們可以看到,小數點后的數決定了保留幾位小數。''' #train[c].nunique() : 出現了多少種不同的特征值 #.value_counts(normalize=True).values[0] '''value_counts(): 每個特征值出現的次數value_counts(normalize=True):每個特征值的計數占比,默認降序排序value_counts(normalize=True).values[0]:返回計數占比最大的特征值的計數占比''' pd.options.display.float_format = '{:,.4f}'.format sk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0] * 100} for c in train.columns]) sk_df = sk_df.sort_values('skewness', ascending=False) sk_df columnskewnessuniq75569682771296528353376112732116201878267970455519663977...57642321302245960626348427261178116585436813374041473440
      Census_IsWIMBootEnabled100.00002
      IsBeta99.99922
      Census_IsFlightsDisabled99.99902
      Census_IsFlightingInternal99.99862
      AutoSampleOptIn99.99712
      Census_ThresholdOptIn99.97492
      SMode99.95372
      Census_IsPortableOperatingSystem99.94552
      PuaMode99.91342
      Census_DeviceFamily99.83833
      UacLuaenable99.392511
      Census_IsVirtualDevice99.29612
      ProductName98.93566
      HasTpm98.79712
      IsSxsPassiveMode98.26662
      Firewall97.85832
      AVProductsEnabled97.39846
      RtpStateBitfield97.32627
      OsVer96.761358
      Platform96.60634
      Census_IsPenCapable96.19292
      IsProtected94.56242
      Census_IsAlwaysOnAlwaysConnectedCapable94.25812
      Census_FlightRing93.658010
      Census_HasOpticalDiskDrive92.28132
      Census_OSArchitecture90.85803
      Processor90.85303
      Census_GenuineStateName88.29925
      Census_ProcessorManufacturerIdentifier88.27897
      Census_IsTouchEnabled87.44572
      .........
      Census_OSBuildNumber44.9351165
      Census_OSWUAutoUpdateOptionsName44.32566
      OsPlatformSubRelease43.88879
      OsBuild43.888776
      IeVerIdentifier43.8454303
      EngineVersion43.099070
      OsBuildLab41.0045663
      Census_OSEdition38.894833
      Census_OSSkuName38.893430
      Census_OSInstallLanguageIdentifier35.877739
      Census_OSUILocaleIdentifier35.5414147
      Census_InternalPrimaryDiagonalDisplaySizeInInches34.3398785
      Census_PrimaryDiskTotalCapacity32.04085735
      Census_FirmwareManufacturerIdentifier30.8882712
      Census_OSInstallTypeName29.23329
      LocaleEnglishNameIdentifier23.4780276
      Wdft_RegionIdentifier20.887715
      GeoNameIdentifier17.1716292
      Census_OSBuildRevision15.8453285
      Census_OSVersion15.8452469
      Census_OEMNameIdentifier14.58503832
      DefaultBrowsersIdentifier10.62572017
      CountryIdentifier4.4519222
      Census_OEMModelIdentifier3.4559175365
      Census_ProcessorModelIdentifier3.25763428
      AvSigVersion1.14698531
      CityIdentifier1.1030107366
      Census_FirmwareVersionIdentifier1.022850494
      Census_SystemVolumeTotalCapacity0.5863536848
      MachineIdentifier0.00008921483

    83 rows × 3 columns

    可以看到,有12種特征的最大特征值計數占比超過了99%,故移除:

    droppable_features.extend(sk_df[sk_df.skewness > 99].column.tolist()) droppable_features ['PuaMode','Census_ProcessorClass','Census_IsWIMBootEnabled','IsBeta','Census_IsFlightsDisabled','Census_IsFlightingInternal','AutoSampleOptIn','Census_ThresholdOptIn','SMode','Census_IsPortableOperatingSystem','PuaMode','Census_DeviceFamily','UacLuaenable','Census_IsVirtualDevice']

    我們發現在這已移除的特征中'PuaMode'居然出現了兩次,故我們移除其中一個:

    # PuaMode is duplicated in the two categories. droppable_features.remove('PuaMode')# Drop these columns. #axis=1 : 表示對列進行操作 #inplace=True : 不創建新的對象,對原始數據進行修改 train.drop(droppable_features, axis=1, inplace=True)

    至此,我們已經移除了2+(12-1)=13 種特征。

    (2)

    另外,在剩下的特征值中,還存在這許多的缺失值(Nan),我們需要將其進行處理。

    # 返回每一種特征的缺失值的計數占比 #.isnull().sum():分別返回每一種特征的缺失值個數null_counts = train.isnull().sum() null_counts = null_counts / train.shape[0] null_counts[null_counts > 0.1]

    ?

    DefaultBrowsersIdentifier 0.9514 OrganizationIdentifier 0.3084 SmartScreen 0.3561 Census_InternalBatteryType 0.7105 dtype: float64

    可以看到,有4種特征含有大量的缺失值(NaN)。

    ???? 1.DefaultBrowsersIdentifier

    train.DefaultBrowsersIdentifier.value_counts().head(5)

    ?

    239.0000 46056 3,195.0000 42692 1,632.0000 28751 3,176.0000 24220 146.0000 20756 Name: DefaultBrowsersIdentifier, dtype: int64 '''.fillna(0,inplece=True) : 對缺失值以0填充,并且在原始數據中進行修改,也就是說缺失值全部都用0替代了.fillna(0,inplace=False) : 對缺失值以0填充,但能用來打印看一下,并不會改變原始數據,缺失值還是缺失值 ''' train.DefaultBrowsersIdentifier.fillna(0, inplace=True)

    ???? 2. SmartScreen

    #.value_counts() : 返回該特征中每種特征值出現的次數 train.SmartScreen.value_counts() RequireAdmin 4316183 ExistsNotSet 1046183 Off 186553 Warn 135483 Prompt 34533 Block 22533 off 1350 On 731  416  335 on 147 requireadmin 10 OFF 4 0 3 Promt 2 requireAdmin 1 Enabled 1 prompt 1 warn 1 00000000 1  1 Name: SmartScreen, dtype: int64

    ?'SmartSreen'中的特征值太雜亂,我們給它們賦值為較正規的字符串:

    trans_dict = {'off': 'Off', '': '2', '': '1', 'on': 'On', 'requireadmin': 'RequireAdmin', 'OFF': 'Off', 'Promt': 'Prompt', 'requireAdmin': 'RequireAdmin', 'prompt': 'Prompt', 'warn': 'Warn', '00000000': '0', '': '3', np.nan: 'NoExist' } train.replace({'SmartScreen': trans_dict}, inplace=True) #.replace() :更名函數 train.SmartScreen.isnull().sum() 0

    為什么會是0呢,因為所有缺失值都已經賦值為'NoExist'

    ???? 3.OrganizationIdentifier

    train.OrganizationIdentifier.value_counts() 27.0000 4196457 18.0000 1764175 48.0000 63845 50.0000 45502 11.0000 19436 37.0000 19398 49.0000 13627 46.0000 10974 14.0000 4713 32.0000 4045 36.0000 3909 52.0000 3043 33.0000 2896 2.0000 2595 5.0000 1990 40.0000 1648 28.0000 1591 4.0000 1385 10.0000 1083 51.0000 917 20.0000 915 1.0000 893 8.0000 723 22.0000 418 39.0000 413 6.0000 412 31.0000 398 21.0000 397 47.0000 385 3.0000 331 16.0000 242 19.0000 172 26.0000 160 44.0000 150 29.0000 135 42.0000 132 7.0000 98 41.0000 77 45.0000 73 30.0000 64 43.0000 60 35.0000 32 23.0000 20 15.0000 13 25.0000 12 12.0000 7 34.0000 2 38.0000 1 17.0000 1 Name: OrganizationIdentifier, dtype: int64

    這個特征是用來說明一種類似于ID的數據的,所以我們可以用0來給缺失值賦值:

    train.replace({'OrganizationIdentifier': {np.nan: 0}}, inplace=True)

    ???? 4.Census_InternalBatteryType

    pd.options.display.max_rows = 99 train.Census_InternalBatteryType.value_counts() lion 2028256 li-i 245617 # 183998 lip 62099 liio 32635 li p 8383 li 6708 nimh 4614 real 2744 bq20 2302 pbac 2274 vbox 1454 unkn 533 lgi0 399 lipo 198 lhp0 182 4cel 170 lipp 83 ithi 79 batt 60 ram 35 bad 33 virt 33 pad0 22 lit 16 ca48 16 a132 10 ots0 9 lai0 8 ???? 8 lio 5 4lio 4 lio 4 asmb 4 li-p 4 0x0b 3 lgs0 3 icp3 3 3ion 2 a140 2 h00j 2 5nm1 2 lhpo 2 a138 2 lilo 1 li-h 1 lp 1 li? 1 ion 1 pbso 1 3500 1 6ion 1 @i 1 li 1 sams 1 ip 1 8 1 #TAB# 1 l&#TAB# 1 lio 1 ˙˙˙ 1 l 1 cl53 1 li?? 1 pa50 1 í-i 1 ÷?ó? 1 li-l 1 h4°s 1 d 1 lgl0 1 4ion 1 0ts0 1 sail 1 p-sn 1 a130 1 2337 1 l??? 1 Name: Census_InternalBatteryType, dtype: int64

    此特征中,缺失值、“```”、“unkn”都表示為"unknow",所以我們將'unknow'賦值給它們:

    trans_dict = {'˙˙˙': 'unknown', 'unkn': 'unknown', np.nan: 'unknown' } train.replace({'Census_InternalBatteryType': trans_dict}, inplace=True)

    (3)

    注意: 這4種特征是缺失值占比>10%,含有這4種特征缺失值的樣本我們不能刪除,盡管其有缺失值,我們也要用其他值去填充它,而還有許多缺失值計數占比位于0~10%之間的特征,我們要把這些特征的缺失值給移除(實質是移除了含有這類特征缺失值的樣本(行))。

    train.shape

    ?(8921483, 70)

    # .dropna(inplace=True):刪除含有NaN的所有行,保留原來的索引值不變 train.dropna(inplace=True) train.shape (7667789, 70)

    最終大約有14%的樣本被刪除了。

    另外,還有特征'MachineIdentifier',它對惡意代碼檢測無作用,我們也要把它刪除:

    train.drop('MachineIdentifier', axis=1, inplace=True)

    至此我們已經刪除了13+1=14種特征。

    ?

    (4)

    為了使數據能夠用于機器學習,我們需要把一些數據的類型轉化為category類型,原因: 請點擊此Category類型?

    #將'SmartScreen'/'Census_InternalBatterType'的特征值轉化為category類型 train['SmartScreen'] = train.SmartScreen.astype('category') train['Census_InternalBatteryType'] = train.Census_InternalBatteryType.astype('category')#cate_cols:存放類型為category特征的名稱 cate_cols = train.select_dtypes(include='category').columns.tolist()from sklearn.preprocessing import LabelEncoder le = LabelEncoder()for col in cate_cols:train[col] = le.fit_transform(train[col]) #經過le.fit_transform(),train['SmartScreen]/train['Census_InternalBatteryType']的類型為int64

    關于LaberEncoder,詳情請點擊LabelEncoder?

    (5)

    用一個算法去減小train的大小。我稱之為“減存算法”。欲知詳情請點擊這里減存算法?

    代碼如下:

    def reduce_mem_usage(df):""" iterate through all the columns of a dataframe and modify the data typeto reduce memory usage. """ #.memory_usage() ???????????????????????start_mem = df.memory_usage().sum() / 1024**2print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_type = df[col].dtypeif col_type != object:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':#np.iinfo()的用法我已經放在代碼下面了,請自行觀看if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64) else:if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:df[col] = df[col].astype(np.float16)elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)#非整型和浮點型(例如string類型)就轉化為category類型else:df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))return df%time train = reduce_mem_usage(train)

    np.iinfo():

    CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 5.48 μs Memory usage of dataframe is 2464.34 MB Memory usage after optimization is: 965.26 MB Decreased by 60.8%

    ?

    ?(6)

    • Highly correlated features.

    因為特征實在太多,所以就以每10個特征生成一個相關(系數)矩陣。相關稀疏矩陣的知識請點這里相關系數矩陣?

    cols = train.columns.tolist() corr_remove = [] #用來裝要移除的特征 import seaborn as snsplt.figure(figsize=(10,10)) co_cols = cols[:10] co_cols.append('HasDetections')# sns.heatmap() : 用熱力圖來畫出相關系數矩陣 '''.corr():相關性cmap : 熱力圖的顏色annot=True : 把每一個相關系數都顯示出來center=0.0 : 相關系數為0.0時的顏色深度是居中的顏色深度 ''' sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 1 ~ 10th columns') plt.show()

    ?

    ?

    沒有出現>=0.99&的關聯系數。繼續~

    ?

    co_cols = cols[10:20] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 11 ~ 20th columns') plt.show()

    ?

    出現了。移除掉特征值多樣性較小的特征:

    print(train.Platform.nunique()) print(train.OsVer.nunique()) 3 45

    Platform vs OsVer ? 3<45 :? remove Platform

    #還記得嗎,corr_remove是我們上面定義的裝待移除特征名稱的空列表 corr_remove.append('Platform')

    ok,繼續~

    ?

    co_cols = cols[20:30] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 21 ~ 30th columns') plt.show()

    ?

    可惜沒有出現>=0.99的相關系數,別灰心,繼續加油~

    ?

    co_cols = cols[30:40] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 31 ~ 40th columns') plt.show()

    還是沒有,繼續繼續~

    ?

    co_cols = cols[40:50] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0) plt.title('Correlation between 41 ~ 50th columns') plt.show()

    還是沒有,再來~

    ?

    co_cols = cols[50:60] co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0) plt.title('Correlation between 51 ~ 60th columns') plt.show()

    跟上次找到>=0.99相關系數時那樣處理:

    print(train.Census_OSEdition.nunique()) print(train.Census_OSSkuName.nunique(), '\n') print(train.Census_OSInstallLanguageIdentifier.nunique()) print(train.Census_OSUILocaleIdentifier.nunique()) 29 25 39 144

    Census_OSEdition vs Census_OSSkuName?29>25 : remove Census_OSSkuName

    Census_OSInstallLanguageIdentifier vs Census_OSUILocaleIdentifier?39<144 : remove Census_OSInstallLanguageIdentifier

    corr_remove.append('Census_OSSkuName') corr_remove.append('Census_OSInstallLanguageIdentifier')

    做事要有始有終,繼續~

    ?

    co_cols = cols[60:] #co_cols.append('HasDetections') plt.figure(figsize=(10,10)) sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0) plt.title('Correlation between from 61th to the last columns') plt.show()

    各組特征相關性分析完畢。

    從各組特征中總共得到3種要移除的特征。

    corr_remove ['Platform', 'Census_OSSkuName', 'Census_OSInstallLanguageIdentifier']

    移除此3組的代碼如下:

    train.drop(corr_remove, axis=1, inplace=True)

    至此,已經移除了 13+3=16 個特征。

    ?

    對余下的所有數據構建相關系數矩陣:

    corr = train.corr() high_corr = (corr >= 0.99).astype('uint8') plt.figure(figsize=(15,15)) sns.heatmap(high_corr, cmap='RdBu_r', annot=True, center=0.0) plt.show()

    出現了2個相關性>=0.99的特征。

    print(train.Census_OSArchitecture.nunique()) print(train.Processor.nunique()) 3 3

    ?Census_OSArchitecture vs Processor ? 3=3 :居然相等。讓我們看看它們與'HasDections'的相關性:

    train[['Census_OSArchitecture', 'Processor', 'HasDetections']].corr()

    ?

    它們與'HasDections'的相關系數都是-0.0758,所以移除哪一個都可以,那我選擇移除 'Processor' :

    corr_remove.append('Processor') #droppable_features是我們最先定義的一個空列表 droppable_features.extend(corr_remove) print(len(droppable_features)) droppable_features 17 ['Census_ProcessorClass','Census_IsWIMBootEnabled','IsBeta','Census_IsFlightsDisabled','Census_IsFlightingInternal','AutoSampleOptIn','Census_ThresholdOptIn','SMode','Census_IsPortableOperatingSystem','PuaMode','Census_DeviceFamily','UacLuaenable','Census_IsVirtualDevice','Platform','Census_OSSkuName','Census_OSInstallLanguageIdentifier','Processor']

    ?

    大功告成。通過對數據進行分析之后能移除的特征有 17 個。

    與50位技術專家面對面20年技術見證,附贈技術全景圖

    總結

    以上是生活随笔為你收集整理的Everyone Do this at the Beginning!!-Kaggle 数据预处理方案的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。