代碼來源?
前言
閱讀別的的優秀代碼有助于提高自己的代碼編寫能力,從中我們不僅能學習到許多的編程知識,還能借鑒他人優秀的編程習慣,也能學習到別人獨特的編程技巧。這篇博客是博主對微軟2019惡意軟件檢測比賽第七名的一些個人總結和看法,有些代碼上博主已經給了注釋,同時也會額外給代碼另外進行注釋。由于博主能力有限,錯誤的出現在所難免,還望技術愛好者們 不吝賜教。
正文
概要
眾所周知,機器學習分類模型的構建主要由兩部分組成1.數據預處理(包括數據清洗、特征工程等) 2.機器學習模型構建(訓練、調參),而數據預處理是機器學習模型構建的前期工作,用于訓練的數據的質量在很大程度決定了最后的機器學習模型的質量,所以一般的機器學習項目的代碼絕大篇幅都是處理數據的代碼,這份代碼也是如此。 個人認為,這份代碼的的數據處理不算很好,但也還算過得去(如果想了解比較有趣的數據預處理代碼請看博主的另一篇博客? )。這份代碼所使用的機器學習算法是lightGBM 。
代碼詳解
說明: 博主會把代碼分開來講解,但由于設備原因無法把每一步的代碼結果顯示出來,條件允許的技術愛好者們可以自己復制代碼自己去run一下,代碼中使用的文件在官網 可以下載。雖然是步講解,但是從上往下把代碼拼接起來的是完整的代碼。
數據預處理部分
import numpy
as np
import pandas
as pd
import gc
import time
import random
from lightgbm
import LGBMClassifier
from sklearn
. metrics
import roc_auc_score
, roc_curve
from sklearn
. model_selection
import StratifiedKFold
import matplotlib
. pyplot
as plot
import seaborn
as sb
dataFolder
= '../input/'
submissionFileName
= 'submission'
trainFile
= 'train.csv'
testFile
= 'test.csv'
numberOfRows
= 4000000 seed
= 6001
np
. random
. seed
( seed
)
random
. seed
( seed
) def displayImportances ( featureImportanceDf
, submissionFileName
) : cols
= featureImportanceDf
[ [ "feature" , "importance" ] ] . groupby
( "feature" ) . mean
( ) . sort_values
( by
= "importance" , ascending
= False ) . indexbestFeatures
= featureImportanceDf
. loc
[ featureImportanceDf
. feature
. isin
( cols
) ] plot
. figure
( figsize
= ( 14 , 14 ) ) sb
. barplot
( x
= "importance" , y
= "feature" , data
= bestFeatures
. sort_values
( by
= "importance" , ascending
= False ) ) plot
. title
( 'LightGBM Features' ) plot
. tight_layout
( ) plot
. savefig
( submissionFileName
+ '.png' )
這一段代碼,其實我覺得可以不用把路徑用幾個變量來表示(或許是代碼作者的編程習慣吧)。numberOfRows=4000000的用法要縱觀代碼才能知道,是這樣的,代碼作者把比賽官方給的train和test拼接在了一起,然后再選取前4000000個樣例作為訓練集(最后被分為訓練集和驗證集)。seed=6001及下面兩條代碼是為了生成隨機種子,但博主有個疑惑 ,為什么用了np.random.seed(seed)還要用 random.seed(seed)? ,先按住不表,等我查好資料再來補充。至于那個自定義函數,是最后來保存輸出結果的。
為官方提供的文件中的特征設置類型 就是說原始數據中的特征只有特征值,官方是沒有標出它是什么類型的數據,需要自己來設置。
dtypes
= { 'MachineIdentifier' : 'category' , 'ProductName' : 'category' , 'EngineVersion' : 'category' , 'AppVersion' : 'category' , 'AvSigVersion' : 'category' , 'IsBeta' : 'int8' , 'RtpStateBitfield' : 'float16' , 'IsSxsPassiveMode' : 'int8' , 'DefaultBrowsersIdentifier' : 'float16' , 'AVProductStatesIdentifier' : 'float32' , 'AVProductsInstalled' : 'float16' , 'AVProductsEnabled' : 'float16' , 'HasTpm' : 'int8' , 'CountryIdentifier' : 'int16' , 'CityIdentifier' : 'float32' , 'OrganizationIdentifier' : 'float16' , 'GeoNameIdentifier' : 'float16' , 'LocaleEnglishNameIdentifier' : 'int8' , 'Platform' : 'category' , 'Processor' : 'category' , 'OsVer' : 'category' , 'OsBuild' : 'int16' , 'OsSuite' : 'int16' , 'OsPlatformSubRelease' : 'category' , 'OsBuildLab' : 'category' , 'SkuEdition' : 'category' , 'IsProtected' : 'float16' , 'AutoSampleOptIn' : 'int8' , 'PuaMode' : 'category' , 'SMode' : 'float16' , 'IeVerIdentifier' : 'float16' , 'SmartScreen' : 'category' , 'Firewall' : 'float16' , 'UacLuaenable' : 'float32' , 'Census_MDC2FormFactor' : 'category' , 'Census_DeviceFamily' : 'category' , 'Census_OEMNameIdentifier' : 'float16' , 'Census_OEMModelIdentifier' : 'float32' , 'Census_ProcessorCoreCount' : 'float16' , 'Census_ProcessorManufacturerIdentifier' : 'float16' , 'Census_ProcessorModelIdentifier' : 'float16' , 'Census_ProcessorClass' : 'category' , 'Census_PrimaryDiskTotalCapacity' : 'float32' , 'Census_PrimaryDiskTypeName' : 'category' , 'Census_SystemVolumeTotalCapacity' : 'float32' , 'Census_HasOpticalDiskDrive' : 'int8' , 'Census_TotalPhysicalRAM' : 'float32' , 'Census_ChassisTypeName' : 'category' , 'Census_InternalPrimaryDiagonalDisplaySizeInInches' : 'float16' , 'Census_InternalPrimaryDisplayResolutionHorizontal' : 'float16' , 'Census_InternalPrimaryDisplayResolutionVertical' : 'float16' , 'Census_PowerPlatformRoleName' : 'category' , 'Census_InternalBatteryType' : 'category' , 'Census_InternalBatteryNumberOfCharges' : 'float32' , 'Census_OSVersion' : 'category' , 'Census_OSArchitecture' : 'category' , 'Census_OSBranch' : 'category' , 'Census_OSBuildNumber' : 'int16' , 'Census_OSBuildRevision' : 'int32' , 'Census_OSEdition' : 'category' , 'Census_OSSkuName' : 'category' , 'Census_OSInstallTypeName' : 'category' , 'Census_OSInstallLanguageIdentifier' : 'float16' , 'Census_OSUILocaleIdentifier' : 'int16' , 'Census_OSWUAutoUpdateOptionsName' : 'category' , 'Census_IsPortableOperatingSystem' : 'int8' , 'Census_GenuineStateName' : 'category' , 'Census_ActivationChannel' : 'category' , 'Census_IsFlightingInternal' : 'float16' , 'Census_IsFlightsDisabled' : 'float16' , 'Census_FlightRing' : 'category' , 'Census_ThresholdOptIn' : 'float16' , 'Census_FirmwareManufacturerIdentifier' : 'float16' , 'Census_FirmwareVersionIdentifier' : 'float32' , 'Census_IsSecureBootEnabled' : 'int8' , 'Census_IsWIMBootEnabled' : 'float16' , 'Census_IsVirtualDevice' : 'float16' , 'Census_IsTouchEnabled' : 'int8' , 'Census_IsPenCapable' : 'int8' , 'Census_IsAlwaysOnAlwaysConnectedCapable' : 'float16' , 'Wdft_IsGamer' : 'float16' , 'Wdft_RegionIdentifier' : 'float16' , 'HasDetections' : 'int8' }
selectedFeatures
= [ 'AVProductStatesIdentifier' , 'AVProductsEnabled' , 'IsProtected' , 'Processor' , 'OsSuite' , 'IsProtected' , 'RtpStateBitfield' , 'AVProductsInstalled' , 'Wdft_IsGamer' , 'DefaultBrowsersIdentifier' , 'OsBuild' , 'Wdft_RegionIdentifier' , 'SmartScreen' , 'CityIdentifier' , 'AppVersion' , 'Census_IsSecureBootEnabled' , 'Census_PrimaryDiskTypeName' , 'Census_SystemVolumeTotalCapacity' , 'Census_HasOpticalDiskDrive' , 'Census_IsWIMBootEnabled' , 'Census_IsVirtualDevice' , 'Census_IsTouchEnabled' , 'Census_FirmwareVersionIdentifier' , 'GeoNameIdentifier' , 'IeVerIdentifier' , 'Census_FirmwareManufacturerIdentifier' , 'Census_InternalPrimaryDisplayResolutionHorizontal' , 'Census_InternalPrimaryDisplayResolutionVertical' , 'Census_OEMModelIdentifier' , 'Census_ProcessorModelIdentifier' , 'Census_OSVersion' , 'Census_InternalPrimaryDiagonalDisplaySizeInInches' , 'Census_OEMNameIdentifier' , 'Census_ChassisTypeName' , 'Census_OSInstallLanguageIdentifier' , 'EngineVersion' , 'OrganizationIdentifier' , 'CountryIdentifier' , 'Census_ActivationChannel' , 'Census_ProcessorCoreCount' , 'Census_OSWUAutoUpdateOptionsName' , 'Census_InternalBatteryType' ]
代碼作者因為具備非常非常深厚的數據處理技術功底,他可能是根據以前對惡意代碼數據處理的經驗直接選擇了這些特征來給機器學習模型進行訓練。所以說,特征是不能亂選的,如果沒有代碼作者那樣的技術,還是借鑒別人的數據預處理方法進行特征篩選吧。
trainDf
= pd
. read_csv
( dataFolder
+ trainFile
, dtype
= dtypes
, usecols
= selectedFeatures
, low_memory
= True , nrows
= numberOfRows
)
labels
= pd
. read_csv
( dataFolder
+ trainFile
, usecols
= [ 'HasDetections' ] , nrows
= numberOfRows
)
testDf
= pd
. read_csv
( dataFolder
+ testFile
, dtype
= dtypes
, usecols
= selectedFeatures
, low_memory
= True )
print ( '== Dataset Shapes ==' )
print ( 'Train : ' + str ( trainDf
. shape
) )
print ( 'Labels : ' + str ( labels
. shape
) )
print ( 'Test : ' + str ( testDf
. shape
) )
df
= trainDf
. append
( testDf
) . reset_index
( )
del trainDf
, testDf
gc
. collect
( )
df 是將train和test拼接之后的新的DataFrame。
對特征 ‘SmartScreen’ 的特征值進行處理
df
. loc
[ df
. SmartScreen
== 'off' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== 'of' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== 'OFF' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== '00000000' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== '0' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== 'ON' , 'SmartScreen' ] = 'On'
df
. loc
[ df
. SmartScreen
== 'on' , 'SmartScreen' ] = 'On'
df
. loc
[ df
. SmartScreen
== 'Enabled' , 'SmartScreen' ] = 'On'
df
. loc
[ df
. SmartScreen
== 'BLOCK' , 'SmartScreen' ] = 'Block'
df
. loc
[ df
. SmartScreen
== 'requireadmin' , 'SmartScreen' ] = 'RequireAdmin'
df
. loc
[ df
. SmartScreen
== 'requireAdmin' , 'SmartScreen' ] = 'RequireAdmin'
df
. loc
[ df
. SmartScreen
== 'RequiredAdmin' , 'SmartScreen' ] = 'RequireAdmin'
df
. loc
[ df
. SmartScreen
== 'Promt' , 'SmartScreen' ] = 'Prompt'
df
. loc
[ df
. SmartScreen
== 'Promprt' , 'SmartScreen' ] = 'Prompt'
df
. loc
[ df
. SmartScreen
== 'prompt' , 'SmartScreen' ] = 'Prompt'
df
. loc
[ df
. SmartScreen
== 'warn' , 'SmartScreen' ] = 'Warn'
df
. loc
[ df
. SmartScreen
== 'Deny' , 'SmartScreen' ] = 'Block'
df
. loc
[ df
. SmartScreen
== '' , 'SmartScreen' ] = 'Off'
在這里我們能學到一種從某特征中取特定值的方法:通過設定條件來取特征中的目標特征值
將每種特征的個特征值出現次數統計出來再生成一個新的DataFrame
for col
in [ f
for f
in df
. columns
if f
not in [ 'index' , 'HasDetections' , 'Census_SystemVolumeTotalCapacity' ] ] : df
[ col
] = df
[ col
] . map ( df
[ col
] . value_counts
( ) ) dfDummy
= pd
. get_dummies
( df
, dummy_na
= True )
print ( 'Dummy: ' + str ( dfDummy
. shape
) )
del df
gc
. collect
( )
print ( '== Dataset Shapes ==' )
print ( 'Train: ' + str ( train
. shape
) )
print ( 'Test: ' + str ( test
. shape
) )
print ( '== Dataset Columns ==' )
features
= [ f
for f
in train
. columns
if f
not in [ 'index' ] ]
for feature
in features
: print ( feature
)
df[col].map(df[col].value_counts()) 通過.map()函數將每個特征值的出現次數映射到原來存放特征值的那個位置 (如果是函數意思不懂的話博主建議自己去查一下,這里只給出代碼的意義)。這行代碼是很有技巧的,因為它只用了一行代碼就對每個特征中存放的值從特征值換成了特征值出現次數,也就是所謂的頻率(更正式的“頻率”應該是出現次數除以100) ,那為什么要修改為頻率呢?那是因為lightGBM 算法是基于頻率的。
feature 在上面我們把 train 和 test 拼接起來的時候使用了函數 .reset_index(),會出現新的一列’index’保存原來的索引,所以在這里我們要 not in ['index'] ``
df[col]=df[col].map(df[col].value_counts()) 這行代碼比較難,我這里放個例子給大家看看
機器學習模型構建部分
folds
= StratifiedKFold
( n_splits
= 5 , shuffle
= True , random_state
= seed
)
oofPreds
= np
. zeros
( train
. shape
[ 0 ] )
subPreds
= np
. zeros
( test
. shape
[ 0 ] )
featureImportanceDf
= pd
. DataFrame
( )
for n_fold
, ( trainXId
, validXId
) in enumerate ( folds
. split
( train
[ features
] , labels
) ) : trainX
, trainY
= train
[ features
] . iloc
[ trainXId
] , labels
. iloc
[ trainXId
] validX
, validY
= train
[ features
] . iloc
[ validXId
] , labels
. iloc
[ validXId
] print ( '== Fold: ' + str ( n_fold
) ) lgbm
= LGBMClassifier
( objective
= 'binary' , boosting_type
= 'gbdt' , n_estimators
= 2500 , learning_rate
= 0.05 , num_leaves
= 250 , min_data_in_leaf
= 125 , bagging_fraction
= 0.901 , max_depth
= 13 , reg_alpha
= 2.5 , reg_lambda
= 2.5 , min_split_gain
= 0.0001 , min_child_weight
= 25 , feature_fraction
= 0.5 , silent
= - 1 , verbose
= - 1 , n_jobs
= - 1 ) lgbm
. fit
( trainX
, trainY
, eval_set
= [ ( trainX
, trainY
) , ( validX
, validY
) ] , eval_metric
= 'auc' , verbose
= 250 , early_stopping_rounds
= 100 ) oofPreds
[ validXId
] = lgbm
. predict_proba
( validX
, num_iteration
= lgbm
. best_iteration_
) [ : , 1 ] print ( 'Fold %2d AUC : %.6f' % ( n_fold
+ 1 , roc_auc_score
( validY
, oofPreds
[ validXId
] ) ) ) print ( 'Cleanup' ) del trainX
, trainY
, validX
, validYgc
. collect
( ) subPreds
+= lgbm
. predict_proba
( test
[ features
] , num_iteration
= lgbm
. best_iteration_
) [ : , 1 ] / folds
. n_splits fold_importance_df
= pd
. DataFrame
( ) fold_importance_df
[ "feature" ] = featuresfold_importance_df
[ "importance" ] = lgbm
. feature_importances_ fold_importance_df
[ "fold" ] = n_fold
+ 1 featureImportanceDf
= pd
. concat
( [ featureImportanceDf
, fold_importance_df
] , axis
= 0 ) print ( 'Cleanup. Post-Fold' ) del lgbmgc
. collect
( ) print ( 'Full AUC score %.6f' % roc_auc_score
( labels
, oofPreds
) )
1.oofPreds = np.zeros(train.shape[0]) : 創建一個與 train 行長度相等的元素為0的數組 subPreds = np.zeros(test.shape[0]) : 創建一個與 test 行長度相等的元素為0的數組
2.oofPreds = np.zeros(train.shape[0]) subPreds = np.zeros(test.shape[0])是 numpy.ndarray類型,因為roc_auc_score()參數得是array類型。
3.經過訓練,我們可以計算AUC值來檢測分類效果
oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] 驗證集中樣本預測為1(正樣本)的概率 roc_auc_score(validY, oofPreds[validXId])) 過驗證集的標簽和預測驗證集為正樣本的概率計算AUC
保存文件、可視化模塊(可視化函數在代碼最上面定義了)
displayImportances
( featureImportanceDf
, submissionFileName
)
kaggleSubmission
= pd
. read_csv
( dataFolder
+ 'sample_submission.csv' )
kaggleSubmission
[ 'HasDetections' ] = subPreds
kaggleSubmission
. to_csv
( submissionFileName
+ '.csv' , index
= False )
總結
以上是生活随笔 為你收集整理的“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。