當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码

發布時間：2025/3/15 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 “7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

代碼來源?

前言

閱讀別的的優秀代碼有助于提高自己的代碼編寫能力，從中我們不僅能學習到許多的編程知識，還能借鑒他人優秀的編程習慣，也能學習到別人獨特的編程技巧。這篇博客是博主對微軟2019惡意軟件檢測比賽第七名的一些個人總結和看法，有些代碼上博主已經給了注釋，同時也會額外給代碼另外進行注釋。由于博主能力有限，錯誤的出現在所難免，還望技術愛好者們不吝賜教。

正文

概要

眾所周知，機器學習分類模型的構建主要由兩部分組成1.數據預處理（包括數據清洗、特征工程等） 2.機器學習模型構建（訓練、調參），而數據預處理是機器學習模型構建的前期工作，用于訓練的數據的質量在很大程度決定了最后的機器學習模型的質量，所以一般的機器學習項目的代碼絕大篇幅都是處理數據的代碼，這份代碼也是如此。 個人認為，這份代碼的的數據處理不算很好，但也還算過得去（如果想了解比較有趣的數據預處理代碼請看博主的另一篇博客?)。這份代碼所使用的機器學習算法是lightGBM。

代碼詳解

說明:
博主會把代碼分開來講解，但由于設備原因無法把每一步的代碼結果顯示出來，條件允許的技術愛好者們可以自己復制代碼自己去run一下，代碼中使用的文件在官網可以下載。雖然是步講解，但是從上往下把代碼拼接起來的是完整的代碼。

數據預處理部分

庫的導入

#imports import numpy as np import pandas as pd import gc # python 的垃圾收集機制 import time # 貌似在這份代碼中沒有用...... import random # 隨機數 from lightgbm import LGBMClassifier # lightGBM 算法庫 from sklearn.metrics import roc_auc_score, roc_curve # AUC ROC 模型分類能力的一種評估標準 from sklearn.model_selection import StratifiedKFold # 訓練集和驗證集的劃分 import matplotlib.pyplot as plot #可視化 import seaborn as sb #可視化

實現功能前的預備階段

#vars dataFolder = '../input/' submissionFileName = 'submission' trainFile='train.csv' testFile='test.csv' #used 4000000 nr of rows in stead of 8000000 because of Kernel memory issue numberOfRows = 4000000seed = 6001 np.random.seed(seed) random.seed(seed)def displayImportances(featureImportanceDf, submissionFileName):# 根據 importance 的降序排位來給 feature 排序，再將排序后的特征存入 cols （存的特征的名稱）cols = featureImportanceDf[["feature", "importance"]].groupby("feature").mean().sort_values(by = "importance", ascending = False).index# .loc() 不僅可以索引為參數，也可以以boolean為參數。boolean的操作單位是某個特征的特征值bestFeatures = featureImportanceDf.loc[featureImportanceDf.feature.isin(cols)] # isin()接受一個列表，判斷該列中元素是否在列表中，并返回boolean值plot.figure(figsize = (14, 14))sb.barplot(x = "importance", y = "feature", data = bestFeatures.sort_values(by = "importance", ascending = False))plot.title('LightGBM Features')plot.tight_layout()plot.savefig(submissionFileName + '.png')

這一段代碼，其實我覺得可以不用把路徑用幾個變量來表示（或許是代碼作者的編程習慣吧）。numberOfRows=4000000的用法要縱觀代碼才能知道，是這樣的，代碼作者把比賽官方給的train和test拼接在了一起，然后再選取前4000000個樣例作為訓練集（最后被分為訓練集和驗證集）。seed=6001及下面兩條代碼是為了生成隨機種子，但博主有個疑惑，為什么用了np.random.seed(seed)還要用 random.seed(seed)？,先按住不表，等我查好資料再來補充。至于那個自定義函數，是最后來保存輸出結果的。

為官方提供的文件中的特征設置類型
就是說原始數據中的特征只有特征值，官方是沒有標出它是什么類型的數據，需要自己來設置。

dtypes = {'MachineIdentifier': 'category','ProductName': 'category','EngineVersion': 'category','AppVersion': 'category','AvSigVersion': 'category','IsBeta': 'int8','RtpStateBitfield': 'float16','IsSxsPassiveMode': 'int8','DefaultBrowsersIdentifier': 'float16','AVProductStatesIdentifier': 'float32','AVProductsInstalled': 'float16','AVProductsEnabled': 'float16','HasTpm': 'int8','CountryIdentifier': 'int16','CityIdentifier': 'float32','OrganizationIdentifier': 'float16','GeoNameIdentifier': 'float16','LocaleEnglishNameIdentifier': 'int8','Platform': 'category','Processor': 'category','OsVer': 'category','OsBuild': 'int16','OsSuite': 'int16','OsPlatformSubRelease': 'category','OsBuildLab': 'category','SkuEdition': 'category','IsProtected': 'float16','AutoSampleOptIn': 'int8','PuaMode': 'category','SMode': 'float16','IeVerIdentifier': 'float16','SmartScreen': 'category','Firewall': 'float16','UacLuaenable': 'float32','Census_MDC2FormFactor': 'category','Census_DeviceFamily': 'category','Census_OEMNameIdentifier': 'float16','Census_OEMModelIdentifier': 'float32','Census_ProcessorCoreCount': 'float16','Census_ProcessorManufacturerIdentifier': 'float16','Census_ProcessorModelIdentifier': 'float16','Census_ProcessorClass': 'category','Census_PrimaryDiskTotalCapacity': 'float32','Census_PrimaryDiskTypeName': 'category','Census_SystemVolumeTotalCapacity': 'float32','Census_HasOpticalDiskDrive': 'int8','Census_TotalPhysicalRAM': 'float32','Census_ChassisTypeName': 'category','Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float16','Census_InternalPrimaryDisplayResolutionHorizontal': 'float16','Census_InternalPrimaryDisplayResolutionVertical': 'float16','Census_PowerPlatformRoleName': 'category','Census_InternalBatteryType': 'category','Census_InternalBatteryNumberOfCharges': 'float32','Census_OSVersion': 'category','Census_OSArchitecture': 'category','Census_OSBranch': 'category','Census_OSBuildNumber': 'int16','Census_OSBuildRevision': 'int32','Census_OSEdition': 'category','Census_OSSkuName': 'category','Census_OSInstallTypeName': 'category','Census_OSInstallLanguageIdentifier': 'float16','Census_OSUILocaleIdentifier': 'int16','Census_OSWUAutoUpdateOptionsName': 'category','Census_IsPortableOperatingSystem': 'int8','Census_GenuineStateName': 'category','Census_ActivationChannel': 'category','Census_IsFlightingInternal': 'float16','Census_IsFlightsDisabled': 'float16','Census_FlightRing': 'category','Census_ThresholdOptIn': 'float16','Census_FirmwareManufacturerIdentifier': 'float16','Census_FirmwareVersionIdentifier': 'float32','Census_IsSecureBootEnabled': 'int8','Census_IsWIMBootEnabled': 'float16','Census_IsVirtualDevice': 'float16','Census_IsTouchEnabled': 'int8','Census_IsPenCapable': 'int8','Census_IsAlwaysOnAlwaysConnectedCapable': 'float16','Wdft_IsGamer': 'float16','Wdft_RegionIdentifier': 'float16','HasDetections': 'int8'}

特征選擇

selectedFeatures = [ 'AVProductStatesIdentifier','AVProductsEnabled','IsProtected','Processor','OsSuite','IsProtected','RtpStateBitfield','AVProductsInstalled','Wdft_IsGamer','DefaultBrowsersIdentifier','OsBuild','Wdft_RegionIdentifier','SmartScreen','CityIdentifier','AppVersion','Census_IsSecureBootEnabled','Census_PrimaryDiskTypeName','Census_SystemVolumeTotalCapacity','Census_HasOpticalDiskDrive','Census_IsWIMBootEnabled','Census_IsVirtualDevice','Census_IsTouchEnabled','Census_FirmwareVersionIdentifier','GeoNameIdentifier','IeVerIdentifier','Census_FirmwareManufacturerIdentifier','Census_InternalPrimaryDisplayResolutionHorizontal','Census_InternalPrimaryDisplayResolutionVertical','Census_OEMModelIdentifier','Census_ProcessorModelIdentifier','Census_OSVersion','Census_InternalPrimaryDiagonalDisplaySizeInInches','Census_OEMNameIdentifier','Census_ChassisTypeName','Census_OSInstallLanguageIdentifier','EngineVersion','OrganizationIdentifier' ,'CountryIdentifier' ,'Census_ActivationChannel','Census_ProcessorCoreCount','Census_OSWUAutoUpdateOptionsName','Census_InternalBatteryType']

代碼作者因為具備非常非常深厚的數據處理技術功底，他可能是根據以前對惡意代碼數據處理的經驗直接選擇了這些特征來給機器學習模型進行訓練。所以說，特征是不能亂選的，如果沒有代碼作者那樣的技術，還是借鑒別人的數據預處理方法進行特征篩選吧。

載入數據

# Load Data with selected features trainDf = pd.read_csv(dataFolder + trainFile, dtype=dtypes,usecols=selectedFeatures, low_memory=True, nrows = numberOfRows) # 訓練集 labels = pd.read_csv(dataFolder + trainFile, usecols = ['HasDetections'], nrows = numberOfRows) # 標簽 testDf = pd.read_csv(dataFolder + testFile,dtype=dtypes, usecols=selectedFeatures, low_memory=True) #測試集 print('== Dataset Shapes ==') print('Train : ' + str(trainDf.shape)) # trainDf.shape 是 tuple 類型 print('Labels : ' + str(labels.shape)) print('Test : ' + str(testDf.shape))# Append Datasets and Cleanup df = trainDf.append(testDf).reset_index() # 從這里可以看到 .append() 對DataFrame來說一樣有效，不僅可以用在 list 上,并且會出現新的‘index’列（用來保存原來的index）。這里是上下拼接。 del trainDf, testDf # 刪除 trainDf testDf 節省內存 gc.collect()

df 是將train和test拼接之后的新的DataFrame。

對特征 ‘SmartScreen’ 的特征值進行處理

# Modify SmartScreen Feature df.loc[df.SmartScreen == 'off', 'SmartScreen'] = 'Off' # df.SmartScreen=='off'是條件 df.loc[df.SmartScreen == 'of', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == 'OFF', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == '00000000', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == '0', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == 'ON', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'on', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'Enabled', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'BLOCK', 'SmartScreen'] = 'Block' df.loc[df.SmartScreen == 'requireadmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'requireAdmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'RequiredAdmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'Promt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'Promprt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'prompt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'warn', 'SmartScreen'] = 'Warn' df.loc[df.SmartScreen == 'Deny', 'SmartScreen'] = 'Block' df.loc[df.SmartScreen == '', 'SmartScreen'] = 'Off'

在這里我們能學到一種從某特征中取特定值的方法：通過設定條件來取特征中的目標特征值

將每種特征的個特征值出現次數統計出來再生成一個新的DataFrame

#Count Encoding (with exceptions) for col in [f for f in df.columns if f not in ['index','HasDetections','Census_SystemVolumeTotalCapacity']]:df[col]=df[col].map(df[col].value_counts()) # col列中的特征值換成該特征值在該特征中出現的次數dfDummy = pd.get_dummies(df, dummy_na=True) # 對 df 進行獨熱編碼，dummy_na=True 表示考慮缺失值NaN print('Dummy: ' + str(dfDummy.shape))# Cleanup del df gc.collect()# Summary Shape print('== Dataset Shapes ==') print('Train: ' + str(train.shape)) print('Test: ' + str(test.shape))# Summary Columns print('== Dataset Columns ==') features = [f for f in train.columns if f not in ['index']] for feature in features:print(feature)

df[col].map(df[col].value_counts()) 通過.map()函數將每個特征值的出現次數映射到原來存放特征值的那個位置 (如果是函數意思不懂的話博主建議自己去查一下，這里只給出代碼的意義)。這行代碼是很有技巧的，因為它只用了一行代碼就對每個特征中存放的值從特征值換成了特征值出現次數，也就是所謂的頻率（更正式的“頻率”應該是出現次數除以100），那為什么要修改為頻率呢？那是因為lightGBM算法是基于頻率的。

feature 在上面我們把 train 和 test 拼接起來的時候使用了函數 .reset_index()，會出現新的一列’index’保存原來的索引，所以在這里我們要 not in ['index']
``

df[col]=df[col].map(df[col].value_counts())
這行代碼比較難，我這里放個例子給大家看看

機器學習模型構建部分

訓練模塊

# CV Folds folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = seed)# Create arrays and dataframes to store results oofPreds = np.zeros(train.shape[0]) # numpy.ndarray 類型 subPreds = np.zeros(test.shape[0]) # numpy.ndarray 類型 featureImportanceDf = pd.DataFrame()# Loop through all Folds. for n_fold, (trainXId, validXId) in enumerate(folds.split(train[features], labels)): # enumerate 為每個元素標個索引，并且將該索引與相應的值合并為一個元組，這里應該有5個元組，因為折了5次# Create TrainXY and ValidationXY set based on fold-indexestrainX, trainY = train[features].iloc[trainXId], labels.iloc[trainXId]validX, validY = train[features].iloc[validXId], labels.iloc[validXId]print('== Fold: ' + str(n_fold)) # 強制轉化為 str 類型應該是代碼作者的習慣，其實直接顯示數值也行的# LightGBM parameterslgbm = LGBMClassifier(objective = 'binary',boosting_type = 'gbdt',n_estimators = 2500,learning_rate = 0.05, num_leaves = 250,min_data_in_leaf = 125, bagging_fraction = 0.901,max_depth = 13, reg_alpha = 2.5,reg_lambda = 2.5,min_split_gain = 0.0001,min_child_weight = 25,feature_fraction = 0.5, silent = -1,verbose = -1,#n_jobs is set to -1 instead of 4 otherwise the kernell will time outn_jobs = -1) lgbm.fit(trainX, trainY, eval_set=[(trainX, trainY), (validX, validY)], eval_metric = 'auc', verbose = 250, early_stopping_rounds = 100)# 通過分類器模型對驗證集預測為正樣本的概率和驗證集的真實標簽計算AUC來檢測分類器模型的分類效果oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] # 驗證集中樣本預測為1(正樣本)的概率print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(validY, oofPreds[validXId]))) # 通過驗證集的標簽和預測為正樣本的概率計算AUC# cleanupprint('Cleanup')del trainX, trainY, validX, validYgc.collect()subPreds += lgbm.predict_proba(test[features], num_iteration = lgbm.best_iteration_)[:, 1] / folds.n_splits # 對測試集進行預測，并返回預測為正例的概率， folds.n_splits = 5 （折了5次）# Feature Importancefold_importance_df = pd.DataFrame()fold_importance_df["feature"] = featuresfold_importance_df["importance"] = lgbm.feature_importances_ # .feature_importances_：特征重要性，特征越重要該值越大fold_importance_df["fold"] = n_fold + 1featureImportanceDf = pd.concat([featureImportanceDf, fold_importance_df], axis=0) # 垂直拼接，并保留原index# cleanupprint('Cleanup. Post-Fold')del lgbmgc.collect()print('Full AUC score %.6f' % roc_auc_score(labels, oofPreds)) # 全部樣本的AUC值

1.oofPreds = np.zeros(train.shape[0]) ：創建一個與 train 行長度相等的元素為0的數組
subPreds = np.zeros(test.shape[0]) ：創建一個與 test 行長度相等的元素為0的數組

2.oofPreds = np.zeros(train.shape[0]) subPreds = np.zeros(test.shape[0])是 numpy.ndarray類型，因為roc_auc_score()參數得是array類型。

3.經過訓練，我們可以計算AUC值來檢測分類效果

oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] 驗證集中樣本預測為1(正樣本)的概率
roc_auc_score(validY, oofPreds[validXId])) 過驗證集的標簽和預測驗證集為正樣本的概率計算AUC

保存文件、可視化模塊(可視化函數在代碼最上面定義了)

# Feature Importance displayImportances(featureImportanceDf, submissionFileName) # Generate Submission kaggleSubmission = pd.read_csv(dataFolder + 'sample_submission.csv') kaggleSubmission['HasDetections'] = subPreds kaggleSubmission.to_csv(submissionFileName + '.csv', index = False)

總結

以上是生活随笔為你收集整理的“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：合并DateFrame之—— appen
下一篇： DAE(去噪自动编码器)

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码

前言

正文

概要

代碼詳解

數據預處理部分

總結