當(dāng)前位置：首頁(yè) >

kaggle-Santander 客户交易预测总结

發(fā)布時(shí)間：2025/3/21 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 kaggle-Santander 客户交易预测总结小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1 繪圖

sns.kdeplot()——核密度估計(jì)圖
sns.distplot()——集合了matplotlib的hist()與核函數(shù)估計(jì)kdeplot的功能
Seaborn入門系列之kdeplot和distplot

2 Permutation Importance

我們?cè)跇?gòu)建樹(shù)類模型（XGBoost、LightGBM等）時(shí)，如果想要知道哪些變量比較重要的話。可以通過(guò)模型的feature_importances_方法來(lái)獲取特征重要性。例如LightGBM的feature_importances_可以通過(guò)特征的分裂次數(shù)或利用該特征分裂后的增益來(lái)衡量。一般情況下，不同的衡量準(zhǔn)則得到的特征重要性順序會(huì)有差異。我一般是通過(guò)多種評(píng)價(jià)標(biāo)準(zhǔn)來(lái)交叉選擇特征。若一個(gè)特征在不同的評(píng)價(jià)標(biāo)準(zhǔn)下都是比較重要的，那么該特征對(duì)label有較好的預(yù)測(cè)能力。
若將一個(gè)特征置為隨機(jī)數(shù)，模型效果下降很多，說(shuō)明該特征比較重要；反之則不是

import eli5 from eli5.sklearn import PermutationImportance from sklearn.feature_selection import SelectFromModeldef PermutationImportance_(clf,X_train,y_train,X_valid,X_test):perm = PermutationImportance(clf, n_iter=5, random_state=1024, cv=5)perm.fit(X_train, y_train) result_ = {'var':X_train.columns.values,'feature_importances_':perm.feature_importances_,'feature_importances_std_':perm.feature_importances_std_}feature_importances_ = pd.DataFrame(result_, columns=['var','feature_importances_','feature_importances_std_'])feature_importances_ = feature_importances_.sort_values('feature_importances_',ascending=False)#eli5.show_weights(perm, feature_names=X_train.columns.tolist(), top=500) #結(jié)果可視化 sel = SelectFromModel(perm, threshold=0.00, prefit=True)X_train_ = sel.transform(X_train)X_valid_ = sel.transform(X_valid)X_test_ = sel.transform(X_test)return feature_importances_,X_train_,X_valid_,X_test#PermutationImportance model_1 = RandomForestClassifier(random_state=1024) feature_importances_1,X_train_1,X_valid_1,X_test_1 = PermutationImportance_(model_1,X_train,y_train,X_valid,X_test)model_2 = lgb.LGBMClassifier(objective='binary',random_state=1024) feature_importances_2,X_train_2,X_valid_2,X_test_2 = PermutationImportance_(model_2,X_train,y_train,X_valid,X_test)model_3 = LogisticRegression(random_state=1024) feature_importances_3,X_train_3,X_valid_3,X_test_3 = PermutationImportance_(model_3,X_train,y_train,X_valid,X_test

3 部分依賴圖

部分依賴圖顯示每個(gè)變量或預(yù)測(cè)變量如何影響模型的預(yù)測(cè)。這對(duì)于以下問(wèn)題很有用：

男女之間的工資差異有多少僅僅取決于性別，而不是教育背景或工作經(jīng)歷的差異？

控制房屋特征，經(jīng)度和緯度對(duì)房?jī)r(jià)有何影響？為了重申這一點(diǎn)，我們想要了解在不同區(qū)域如何定價(jià)同樣大小的房屋，即使實(shí)際上這些地區(qū)的房屋大小不同。

由于飲食差異或其他因素，兩組之間是否存在健康差異？

#畫部分依賴圖，看目標(biāo)y與變量之間的關(guān)系 from sklearn.ensemble.partial_dependence import plot_partial_dependencemy_plots= plot_partial_dependence(my_model,feature_names= clo_to_use,features= [0,2],X= imputed_X)

4 tqdm

from tqdm import tqdm_notebook as tqdm

Tqdm 是一個(gè)快速，可擴(kuò)展的Python進(jìn)度條，可以在 Python 長(zhǎng)循環(huán)中添加一個(gè)進(jìn)度提示信息，用戶只需要封裝任意的迭代器 tqdm(iterator)。

5 特征工程

找出每一列中的唯一值，如果其唯一，則標(biāo)記為1。
如果某一樣本中含有唯一值，則視為真樣本；如果某一樣本中所有特征均不唯一，則視為假樣本。
將真樣本和真實(shí)訓(xùn)練樣本拼在一起。

unique_samples = [] unique_count = np.zeros_like(df_test) for feature in range(df_test.shape[1]):_, index_, count_ = np.unique(df_test[:, feature], return_counts=True, return_index=True)unique_count[index_[count_ == 1], feature] += 1# Samples which have unique values are real the others are fake real_samples_indexes = np.argwhere(np.sum(unique_count, axis=1) > 0)[:, 0] synthetic_samples_indexes = np.argwhere(np.sum(unique_count, axis=1) == 0)[:, 0]

"vc"列：重復(fù)數(shù)值的個(gè)數(shù)，大于10次的取10
"sum"列：出現(xiàn)次數(shù)大于1的，用vc列的值乘以（原值-均值）

for feat in feats:temp = df[feat].value_counts(dropna = True) df_train[feat+"vc"] = df_train[feat].map(temp).map(lambda x:min(10,x)).astype(np.uint8)df_test[feat+"vc"] = df_test[feat].map(temp).map(lambda x:min(10,x)).astype(np.uint8)print(feat,temp.shape[0],df_train[feat+"vc"].map(lambda x:int(x>2)).sum(),df_train[feat+"vc"].map(lambda x:int(x>3)).sum())df_train[feat+"sum"] = ((df_train[feat] - df[feat].mean()) * df_train[feat+"vc"].map(lambda x:int(x>1))).astype(np.float32)df_test[feat+"sum"] = ((df_test[feat] - df[feat].mean()) * df_test[feat+"vc"].map(lambda x:int(x>1))).astype(np.float32)df_train[feat+"sum2"] = ((df_train[feat]) * df_train[feat+"vc"].map(lambda x:int(x>2))).astype(np.float32)df_test[feat+"sum2"] = ((df_test[feat]) * df_test[feat+"vc"].map(lambda x:int(x>2))).astype(np.float32)df_train[feat+"sum3"] = ((df_train[feat]) * df_train[feat+"vc"].map(lambda x:int(x>4))).astype(np.float32) df_test[feat+"sum3"] = ((df_test[feat]) * df_test[feat+"vc"].map(lambda x:int(x>4))).astype(np.float32) # FREQUENCY ENCODE def encode_FE(df,col,test):cv = df[col].value_counts()nm = col+'_FE'df[nm] = df[col].map(cv)test[nm] = test[col].map(cv)test[nm].fillna(0,inplace=True)if cv.max()<=255:df[nm] = df[nm].astype('uint8')test[nm] = test[nm].astype('uint8')else:df[nm] = df[nm].astype('uint16')test[nm] = test[nm].astype('uint16') returntest['target'] = -1 comb = pd.concat([train,test.loc[real_samples_indexes]],axis=0,sort=True) for i in range(200): encode_FE(comb,'var_'+str(i),test) train = comb[:len(train)]; del comb print('Added 200 new magic features!') 《新程序員》：云原生和全面數(shù)字化實(shí)踐50位技術(shù)專家共同創(chuàng)作，文字、視頻、音頻交互閱讀

總結(jié)

以上是生活随笔為你收集整理的kaggle-Santander 客户交易预测总结的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。