日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

数据挖掘比赛模板

發(fā)布時(shí)間:2025/3/21 编程问答 16 豆豆
生活随笔 收集整理的這篇文章主要介紹了 数据挖掘比赛模板 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Santander ML Explainability

  • 1 準(zhǔn)本步驟
    • 1-1.導(dǎo)包
    • 1-2 設(shè)置
    • 1-3 版本
  • 2 問(wèn)題描述
  • 3 EDA
    • 3-1 數(shù)據(jù)采集
      • 減小內(nèi)存
      • 3-1-1 數(shù)據(jù)集字段
      • 3-2-2 數(shù)值描述
    • 3-2可視化
      • 3-2-1 直方圖
      • 3-2-2 平均頻率
      • 3-2-3 countplot
      • 3-2-4 hist
      • 3-2-5 distplot
      • 3-2-6 散點(diǎn)圖
    • 3-3 數(shù)據(jù)預(yù)處理
    • 3-3-1 缺失值檢查
    • 3-3-2 二值化
    • 3-3-3 數(shù)據(jù)平衡
    • 3-3-4 偏度和峰度
  • 4 特征工程
    • 4-1 Permutation Importance
    • 4-2 如何計(jì)算和顯示重要性
    • 4-4 Partial Dependence Plots
    • 4-5 Partial Dependence Plot
    • 4-6 Chart analysis
    • 4-7 SHAP Values
    • 4-7 特征之間的相關(guān)性
  • 5 模型
    • 準(zhǔn)備:Augment
    • 5-1 lightgbm
    • 5-2 RandomForestClassifier
    • 5-3 DecisionTreeClassifier
    • 5-4 Logistic Regression
  • 6 提交

1 準(zhǔn)本步驟

1-1.導(dǎo)包

from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedKFold from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from catboost import CatBoostClassifier,Pool from IPython.display import display import matplotlib.patches as patch import matplotlib.pyplot as plt from sklearn.svm import NuSVR from scipy.stats import norm from sklearn import svm import lightgbm as lgb import xgboost as xgb import seaborn as sns import pandas as pd import numpy as np import warnings import time import glob import sys import os import gc

1-2 設(shè)置

# for get better result chage fold_n to 5 fold_n=5 folds = StratifiedKFold(n_splits=fold_n, shuffle=True, random_state=10) %matplotlib inline %precision 4 warnings.filterwarnings('ignore') plt.style.use('ggplot') np.set_printoptions(suppress=True) pd.set_option("display.precision", 15)

1-3 版本

print('pandas: {}'.format(pd.__version__)) print('numpy: {}'.format(np.__version__)) print('Python: {}'.format(sys.version))

2 問(wèn)題描述

3 EDA

3-1 數(shù)據(jù)采集

print(os.listdir("../input/")) train= pd.read_csv("../input/train.csv") test = pd.read_csv('../input/test.csv') sample_submission = pd.read_csv('../input/sample_submission.csv') sample_submission.head() train.shape, test.shape, sample_submission.shape train.head(5)

減小內(nèi)存

def reduce_mem_usage(df):start_mem_usg = df.memory_usage().sum() / 1024**2 print("Memory usage of properties dataframe is :",start_mem_usg," MB")NAlist = [] # Keeps track of columns that have missing values filled in. for col in df.columns:if df[col].dtype != object: # Exclude strings# Print current column typeprint("******************************")print("Column: ",col)print("dtype before: ",df[col].dtype)# make variables for Int, max and minIsInt = Falsemx = df[col].max()mn = df[col].min()# Integer does not support NA, therefore, NA needs to be filledif not np.isfinite(df[col]).all(): NAlist.append(col)df[col].fillna(mn-1,inplace=True) # test if column can be converted to an integerasint = df[col].fillna(0).astype(np.int64)result = (df[col] - asint)result = result.sum()if result > -0.01 and result < 0.01:IsInt = True# Make Integer/unsigned Integer datatypesif IsInt:if mn >= 0:if mx < 255:df[col] = df[col].astype(np.uint8)elif mx < 65535:df[col] = df[col].astype(np.uint16)elif mx < 4294967295:df[col] = df[col].astype(np.uint32)else:df[col] = df[col].astype(np.uint64)else:if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64) # Make float datatypes 32 bitelse:df[col] = df[col].astype(np.float32)# Print new column typeprint("dtype after: ",df[col].dtype)print("******************************")# Print final resultprint("___MEMORY USAGE AFTER COMPLETION:___")mem_usg = df.memory_usage().sum() / 1024**2 print("Memory usage is: ",mem_usg," MB")print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")return df, NAlist #訓(xùn)練集 train, NAlist = reduce_mem_usage(train) print("_________________") print("") print("Warning: the following columns have missing values filled with 'df['column_name'].min() -1': ") print("_________________") print("") print(NAlist) #測(cè)試集 test, NAlist = reduce_mem_usage(test) print("_________________") print("") print("Warning: the following columns have missing values filled with 'df['column_name'].min() -1': ") print("_________________") print("") print(NAlist)

3-1-1 數(shù)據(jù)集字段

train.columns print(len(train.columns)) print(train.info())

3-2-2 數(shù)值描述

train.describe()

3-2可視化

3-2-1 直方圖

train['target'].value_counts().plot.bar(); f,ax=plt.subplots(1,2,figsize=(20,10)) train[train['target']==0].var_0.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red') ax[0].set_title('target= 0') x1=list(range(0,85,5)) ax[0].set_xticks(x1) train[train['target']==1].var_0.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black') ax[1].set_title('target= 1') x2=list(range(0,85,5)) ax[1].set_xticks(x2) plt.show()

3-2-2 平均頻率

train[train.columns[2:]].mean().plot('hist'); plt.title('Mean Frequency');

3-2-3 countplot

f,ax=plt.subplots(1,2,figsize=(18,8)) train['target'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True) ax[0].set_title('target') ax[0].set_ylabel('') sns.countplot('target',data=train,ax=ax[1]) ax[1].set_title('target') plt.show()

3-2-4 hist

train["var_0"].hist();

3-2-5 distplot

sns.set(rc={'figure.figsize':(9,7)}) sns.distplot(train['target']);

3-2-6 散點(diǎn)圖

def plot_feature_scatter(df1, df2, features):i = 0sns.set_style('whitegrid')plt.figure()fig, ax = plt.subplots(4,4,figsize=(14,14))for feature in features:i += 1plt.subplot(4,4,i)plt.scatter(df1[feature], df2[feature], marker='+')plt.xlabel(feature, fontsize=9)plt.show(); features = ['var_0', 'var_1','var_2','var_3', 'var_4', 'var_5', 'var_6', 'var_7', 'var_8', 'var_9', 'var_10','var_11','var_12', 'var_13', 'var_14', 'var_15', ] plot_feature_scatter(train_df[::20],test_df[::20], features) def plot_new_feature_distribution(df1, df2, label1, label2, features):i = 0sns.set_style('whitegrid')plt.figure()fig, ax = plt.subplots(2,4,figsize=(18,8))for feature in features:i += 1plt.subplot(2,4,i)sns.kdeplot(df1[feature], bw=0.5,label=label1)sns.kdeplot(df2[feature], bw=0.5,label=label2)plt.xlabel(feature, fontsize=11)locs, labels = plt.xticks()plt.tick_params(axis='x', which='major', labelsize=8)plt.tick_params(axis='y', which='major', labelsize=8)plt.show(); t0 = train_df.loc[train_df['target'] == 0] t1 = train_df.loc[train_df['target'] == 1] features = train_df.columns.values[202:] plot_new_feature_distribution(t0, t1, 'target: 0', 'target: 1', features)

3-3 數(shù)據(jù)預(yù)處理

3-3-1 缺失值檢查

def check_missing_data(df):flag=df.isna().sum().any()if flag==True:total = df.isnull().sum()percent = (df.isnull().sum())/(df.isnull().count()*100)output = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])data_type = []# written by MJ Bahmanifor col in df.columns:dtype = str(df[col].dtype)data_type.append(dtype)output['Types'] = data_typereturn(np.transpose(output))else:return(False) #另一個(gè)版本 def missing_data(data):total = data.isnull().sum()percent = (data.isnull().sum()/data.isnull().count()*100)tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])types = []for col in data.columns:dtype = str(data[col].dtype)types.append(dtype)tt['Types'] = typesreturn(np.transpose(tt)) check_missing_data(train) check_missing_data(test)

3-3-2 二值化

train['target'].unique()

3-3-3 數(shù)據(jù)平衡

train['target'].value_counts() def check_balance(df,target):check=[]# written by MJ Bahmani for binary targetprint('size of data is:',df.shape[0] )for i in [0,1]:print('for target {} ='.format(i))print(df[target].value_counts()[i]/df.shape[0]*100,'%')

3-3-4 偏度和峰度

print("Skewness: %f" % train['target'].skew()) print("Kurtosis: %f" % train['target'].kurt())

4 特征工程

4-1 Permutation Importance

1.哪些特征對(duì)預(yù)測(cè)影響最大?
2.如何從模型中提取信息?

cols=["target","ID_code"] X = train.drop(cols,axis=1) y = train["target"] X_test = test.drop("ID_code",axis=1) train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1) rfc_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)

4-2 如何計(jì)算和顯示重要性

import eli5 from eli5.sklearn import PermutationImportanceperm = PermutationImportance(rfc_model, random_state=1).fit(val_X, val_y) eli5.show_weights(perm, feature_names = val_X.columns.tolist(), top=150)

4-4 Partial Dependence Plots

雖然特征重要性顯示變量對(duì)預(yù)測(cè)的影響最大,Partial Dependence Plots顯示特征如何影響預(yù)測(cè)和在模型擬合后計(jì)算Partial Dependence Plots

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1) tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_X, train_y) features = [c for c in train.columns if c not in ['ID_code', 'target']] from sklearn import tree import graphviz tree_graph = tree.export_graphviz(tree_model, out_file=None, feature_names=features) graphviz.Source(tree_graph)

4-5 Partial Dependence Plot

from matplotlib import pyplot as plt from pdpbox import pdp, get_dataset, info_plots# Create the data that we will plot pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=features, feature='var_81')# plot it pdp.pdp_plot(pdp_goals, 'var_81') plt.show()

4-6 Chart analysis

# Create the data that we will plot pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=features, feature='var_82')# plot it pdp.pdp_plot(pdp_goals, 'var_82') plt.show()

4-7 SHAP Values

row_to_show = 5 data_for_prediction = val_X.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired data_for_prediction_array = data_for_prediction.values.reshape(1, -1)rfc_model.predict_proba(data_for_prediction_array); import shap # package used to calculate Shap values# Create object that can calculate shap values explainer = shap.TreeExplainer(rfc_model)# Calculate Shap values shap_values = explainer.shap_values(data_for_prediction) shap.initjs() shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)

4-7 特征之間的相關(guān)性

%%time correlations = train_df[features].corr().abs().unstack().sort_values(kind="quicksort").reset_index() correlations = correlations[correlations['level_0'] != correlations['level_1']] correlations.head(10)

5 模型

準(zhǔn)備:Augment

def augment(x,y,t=2):xs,xn = [],[]for i in range(t):mask = y>0x1 = x[mask].copy()ids = np.arange(x1.shape[0])for c in range(x1.shape[1]):np.random.shuffle(ids)x1[:,c] = x1[ids][:,c]xs.append(x1)for i in range(t//2):mask = y==0x1 = x[mask].copy()ids = np.arange(x1.shape[0])for c in range(x1.shape[1]):np.random.shuffle(ids)x1[:,c] = x1[ids][:,c]xn.append(x1)xs = np.vstack(xs)xn = np.vstack(xn)ys = np.ones(xs.shape[0])yn = np.zeros(xn.shape[0])x = np.vstack([x,xs,xn])y = np.concatenate([y,ys,yn])return x,y

5-1 lightgbm

params = {'objective' : "binary", 'boost':"gbdt",'metric':"auc",'boost_from_average':"false",'num_threads':8,'learning_rate' : 0.01,'num_leaves' : 13,'max_depth':-1,'tree_learner' : "serial",'feature_fraction' : 0.05,'bagging_freq' : 5,'bagging_fraction' : 0.4,'min_data_in_leaf' : 80,'min_sum_hessian_in_leaf' : 10.0,'verbosity' : 1} %%time y_pred_lgb = np.zeros(len(X_test)) num_round = 1000000 for fold_n, (train_index, valid_index) in enumerate(folds.split(X,y)):print('Fold', fold_n, 'started at', time.ctime())X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]train_data = lgb.Dataset(X_train, label=y_train)valid_data = lgb.Dataset(X_valid, label=y_valid)lgb_model = lgb.train(params,train_data,num_round,#change 20 to 2000valid_sets = [train_data, valid_data],verbose_eval=1000,early_stopping_rounds = 3500)##change 10 to 200y_pred_lgb += lgb_model.predict(X_test, num_iteration=lgb_model.best_iteration)/5

5-2 RandomForestClassifier

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1) rfc_model = RandomForestClassifier(random_state=0).fit(train_X, train_y) y_pred_rfc = rfc_model.predict(X_test)

5-3 DecisionTreeClassifier

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1) tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_X, train_y) y_pred_tree = tree_model.predict(X_test)

5-4 Logistic Regression

def augment(x,y,t=2):if t==0:return x, yxs,xn = [],[]for i in range(t):mask = y>0x1 = x[mask].copy()ids = np.arange(x1.shape[0])for c in range(x1.shape[1]):np.random.shuffle(ids)x1[:,c] = x1[ids][:,c]xs.append(x1)del x1gc.collect()for i in range(t//2):mask = y==0x1 = x[mask].copy()ids = np.arange(x1.shape[0])for c in range(x1.shape[1]):np.random.shuffle(ids)x1[:,c] = x1[ids][:,c]xn.append(x1)del x1gc.collect()print("The sizes of x, xn, and xs are {}, {}, {}, respectively.".format(sys.getsizeof(x),sys.getsizeof(xn),sys.getsizeof(xs)))xs = np.vstack(xs)xn = np.vstack(xn)print("The sizes of x, xn, and xs are {}, {}, {}, respectively.".format(sys.getsizeof(x)/1024**3,sys.getsizeof(xn),sys.getsizeof(xs)))ys = np.ones(xs.shape[0])yn = np.zeros(xn.shape[0])y = np.concatenate([y,ys,yn])print("The sizes of y, yn, and ys are {}, {}, {}, respectively.".format(sys.getsizeof(y),sys.getsizeof(yn),sys.getsizeof(ys)))gc.collect()return np.vstack([x,xs, xn]), y for fold_, (trn_, val_) in enumerate(folds.split(y, y)):print("Current Fold: {}".format(fold_))trn_x, trn_y = X[trn_, :], y[trn_]val_x, val_y = X[val_, :], y[val_]NAUGMENTATIONS=1#5NSHUFFLES=0#2 # turning off the augmentation by shuffling since it did not helpval_pred, test_fold_pred = 0, 0for i in range(NAUGMENTATIONS):print("\nFold {}, Augmentation {}".format(fold_, i+1))trn_aug_x, trn_aug_y = augment(trn_x, trn_y, NSHUFFLES)trn_aug_x = pd.DataFrame(trn_aug_x)trn_aug_x = trn_aug_x.add_prefix('var_')clf = Pipeline([#('scaler', StandardScaler()),#('qt', QuantileTransformer(output_distribution='normal')),('lr_clf', LogisticRegression(solver='lbfgs', max_iter=1500, C=10))])clf.fit(trn_aug_x, trn_aug_y)print("Making predictions for the validation data")val_pred += clf.predict_proba(val_x)[:,1]print("Making predictions for the test data")test_fold_pred += clf.predict_proba(X_test)[:,1]val_pred /= NAUGMENTATIONStest_fold_pred /= NAUGMENTATIONSroc_cv.append(roc_auc_score(val_y, val_pred))print("AUC = {}".format(roc_auc_score(val_y, val_pred)))oof_preds[val_, :] = val_pred.reshape((-1, 1))test_preds += test_fold_pred.reshape((-1, 1)) test_preds /= NFOLDS roc_score_1 = round(roc_auc_score(y, oof_preds.ravel()), 5) roc_score = round(sum(roc_cv)/len(roc_cv), 5) st_dev = round(np.array(roc_cv).std(), 5)print("Average of the folds' AUCs = {}".format(roc_score)) print("Combined folds' AUC = {}".format(roc_score_1)) print("The standard deviation = {}".format(st_dev))

6 提交

submission_rfc_cat = pd.DataFrame({"ID_code": test["ID_code"],"target": (y_pred_rfc +y_pred_cat)/2}) submission_rfc_cat.to_csv('submission_rfc_cat.csv', index=False)

總結(jié)

以上是生活随笔為你收集整理的数据挖掘比赛模板的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 91资源在线播放 | 国产美女视频免费观看下载软件 | 看免费黄色片 | 精品熟女一区 | 青草一区 | 偷拍一区二区三区 | 日韩成人在线看 | 欧美特黄 | 一区二区三区四区在线观看视频 | 懂色a v | 亚洲88| 亚洲成人午夜在线 | 国产成人精品视频在线观看 | 成人动漫av在线 | 中文字幕国产在线 | 欧美日韩a级片 | 国产精品一区二区三区免费在线观看 | 亚洲资源av | 黑白配高清国语在线观看 | 日本综合久久 | 蜜桃视频污 | 另类综合小说 | 黄色一级免费网站 | 毛片123| 成年网站免费观看 | 亚州黄色网址 | 青青草一区二区三区 | 全黄一级播放 | 亚洲精品黄色 | 亚洲调教欧美在线 | 日韩不卡一二区 | 亚洲无人区码一码二码三码的含义 | 久久永久免费 | 成人手机在线视频 | 强辱丰满人妻hd中文字幕 | 亚洲成人日韩在线 | 国产91影院 | 朝鲜一级黄色片 | 中出在线播放 | 亚洲色图40p | 最近高清中文在线字幕在线观看 | 日韩av一二三区 | 欧美性猛交xxxx免费看久久久 | 天天干天天爱天天射 | 中文字幕在线视频一区 | 国产精品高潮呻吟久久av黑人 | 中文字幕在线视频一区 | 国产 日韩 欧美 综合 | bt天堂新版中文在线地址 | 中文字幕一区二区免费 | 日本一级免费视频 | 123超碰| 天堂在线资源8 | 爱爱免费网站 | 91av手机在线| 黄毛片在线观看 | 一区二区三区精品 | 青青青青青草 | 色久综合 | 熟妇人妻中文av无码 | 狠狠鲁影院 | 国产成a人亚洲精v品无码 | 国产粉嫩一区二区三区 | 国产毛片久久久久 | 羞羞在线观看 | 蜜臀久久99精品久久一区二区 | 中文字幕网站 | 亚洲国产精品va在线看黑人 | 男人的天堂色偷偷 | 久草一区| 久久午夜国产 | av午夜在线| 78m78成人免费网站 | 黑人玩弄人妻一区二区绿帽子 | 日韩激情网址 | 国产视频一区二区在线 | 波多野结衣在线电影 | av在线播放国产 | 极品销魂美女少妇尤物 | 欧美大片免费观看网址 | 爱爱免费网站 | 亚洲综合视频一区 | 亚洲精品一区二三区不卡 | 狠狠爱欧美 | 12av在线| 91中文字幕在线观看 | 精品在线一区二区 | 久久精品人妻一区二区三区 | 日韩欧美亚洲视频 | 美女丝袜av| 国产伦精品免费视频 | 91传媒理伦片在线观看 | 亚洲涩涩图 | 牛牛av国产一区二区 | 国产电影一区二区三区爱妃记 | 嫩草影院黄色 | av天天在线 | 美女扒开尿口给男人看 | 亚洲色偷偷色噜噜狠狠99网 |