XGBoost的基本使用应用Kaggle便利店销量预测
XGBoost的基本使用應(yīng)用
導(dǎo)入XGBoost等相關(guān)包:
from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score加載數(shù)據(jù),提取特征集和標(biāo)簽:
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')X = dataset[:, 0:8] y = dataset[:, 8] dataset #array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ], # [ 1. , 85. , 66. , ..., 0.351, 31. , 0. ], # [ 8. , 183. , 64. , ..., 0.672, 32. , 1. ], # ..., # [ 5. , 121. , 72. , ..., 0.245, 30. , 0. ], # [ 1. , 126. , 60. , ..., 0.349, 47. , 1. ], # [ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])將數(shù)據(jù)劃分為訓(xùn)練集和測(cè)試集:
seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)X_train.shape, X_test.shape, y_train.shape, y_test.shape #((514, 8), (254, 8), (514,), (254,))創(chuàng)建及訓(xùn)練模型:
model = XGBClassifier(n_jobs=-1) model.fit(X_train, y_train) #XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, # colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, # max_depth=3, min_child_weight=1, missing=None, n_estimators=100, # n_jobs=-1, nthread=None, objective='binary:logistic', # random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, # seed=None, silent=True, subsample=1)使用訓(xùn)練后的模型對(duì)測(cè)試集進(jìn)行預(yù)測(cè),并計(jì)算預(yù)測(cè)值與實(shí)際之間的acc值:
y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy: %.2f%%" % (accuracy * 100.0)) #Accuracy: 77.95%使用訓(xùn)練后的模型對(duì)測(cè)試集進(jìn)行預(yù)測(cè),得到每個(gè)類別的預(yù)測(cè)概率:
y_pred = model.predict(X_test) y_pred #array([0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., # 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., # 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0., # . . . # 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., # 0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., # 0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1.]) y_pred_proba = model.predict_proba(X_test) y_pred_proba #array([[0.9545844 , 0.04541559], # [0.05245447, 0.9475455 ], # [0.41897488, 0.5810251 ], # # [0.42821795, 0.57178205], # [0.2364142 , 0.7635858 ], # [0.05780089, 0.9421991 ]], dtype=float32)監(jiān)控模型表現(xiàn):xgboost可以在模型訓(xùn)練時(shí),評(píng)價(jià)模型在測(cè)試集上的表現(xiàn),也可以輸出每一步的分?jǐn)?shù)
model = XGBClassifier() eval_set = [(X_test,y_test)] model.fit(X_train,y_train,early_stopping_rounds=10,eval_metric="logloss",eval_set=eval_set,verbose=True) #10輪驗(yàn)證集效果不提升,停止 #那么它會(huì)在每加入一棵樹后打印出logloss #[0] validation_0-logloss:0.60491 #[1] validation_0-logloss:0.55934 #[2] validation_0-logloss:0.53068 #[3] validation_0-logloss:0.51795 #[4] validation_0-logloss:0.51153 #[5] validation_0-logloss:0.50935 #[6] validation_0-logloss:0.50818 #[7] validation_0-logloss:0.51097 #[8] validation_0-logloss:0.51760 #[9] validation_0-logloss:0.51912 #[10] validation_0-logloss:0.52503 #[11] validation_0-logloss:0.52697 #[12] validation_0-logloss:0.53335 #[13] validation_0-logloss:0.53905 #[14] validation_0-logloss:0.54546 #[15] validation_0-logloss:0.54613 #[16] validation_0-logloss:0.54982輸出各特征重要程度:
from xgboost import plot_importance from matplotlib import pyplot %matplotlib inlineplot_importance(model) pyplot.show()
xgboost根據(jù)結(jié)構(gòu)分?jǐn)?shù)的增益情況計(jì)算出來(lái)選擇哪個(gè)特征作為分割點(diǎn),而某個(gè)特征的重要性就是它在所有樹中出現(xiàn)的次數(shù)之和。也就是說(shuō)一個(gè)屬性越多的被用來(lái)在模型中構(gòu)建決策樹,它的重要性就相對(duì)越高
調(diào)參
如何調(diào)參呢,下面是三個(gè)超參數(shù)的一般實(shí)踐最佳值,可以先將它們?cè)O(shè)定為這個(gè)范圍,然后畫出learning curves,再再調(diào)節(jié)參數(shù)找到最佳模型:
- learning_rate=0.1 或更小,越小就需要多假如若學(xué)習(xí)器
- tree_depth = 2~8
- subsample=訓(xùn)練集的30%~80%
接下來(lái)我們用GridSearchCV來(lái)進(jìn)行調(diào)參會(huì)更加方便一些
可以調(diào)的參數(shù)組合有:
樹的個(gè)數(shù)和大小(n_estimators and max_depth)學(xué)習(xí)率和樹的個(gè)數(shù)(learning_rate and n_estimators).行列的subsampling rates(subsample,colsample_bytree and colsample_bylevel )
導(dǎo)入調(diào)參相關(guān)包:
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold創(chuàng)建模型及參數(shù)搜索空間:
model_GS = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] max_depth = [1, 2, 3, 4, 5] param_grid = dict(learning_rate=learning_rate, max_depth=max_depth)設(shè)置分層抽樣驗(yàn)證及創(chuàng)建搜索對(duì)象:
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed) grid_search = GridSearchCV(model_GS, param_grid=param_grid, scoring='neg_log_loss', n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, y)y_pred = grid_result.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy: %.2f%%" % (accuracy * 100.0)) #Accuracy: 81.10% grid_result.best_score_, grid_result.best_params_ #(-0.47171179660714796, {'learning_rate': 0.2, 'max_depth': 1})便利店銷量預(yù)測(cè)
Describe
羅斯曼在7個(gè)歐洲國(guó)家經(jīng)營(yíng)著3000多家藥店。目前,羅斯曼商店經(jīng)理的任務(wù)是提前六周預(yù)測(cè)每日銷售額。商店銷售受到許多因素的影響,包括促銷、競(jìng)爭(zhēng)、學(xué)校和國(guó)定假日、季節(jié)性和地域性。由于成千上萬(wàn)的經(jīng)理根據(jù)自己的特殊情況預(yù)測(cè)銷售,結(jié)果的準(zhǔn)確性可能會(huì)有很大差異。
在他們的第一場(chǎng)Kaggle競(jìng)賽中,Rossmann挑戰(zhàn)你預(yù)測(cè)德國(guó)各地1115家商店6周的日銷售額。可靠的銷售預(yù)測(cè)使商店經(jīng)理能夠制定有效的員工時(shí)間表,提高工作效率和積極性。通過(guò)幫助Rossmann創(chuàng)建一個(gè)穩(wěn)健的預(yù)測(cè)模型,您將幫助門店經(jīng)理專注于對(duì)他們來(lái)說(shuō)最重要的事情:他們的客戶和團(tuán)隊(duì)!
我們?yōu)槟峁┝?115家羅斯曼商店的歷史銷售數(shù)據(jù)。任務(wù)是預(yù)測(cè)測(cè)試集的“銷售”列。請(qǐng)注意,數(shù)據(jù)集中的一些商店因翻新而暫時(shí)關(guān)閉。
Evaluation
Submissions are evaluated on the Root Mean Square Percentage Error (RMSPE). The RMSPE is calculated as
RMSPE?=1n∑i=1n(yi?y^iyi)2\operatorname{RMSPE}=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(\frac{y_{i}-\hat{y}_{i}}{y_{i}}\right)^{2}} RMSPE=n1?i=1∑n?(yi?yi??y^?i??)2?
where y_i denotes the sales of a single store on a single day and yhat_i denotes the corresponding prediction. Any day and store with 0 sales is ignored in scoring.
自定義損失函數(shù)的方法太過(guò)于復(fù)雜(一階導(dǎo)、二階導(dǎo)不求)(自定義損失函數(shù)要求計(jì)算出一階導(dǎo)、二階導(dǎo),不是光有損失函數(shù)就行了),我們可以通過(guò)自定義評(píng)估指標(biāo)來(lái)完成這個(gè)預(yù)測(cè)
Files
- train.csv - historical data including Sales
- test.csv - historical data excluding Sales
- sample_submission.csv - a sample submission file in the correct format
- store.csv - supplemental information about the stores
Data fields
Most of the fields are self-explanatory. The following are descriptions for those that aren't.
- Id - an Id that represents a (Store, Date) duple within the test set
- Store - a unique Id for each store
- Sales - the turnover for any given day (this is what you are predicting)
- Customers - the number of customers on a given day
- Open - an indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
- StoreType - differentiates between 4 different store models: a, b, c, d
- Assortment - describes an assortment level: a = basic, b = extra, c = extended
- CompetitionDistance - distance in meters to the nearest competitor store
- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
- Promo - indicates whether a store is running a promo on that day
- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
==promotion是否促銷
promotion是否在長(zhǎng)周期的促銷
Promo2Since[Year/Week]長(zhǎng)周期促銷從哪年那個(gè)星期開始
PromoInterval促銷情況
引入所需的庫(kù)
import pandas as pd import datetime import csv import numpy as np import os import scipy as sp import xgboost as xgb import itertools import operator import warnings warnings.filterwarnings("ignore") from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.base import TransformerMixin from matplotlib import pylab as plt plot = Truegoal = 'Sales' myid = 'Id'當(dāng)你的eval metric和loss function并不一致的時(shí)候
Early stopping
按照原來(lái)的loss function去優(yōu)化,一顆一顆樹生長(zhǎng)和添加,但是在驗(yàn)證集上,盯著eval metric去看,在驗(yàn)證集上評(píng)估指標(biāo)不再優(yōu)化的時(shí)候,停止集成模型的生長(zhǎng)。
有標(biāo)簽的數(shù)據(jù)部分(訓(xùn)練集) + 無(wú)標(biāo)簽/需要做預(yù)估的部分(測(cè)試集)
訓(xùn)練集 = 真正的訓(xùn)練集 + 驗(yàn)證集(利用它去完成模型選擇和調(diào)參)
定義一些變換和評(píng)判準(zhǔn)則
使用不同的evaluation function的時(shí)候要特別注意這個(gè)
def ToWeight(y):# y is np.arrayw = np.zeros(y.shape, dtype=float)ind = y != 0w[ind] = 1./(y[ind]**2)return wdef rmspe(yhat, y):w = ToWeight(y)rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))return rmspedef rmspe_xg(yhat, y):# y = y.valuesy = y.get_label()y = np.exp(y) - 1yhat = np.exp(yhat) - 1w = ToWeight(y)rmspe = np.sqrt(np.mean(w * (y - yhat)**2))return "rmspe", rmspe store = pd.read_csv('store.csv') store.head() train_df = pd.read_csv('train.csv') train_df.head() test_df = pd.read_csv('test.csv') test_df.head()加載數(shù)據(jù)
def load_data():"""加載數(shù)據(jù),設(shè)定數(shù)值型和非數(shù)值型數(shù)據(jù)"""store = pd.read_csv('store.csv')train_org = pd.read_csv('train.csv',dtype={'StateHoliday':pd.np.string_})test_org = pd.read_csv('test.csv',dtype={'StateHoliday':pd.np.string_})train = pd.merge(train_org,store, on='Store', how='left')test = pd.merge(test_org,store, on='Store', how='left')features = test.columns.tolist()numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']features_numeric = test.select_dtypes(include=numerics).columns.tolist()features_non_numeric = [f for f in features if f not in features_numeric]return (train,test,features,features_non_numeric)數(shù)據(jù)與特征處理
def process_data(train,test,features,features_non_numeric):"""Feature engineering and selection."""# # FEATURE ENGINEERINGtrain = train[train['Sales'] > 0]for data in [train,test]:# year month daydata['year'] = data.Date.apply(lambda x: x.split('-')[0])data['year'] = data['year'].astype(float)data['month'] = data.Date.apply(lambda x: x.split('-')[1])data['month'] = data['month'].astype(float)data['day'] = data.Date.apply(lambda x: x.split('-')[2])data['day'] = data['day'].astype(float)# promo interval "Jan,Apr,Jul,Oct"data['promojan'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jan" in x else 0)#TypeError: argument of type 'float' is not iterable 為什么使用isinstance(x,float)data['promofeb'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Feb" in x else 0)data['promomar'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Mar" in x else 0)data['promoapr'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Apr" in x else 0)data['promomay'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "May" in x else 0)data['promojun'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jun" in x else 0)data['promojul'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jul" in x else 0)data['promoaug'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Aug" in x else 0)data['promosep'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Sep" in x else 0)data['promooct'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Oct" in x else 0)data['promonov'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Nov" in x else 0)data['promodec'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Dec" in x else 0)# # Features set.noisy_features = [myid,'Date']features = [c for c in features if c not in noisy_features]features_non_numeric = [c for c in features_non_numeric if c not in noisy_features]features.extend(['year','month','day'])# Fill NAclass DataFrameImputer(TransformerMixin):# http://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learndef __init__(self):"""Impute missing values.Columns of dtype object are imputed with the most frequent valuein column.Columns of other types are imputed with mean of column."""def fit(self, X, y=None):self.fill = pd.Series([X[c].value_counts().index[0] # modeif X[c].dtype == np.dtype('O') else X[c].mean() for c in X], # meanindex=X.columns)return selfdef transform(self, X, y=None):return X.fillna(self.fill)train = DataFrameImputer().fit_transform(train)test = DataFrameImputer().fit_transform(test)# Pre-processing non-numberic valuesle = LabelEncoder()for col in features_non_numeric:le.fit(list(train[col])+list(test[col]))train[col] = le.transform(train[col])test[col] = le.transform(test[col])# LR和神經(jīng)網(wǎng)絡(luò)這種模型都對(duì)輸入數(shù)據(jù)的幅度極度敏感,請(qǐng)先做歸一化操作scaler = StandardScaler()for col in set(features) - set(features_non_numeric) - \set([]): # TODO: add what not to scalescaler.fit(list(train[col])+list(test[col]))train[col] = scaler.transform(train[col])test[col] = scaler.transform(test[col])return (train,test,features,features_non_numeric)訓(xùn)練與分析
predict_result = log(y+1) y = e^(predict_result)-1 def XGB_native(train,test,features,features_non_numeric):depth = 6eta = 0.01ntrees = 8000mcw = 3params = {"objective": "reg:linear","booster": "gbtree","eta": eta,"max_depth": depth,"min_child_weight": mcw,"subsample": 0.7,"colsample_bytree": 0.7,"silent": 1}print "Running with params: " + str(params)print "Running with ntrees: " + str(ntrees)print "Running with features: " + str(features)# Train model with local splittsize = 0.05X_train, X_test = cross_validation.train_test_split(train, test_size=tsize)dtrain = xgb.DMatrix(X_train[features], np.log(X_train[goal] + 1))dvalid = xgb.DMatrix(X_test[features], np.log(X_test[goal] + 1))watchlist = [(dvalid, 'eval'), (dtrain, 'train')]gbm = xgb.train(params, dtrain, ntrees, evals=watchlist, early_stopping_rounds=100, feval=rmspe_xg, verbose_eval=True)train_probs = gbm.predict(xgb.DMatrix(X_test[features]))indices = train_probs < 0train_probs[indices] = 0error = rmspe(np.exp(train_probs) - 1, X_test[goal].values)print error# Predict and Exporttest_probs = gbm.predict(xgb.DMatrix(test[features]))indices = test_probs < 0test_probs[indices] = 0submission = pd.DataFrame({myid: test[myid], goal: np.exp(test_probs) - 1})if not os.path.exists('result/'):os.makedirs('result/')submission.to_csv("./result/dat-xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.csv" % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize)) , index=False)# Feature importanceif plot:outfile = open('xgb.fmap', 'w')i = 0for feat in features:outfile.write('{0}\t{1}\tq\n'.format(i, feat))i = i + 1outfile.close()importance = gbm.get_fscore(fmap='xgb.fmap')importance = sorted(importance.items(), key=operator.itemgetter(1))df = pd.DataFrame(importance, columns=['feature', 'fscore'])df['fscore'] = df['fscore'] / df['fscore'].sum()# Plotitupplt.figure()df.plot()df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(25, 15))plt.title('XGBoost Feature Importance')plt.xlabel('relative importance')plt.gcf().savefig('Feature_Importance_xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.png' % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize))) print "=> 載入數(shù)據(jù)中..." train,test,features,features_non_numeric = load_data() print "=> 處理數(shù)據(jù)與特征工程..." train,test,features,features_non_numeric = process_data(train,test,features,features_non_numeric) print "=> 使用XGBoost建模..." XGB_native(train,test,features,features_non_numeric) train.head() # => 載入數(shù)據(jù)中... # => 處理數(shù)據(jù)與特征工程... # => 使用XGBoost建模... # Running with params: {'subsample': 0.7, 'eta': 0.01, 'colsample_bytree': 0.7, 'silent': 1, 'objective': 'reg:linear', 'max_depth': 6, 'min_child_weight': 3, 'booster': 'gbtree'} # Running with ntrees: 8000 # Running with features: ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'year', 'month', 'day'] # [0] eval-rmspe:0.999864 train-rmspe:0.999864 # Multiple eval metrics have been passed: 'train-rmspe' will be used for early stopping.# Will train until train-rmspe hasn't improved in 100 rounds. # [1] eval-rmspe:0.999838 train-rmspe:0.999837 # [2] eval-rmspe:0.99981 train-rmspe:0.999809 # [3] eval-rmspe:0.999779 train-rmspe:0.999779 # . . . # [503] eval-rmspe:0.314933 train-rmspe:0.342737 # [504] eval-rmspe:0.315016 train-rmspe:0.342834 # [505] eval-rmspe:0.31512 train-rmspe:0.342928 # Stopping. Best iteration: # [405] eval-rmspe:0.312829 train-rmspe:0.33589# 0.315119522982總結(jié)
以上是生活随笔為你收集整理的XGBoost的基本使用应用Kaggle便利店销量预测的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 菜菜sklearn——XGBoost(3
- 下一篇: LightGBM用法速查表