XGBoost的基本使用应用Kaggle便利店销量预测
XGBoost的基本使用應用
導入XGBoost等相關包:
from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score加載數據,提取特征集和標簽:
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')X = dataset[:, 0:8] y = dataset[:, 8] dataset #array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ], # [ 1. , 85. , 66. , ..., 0.351, 31. , 0. ], # [ 8. , 183. , 64. , ..., 0.672, 32. , 1. ], # ..., # [ 5. , 121. , 72. , ..., 0.245, 30. , 0. ], # [ 1. , 126. , 60. , ..., 0.349, 47. , 1. ], # [ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])將數據劃分為訓練集和測試集:
seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)X_train.shape, X_test.shape, y_train.shape, y_test.shape #((514, 8), (254, 8), (514,), (254,))創建及訓練模型:
model = XGBClassifier(n_jobs=-1) model.fit(X_train, y_train) #XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, # colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, # max_depth=3, min_child_weight=1, missing=None, n_estimators=100, # n_jobs=-1, nthread=None, objective='binary:logistic', # random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, # seed=None, silent=True, subsample=1)使用訓練后的模型對測試集進行預測,并計算預測值與實際之間的acc值:
y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy: %.2f%%" % (accuracy * 100.0)) #Accuracy: 77.95%使用訓練后的模型對測試集進行預測,得到每個類別的預測概率:
y_pred = model.predict(X_test) y_pred #array([0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., # 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., # 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0., # . . . # 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., # 0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., # 0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1.]) y_pred_proba = model.predict_proba(X_test) y_pred_proba #array([[0.9545844 , 0.04541559], # [0.05245447, 0.9475455 ], # [0.41897488, 0.5810251 ], # # [0.42821795, 0.57178205], # [0.2364142 , 0.7635858 ], # [0.05780089, 0.9421991 ]], dtype=float32)監控模型表現:xgboost可以在模型訓練時,評價模型在測試集上的表現,也可以輸出每一步的分數
model = XGBClassifier() eval_set = [(X_test,y_test)] model.fit(X_train,y_train,early_stopping_rounds=10,eval_metric="logloss",eval_set=eval_set,verbose=True) #10輪驗證集效果不提升,停止 #那么它會在每加入一棵樹后打印出logloss #[0] validation_0-logloss:0.60491 #[1] validation_0-logloss:0.55934 #[2] validation_0-logloss:0.53068 #[3] validation_0-logloss:0.51795 #[4] validation_0-logloss:0.51153 #[5] validation_0-logloss:0.50935 #[6] validation_0-logloss:0.50818 #[7] validation_0-logloss:0.51097 #[8] validation_0-logloss:0.51760 #[9] validation_0-logloss:0.51912 #[10] validation_0-logloss:0.52503 #[11] validation_0-logloss:0.52697 #[12] validation_0-logloss:0.53335 #[13] validation_0-logloss:0.53905 #[14] validation_0-logloss:0.54546 #[15] validation_0-logloss:0.54613 #[16] validation_0-logloss:0.54982輸出各特征重要程度:
from xgboost import plot_importance from matplotlib import pyplot %matplotlib inlineplot_importance(model) pyplot.show()
xgboost根據結構分數的增益情況計算出來選擇哪個特征作為分割點,而某個特征的重要性就是它在所有樹中出現的次數之和。也就是說一個屬性越多的被用來在模型中構建決策樹,它的重要性就相對越高
調參
如何調參呢,下面是三個超參數的一般實踐最佳值,可以先將它們設定為這個范圍,然后畫出learning curves,再再調節參數找到最佳模型:
- learning_rate=0.1 或更小,越小就需要多假如若學習器
- tree_depth = 2~8
- subsample=訓練集的30%~80%
接下來我們用GridSearchCV來進行調參會更加方便一些
可以調的參數組合有:
樹的個數和大小(n_estimators and max_depth)學習率和樹的個數(learning_rate and n_estimators).行列的subsampling rates(subsample,colsample_bytree and colsample_bylevel )
導入調參相關包:
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold創建模型及參數搜索空間:
model_GS = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] max_depth = [1, 2, 3, 4, 5] param_grid = dict(learning_rate=learning_rate, max_depth=max_depth)設置分層抽樣驗證及創建搜索對象:
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed) grid_search = GridSearchCV(model_GS, param_grid=param_grid, scoring='neg_log_loss', n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, y)y_pred = grid_result.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy: %.2f%%" % (accuracy * 100.0)) #Accuracy: 81.10% grid_result.best_score_, grid_result.best_params_ #(-0.47171179660714796, {'learning_rate': 0.2, 'max_depth': 1})便利店銷量預測
Describe
羅斯曼在7個歐洲國家經營著3000多家藥店。目前,羅斯曼商店經理的任務是提前六周預測每日銷售額。商店銷售受到許多因素的影響,包括促銷、競爭、學校和國定假日、季節性和地域性。由于成千上萬的經理根據自己的特殊情況預測銷售,結果的準確性可能會有很大差異。
在他們的第一場Kaggle競賽中,Rossmann挑戰你預測德國各地1115家商店6周的日銷售額。可靠的銷售預測使商店經理能夠制定有效的員工時間表,提高工作效率和積極性。通過幫助Rossmann創建一個穩健的預測模型,您將幫助門店經理專注于對他們來說最重要的事情:他們的客戶和團隊!
我們為您提供了1115家羅斯曼商店的歷史銷售數據。任務是預測測試集的“銷售”列。請注意,數據集中的一些商店因翻新而暫時關閉。
Evaluation
Submissions are evaluated on the Root Mean Square Percentage Error (RMSPE). The RMSPE is calculated as
RMSPE?=1n∑i=1n(yi?y^iyi)2\operatorname{RMSPE}=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(\frac{y_{i}-\hat{y}_{i}}{y_{i}}\right)^{2}} RMSPE=n1?i=1∑n?(yi?yi??y^?i??)2?
where y_i denotes the sales of a single store on a single day and yhat_i denotes the corresponding prediction. Any day and store with 0 sales is ignored in scoring.
自定義損失函數的方法太過于復雜(一階導、二階導不求)(自定義損失函數要求計算出一階導、二階導,不是光有損失函數就行了),我們可以通過自定義評估指標來完成這個預測
Files
- train.csv - historical data including Sales
- test.csv - historical data excluding Sales
- sample_submission.csv - a sample submission file in the correct format
- store.csv - supplemental information about the stores
Data fields
Most of the fields are self-explanatory. The following are descriptions for those that aren't.
- Id - an Id that represents a (Store, Date) duple within the test set
- Store - a unique Id for each store
- Sales - the turnover for any given day (this is what you are predicting)
- Customers - the number of customers on a given day
- Open - an indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
- StoreType - differentiates between 4 different store models: a, b, c, d
- Assortment - describes an assortment level: a = basic, b = extra, c = extended
- CompetitionDistance - distance in meters to the nearest competitor store
- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
- Promo - indicates whether a store is running a promo on that day
- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
==promotion是否促銷
promotion是否在長周期的促銷
Promo2Since[Year/Week]長周期促銷從哪年那個星期開始
PromoInterval促銷情況
引入所需的庫
import pandas as pd import datetime import csv import numpy as np import os import scipy as sp import xgboost as xgb import itertools import operator import warnings warnings.filterwarnings("ignore") from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.base import TransformerMixin from matplotlib import pylab as plt plot = Truegoal = 'Sales' myid = 'Id'當你的eval metric和loss function并不一致的時候
Early stopping
按照原來的loss function去優化,一顆一顆樹生長和添加,但是在驗證集上,盯著eval metric去看,在驗證集上評估指標不再優化的時候,停止集成模型的生長。
有標簽的數據部分(訓練集) + 無標簽/需要做預估的部分(測試集)
訓練集 = 真正的訓練集 + 驗證集(利用它去完成模型選擇和調參)
定義一些變換和評判準則
使用不同的evaluation function的時候要特別注意這個
def ToWeight(y):# y is np.arrayw = np.zeros(y.shape, dtype=float)ind = y != 0w[ind] = 1./(y[ind]**2)return wdef rmspe(yhat, y):w = ToWeight(y)rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))return rmspedef rmspe_xg(yhat, y):# y = y.valuesy = y.get_label()y = np.exp(y) - 1yhat = np.exp(yhat) - 1w = ToWeight(y)rmspe = np.sqrt(np.mean(w * (y - yhat)**2))return "rmspe", rmspe store = pd.read_csv('store.csv') store.head() train_df = pd.read_csv('train.csv') train_df.head() test_df = pd.read_csv('test.csv') test_df.head()加載數據
def load_data():"""加載數據,設定數值型和非數值型數據"""store = pd.read_csv('store.csv')train_org = pd.read_csv('train.csv',dtype={'StateHoliday':pd.np.string_})test_org = pd.read_csv('test.csv',dtype={'StateHoliday':pd.np.string_})train = pd.merge(train_org,store, on='Store', how='left')test = pd.merge(test_org,store, on='Store', how='left')features = test.columns.tolist()numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']features_numeric = test.select_dtypes(include=numerics).columns.tolist()features_non_numeric = [f for f in features if f not in features_numeric]return (train,test,features,features_non_numeric)數據與特征處理
def process_data(train,test,features,features_non_numeric):"""Feature engineering and selection."""# # FEATURE ENGINEERINGtrain = train[train['Sales'] > 0]for data in [train,test]:# year month daydata['year'] = data.Date.apply(lambda x: x.split('-')[0])data['year'] = data['year'].astype(float)data['month'] = data.Date.apply(lambda x: x.split('-')[1])data['month'] = data['month'].astype(float)data['day'] = data.Date.apply(lambda x: x.split('-')[2])data['day'] = data['day'].astype(float)# promo interval "Jan,Apr,Jul,Oct"data['promojan'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jan" in x else 0)#TypeError: argument of type 'float' is not iterable 為什么使用isinstance(x,float)data['promofeb'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Feb" in x else 0)data['promomar'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Mar" in x else 0)data['promoapr'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Apr" in x else 0)data['promomay'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "May" in x else 0)data['promojun'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jun" in x else 0)data['promojul'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jul" in x else 0)data['promoaug'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Aug" in x else 0)data['promosep'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Sep" in x else 0)data['promooct'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Oct" in x else 0)data['promonov'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Nov" in x else 0)data['promodec'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Dec" in x else 0)# # Features set.noisy_features = [myid,'Date']features = [c for c in features if c not in noisy_features]features_non_numeric = [c for c in features_non_numeric if c not in noisy_features]features.extend(['year','month','day'])# Fill NAclass DataFrameImputer(TransformerMixin):# http://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learndef __init__(self):"""Impute missing values.Columns of dtype object are imputed with the most frequent valuein column.Columns of other types are imputed with mean of column."""def fit(self, X, y=None):self.fill = pd.Series([X[c].value_counts().index[0] # modeif X[c].dtype == np.dtype('O') else X[c].mean() for c in X], # meanindex=X.columns)return selfdef transform(self, X, y=None):return X.fillna(self.fill)train = DataFrameImputer().fit_transform(train)test = DataFrameImputer().fit_transform(test)# Pre-processing non-numberic valuesle = LabelEncoder()for col in features_non_numeric:le.fit(list(train[col])+list(test[col]))train[col] = le.transform(train[col])test[col] = le.transform(test[col])# LR和神經網絡這種模型都對輸入數據的幅度極度敏感,請先做歸一化操作scaler = StandardScaler()for col in set(features) - set(features_non_numeric) - \set([]): # TODO: add what not to scalescaler.fit(list(train[col])+list(test[col]))train[col] = scaler.transform(train[col])test[col] = scaler.transform(test[col])return (train,test,features,features_non_numeric)訓練與分析
predict_result = log(y+1) y = e^(predict_result)-1 def XGB_native(train,test,features,features_non_numeric):depth = 6eta = 0.01ntrees = 8000mcw = 3params = {"objective": "reg:linear","booster": "gbtree","eta": eta,"max_depth": depth,"min_child_weight": mcw,"subsample": 0.7,"colsample_bytree": 0.7,"silent": 1}print "Running with params: " + str(params)print "Running with ntrees: " + str(ntrees)print "Running with features: " + str(features)# Train model with local splittsize = 0.05X_train, X_test = cross_validation.train_test_split(train, test_size=tsize)dtrain = xgb.DMatrix(X_train[features], np.log(X_train[goal] + 1))dvalid = xgb.DMatrix(X_test[features], np.log(X_test[goal] + 1))watchlist = [(dvalid, 'eval'), (dtrain, 'train')]gbm = xgb.train(params, dtrain, ntrees, evals=watchlist, early_stopping_rounds=100, feval=rmspe_xg, verbose_eval=True)train_probs = gbm.predict(xgb.DMatrix(X_test[features]))indices = train_probs < 0train_probs[indices] = 0error = rmspe(np.exp(train_probs) - 1, X_test[goal].values)print error# Predict and Exporttest_probs = gbm.predict(xgb.DMatrix(test[features]))indices = test_probs < 0test_probs[indices] = 0submission = pd.DataFrame({myid: test[myid], goal: np.exp(test_probs) - 1})if not os.path.exists('result/'):os.makedirs('result/')submission.to_csv("./result/dat-xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.csv" % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize)) , index=False)# Feature importanceif plot:outfile = open('xgb.fmap', 'w')i = 0for feat in features:outfile.write('{0}\t{1}\tq\n'.format(i, feat))i = i + 1outfile.close()importance = gbm.get_fscore(fmap='xgb.fmap')importance = sorted(importance.items(), key=operator.itemgetter(1))df = pd.DataFrame(importance, columns=['feature', 'fscore'])df['fscore'] = df['fscore'] / df['fscore'].sum()# Plotitupplt.figure()df.plot()df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(25, 15))plt.title('XGBoost Feature Importance')plt.xlabel('relative importance')plt.gcf().savefig('Feature_Importance_xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.png' % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize))) print "=> 載入數據中..." train,test,features,features_non_numeric = load_data() print "=> 處理數據與特征工程..." train,test,features,features_non_numeric = process_data(train,test,features,features_non_numeric) print "=> 使用XGBoost建模..." XGB_native(train,test,features,features_non_numeric) train.head() # => 載入數據中... # => 處理數據與特征工程... # => 使用XGBoost建模... # Running with params: {'subsample': 0.7, 'eta': 0.01, 'colsample_bytree': 0.7, 'silent': 1, 'objective': 'reg:linear', 'max_depth': 6, 'min_child_weight': 3, 'booster': 'gbtree'} # Running with ntrees: 8000 # Running with features: ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'year', 'month', 'day'] # [0] eval-rmspe:0.999864 train-rmspe:0.999864 # Multiple eval metrics have been passed: 'train-rmspe' will be used for early stopping.# Will train until train-rmspe hasn't improved in 100 rounds. # [1] eval-rmspe:0.999838 train-rmspe:0.999837 # [2] eval-rmspe:0.99981 train-rmspe:0.999809 # [3] eval-rmspe:0.999779 train-rmspe:0.999779 # . . . # [503] eval-rmspe:0.314933 train-rmspe:0.342737 # [504] eval-rmspe:0.315016 train-rmspe:0.342834 # [505] eval-rmspe:0.31512 train-rmspe:0.342928 # Stopping. Best iteration: # [405] eval-rmspe:0.312829 train-rmspe:0.33589# 0.315119522982總結
以上是生活随笔為你收集整理的XGBoost的基本使用应用Kaggle便利店销量预测的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 菜菜sklearn——XGBoost(3
- 下一篇: LightGBM用法速查表