Kaggle Tabular Playground Series - Jan 2022 学习笔记2(使用时间序列的线性回归)
之前我們對(duì)TPSJAN22進(jìn)行了簡(jiǎn)單的數(shù)據(jù)分析(詳見(jiàn):Kaggle Tabular Playground Series - Jan 2022 學(xué)習(xí)筆記1(數(shù)據(jù)分析))。現(xiàn)在我們嘗試使用時(shí)間序列和線性回歸來(lái)訓(xùn)練模型。
試題地址:Tabular Playground Series - Jan 2022
本文參考:TPSJAN22-03 Linear Model
import pandas as pd import numpy as np import pickle import math import matplotlib.pyplot as plt import dateutil.easter as easter from matplotlib.ticker import MaxNLocator from datetime import datetime, date, timedelta from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GroupKFold from sklearn.linear_model import LinearRegression, HuberRegressor, Ridge, Lassoimport matplotlib.dates as mdates original_train_df = pd.read_csv('../datas/train.csv') original_test_df = pd.read_csv('../datas/test.csv') gdp_df = pd.read_csv('../datas/GDP_data_2015_to_2019_Finland_Norway_Sweden.csv')gdp_df.set_index('year', inplace=True)for df in [original_train_df, original_test_df]:df['date'] = pd.to_datetime(df.date) original_train_df.head(2)
TPSJAN22 要求使用SMAPE作為損失函數(shù),但是Scikit-learn沒(méi)有提供,所以先自定義一個(gè)損失函數(shù)。
接下來(lái)根據(jù)我們之前的數(shù)據(jù)分析猜想對(duì)特征進(jìn)行處理
def engineer(df): def get_gdp(row):country = 'GDP_' + row.countryreturn gdp_df.loc[row.date.year, country]#加上gdp信息;增加每周的季節(jié)性指示器(Seasonal indicators) new_df = pd.DataFrame({'gdp': np.log(df.apply(get_gdp, axis=1)),'wd2': df.date.dt.weekday == 1, 'wd3': df.date.dt.weekday == 2,'wd4': df.date.dt.weekday == 3, 'wd5': df.date.dt.weekday == 4,'wd6': df.date.dt.weekday == 5,'wd7': df.date.dt.weekday == 6,})#將商品種類(lèi),國(guó)家,商店進(jìn)行獨(dú)熱編碼for country in ['Finland', 'Norway']:new_df[country] = df.country == countrynew_df['KaggleRama'] = df.store == 'KaggleRama'for product in ['Kaggle Mug', 'Kaggle Hat']:new_df[product] = df['product'] == product#添加傅里葉特征:我們對(duì)每個(gè)產(chǎn)品添加3對(duì)傅里葉特征dayofyear = df.date.dt.dayofyearfor k in range(1, 3):new_df[f'sin{k}'] = np.sin(dayofyear / 365 * 2 * math.pi * k)new_df[f'cos{k}'] = np.cos(dayofyear / 365 * 2 * math.pi * k)new_df[f'mug_sin{k}'] = new_df[f'sin{k}'] * new_df['Kaggle Mug']new_df[f'mug_cos{k}'] = new_df[f'cos{k}'] * new_df['Kaggle Mug']new_df[f'hat_sin{k}'] = new_df[f'sin{k}'] * new_df['Kaggle Hat']new_df[f'hat_cos{k}'] = new_df[f'cos{k}'] * new_df['Kaggle Hat']return new_df來(lái)看看處理好的特征
train_df = engineer(original_train_df) train_df['date'] = original_train_df.date train_df['num_sold'] = original_train_df.num_sold.astype(np.float32) test_df = engineer(original_test_df) features = test_df.columns for df in [train_df, test_df]:df[features] = df[features].astype(np.float32)train_df list(train_df)[‘gdp’, ‘wd2’, ‘wd3’, ‘wd4’, ‘wd5’, ‘wd6’, ‘wd7’, ‘Finland’, ‘Norway’, ‘KaggleRama’, ‘Kaggle Mug’, ‘Kaggle Hat’, ‘sin1’, ‘cos1’, ‘mug_sin1’, ‘mug_cos1’, ‘hat_sin1’, ‘hat_cos1’, ‘sin2’, ‘cos2’, ‘mug_sin2’, ‘mug_cos2’, ‘hat_sin2’, ‘hat_cos2’, ‘date’, ‘num_sold’]
下面我們開(kāi)始訓(xùn)練模型
def fit_model(X_tr):# Preprocess the dataX_tr_f = X_tr[features]preproc = StandardScaler()X_tr_f = preproc.fit_transform(X_tr_f)y_tr = X_tr.num_sold.values.reshape(-1, 1)model = LinearRegression()#因?yàn)閟k-learn沒(méi)有SMAPE作為損失函數(shù),所以對(duì)目標(biāo)值取對(duì)數(shù),然后使用默認(rèn)的MAE可以得到近似SMAPE的效果。#更多詳細(xì)信息參考:https://www.kaggle.com/code/ambrosm/tpsjan22-03-linear-model/notebook#Training-the-simple-model-(without-holidays) 第一段#和該討論:https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298473model.fit(X_tr_f, np.log(y_tr).ravel())return preproc, model preproc, model = fit_model(train_df)train_pred_df = original_train_df.copy()#因?yàn)轭A(yù)測(cè)的結(jié)果是對(duì)目標(biāo)值取對(duì)數(shù),所以要獲取預(yù)測(cè)值需要將預(yù)測(cè)結(jié)果進(jìn)行指數(shù)運(yùn)算來(lái)進(jìn)行還原 train_pred_df['pred'] = np.exp(model.predict(preproc.transform(train_df[features]))) train_pred_df
接下來(lái)我們選取Finland來(lái)看看每個(gè)商品在每個(gè)商店的預(yù)測(cè)值和目標(biāo)值的損失值
可以發(fā)現(xiàn),每個(gè)商店每個(gè)商品的損失值分布很相似,我們可以直接看Finland所有商品每天的損失值。
可以發(fā)現(xiàn)每年1月和12月有巨大的誤差,跟之前特征分析的時(shí)候的推測(cè)很相似。我們可能需要將年底和年初的日期提取出來(lái)作為特征值。放大12月、1月的誤差來(lái)看看:
可以發(fā)現(xiàn)在12月24號(hào)到1月4號(hào)之間誤差有巨大的波動(dòng),所以我們將每年的12月24號(hào)到1月4號(hào)之間的日期標(biāo)注出來(lái)作為特征,再訓(xùn)練模型看看
# Feature engineering def engineer(df):"""Return a new dataframe with the engineered features"""def get_gdp(row):country = 'GDP_' + row.countryreturn gdp_df.loc[row.date.year, country]new_df = pd.DataFrame({'gdp': np.log(df.apply(get_gdp, axis=1)),'wd2': df.date.dt.weekday == 1, 'wd3': df.date.dt.weekday == 2,'wd4': df.date.dt.weekday == 3, 'wd5': df.date.dt.weekday == 4,'wd6': df.date.dt.weekday == 5,'wd7': df.date.dt.weekday == 6,})# One-hot encoding (no need to encode the last categories)for country in ['Finland', 'Norway']:new_df[country] = df.country == countrynew_df['KaggleRama'] = df.store == 'KaggleRama'for product in ['Kaggle Mug', 'Kaggle Hat']:new_df[product] = df['product'] == product# Seasonal variations (Fourier series)# The three products have different seasonal patternsdayofyear = df.date.dt.dayofyearfor k in range(1, 3):temp_sin = np.sin(dayofyear / 365 * 2 * math.pi * k)temp_cos = np.cos(dayofyear / 365 * 2 * math.pi * k)new_df[f'mug_sin{k}'] = temp_sin * new_df['Kaggle Mug']new_df[f'mug_cos{k}'] = temp_cos * new_df['Kaggle Mug']new_df[f'hat_sin{k}'] = temp_sin * new_df['Kaggle Hat']new_df[f'hat_cos{k}'] = temp_cos * new_df['Kaggle Hat']new_df = pd.concat([new_df,pd.DataFrame({f"dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland') for d in range(24, 32)}),pd.DataFrame({f"jan{d}":(df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Finland') for d in range(1, 5)})],axis=1)return new_df train_df = engineer(original_train_df) train_df['date'] = original_train_df.date train_df['num_sold'] = original_train_df.num_sold.astype(np.float32) test_df = engineer(original_test_df) features = test_df.columns for df in [train_df, test_df]:df[features] = df[features].astype(np.float32)preproc, model = fit_model(train_df)train_pred_df = original_train_df.copy() train_pred_df['pred'] = np.exp(model.predict(preproc.transform(train_df[features])))by_date = train_pred_df[train_pred_df.country == 'Finland'].groupby(train_pred_df['date']) residuals = (by_date.pred.mean() - by_date.num_sold.mean()) / (by_date.pred.mean() + by_date.num_sold.mean()) * 200plot_all_residuals(residuals)plot_around(residuals, 1, 1, 30)
似乎1月4到15號(hào)之間還有比較明顯的誤差,同時(shí)12月初似乎也有誤差,我們先將1月的特征范圍增加到15號(hào)
現(xiàn)在12月末和1月初的誤差我們已經(jīng)處理得好了很多了。但是觀察上面的散點(diǎn)圖,可以發(fā)現(xiàn)在每年的4、5、6、7、11、12月還有很明顯的誤差。我們放大區(qū)間來(lái)觀察一下。
可以發(fā)現(xiàn)在4月每年都有相似的波動(dòng),但是有的年份波動(dòng)發(fā)生得早,有些年份的波動(dòng)發(fā)生得晚,6月底、11月初都有類(lèi)似的情況。從每年的5月1日開(kāi)始又持續(xù)時(shí)間大約10天的波動(dòng),12月初也有類(lèi)似的情況。回想之前的12月末和1月初的信息,我們可以推測(cè)這些誤差點(diǎn)可能與節(jié)假日有關(guān)。從kaggle網(wǎng)友提供的三個(gè)國(guó)家的官方節(jié)假日信息,我們可以從中得到一些啟發(fā)。
Holidays_Finland_Norway_Sweden_2015-2019
我們發(fā)現(xiàn)1月有元旦節(jié),日期固定。4月會(huì)有復(fù)活節(jié),日期不固定,5月有國(guó)際勞動(dòng)節(jié),日期固定等等。通過(guò)對(duì)節(jié)假日信息和上面誤差信息比較,我們可以對(duì)我們的特征工程做出修改:
# Feature engineering def engineer(df):def get_gdp(row):country = 'GDP_' + row.countryreturn gdp_df.loc[row.date.year, country]new_df = pd.DataFrame({'gdp': np.log(df.apply(get_gdp, axis=1)),'wd2': df.date.dt.weekday == 1, 'wd3': df.date.dt.weekday == 2,'wd4': df.date.dt.weekday == 3, 'wd5': df.date.dt.weekday == 4,'wd6': df.date.dt.weekday == 5,'wd7': df.date.dt.weekday == 6,})# One-hot encoding (no need to encode the last categories)for country in ['Finland', 'Norway']:new_df[country] = df.country == countrynew_df['KaggleRama'] = df.store == 'KaggleRama'for product in ['Kaggle Mug', 'Kaggle Hat']:new_df[product] = df['product'] == productdayofyear = df.date.dt.dayofyearfor k in range(1, 3):temp_sin = np.sin(dayofyear / 365 * 2 * math.pi * k)temp_cos = np.cos(dayofyear / 365 * 2 * math.pi * k)new_df[f'mug_sin{k}'] = temp_sin * new_df['Kaggle Mug']new_df[f'mug_cos{k}'] = temp_cos * new_df['Kaggle Mug']new_df[f'hat_sin{k}'] = temp_sin * new_df['Kaggle Hat']new_df[f'hat_cos{k}'] = temp_cos * new_df['Kaggle Hat']##圣誕節(jié)、元旦節(jié)new_df = pd.concat([new_df,pd.DataFrame({f"dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland') for d in range(24, 32)}),pd.DataFrame({f"jan{d}":(df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Finland') for d in range(1, 16)}),pd.DataFrame({f"n-dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Norway') for d in range(24, 32)}),pd.DataFrame({f"n-jan{d}":(df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Norway') for d in range(1, 10)}),pd.DataFrame({f"s-dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Sweden') for d in range(24, 32)}),pd.DataFrame({f"s-jan{d}":(df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Sweden')for d in range(1, 16)})],axis=1)new_df = pd.concat([new_df,pd.DataFrame({f"may{d}":(df.date.dt.month == 5) & (df.date.dt.day == d)for d in list(range(1, 11))}), # + list(range(17, 25))pd.DataFrame({f"may{d}":(df.date.dt.month == 5) & (df.date.dt.day == d) & (df.country == 'Norway')for d in list(range(16, 28))})],axis=1)new_df = pd.concat([new_df,pd.DataFrame({f"june{d}":(df.date.dt.month == 6) & (df.date.dt.day == d) & (df.country == 'Sweden')for d in list(range(7, 15))}),],axis=1)# Midsummer Daymidsummer_day_date = df.date.dt.year.map({2015: pd.Timestamp(('2015-06-20')),2016: pd.Timestamp(('2016-06-25')),2017: pd.Timestamp(('2017-06-24')),2018: pd.Timestamp(('2018-06-23')),2019: pd.Timestamp(('2019-06-22'))})new_df = pd.concat([new_df,pd.DataFrame({f"midsummer_day{d}": (df.date - midsummer_day_date == np.timedelta64(d, "D")) & (df.country != 'Norway') for d in list(range(-2, 11))})],axis=1)# First Sunday of Novembersun_nov_date = df.date.dt.year.map({2015: pd.Timestamp(('2015-11-1')),2016: pd.Timestamp(('2016-11-6')),2017: pd.Timestamp(('2017-11-5')),2018: pd.Timestamp(('2018-11-4')),2019: pd.Timestamp(('2019-11-3'))})new_df = pd.concat([new_df,pd.DataFrame({f"sun_nov{d}": (df.date - sun_nov_date == np.timedelta64(d, "D")) & (df.country != 'Norway') for d in list(range(-1, 10))})],axis=1)# First half of December (Independence Day of Finland, 6th of December)new_df = pd.concat([new_df,pd.DataFrame({f"dec{d}":(df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland')for d in list(range(6, 14))})],axis=1)# Eastereaster_date = df.date.apply(lambda date: pd.Timestamp(easter.easter(date.year)))new_df = pd.concat([new_df,pd.DataFrame({f"easter{d}": (df.date - easter_date == np.timedelta64(d, "D")) & (df.country == 'Finland') for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(50, 59))})],axis=1)new_df = pd.concat([new_df,pd.DataFrame({f"n-easter{d}": (df.date - easter_date == np.timedelta64(d, "D")) & (df.country == 'Norway') for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(50, 59))})],axis=1)new_df = pd.concat([new_df,pd.DataFrame({f"s-easter{d}": (df.date - easter_date == np.timedelta64(d, "D")) & (df.country == 'Sweden') for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(50, 59))})],axis=1)return new_dftrain_df = engineer(original_train_df) train_df['date'] = original_train_df.date train_df['num_sold'] = original_train_df.num_sold.astype(np.float32) test_df = engineer(original_test_df) features = test_df.columns for df in [train_df, test_df]:df[features] = df[features].astype(np.float32)preproc, model = fit_model(train_df)train_pred_df = original_train_df.copy() train_pred_df['pred'] = np.exp(model.predict(preproc.transform(train_df[features])))by_date = train_pred_df[train_pred_df.country == 'Norway'].groupby(train_pred_df['date']) residuals = (by_date.pred.sum() - by_date.num_sold.sum()) / (by_date.pred.sum() + by_date.num_sold.sum()) * 200plot_all_residuals(residuals)plot_around(residuals, 1, 1, 15) plot_around(residuals, 4, 1, 15) plot_around(residuals, 5, 1, 30) plot_around(residuals, 6, 1, 30) plot_around(residuals, 7, 1, 30) plot_around(residuals, 11, 1, 15) plot_around(residuals, 12, 1, 30)可以發(fā)現(xiàn)將每年的節(jié)假日加入特征之后,我們的誤差得到了很好的改善。接下來(lái)來(lái)看看評(píng)估分?jǐn)?shù)
def fit_model(X_tr, X_va=None, outliers=False):"""Scale the data, fit a model, plot the training history and validate the model"""start_time = datetime.now()# Preprocess the dataX_tr_f = X_tr[features]preproc = StandardScaler()X_tr_f = preproc.fit_transform(X_tr_f)y_tr = X_tr.num_sold.values.reshape(-1, 1)model = LinearRegression()model.fit(X_tr_f, np.log(y_tr).ravel())if X_va is not None:# Preprocess the validation dataX_va_f = X_va[features]X_va_f = preproc.transform(X_va_f)y_va = X_va.num_sold.values.reshape(-1, 1)# Inference for validationy_va_pred = np.exp(model.predict(X_va_f)).reshape(-1, 1)oof.update(pd.Series(y_va_pred.ravel(), index=X_va.index))# Evaluation: Execution time and SMAPEsmape_before_correction = np.mean(smape_loss(y_va, y_va_pred))#y_va_pred *= LOSS_CORRECTIONsmape = np.mean(smape_loss(y_va, y_va_pred))print(f"Fold {run}.{fold} | {str(datetime.now() - start_time)[-12:-7]}"f" | SMAPE: {smape:.5f} (before correction: {smape_before_correction:.5f})")score_list.append(smape)# Plot y_true vs. y_predif fold == 0:plt.figure(figsize=(10, 10))plt.scatter(y_va, y_va_pred, s=1, color='r')#plt.scatter(np.log(y_va), np.log(y_va_pred), s=1, color='g')plt.plot([plt.xlim()[0], plt.xlim()[1]], [plt.xlim()[0], plt.xlim()[1]], '--', color='k')plt.gca().set_aspect('equal')plt.xlabel('y_true')plt.ylabel('y_pred')plt.title('OOF Predictions')plt.show()return preproc, model #%%time RUNS = 1 # should be 1. increase the number of runs only if you want see how the result depends on the random seed OUTLIERS = True TRAIN_VAL_CUT = datetime(2018, 1, 1) LOSS_CORRECTION = 1# Make the results reproducible np.random.seed(202100)total_start_time = datetime.now() oof = pd.Series(0.0, index=train_df.index) score_list = [] for run in range(RUNS):kf = GroupKFold(n_splits=4)for fold, (train_idx, val_idx) in enumerate(kf.split(train_df, groups=train_df.date.dt.year)):X_tr = train_df.iloc[train_idx]X_va = train_df.iloc[val_idx]print(f"Fold {run}.{fold}")preproc, model = fit_model(X_tr, X_va)print(f"Average SMAPE: {sum(score_list) / len(score_list):.5f}") with open('oof.pickle', 'wb') as handle: pickle.dump(oof, handle) # Plot all num_sold_true and num_sold_pred (five years) for one country-store-product combination def plot_five_years_combination(engineer, country='Norway', store='KaggleMart', product='Kaggle Hat'):demo_df = pd.DataFrame({'row_id': 0,'date': pd.date_range('2015-01-01', '2019-12-31', freq='D'),'country': country,'store': store,'product': product})demo_df.set_index('date', inplace=True, drop=False)demo_df = engineer(demo_df)demo_df['num_sold'] = np.exp(model.predict(preproc.transform(demo_df[features])))plt.figure(figsize=(20, 6))plt.plot(np.arange(len(demo_df)), demo_df.num_sold, label='prediction')train_subset = train_df[(original_train_df.country == country) & (original_train_df.store == store) & (original_train_df['product'] == product)]# plt.scatter(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5, color='red', s=3)plt.plot(np.arange(len(train_subset)), train_subset.num_sold, label='true', color='red')plt.legend()plt.title('Predictions and true num_sold for five years')plt.show()plot_five_years_combination(engineer)生成提交數(shù)據(jù)
# Fit the model on the complete training data train_idx = np.arange(len(train_df)) X_tr = train_df.iloc[train_idx] preproc, model = fit_model(X_tr, None)plot_five_years_combination(engineer) # Quick check for debugging# Inference for test test_pred_list = [] test_pred_list.append(np.exp(model.predict(preproc.transform(test_df[features]))) * LOSS_CORRECTION)# Create the submission file sub = original_test_df[['row_id']].copy() sub['num_sold'] = sum(test_pred_list) / len(test_pred_list) sub.to_csv('submission_linear_model.csv', index=False)# Plot the distribution of the test predictions plt.figure(figsize=(16,3)) plt.hist(train_df['num_sold'], bins=np.linspace(0, 3000, 201),density=True, label='Training') plt.hist(sub['num_sold'], bins=np.linspace(0, 3000, 201),density=True, rwidth=0.5, label='Test predictions') plt.xlabel('num_sold') plt.ylabel('Frequency') plt.legend() plt.show() sub
因?yàn)殇N(xiāo)售數(shù)量是整數(shù),我們對(duì)結(jié)果進(jìn)行四舍五入一下,提交看看哪個(gè)分?jǐn)?shù)要高些。
# Create a rounded submission file sub_rounded = sub.copy() sub_rounded['num_sold'] = sub_rounded['num_sold'].round() sub_rounded.to_csv('submission_linear_model_rounded.csv', index=False) sub_rounded未四舍五入的分?jǐn)?shù)
四舍五入的分?jǐn)?shù)
相關(guān)連接:
Kaggle Tabular Playground Series - Jan 2022 學(xué)習(xí)筆記1(數(shù)據(jù)分析)
總結(jié)
以上是生活随笔為你收集整理的Kaggle Tabular Playground Series - Jan 2022 学习笔记2(使用时间序列的线性回归)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: VS2010调试快捷键
- 下一篇: String类 写出类的成员函数实现