當(dāng)前位置：首頁 >

Kaggle Tabular Playground Series - Jan 2022 学习笔记1（数据分析）

發(fā)布時(shí)間：2023/12/20 43 豆豆

生活随笔收集整理的這篇文章主要介紹了 Kaggle Tabular Playground Series - Jan 2022 学习笔记1（数据分析）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

試題地址：Tabular Playground Series - Jan 2022

簡介：給出了兩家商店在三個(gè)國家在2015年-2018年的三種產(chǎn)品的每天的銷售量，要求預(yù)測2019年的銷售量。

本文參考 TPSFEB22-01 EDA which makes sense

import pandas as pd import matplotlib.pyplot as plt import matplotlib.dates as mdates

讀取數(shù)據(jù)

train_data=pd.read_csv('../datas/train.csv') test_data=pd.read_csv('../datas/test.csv')for df in [train_data, test_data]:df['date'] = pd.to_datetime(df.date)df.set_index('date', inplace=True, drop=False)# Shape and preview print('Training data df shape:',train_data.shape) print('Test data df shape:',test_data.shape) train_data.head()

查看有無缺失值

print('Number of missing values in training set:',train_data.isna().sum().sum()) print('') print('Number of missing values in test set:',test_data.isna().sum().sum())

查看數(shù)據(jù)成分

print('Training cardinalities: \n', train_data.nunique()) print('') print('Test cardinalities: \n', test_data.nunique())

查看日期范圍

print('Training data:') print('Min date', train_data['date'].min()) print('Max date', train_data['date'].max()) print('') print('Test data:') print('Min date', test_data['date'].min()) print('Max date', test_data['date'].max())

綜上，我們可以發(fā)現(xiàn)：有三個(gè)國家，兩個(gè)商店和三種產(chǎn)品，這樣就會(huì)有18種組合。訓(xùn)練數(shù)據(jù)涵蓋2015 - 2018年，測試數(shù)據(jù)要求我們預(yù)測2019年。訓(xùn)練數(shù)據(jù)和測試數(shù)據(jù)均無缺失值。

來看看每個(gè)組合銷售量的圖

plt.figure(figsize=(18, 15)) for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):# df = df.set_index('date')# print(df.index)# breakax = plt.subplot(6, 3, i+1, ymargin=0.5)ax.plot(df.index,df.num_sold)ax.set_title(combi)ax.xaxis.set_major_locator(mdates.YearLocator())ax.xaxis.set_major_formatter(mdates.DateFormatter('%y/%m/%d'))ax.xaxis.set_minor_locator(mdates.MonthLocator())plt.tight_layout(h_pad=3.0) plt.suptitle('Daily sales for 2015-2018', y=1.03) plt.show()

從上圖中可以發(fā)現(xiàn)每年年底每種產(chǎn)品的銷售量遠(yuǎn)高于平常，可能需要將每年年底的日期單獨(dú)提取出來作為特征。同時(shí)可以發(fā)現(xiàn)Kaggle Hat 和Kaggle Mug似乎具有季節(jié)性特征，考慮增加傅里葉特征。

plt.figure(figsize=(18, 20)) for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):ax = plt.subplot(6, 3, i+1, ymargin=0.5)resampled = df.resample('MS').sum()for m, (combm, dfm) in enumerate(resampled.groupby(resampled.index.year)):# print(dfm)ax.plot(range(1 , 13) , dfm.num_sold, label=combm )# break# resampled = resampled.groupby(resampled.index.year)# break# ax.plot(range(1, 13), resampled.num_sold)ax.legend()ax.set_xticks(ticks=range(1, 13))# ax.set_xticklabels('JFMAMJJASOND')ax.set_title(combi)# ax.set_ylim(resampled.num_sold.min(), resampled.num_sold.max()) plt.suptitle('Monthly sales for 2015-2018', y=1.03) plt.tight_layout(h_pad=3.0) plt.show()

可以發(fā)現(xiàn)每年每月的波動(dòng)很相似。同時(shí)，銷售趨勢并不是逐年遞增，最明顯的是挪威2016年的每月的銷售量會(huì)低于2015年！所以可能還受其他因素的影響。

接下來看看每周是否有季節(jié)性特征

plt.figure(figsize=(18, 12)) for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):ax = plt.subplot(6, 3, i+1, ymargin=0.5)#計(jì)算每周每天銷售的平均值resampled = df.groupby(df.index.dayofweek).mean()ax.bar(range(7), resampled.num_sold, color=['b']*4 + ['g'] + ['orange']*2)ax.set_title(combi)ax.set_xticks(range(7))ax.set_xticklabels(['M', 'T', 'W', 'T', 'F', 'S', 'S'])ax.set_ylim(0, resampled.num_sold.max()) plt.suptitle('Sales per day of the week', y=1.03) plt.tight_layout(h_pad=3.0) plt.show()

可以發(fā)現(xiàn)一到了周末銷量會(huì)有明顯的升高，考慮增加每周的季節(jié)性指示器(Seasonal indicators)

接下來看看12月和1月的銷量統(tǒng)計(jì)

plt.figure(figsize=(18, 12)) for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):ax = plt.subplot(6, 3, i+1, ymargin=0.5)ax.bar(range(1, 32),df.num_sold[df.date.dt.month==12].groupby(df.date.dt.day).mean(),color=['b'] * 25 + ['orange'] * 6)ax.set_title(combi)ax.set_xticks(ticks=range(5, 31, 5)) plt.tight_layout(h_pad=3.0) plt.suptitle('Daily sales for December', y=1.03) plt.show()

plt.figure(figsize=(18, 12)) for i, (combi, df) in enumerate(train_data.groupby(['country', 'store', 'product'])):ax = plt.subplot(6, 3, i+1, ymargin=0.5)ax.bar(range(1, 32),df.num_sold[df.date.dt.month==1].groupby(df.date.dt.day).mean(),color=['b'] * 5 + ['orange'] * 26)ax.set_title(combi)ax.set_xticks(ticks=range(5, 31, 5)) plt.tight_layout(h_pad=3.0) plt.suptitle('Daily sales for December', y=1.03) plt.show()

可以看出，大約12月25日銷量開始增長，基本上到1月5日回歸正常。

之前發(fā)現(xiàn)挪威2016年月銷售量是低于2015年的，后來發(fā)現(xiàn)可能跟GDP有關(guān)。

參考討論1

參考討論2

gdp_df = pd.read_csv('../datas/GDP_data_2015_to_2019_Finland_Norway_Sweden.csv')gdp_df.set_index('year', inplace=True) gdp_df

至此，我們完成了初步的數(shù)據(jù)分析。接下來我們將會(huì)使用時(shí)間序列和線性回歸來嘗試擬合數(shù)據(jù)。下一節(jié)：Kaggle Tabular Playground Series - Jan 2022 學(xué)習(xí)筆記2（使用時(shí)間序列的線性回歸）

總結(jié)

以上是生活随笔為你收集整理的Kaggle Tabular Playground Series - Jan 2022 学习笔记1（数据分析）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： MATLAB中MEX文件的编写与调试
下一篇： java中钩子函数回调函数_钩子函数和

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

Kaggle Tabular Playground Series - Jan 2022 学习笔记1（数据分析）

總結(jié)