當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle共享单车数据分析及预测（随机森林）

發(fā)布時間：2024/7/5 编程问答 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 kaggle共享单车数据分析及预测（随机森林）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

一、數(shù)據(jù)收集
- 1.1、項目說明
- 1.2、數(shù)據(jù)內(nèi)容及變量說明
二、數(shù)據(jù)處理
- 2.1、導(dǎo)入數(shù)據(jù)
- 2.2、缺失值處理
- 2.3、Label數(shù)據(jù)(即count)異常值處理
- 2.4、其他數(shù)據(jù)異常值處理
- 2.5、時間型數(shù)據(jù)數(shù)據(jù)處理
三、數(shù)據(jù)分析
- 3.1 描述性分析
- 3.2、探索性分析
- - 3.2.1、整體性分析
  - 3.2.2、相關(guān)性分析
  - 3.2.3、影響因素分析
  - - 3.2.3.1、時段對租賃數(shù)量的影響
    - 3.2.3.2、溫度對租賃數(shù)量的影響
    - 3.2.3.3、濕度對租賃數(shù)量的影響
    - 3.2.3.4、年份、月份對租賃數(shù)量的影響
    - 3.2.3.5、季節(jié)對出行人數(shù)的影響
    - 3.2.3.6、天氣情況對出行情況的影響
    - 3.2.3.7、風(fēng)速對出行情況的影響
    - 3.2.3.8、日期對出行的影響
- 3.3、預(yù)測性分析
- - 3.3.1、選擇特征值
  - 3.3.2、訓(xùn)練集、測試集分離
  - 3.3.3、多余特征值舍棄
  - 3.3.4、選擇模型、訓(xùn)練模型
  - 3.3.5、預(yù)測測試集數(shù)據(jù)

一、數(shù)據(jù)收集

1.1、項目說明

自行車共享系統(tǒng)是一種租賃自行車的方法，注冊會員、租車、還車都將通過城市中的站點網(wǎng)絡(luò)自動完成，通過這個系統(tǒng)人們可以根據(jù)需要從一個地方租賃一輛自行車然后騎到自己的目的地歸還。在這次比賽中，參與者需要結(jié)合歷史天氣數(shù)據(jù)下的使用模式，來預(yù)測D.C.華盛頓首都自行車共享項目的自行車租賃需求。

1.2、數(shù)據(jù)內(nèi)容及變量說明

比賽提供了跨越兩年的每小時租賃數(shù)據(jù)，包含天氣信息和日期信息，訓(xùn)練集由每月前19天的數(shù)據(jù)組成，測試集是每月第二十天到當(dāng)月底的數(shù)據(jù)。

二、數(shù)據(jù)處理

2.1、導(dǎo)入數(shù)據(jù)

import matplotlib.pyplot as pltimport seaborn as sns sns.set(style='whitegrid' , palette='tab10')train=pd.read_csv(r'D:\A\Data\ufo\train.csv',encoding='utf-8') train.info()test=pd.read_csv(r'D:\A\Data\ufo\test.csv',encoding='utf-8') print(test.info())

2.2、缺失值處理

#可視化查詢?nèi)笔е?/span> import missingno as msno msno.matrix(train,figsize=(12,5)) msno.matrix(test,figsize=(12,5))

本次數(shù)據(jù)沒有缺失值，不需要進行缺失值處理。

2.3、Label數(shù)據(jù)(即count)異常值處理

#觀察訓(xùn)練集數(shù)據(jù)描述統(tǒng)計 train.describe().T

先從數(shù)值型數(shù)據(jù)入手，可以看出租賃額（count）數(shù)值差異大，再觀察一下它們的密度分布：

#觀察租賃額密度分布 fig = plt.figure() ax = fig.add_subplot(1, 1, 1) fig.set_size_inches(6,5)sns.distplot(train['count']) ax.set(xlabel='count',title='Distribution of count',)

發(fā)現(xiàn)數(shù)據(jù)密度分布的偏斜比較嚴重，且有一個很長的尾，所以希望能把這一列數(shù)據(jù)的長尾處理一下，先排除掉3個標(biāo)準(zhǔn)差以外的數(shù)據(jù)試一下能不能滿足要求

train_WithoutOutliers = train[np.abs(train['count']-train['count'].mean())<=(3*train['count'].std())] print(train_WithoutOutliers.shape) train_WithoutOutliers['count'] .describe()

與處理前對比不是很明顯，可視化展示對比看一下：

fig = plt.figure() ax1 = fig.add_subplot(1, 2, 1) ax2 = fig.add_subplot(1, 2, 2) fig.set_size_inches(12,5)sns.distplot(train_WithoutOutliers['count'],ax=ax1) sns.distplot(train['count'],ax=ax2)ax1.set(xlabel='count',title='Distribution of count without outliers',) ax2.set(xlabel='registered',title='Distribution of count')

可以看到數(shù)據(jù)波動依然很大，而我們希望波動相對穩(wěn)定，否則容易產(chǎn)生過擬合，所以希望對數(shù)據(jù)進行處理，使得數(shù)據(jù)相對穩(wěn)定，此處選擇對數(shù)變化，來使得數(shù)據(jù)穩(wěn)定。

yLabels=train_WithoutOutliers['count'] yLabels_log=np.log(yLabels) sns.distplot(yLabels_log)

經(jīng)過對數(shù)變換后數(shù)據(jù)分布更均勻，大小差異也縮小了，使用這樣的標(biāo)簽對訓(xùn)練模型是有效果的。接下來對其余的數(shù)值型數(shù)據(jù)進行處理，由于其他數(shù)據(jù)同時包含在兩個數(shù)據(jù)集中，為方便數(shù)據(jù)處理先將兩個數(shù)據(jù)集合并。

Bike_data=pd.concat([train_WithoutOutliers,test],ignore_index=True) #查看數(shù)據(jù)集大小 Bike_data.shape

2.4、其他數(shù)據(jù)異常值處理

fig, axes = plt.subplots(2, 2) fig.set_size_inches(12,10)sns.distplot(Bike_data['temp'],ax=axes[0,0]) sns.distplot(Bike_data['atemp'],ax=axes[0,1]) sns.distplot(Bike_data['humidity'],ax=axes[1,0]) sns.distplot(Bike_data['windspeed'],ax=axes[1,1])axes[0,0].set(xlabel='temp',title='Distribution of temp',) axes[0,1].set(xlabel='atemp',title='Distribution of atemp') axes[1,0].set(xlabel='humidity',title='Distribution of humidity') axes[1,1].set(xlabel='windspeed',title='Distribution of windspeed')

通過這個分布可以發(fā)現(xiàn)一些問題，比如風(fēng)速為什么0的數(shù)據(jù)很多，而觀察統(tǒng)計描述發(fā)現(xiàn)空缺值在1–6之間，從這里似乎可以推測，數(shù)據(jù)本身或許是有缺失值的，但是用0來填充了，但這些風(fēng)速為0的數(shù)據(jù)會對預(yù)測產(chǎn)生干擾，希望使用隨機森林根據(jù)相同的年份，月份，季節(jié)，溫度，濕度等幾個特征來填充一下風(fēng)速的缺失值。填充之前看一下非零數(shù)據(jù)的描述統(tǒng)計。

Bike_data[Bike_data['windspeed']!=0]['windspeed'].describe()

from sklearn.ensemble import RandomForestRegressorBike_data["windspeed_rfr"]=Bike_data["windspeed"] # 將數(shù)據(jù)分成風(fēng)速等于0和不等于兩部分 dataWind0 = Bike_data[Bike_data["windspeed_rfr"]==0] dataWindNot0 = Bike_data[Bike_data["windspeed_rfr"]!=0] #選定模型 rfModel_wind = RandomForestRegressor(n_estimators=1000,random_state=42) # 選定特征值 windColumns = ["season","weather","humidity","month","temp","year","atemp"] # 將風(fēng)速不等于0的數(shù)據(jù)作為訓(xùn)練集，fit到RandomForestRegressor之中 rfModel_wind.fit(dataWindNot0[windColumns], dataWindNot0["windspeed_rfr"]) # 通過訓(xùn)練好的模型預(yù)測風(fēng)速 wind0Values = rfModel_wind.predict(X= dataWind0[windColumns]) #將預(yù)測的風(fēng)速填充到風(fēng)速為零的數(shù)據(jù)中 dataWind0.loc[:,"windspeed_rfr"] = wind0Values #連接兩部分數(shù)據(jù) Bike_data = dataWindNot0.append(dataWind0) Bike_data.reset_index(inplace=True) Bike_data.drop('index',inplace=True,axis=1)

觀察隨機森林填充后的密度分布情況

fig, axes = plt.subplots(2, 2) fig.set_size_inches(12,10)sns.distplot(Bike_data['temp'],ax=axes[0,0]) sns.distplot(Bike_data['atemp'],ax=axes[0,1]) sns.distplot(Bike_data['humidity'],ax=axes[1,0]) sns.distplot(Bike_data['windspeed_rfr'],ax=axes[1,1])axes[0,0].set(xlabel='temp',title='Distribution of temp',) axes[0,1].set(xlabel='atemp',title='Distribution of atemp') axes[1,0].set(xlabel='humidity',title='Distribution of humidity') axes[1,1].set(xlabel='windseed',title='Distribution of windspeed')

2.5、時間型數(shù)據(jù)數(shù)據(jù)處理

Bike_data['date']=Bike_data.datetime.apply( lambda c : c.split( )[0]) Bike_data['hour']=Bike_data.datetime.apply( lambda c : c.split( )[1].split(':')[0]).astype('int') Bike_data['year']=Bike_data.datetime.apply( lambda c : c.split( )[0].split('/')[0]).astype('int') Bike_data['month']=Bike_data.datetime.apply( lambda c : c.split( )[0].split('/')[1]).astype('int') Bike_data['weekday']=Bike_data.date.apply( lambda c : datetime.strptime(c,'%Y/%m/%d').isoweekday()) Bike_data.head()

三、數(shù)據(jù)分析

3.1 描述性分析

train.describe().T

溫度, 體表溫度, 相對濕度, 風(fēng)速均近似對稱分布, 而非注冊用戶, 注冊用戶,以及總數(shù)均右邊分布。

for i in range(5, 12):name = train.columns[i]print('{0}偏態(tài)系數(shù)為 {1}, 峰態(tài)系數(shù)為 {2}'.format(name, train[name].skew(), train[name].kurt()))

temp, atemp, humidity低度偏態(tài), windspeed中度偏態(tài), casual, registered, count高度偏態(tài)；
temp, atemp, humidity為平峰分布, windspeed,casual, registered, count為尖峰分布。

3.2、探索性分析

3.2.1、整體性分析

sns.pairplot(Bike_data ,x_vars=['holiday','workingday','weather','season','weekday','hour','windspeed_rfr','humidity','temp','atemp'] ,y_vars=['casual','registered','count'] , plot_kws={'alpha': 0.1})

大致可以看出：

會員在工作日出行多，節(jié)假日出行少，臨時用戶則相反；

一季度出行人數(shù)總體偏少；

租賃數(shù)量隨天氣等級上升而減少；

小時數(shù)對租賃情況影響明顯，會員呈現(xiàn)兩個高峰，非會員呈現(xiàn)一個正態(tài)分布；

租賃數(shù)量隨風(fēng)速增大而減少；

溫度、濕度對非會員影響比較大，對會員影響較小。

3.2.2、相關(guān)性分析

查看各個特征與每小時租車總量（count）的相關(guān)性

correlation = Bike_data.corr() mask = np.array(correlation) mask[np.tril_indices_from(mask)] = False fig,ax= plt.subplots() fig.set_size_inches(20,10) sns.heatmap(correlation, mask=mask,vmax=.8, square=True,annot=True)plt.show()

count 和 registered、casual高度正相關(guān)，相關(guān)系數(shù)分別為0.7 與0.97。因為 count = casual + registered ，所以這個正相關(guān)和預(yù)期相符。count 和 temp 正相關(guān)，相關(guān)系數(shù)為 0.39。一般來說，氣溫過低人們不愿意騎車出行。count 和 humidity（濕度）負相關(guān)，濕度過大的天氣不適宜騎車。當(dāng)然考慮濕度的同時也應(yīng)該考慮溫度。windspeed似乎對租車人數(shù)影響不大（0.1），但我們也應(yīng)該考慮到極端大風(fēng)天氣出現(xiàn)頻率應(yīng)該不高。風(fēng)速在正常范圍內(nèi)波動應(yīng)該對人們租車影響不大。可以看出特征值對租賃數(shù)量的影響力度為,時段>溫度>濕度>年份>月份>季節(jié)>天氣等級>風(fēng)速>星期幾>是否工作日>是否假日

3.2.3、影響因素分析

3.2.3.1、時段對租賃數(shù)量的影響

workingday_df=Bike_data[Bike_data['workingday']==1] workingday_df = workingday_df.groupby(['hour'], as_index=True).agg({'casual':'mean','registered':'mean','count':'mean'})nworkingday_df=Bike_data[Bike_data['workingday']==0] nworkingday_df = nworkingday_df.groupby(['hour'], as_index=True).agg({'casual':'mean','registered':'mean', 'count':'mean'}) fig, axes = plt.subplots(1, 2,sharey = True)workingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the working day',ax=axes[0]) nworkingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the nonworkdays',ax=axes[1])

通過對比可以看出：

工作日對于會員用戶上下班時間是兩個用車高峰，而中午也會有一個小高峰，猜測可能是外出午餐的人；

而對臨時用戶起伏比較平緩，高峰期在17點左右；

并且會員用戶的用車數(shù)量遠超過臨時用戶。

對非工作日而言租賃數(shù)量隨時間呈現(xiàn)一個正態(tài)分布，高峰在14點左右，低谷在4點左右，且分布比較均勻。

3.2.3.2、溫度對租賃數(shù)量的影響

先觀察溫度的走勢

#數(shù)據(jù)按小時統(tǒng)計展示起來太麻煩，希望能夠按天匯總?cè)∫惶斓臍鉁刂形粩?shù) temp_df = Bike_data.groupby(['date','weekday'], as_index=False).agg({'year':'mean','month':'mean','temp':'median'})#由于測試數(shù)據(jù)集中沒有租賃信息，會導(dǎo)致折線圖有斷裂，所以將缺失的數(shù)據(jù)丟棄 temp_df.dropna ( axis = 0 , how ='any', inplace = True ) #預(yù)計按天統(tǒng)計的波動仍然很大，再按月取日平均值 temp_month = temp_df.groupby(['year','month'], as_index=False).agg({'weekday':'min','temp':'median'}) #將按天求和統(tǒng)計數(shù)據(jù)的日期轉(zhuǎn)換成datetime格式 temp_df['date']=pd.to_datetime(temp_df['date'])#將按月統(tǒng)計數(shù)據(jù)設(shè)置一列時間序列 temp_month.rename(columns={'weekday':'day'},inplace=True) temp_month['date']=pd.to_datetime(temp_month[['year','month','day']])#設(shè)置畫框尺寸 fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1)#使用折線圖展示總體租賃情況（count）隨時間的走勢 plt.plot(temp_df['date'] , temp_df['temp'] , linewidth=1.3 , label='Daily average') ax.set_title('Change trend of average temperature per day in two years') plt.plot(temp_month['date'] , temp_month['temp'] , marker='o', linewidth=1.3 ,label='Monthly average') ax.legend()

可以看出每年的氣溫趨勢相同隨月份變化，在7月份氣溫最高，1月份氣溫最低，再看一下每小時平均租賃數(shù)量隨溫度變化的趨勢。

#按溫度取租賃額平均值 temp_rentals = Bike_data.groupby(['temp'], as_index=True).agg({'casual':'mean', 'registered':'mean','count':'mean'}) temp_rentals .plot(title = 'The average number of rentals initiated per hour changes with the temperature')

可觀察到隨氣溫上升租車數(shù)量總體呈現(xiàn)上升趨勢，但在氣溫超過35時開始下降，在氣溫4度時達到最低點。

3.2.3.3、濕度對租賃數(shù)量的影響

先觀察濕度的走勢：

4humidity_df = Bike_data.groupby('date', as_index=False).agg({'humidity':'mean'}) humidity_df['date']=pd.to_datetime(humidity_df['date']) #將日期設(shè)置為時間索引 humidity_df=humidity_df.set_index('date')humidity_month = Bike_data.groupby(['year','month'], as_index=False).agg({'weekday':'min','humidity':'mean'}) humidity_month.rename(columns={'weekday':'day'},inplace=True) humidity_month['date']=pd.to_datetime(humidity_month[['year','month','day']])fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1) plt.plot(humidity_df.index , humidity_df['humidity'] , linewidth=1.3,label='Daily average') plt.plot(humidity_month['date'], humidity_month['humidity'] ,marker='o', linewidth=1.3,label='Monthly average') ax.legend() ax.set_title('Change trend of average humidity per day in two years')

濕度的變化幅度不是很大，多數(shù)圍繞60上下浮動，本次數(shù)據(jù)范圍內(nèi)峰值為80。

humidity_rentals = Bike_data.groupby(['humidity'], as_index=True).agg({'casual':'mean','registered':'mean','count':'mean'}) humidity_rentals .plot (title = 'Average number of rentals initiated per hour in different humidity')

可以觀察到在濕度20左右租賃數(shù)量迅速達到高峰值，此后緩慢遞減。

3.2.3.4、年份、月份對租賃數(shù)量的影響

觀察兩年時間里，總租車數(shù)量隨時間變化的趨勢

#數(shù)據(jù)按小時統(tǒng)計展示起來太麻煩，希望能夠按天匯總 count_df = Bike_data.groupby(['date','weekday'], as_index=False).agg({'year':'mean','month':'mean','casual':'sum','registered':'sum','count':'sum'}) #由于測試數(shù)據(jù)集中沒有租賃信息，會導(dǎo)致折線圖有斷裂，所以將缺失的數(shù)據(jù)丟棄 count_df.dropna ( axis = 0 , how ='any', inplace = True )#預(yù)計按天統(tǒng)計的波動仍然很大，再按月取日平均值 count_month = count_df.groupby(['year','month'], as_index=False).agg({'weekday':'min','casual':'mean', 'registered':'mean','count':'mean'})#將按天求和統(tǒng)計數(shù)據(jù)的日期轉(zhuǎn)換成datetime格式 count_df['date']=pd.to_datetime(count_df['date'])#將按月統(tǒng)計數(shù)據(jù)設(shè)置一列時間序列 count_month.rename(columns={'weekday':'day'},inplace=True) count_month['date']=pd.to_datetime(count_month[['year','month','day']])#設(shè)置畫框尺寸 fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1)#使用折線圖展示總體租賃情況（count）隨時間的走勢 plt.plot(count_df['date'] , count_df['count'] , linewidth=1.3 , label='Daily average') ax.set_title('Change trend of average number of rentals initiated per day in two years') plt.plot(count_month['date'] , count_month['count'] , marker='o', linewidth=1.3 , label='Monthly average') ax.legend()

可以看出：

共享單車的租賃情況2012年整體是比2011年有增漲的；

租賃情況隨月份波動明顯；

數(shù)據(jù)在2011年9到12月，2012年3到9月間波動劇烈；

有很多局部波谷值。

3.2.3.5、季節(jié)對出行人數(shù)的影響

在對年份月份因素的數(shù)據(jù)分析圖中發(fā)現(xiàn)存在很多局部低谷，所以將租賃數(shù)量按季節(jié)取中位數(shù)展示，同時觀察季節(jié)的溫度變化

day_df=Bike_data.groupby('date').agg({'year':'mean','season':'mean','casual':'sum', 'registered':'sum','count':'sum','temp':'mean','atemp':'mean'})season_df = day_df.groupby(['year','season'], as_index=True).agg({'casual':'mean', 'registered':'mean','count':'mean'})season_df .plot(figsize=(18,6),title = 'The trend of average number of rentals initiated per day changes with season')

temp_df = day_df.groupby(['year','season'], as_index=True).agg({'temp':'mean', 'atemp':'mean'}) temp_df.plot(figsize=(18,6),title = 'The trend of average temperature per day changes with season')

可以看出無論是臨時用戶還是會員用戶用車的數(shù)量都在秋季迎來高峰，而春季度用戶數(shù)量最低。

3.2.3.6、天氣情況對出行情況的影響

考慮到不同天氣的天數(shù)不同，例如非常糟糕的天氣（4）會很少出現(xiàn)，查看一下不同天氣等級的數(shù)據(jù)條數(shù)，再對租賃數(shù)量按天氣等級取每小時平均值。

count_weather = Bike_data.groupby('weather') count_weather[['casual','registered','count']].count()

weather_df = Bike_data.groupby('weather', as_index=True).agg({'casual':'mean','registered':'mean'}) weather_df.plot.bar(stacked=True,title = 'Average number of rentals initiated per hour in different weather')

此處存在不合理數(shù)據(jù)：天氣等級4的時候出行人數(shù)并不少，尤其是會員出行人數(shù)甚至比天氣等級2的平均值還高，按理說4等級的應(yīng)該是最少的，將天氣等級4的數(shù)據(jù)打印出來找一下原因：

Bike_data[Bike_data['weather']==4]

觀察可知該數(shù)據(jù)是在上下班高峰期產(chǎn)生的，所以該數(shù)據(jù)是個異常數(shù)據(jù)。不具有代表性。

3.2.3.7、風(fēng)速對出行情況的影響

兩年時間內(nèi)風(fēng)速的變化趨勢

windspeed_df = Bike_data.groupby('date', as_index=False).agg({'windspeed_rfr':'mean'}) windspeed_df['date']=pd.to_datetime(windspeed_df['date']) #將日期設(shè)置為時間索引 windspeed_df=windspeed_df.set_index('date')windspeed_month = Bike_data.groupby(['year','month'], as_index=False).agg({'weekday':'min','windspeed_rfr':'mean'}) windspeed_month.rename(columns={'weekday':'day'},inplace=True) windspeed_month['date']=pd.to_datetime(windspeed_month[['year','month','day']])fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1) plt.plot(windspeed_df.index , windspeed_df['windspeed_rfr'] , linewidth=1.3,label='Daily average') plt.plot(windspeed_month['date'], windspeed_month['windspeed_rfr'] ,marker='o', linewidth=1.3,label='Monthly average') ax.legend() ax.set_title('Change trend of average number of windspeed per day in two years')

可以看出風(fēng)速在2011年9月份和2011年12月到2012年3月份間波動和大，觀察一下租賃人數(shù)隨風(fēng)速變化趨勢，考慮到風(fēng)速特別大的時候很少，如果取平均值會出現(xiàn)異常，所以按風(fēng)速對租賃數(shù)量取最大值。

windspeed_rentals = Bike_data.groupby(['windspeed'], as_index=True).agg({'casual':'max', 'registered':'max','count':'max'}) windspeed_rentals .plot(title = 'Max number of rentals initiated per hour in different windspeed')

可以看到租賃數(shù)量隨風(fēng)速越大租賃數(shù)量越少，在風(fēng)速超過30的時候明顯減少，但風(fēng)速在風(fēng)速40左右卻有一次反彈，打印數(shù)據(jù)找一下反彈原因：

df2=Bike_data[Bike_data['windspeed']>40] df2=df2[df2['count']>400] df2

該條數(shù)據(jù)產(chǎn)生在上下班高峰期時期，所以也是個異常值，不具有代表性。

3.2.3.8、日期對出行的影響

考慮到相同日期是否工作日，星期幾，以及所屬年份等信息是一樣的，把租賃數(shù)據(jù)按天求和，其它日期類數(shù)據(jù)取平均值

day_df = Bike_data.groupby(['date'], as_index=False).agg({'casual':'sum','registered':'sum','count':'sum', 'workingday':'mean','weekday':'mean','holiday':'mean','year':'mean'}) day_df.head()

6number_pei=day_df[['casual','registered']].mean() number_pei

plt.axes(aspect='equal') plt.pie(number_pei, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.05 , radius=1 ) plt.title('Casual or registered in the total lease')

工作日
由于工作日和休息日的天數(shù)差別，對工作日和非工作日租賃數(shù)量取了平均值，對一周中每天的租賃數(shù)量求和

workingday_df=day_df.groupby(['workingday'], as_index=True).agg({'casual':'mean', 'registered':'mean'}) workingday_df_0 = workingday_df.loc[0] workingday_df_1 = workingday_df.loc[1]# plt.axes(aspect='equal') fig = plt.figure(figsize=(8,6)) plt.subplots_adjust(hspace=0.5, wspace=0.2) #設(shè)置子圖表間隔 grid = plt.GridSpec(2, 2, wspace=0.5, hspace=0.5) #設(shè)置子圖表坐標(biāo)軸對齊plt.subplot2grid((2,2),(1,0), rowspan=2) width = 0.3 # 設(shè)置條寬p1 = plt.bar(workingday_df.index,workingday_df['casual'], width) p2 = plt.bar(workingday_df.index,workingday_df['registered'], width,bottom=workingday_df['casual']) plt.title('Average number of rentals initiated per day') plt.xticks([0,1], ('nonworking day', 'working day'),rotation=20) plt.legend((p1[0], p2[0]), ('casual', 'registered'))plt.subplot2grid((2,2),(0,0)) plt.pie(workingday_df_0, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.35 , radius=1.3) plt.axis('equal') plt.title('nonworking day')plt.subplot2grid((2,2),(0,1)) plt.pie(workingday_df_1, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.35 , radius=1.3) plt.title('working day') plt.axis('equal')

weekday_df= day_df.groupby(['weekday'], as_index=True).agg({'casual':'mean', 'registered':'mean'}) weekday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by weekday')

對比圖可發(fā)現(xiàn)：

工作日會員用戶出行數(shù)量較多，臨時用戶出行數(shù)量較少；

周末會員用戶租賃數(shù)量降低，臨時用戶租賃數(shù)量增加。

節(jié)假日
由于節(jié)假日在一年中數(shù)量占比非常少，先來看一每年的節(jié)假日下有幾天：

holiday_coun=day_df.groupby('year', as_index=True).agg({'holiday':'sum'}) holiday_coun

假期的天數(shù)占一年天數(shù)的份額十分少，所以對假期和非假期取日平均值

holiday_df = day_df.groupby('holiday', as_index=True).agg({'casual':'mean', 'registered':'mean'}) holiday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by holiday or not')

節(jié)假日會員或非會員使用量都比非節(jié)假日多，符合規(guī)律。

3.3、預(yù)測性分析

3.3.1、選擇特征值

根據(jù)前面的觀察，決定將時段（hour）、溫度（temp）、濕度（humidity）、年份（year）、月份（month）、季節(jié)（season）、天氣等級（weather）、風(fēng)速（windspeed_rfr）、星期幾（weekday）、是否工作日（workingday）、是否假日（holiday），11項作為特征值。由于CART決策樹使用二分類，所以將多類別型數(shù)據(jù)使用one-hot轉(zhuǎn)化成多個二分型類別

dummies_month = pd.get_dummies(Bike_data['month'], prefix= 'month') dummies_season=pd.get_dummies(Bike_data['season'],prefix='season') dummies_weather=pd.get_dummies(Bike_data['weather'],prefix='weather') dummies_year=pd.get_dummies(Bike_data['year'],prefix='year') #把5個新的DF和原來的表連接起來 Bike_data=pd.concat([Bike_data,dummies_month,dummies_season,dummies_weather,dummies_year],axis=1)

3.3.2、訓(xùn)練集、測試集分離

dataTrain = Bike_data[pd.notnull(Bike_data['count'])] dataTest= Bike_data[~pd.notnull(Bike_data['count'])].sort_values(by=['datetime']) datetimecol = dataTest['datetime'] yLabels=dataTrain['count'] yLabels_log=np.log(yLabels)

3.3.3、多余特征值舍棄

dropFeatures = ['casual' , 'count' , 'datetime' , 'date' , 'registered' ,'windspeed' , 'atemp' , 'month','season','weather', 'year' ]dataTrain = dataTrain.drop(dropFeatures , axis=1) dataTest = dataTest.drop(dropFeatures , axis=1)

3.3.4、選擇模型、訓(xùn)練模型

rfModel = RandomForestRegressor(n_estimators=1000 , random_state = 42)rfModel.fit(dataTrain , yLabels_log)preds = rfModel.predict( X = dataTrain)

3.3.5、預(yù)測測試集數(shù)據(jù)

predsTest= rfModel.predict(X = dataTest)submission=pd.DataFrame({'datetime':datetimecol , 'count':[max(0,x) for x in np.exp(predsTest)]})submission.to_csv(r'D:\A\Data\ufo\/bike_predictions.csv',index=False)

總結(jié)

以上是生活随笔為你收集整理的kaggle共享单车数据分析及预测（随机森林）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：软件外包平台用例图
下一篇： android闹钟测试工具,androi