日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle入门-Bike Sharing Demand自行车需求预测

發(fā)布時間:2024/1/23 编程问答 50 豆豆
生活随笔 收集整理的這篇文章主要介紹了 kaggle入门-Bike Sharing Demand自行车需求预测 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

接觸機(jī)器學(xué)習(xí)斷斷續(xù)續(xù)有一年了,一直沒有真正做點什么事,今天終于開始想刷刷kaggle的問題了,慢慢熟悉和理解機(jī)器學(xué)習(xí)以及深度學(xué)習(xí)。

今天第一題是一個比較基礎(chǔ)的Bike Sharing Demand題,根據(jù)日期時間、天氣、溫度等特征,預(yù)測自行車的租借量。訓(xùn)練與測試數(shù)據(jù)集大概長這樣:

// train datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count 2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0,3,13,16 2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0,8,32,40// test datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed 2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027 2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,

觀察上面的數(shù)據(jù),我們可以發(fā)現(xiàn):租借量等于注冊用戶租借量加上未注冊用戶租借量,即casual + registered。評價指標(biāo)是loss函數(shù)RMSLE (Root Mean Squared Logarithmic Error):

????????????????????????????????????????????

其中,???

為預(yù)測的租借量, ??? 為實際的租借量, ? 為樣本數(shù)。實際上,RMSLE就是一個誤差函數(shù)。

以下是對數(shù)據(jù)的描述:

Data Fields

datetime - hourly date +?timestamp ?
season - ?1 = spring, 2 = summer, 3 = fall, 4 = winter?
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather -?1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog?
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals


整個過程:

# coding: utf-8# In[54]:import numpy as np import pandas as pd get_ipython().magic('matplotlib inline')from sklearn import cross_validation from sklearn.metrics import mean_squared_error from sklearn.ensemble import RandomForestRegressor# In[4]:df_origin = pd.read_csv("train.csv",sep=",") df_origin.head()# ### 查看完整24小時的時間# In[5]:df_origin.head(24)# In[6]:df_origin.tail(24)# ### 查看描述信息# In[7]:df_origin.info()# In[9]:df_origin.describe()# In[10]:df_origin.columns# In[12]:df_origin.shape# In[11]:df_test = pd.read_csv("test.csv",sep=",") df_test.head()# In[13]:df_test.shape# ### 檢測異常值# In[14]:df_origin.isnull# In[18]:#df_test.isnull# ## 特征工程# ### 時間離散化# In[25]:df_origin['hour'] = df_origin['datetime'].str[11:13].astype(int) df_origin.head()# In[26]:from datetime import datetime # In[42]:week = [datetime.date(datetime.strptime(time, '%Y-%m-%d')).weekday() for time in df_origin['datetime'].str[:10]] df_origin['week'] = week df_origin.head()# In[43]:df_origin['month'] = df_origin['datetime'].str[5:7].astype(int) df_origin['year'] = df_origin['datetime'].str[0:4].astype(int) df_origin.head()# In[45]:df_origin.columns.values# In[46]:df_clean = df_origin.loc[:,['season', 'holiday', 'workingday', 'weather', 'temp','atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count','hour', 'week', 'year', 'month']] df_clean.head()# #### 同理 處理test數(shù)據(jù)# In[47]:#temp = pd.DatetimeIndex(train['datetime']) #train['year'] = temp.year #train['month'] = temp.month #train['hour'] = temp.hour #train['weekday'] = temp.weekdaydf_test['hour'] = df_test['datetime'].str[11:13].astype(int) week1 = [datetime.date(datetime.strptime(time, '%Y-%m-%d')).weekday() for time in df_test['datetime'].str[:10]] df_test['week'] = week1 df_test['month'] = df_test['datetime'].str[5:7].astype(int) df_test['year'] = df_test['datetime'].str[0:4].astype(int) df_clean_test = df_test.loc[:,['season', 'holiday', 'workingday', 'weather', 'temp','atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count','hour', 'week', 'year', 'month']] df_test.head()# ## 檢查數(shù)據(jù)均衡# ### log casual和register,然后相加# In[51]:df_origin['casual'].hist()# In[52]:df_origin['registered'].hist()# In[57]:df_clean['log_cas'] = np.log(df_origin['casual'] + 1) df_clean['log_reg'] = np.log(df_origin['registered'] + 1) df_clean.head()# ### 隨機(jī)森林特征選擇# In[58]:df_clean.head(10)# In[59]:fea_cols=['season', 'holiday', 'workingday', 'weather', 'temp','atemp', 'humidity', 'windspeed','hour', 'week', 'year']# ### 許多特征之間有太多相關(guān)性 # # #### season和month,二選一 # #### temp和atemp,二選一 # #### humidity和weather,windspeed,看rf的特征重要度 # #### week和workingday # # # In[60]:df_clean[fea_cols].corr()# ### 剔除特征重要度< 0.01的特征# In[62]:clf_cal = RandomForestRegressor(n_estimators=1000, min_samples_split=11, oob_score=True) clf_cal# In[63]:clf_cal.fit(df_clean[fea_cols].values, df_clean['log_cas'].values) pd.DataFrame(clf_cal.feature_importances_).plot(kind='bar') clf_cal.oob_score_# In[64]:clf_cal.feature_importances_# In[65]:fea_cas = ['season', 'workingday', 'weather', 'temp','humidity', 'windspeed','hour', 'week', 'year']# In[66]:clf_cal.fit(df_clean[fea_cas].values, df_clean['log_cas'].values) pd.DataFrame(clf_cal.feature_importances_).plot(kind='bar') clf_cal.oob_score_# In[67]:clf_reg = RandomForestRegressor(n_estimators=1000, min_samples_split=11, oob_score=True)# In[68]:clf_reg.fit(df_clean[fea_cols].values, df_clean['log_reg'].values) pd.DataFrame(clf_reg.feature_importances_).plot(kind='bar') clf_reg.oob_score_# In[69]:clf_reg.feature_importances_# In[70]:fea_regs=['season', 'workingday', 'weather', 'temp', 'humidity', 'hour', 'week', 'year']# In[71]:clf_reg.fit(df_clean[fea_regs].values, df_clean['log_reg'].values) pd.DataFrame(clf_reg.feature_importances_).plot(kind='bar') clf_reg.oob_score_# In[73]:y_pred7 = np.exp(clf_cal.predict(df_clean_test[fea_cas])) + np.exp(clf_reg.predict(df_clean_test[fea_regs])) - 2 y_pred7[:40]# ### 對結(jié)果四舍五入# In[74]:y_pred7 = [round(x) for x in y_pred7] df_test['count'] = y_pred7 df_test['count'] = df_test['count'].astype(int) df_test.head()# In[75]:df_test.shape# In[77]:df_test.to_csv('result.csv', sep=',', columns=['datetime', 'count'], header=['datetime', 'count'], index = False)# In[ ]:

參考:

1. http://www.cnblogs.com/en-heng/p/6907839.html

2.?http://efavdb.com/bike-share-forecasting/

3.?http://nbviewer.jupyter.org/gist/whbzju/ff06fce9fd738dcf8096#%E6%97%B6%E9%97%B4%E7%A6%BB%E6%95%A3%E5%8C%96


總結(jié)

以上是生活随笔為你收集整理的kaggle入门-Bike Sharing Demand自行车需求预测的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。