當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

唐宇迪机器学习课程笔记：随机森林

發布時間：2023/12/10 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了唐宇迪机器学习课程笔记：随机森林小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

天氣最高溫度預測任務

我們要完成三項任務：

使用隨機森林算法完成基本建模任務

基本任務需要我們處理數據，觀察特征，完成建模并進行可視化展示分析

觀察數據量與特征個數對結果影響

在保證算法一致的前提下，加大數據個數，觀察結果變換。重新考慮特征工程，引入新特征后觀察結果走勢。

對隨機森林算法進行調參，找到最合適的參數

掌握機器學習中兩種經典調參方法，對當前模型進行調節

# 數據讀取 import pandas as pd features = pd.read_csv('data/temps.csv') features.head()

數據表中

year,moth,day,week分別表示的具體的時間
temp_2：前天的最高溫度值
temp_1：昨天的最高溫度值
average：在歷史中，每年這一天的平均最高溫度值
actual：這就是我們的標簽值了，當天的真實最高溫度
friend：這一列可能是湊熱鬧的，你的朋友猜測的可能值，咱們不管它就好了

print('數據維度:',features.shape) features.describe()

數據維度: (348, 9)

其中包括了各個列的數量，如果有缺失數據，數量就有所減少，這里因為并不存在缺失值，所以各個列的數量值就都是348了，均值，標準差，最大最小值等指標在這里就都顯示出來了。
對于時間數據，我們也可以進行一些轉換，目的就是有些工具包在繪圖或者計算的過程中，需要標準的時間格式：

# 處理時間數據 import datetime # 分別得到年月日 years = features['year'] months = features['month'] days = features['day'] # 轉化成datetime格式 dates = [str(int(year)) + '-'+str(int(month))+'-'+str(int(day)) for year,month,day in zip(years,months,days)] dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates] print(dates[:5])

[datetime.datetime(2016, 1, 1, 0, 0), datetime.datetime(2016, 1, 2, 0, 0), datetime.datetime(2016, 1, 3, 0, 0), datetime.datetime(2016, 1, 4, 0, 0), datetime.datetime(2016, 1, 5, 0, 0)]

數據展示

# 準備畫圖 import matplotlib.pyplot as plt %matplotlib inline # 指定默認風格 plt.style.use('fivethirtyeight') # 創建子圖與設置布局 fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize = (10,10)) fig.autofmt_xdate(rotation = 45) # 標簽值 ax1.plot(dates,features['actual']) ax1.set_xlabel('') ax1.set_ylabel('Temperature') ax1.set_title('Max Temp') # 昨天 ax2.plot(dates,features['actual']) ax2.set_xlabel('') ax2.set_ylabel('Temperature') ax2.set_title('Previous Max Temp') # 前天 ax3.plot(dates,features['temp_2']) ax3.set_xlabel('Date') ax3.set_ylabel('Temperature') ax3.set_title('Two Days Prior Max Temp') # 我的黑驢朋友 ax4.plot(dates,features['friend']) ax4.set_xlabel('Date') ax4.set_ylabel('Temperature') ax4.set_title('Donkey Estimate')

數據預處理

One-Hot Encoding

原始數據：

week

Mon

Tue

Wed

Thu

Fri

編碼轉換后:

MonTueWedThuFri

1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

# 獨熱編碼 print(features.dtypes) features = pd.get_dummies(features) print('Shape of features after one-hot encoding:',features.shape) features.head(5)

year int64
month int64
day int64
week object
temp_2 int64
temp_1 int64
average float64
actual int64
friend int64
dtype: object
Shape of features after one-hot encoding: (348, 15)

標簽與數據格式轉換

# 數據與標簽 import numpy as np #標簽 labels = np.array(features['actual']) # 在特征中去掉標簽 features = features.drop('actual',axis=1) # 名字單獨保存一份,以備后患 features_list = list(features.columns) # 轉換成合適的格式 features = np.array(features)

訓練集與測試集

# 數據集劃分 from sklearn.model_selection import train_test_split train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size=0.25,random_state=42) print('訓練集特征維度:',train_features.shape) print('訓練集標簽維度:',train_labels.shape) print('測試集特征維度:',test_features.shape) print('測試集標簽維度:',test_labels.shape)

訓練集特征維度: (261, 14)
訓練集標簽維度: (261,)
測試集特征維度: (87, 14)
測試集標簽維度: (87,)

建立一個基礎的隨機森林模型

萬事俱備，我們可以來建立隨機森林模型啦，首先導入工具包，先建立1000個樹試試吧，其他參數先用默認值，之后我們會再深入到調參任務中：

# 導入算法 from sklearn.ensemble import RandomForestRegressor # 建模 rf = RandomForestRegressor(n_estimators = 1000,random_state=42) # 訓練 rf.fit(train_features,train_labels)

測試

# 預測結果 predictions = rf.predict(test_features) # 計算誤差 errors = abs(predictions - test_labels) # mean absolute percentage error(MAPE) mape = 100 * (errors / test_labels) print('MAPE:',np.mean(mape))

MAPE指標

可視化展示樹

先安裝：graphviz

# 導入所需工具包 from sklearn.tree import export_graphviz import pydot # 拿到其中地一棵樹 tree = rf.estimators_[5] # 導出成dot文件 export_graphviz(tree,out_file = 'tree.dot',feature_names = features_list,rounded = True,precision = 1) # 繪圖 (graph,) = pydot.graph_from_dot_file('tree.dot') # 展示 graph.write_png('tree.png')

還是小一點吧。。。

# 限制一下樹模型 rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3, random_state=42) rf_small.fit(train_features, train_labels)# 提取一顆樹 tree_small = rf_small.estimators_[5]# 保存 export_graphviz(tree_small, out_file = 'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1)(graph, ) = pydot.graph_from_dot_file('small_tree.dot')graph.write_png('small_tree.png');

特征重要性

# 得到特征重要度 importances = list(rf.feature_importances_) # 轉換格式 feature_importances = [(feature,round(importance,2))for feature,importance in zip(features_list,importances)] # 排序 feature_importances = sorted(feature_importances,key = lambda x:x[1],reverse = True) # 對應進行打印 [print('Variable:{:20} Importance:{}'.format(*pair)) for pair in feature_importances]

Variable:temp_1 Importance:0.7
Variable:average Importance:0.19
Variable:day Importance:0.03
Variable:temp_2 Importance:0.02
Variable:friend Importance:0.02
Variable:month Importance:0.01
Variable:year Importance:0.0
Variable:week_Fri Importance:0.0
Variable:week_Mon Importance:0.0
Variable:week_Sat Importance:0.0
Variable:week_Sun Importance:0.0
Variable:week_Thurs Importance:0.0
Variable:week_Tues Importance:0.0
Variable:week_Wed Importance:0.0

用最重要的特征再來試試

# 選擇最重要的那兩個特征來試一試 rf_most_important = RandomForestRegressor(n_estimators = 1000,random_state = 42) # 拿到這兩個特征 important_indices = [features_list.index('temp_1'),features_list.index('average')] train_important = train_features[:,important_indices] test_important = test_features[:,important_indices] # 重新訓練模型 rf_most_important.fit(train_important,train_labels) # 預測結果 predictions = rf_most_important.predict(test_important) errors = abs(predictions-test_labels) # 評估結果 mape = np.mean(100 * (errors / test_labels)) print('mape',mape)

mape 6.229055723613811

# 轉換成list格式 x_values = list(range(len(importances))) # 繪圖 plt.bar(x_values,importances) # x軸名字 plt.xticks(x_values,features_list,rotation = 'vertical') # 圖名與標簽 plt.ylabel('Importance') plt.xlabel('Variable') plt.title('Variable Importances')

預測值與真實值之間的差異

# 日期數據 months = features[:,features_list.index('month')] days = features[:,features_list.index('day')] years = features[:,features_list.index('year')] # 轉換成日期格式 dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year,month,day in zip(years,months,days)] dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates] # 創建一個表格來存日期和對應的數值 true_data = pd.DataFrame(data = {'date':dates,'actual':labels}) # 同理,再創建一個來存日期和對應的模型預測值 months = test_features[:,features_list.index('month')] days = test_features[:,features_list.index('day')] years = test_features[:,features_list.index('year')] test_dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year,month,day in zip(years,months,days)] test_dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in test_dates] predictions_data = pd.DataFrame(data = {'date':test_dates,'prediction':predictions}) # 真實值 plt.plot(true_data['date'],true_data['actual'],'b-',label = 'actual') # 預測值 plt.plot(predictions_data['date'],predictions_data['prediction'],'ro',label = 'prediction') plt.xticks(rotation = '60') plt.legend() # 圖名 plt.xlabel('Date') plt.ylabel('Maximum Temperature (F)') plt.title('Actual and Predicted Values')

看起來還可以，這個走勢我們的模型已經基本能夠掌握了，接下來我們要再深入到數據中了，考慮幾個問題：
1.如果可以利用的數據量增大，會對結果產生什么影響呢？
2.加入新的特征會改進模型效果嗎？此時的時間效率又會怎樣？

總結

以上是生活随笔為你收集整理的唐宇迪机器学习课程笔记：随机森林的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 4 微信公众号开发被动回复消息回复没
下一篇：基于word API 创建的可以打开wo