當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

RandomForest:随机森林

發(fā)布時(shí)間：2025/3/21 编程问答 56 豆豆

生活随笔收集整理的這篇文章主要介紹了 RandomForest:随机森林小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

隨機(jī)森林：RF

隨機(jī)森林是一種一決策樹為基學(xué)習(xí)器的Bagging算法，但是不同之處在于RF決策樹的訓(xùn)練過程中還加入了隨機(jī)屬性選擇（特征上的子采樣）

傳統(tǒng)的決策樹在選擇劃分的屬性時(shí)，會(huì)選擇最優(yōu)屬性
RF

首先，從該節(jié)點(diǎn)的屬性中損及選擇出K個(gè)屬性組成一個(gè)隨機(jī)子集（類也就是Bagging中的Random Subspaces,一般通常K=log2(n))

然后再從這個(gè)子集中選擇一個(gè)最右子集進(jìn)行劃分

參數(shù)

使用這些方法時(shí)要調(diào)整的參數(shù)主要是 n_estimators 和 max_features。前者（n_estimators）是森林里樹的數(shù)量，通常數(shù)量越大，效果越好，但是計(jì)算時(shí)間也會(huì)隨之增加。此外要注意，當(dāng)樹的數(shù)量超過一個(gè)臨界值之后，算法的效果并不會(huì)很顯著地變好。后者（max_features）是分割節(jié)點(diǎn)時(shí)考慮的特征的隨機(jī)子集的大小。這個(gè)值越低，方差減小得越多，但是偏差的增大也越多。根據(jù)經(jīng)驗(yàn)，回歸問題中使用 max_features = n_features，分類問題使用max_features = sqrt（n_features （其中 n_features 是特征的個(gè)數(shù)）是比較好的默認(rèn)值。max_depth = None和 min_samples_split = 2 結(jié)合通常會(huì)有不錯(cuò)的效果（即生成完全的樹）。請(qǐng)記住，這些（默認(rèn)）值通常不是最佳的，同時(shí)還可能消耗大量的內(nèi)存，最佳參數(shù)值應(yīng)由交叉驗(yàn)證獲得。另外，請(qǐng)注意，在隨機(jī)森林中，默認(rèn)使用自助采樣法（bootstrap = True），然而 extra-trees 的默認(rèn)策略是使用整個(gè)數(shù)據(jù)集（bootstrap = False）。當(dāng)使用自助采樣法方法抽樣時(shí)，泛化精度是可以通過剩余的或者袋外的樣本來估算的，設(shè)置 oob_score = True 即可實(shí)現(xiàn)。

提示:

默認(rèn)參數(shù)下模型復(fù)雜度是：O(M*N*log(N)) ，其中 M 是樹的數(shù)目， N 是樣本數(shù)。可以通過設(shè)置以下參數(shù)來降低模型復(fù)雜度： min_samples_split ,?min_samples_leaf ,?max_leaf_nodes?和?max_depth 。

偏差與方差問題

理論部分

因?yàn)橄噍^于一般的決策樹,RF中存在了對(duì)特征的子采樣,增強(qiáng)了模型的隨機(jī)性,雖然這增加了偏差,但是是同時(shí)因?yàn)榧尚Ч?降低了方差,因而這通常在整體上會(huì)獲得一個(gè)更好的模型

除了普通版本的隨機(jī)森林以外,我們還可以通過使用極限隨機(jī)樹來構(gòu)建極限隨機(jī)森林,極限隨機(jī)樹與普通隨機(jī)森林的隨機(jī)樹的區(qū)別在于,前者在劃分屬性的時(shí)候并非選取最優(yōu)屬性,而是隨機(jī)選取(sklearn中的實(shí)現(xiàn)方式是,對(duì)每個(gè)屬性生成隨機(jī)閾值,然后在隨即閾值中選擇最佳閾值),不過極限隨機(jī)森林默認(rèn)沒有開啟自助采樣，bootstrap = False

最終預(yù)測(cè)結(jié)果的生成:在RF的原始論文中,最終預(yù)測(cè)結(jié)果是對(duì)所有預(yù)測(cè)結(jié)果的簡(jiǎn)單投票,但是在我們常用的機(jī)器學(xué)習(xí)庫sklearn中,則是取每個(gè)分類器預(yù)測(cè)概率的平均.

實(shí)驗(yàn)部分

此處我們對(duì)sklearn中的Single estimator versus bagging: bias-variance decomposition示例進(jìn)行稍微的修改，用來展示極限隨機(jī)森林與隨機(jī)森林，普通決策樹在同一數(shù)據(jù)集上的方差-偏差分解

import numpy as np import matplotlib.pyplot as pltplt.figure(figsize=(20, 10))from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor from sklearn.tree import DecisionTreeRegressor# Settings n_repeat = 50 # Number of iterations for computing expectations n_train = 50 # Size of the training set n_test = 1000 # Size of the test set noise = 0.1 # Standard deviation of the noise np.random.seed(0)estimators = [("Tree", DecisionTreeRegressor()),("RandomForestRegressor", RandomForestRegressor(random_state=100,bootstrap = True)),("ExtraTreesClassifier", ExtraTreesRegressor(random_state=100,bootstrap = True)), ]n_estimators = len(estimators)# Generate data def f(x):x = x.ravel()return np.exp(-x ** 2) + 1.5 * np.exp(-(x - 2) ** 2)def generate(n_samples, noise, n_repeat=1):X = np.random.rand(n_samples) * 10 - 5X = np.sort(X)if n_repeat == 1:y = f(X) + np.random.normal(0.0, noise, n_samples)else:y = np.zeros((n_samples, n_repeat))for i in range(n_repeat):y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples)X = X.reshape((n_samples, 1))return X, yX_train = [] y_train = []for i in range(n_repeat):X, y = generate(n_samples=n_train, noise=noise)X_train.append(X)y_train.append(y)X_test, y_test = generate(n_samples=n_test, noise=noise, n_repeat=n_repeat)# Loop over estimators to compare for n, (name, estimator) in enumerate(estimators):# Compute predictionsy_predict = np.zeros((n_test, n_repeat))for i in range(n_repeat):estimator.fit(X_train[i], y_train[i])y_predict[:, i] = estimator.predict(X_test)# Bias^2 + Variance + Noise decomposition of the mean squared errory_error = np.zeros(n_test)for i in range(n_repeat):for j in range(n_repeat):y_error += (y_test[:, j] - y_predict[:, i]) ** 2y_error /= (n_repeat * n_repeat)y_noise = np.var(y_test, axis=1)y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2y_var = np.var(y_predict, axis=1)print("{0}: {1:.4f} (error) = {2:.4f} (bias^2) "" + {3:.4f} (var) + {4:.4f} (noise)".format(name,np.mean(y_error),np.mean(y_bias),np.mean(y_var),np.mean(y_noise)))# Plot figuresplt.subplot(2, n_estimators, n + 1)plt.plot(X_test, f(X_test), "b", label="$f(x)$")plt.plot(X_train[0], y_train[0], ".b", label="LS ~ $y = f(x)+noise$")for i in range(n_repeat):if i == 0:plt.plot(X_test, y_predict[:, i], "r", label="$\^y(x)$")else:plt.plot(X_test, y_predict[:, i], "r", alpha=0.05)plt.plot(X_test, np.mean(y_predict, axis=1), "c",label="$\mathbb{E}_{LS} \^y(x)$")plt.xlim([-5, 5])plt.title(name)if n == 0:plt.legend(loc="upper left", prop={"size": 11})plt.subplot(2, n_estimators, n_estimators + n + 1)plt.plot(X_test, y_error, "r", label="$error(x)$")plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),plt.plot(X_test, y_var, "g", label="$variance(x)$"),plt.plot(X_test, y_noise, "c", label="$noise(x)$")plt.xlim([-5, 5])plt.ylim([0, 0.1])if n == 0:plt.legend(loc="upper left", prop={"size": 11})plt.show() Tree: 0.0255 (error) = 0.0003 (bias^2) + 0.0152 (var) + 0.0098 (noise) RandomForestRegressor: 0.0202 (error) = 0.0004 (bias^2) + 0.0098 (var) + 0.0098 (noise) ExtraTreesClassifier: 0.0175 (error) = 0.0011 (bias^2) + 0.0065 (var) + 0.0098 (noise)

由實(shí)驗(yàn)結(jié)果我們可以很好地看出,相對(duì)于一般的決策樹,隨機(jī)森林雖然增加了模型的偏差,但是大幅度降低了方差,因而在整體上獲取了更好的結(jié)果;而相比之下,在剛剛實(shí)驗(yàn)中的RF算法方差仍然遠(yuǎn)遠(yuǎn)大于偏差,這個(gè)時(shí)候我們就可以采用極限隨機(jī)森林,正因?yàn)橐话愣詷O限隨機(jī)森林相對(duì)于隨機(jī)森林進(jìn)一步增加了偏差,同時(shí)進(jìn)一步下降了方差,因?yàn)樵谠搶?shí)驗(yàn)中極限隨機(jī)森林應(yīng)當(dāng)獲取要優(yōu)于隨機(jī)森林的效果(不過這種趨勢(shì)并不一定是百分之百的)

特征重要程度評(píng)估

特征對(duì)于目標(biāo)變量的相對(duì)重要程度,可以根據(jù)特征使用的相對(duì)順序進(jìn)行評(píng)估。決策樹頂部使用的特征對(duì)更大一部分輸入樣本的最終預(yù)測(cè)結(jié)果做出貢獻(xiàn)；因此，可以使用接受每個(gè)特征對(duì)最終預(yù)測(cè)的貢獻(xiàn)的樣本比例來評(píng)估該 特征的相對(duì)重要性 。

在RF中，通過歲多個(gè)隨機(jī)數(shù)中的預(yù)測(cè)貢獻(xiàn)率進(jìn)行平均，降低了方差，因此可用于特征選擇。不過要注意的是隨機(jī)森林與極限隨機(jī)森林對(duì)于同一個(gè)數(shù)據(jù)集根除的重要程度不一定相同,而且即使是一個(gè)模型在參數(shù)不同的情況下,最終結(jié)果也并不一定相同

因?yàn)闃O限隨機(jī)森林的特殊性質(zhì),所以請(qǐng)不要采用極限隨機(jī)森林進(jìn)行特征重要程度的排名,建議使用RF.

import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier import matplotlib.pyplot as pltplt.figure(figsize=(8, 4))estimators = [("RandomForest", RandomForestClassifier(random_state=100)),("ExtraTrees", ExtraTreesClassifier(random_state=100)), ]n_estimators = len(estimators)# Build a classification task using 3 informative features X, y = make_classification(n_samples=1000,n_features=10,n_informative=3,n_redundant=0,n_repeated=0,n_classes=2,random_state=0,shuffle=False)# Build a forest and compute the feature importances forest = ExtraTreesClassifier(n_estimators=250,random_state=0)forest.fit(X, y)for n, (name, estimator) in enumerate(estimators):estimator.fit(X, y)importances = estimator.feature_importances_std = np.std([tree.feature_importances_ for tree in forest.estimators_],axis=0)indices = np.argsort(importances)[::-1]# Print the feature ranking # print(name +" Feature ranking:") # for f in range(X.shape[1]): # print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))# Plot the feature importances of the forestplt.subplot(1, n_estimators, n + 1)plt.title(name + " Feature importances")plt.bar(range(X.shape[1]), importances[indices],color="r", yerr=std[indices], align="center")plt.xticks(range(X.shape[1]), indices)plt.xlim([-1, X.shape[1]]) plt.show()

完全隨機(jī)樹嵌入

sklearn中還實(shí)現(xiàn)了隨機(jī)森林的一種特殊用法,即完全隨機(jī)樹嵌入（RandomTreesEmbedding）。RandomTreesEmbedding 實(shí)現(xiàn)了一個(gè)無監(jiān)督的數(shù)據(jù)轉(zhuǎn)換。通過由完全隨機(jī)樹構(gòu)成的森林，RandomTreesEmbedding 使用數(shù)據(jù)最終歸屬的葉子節(jié)點(diǎn)的索引值（編號(hào)）對(duì)數(shù)據(jù)進(jìn)行編碼。該索引以 one-of-K 方式編碼，最終形成一個(gè)高維的稀疏二進(jìn)制編碼。這種編碼可以被非常高效地計(jì)算出來，并且可以作為其他學(xué)習(xí)任務(wù)的基礎(chǔ)。編碼的大小和稀疏度可以通過選擇樹的數(shù)量和每棵樹的最大深度來確定。對(duì)于集成中的每棵樹的每個(gè)節(jié)點(diǎn)包含一個(gè)實(shí)例（校對(duì)者注：這里真的沒搞懂）。編碼的大小（維度）最多為 n_estimators * 2 ** max_depth，即森林中的葉子節(jié)點(diǎn)的最大數(shù)。

其作用一共有兩種：

非線性降維

生成新的特征（此處與GBT系列的效果相似，但是生成的新特征的作用從后面的實(shí)驗(yàn)來看，似乎不如GBT系列）

對(duì)于功能一
下面是一個(gè)驗(yàn)證完全隨機(jī)樹嵌入作用的兩個(gè)例子：

例子一：使用完全隨機(jī)樹嵌入進(jìn)行散列特征轉(zhuǎn)換

RandomTreesEmbedding提供了一種將數(shù)據(jù)映射到非常高維稀疏表示的方法，這可能有助于分類。該映射是完全無監(jiān)督的，非常有效。

這個(gè)例子顯示了由幾棵樹給出的分區(qū)，并且顯示了變換如何也可以用于非線性降維或者非線性分類。

相鄰的點(diǎn)經(jīng)常共享同一個(gè)樹的葉節(jié)點(diǎn)并且因此共享大部分的散列表示，這允許截?cái)嗥娈愔捣纸?#xff08;truncated SVD）可以分離數(shù)據(jù)轉(zhuǎn)換后的兩個(gè)同心圓。

在高維空間中，線性分類器通常達(dá)到極好的精度。對(duì)于稀疏的二進(jìn)制數(shù)據(jù)，BernoulliNB特別適合。最下面一行將BernoulliNB在變換空間中獲得的決策邊界與在原始數(shù)據(jù)上學(xué)習(xí)的ExtraTreesClassifier森林進(jìn)行比較。

from sklearn.datasets import make_circles from sklearn.ensemble import RandomTreesEmbedding, ExtraTreesClassifier from sklearn.decomposition import TruncatedSVD from sklearn.naive_bayes import BernoulliNB# make a synthetic dataset X, y = make_circles(factor=0.5, random_state=0, noise=0.05)# use RandomTreesEmbedding to transform data hasher = RandomTreesEmbedding(n_estimators=10, random_state=0, max_depth=3) X_transformed = hasher.fit_transform(X)# Visualize result after dimensionality reduction using truncated SVD svd = TruncatedSVD(n_components=2) X_reduced = svd.fit_transform(X_transformed)# Learn a Naive Bayes classifier on the transformed data nb = BernoulliNB() nb.fit(X_transformed, y)# Learn an ExtraTreesClassifier for comparison trees = ExtraTreesClassifier(max_depth=3, n_estimators=10, random_state=0) trees.fit(X, y)# scatter plot of original and reduced data fig = plt.figure(figsize=(9, 8))ax = plt.subplot(221) ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k') ax.set_title("Original Data (2d)") ax.set_xticks(()) ax.set_yticks(())ax = plt.subplot(222) ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, s=50, edgecolor='k') ax.set_title("Truncated SVD reduction (2d) of transformed data (%dd)" %X_transformed.shape[1]) ax.set_xticks(()) ax.set_yticks(())# Plot the decision in original space. For that, we will assign a color # to each point in the mesh [x_min, x_max]x[y_min, y_max]. h = .01 x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# transform grid using RandomTreesEmbedding transformed_grid = hasher.transform(np.c_[xx.ravel(), yy.ravel()]) y_grid_pred = nb.predict_proba(transformed_grid)[:, 1]ax = plt.subplot(223) ax.set_title("Naive Bayes on Transformed data") ax.pcolormesh(xx, yy, y_grid_pred.reshape(xx.shape)) ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k') ax.set_ylim(-1.4, 1.4) ax.set_xlim(-1.4, 1.4) ax.set_xticks(()) ax.set_yticks(())# transform grid using ExtraTreesClassifier y_grid_pred = trees.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]ax = plt.subplot(224) ax.set_title("ExtraTrees predictions") ax.pcolormesh(xx, yy, y_grid_pred.reshape(xx.shape)) ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k') ax.set_ylim(-1.4, 1.4) ax.set_xlim(-1.4, 1.4) ax.set_xticks(()) ax.set_yticks(())plt.tight_layout() plt.show()

[關(guān)于生成新特征這一作用的實(shí)驗(yàn)](Feature transformations with ensembles of trees)

完全隨機(jī)樹嵌入,可以將特征轉(zhuǎn)化為更高維度,更稀疏的空間,方式為首先在數(shù)據(jù)集上訓(xùn)練模型(極限隨機(jī)森林,隨機(jī)森林,GBT系列皆可)然后將新的特征空間中每個(gè)葉節(jié)點(diǎn)都會(huì)分配一個(gè)固定的特征索引,然后將所有的葉節(jié)點(diǎn)進(jìn)行獨(dú)熱編碼,通過將樣本所在的葉子設(shè)置為1,其他特征設(shè)置為0,來對(duì)樣本進(jìn)行編碼,將其轉(zhuǎn)轉(zhuǎn)換到稀疏的,高維度的空間.

下面的代碼展示了,不同轉(zhuǎn)換模型轉(zhuǎn)換出的特征最終得到特征在LR上的分類效果,第二幅圖是第一幅圖左上角的放大,可以看出在本數(shù)據(jù)集上似乎還是GBT系列的轉(zhuǎn)換效果好一些(你也可以使用lightGBM與XGBoot中的sklearn借口,實(shí)現(xiàn)代碼中GradientBoostingClassifier類似的效果)

import numpy as np np.random.seed(10)import matplotlib.pyplot as pltfrom sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,GradientBoostingClassifier) from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.pipeline import make_pipelinen_estimator = 10 X, y = make_classification(n_samples=80000) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) # It is important to train the ensemble of trees on a different subset # of the training data than the linear regression model to avoid # overfitting, in particular if the total number of leaves is # similar to the number of training samples X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,y_train,test_size=0.5)# Unsupervised transformation based on totally random trees rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,random_state=0)rt_lm = LogisticRegression() pipeline = make_pipeline(rt, rt_lm) pipeline.fit(X_train, y_train) y_pred_rt = pipeline.predict_proba(X_test)[:, 1] fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)# Supervised transformation based on random forests rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator) rf_enc = OneHotEncoder() rf_lm = LogisticRegression() rf.fit(X_train, y_train) rf_enc.fit(rf.apply(X_train)) rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1] fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)grd = GradientBoostingClassifier(n_estimators=n_estimator) grd_enc = OneHotEncoder() grd_lm = LogisticRegression() grd.fit(X_train, y_train) grd_enc.fit(grd.apply(X_train)[:, :, 0]) grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)y_pred_grd_lm = grd_lm.predict_proba(grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1] fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)# The gradient boosted model by itself y_pred_grd = grd.predict_proba(X_test)[:, 1] fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)# The random forest model by itself y_pred_rf = rf.predict_proba(X_test)[:, 1] fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)plt.figure(1) plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR') plt.plot(fpr_rf, tpr_rf, label='RF') plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR') plt.plot(fpr_grd, tpr_grd, label='GBT') plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR') plt.xlabel('False positive rate') plt.ylabel('True positive rate') plt.title('ROC curve') plt.legend(loc='best') plt.show()plt.figure(2) plt.xlim(0, 0.2) plt.ylim(0.8, 1) plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR') plt.plot(fpr_rf, tpr_rf, label='RF') plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR') plt.plot(fpr_grd, tpr_grd, label='GBT') plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR') plt.xlabel('False positive rate') plt.ylabel('True positive rate') plt.title('ROC curve (zoomed in at top left)') plt.legend(loc='best') plt.show()

除此之外,隨機(jī)森林還可以進(jìn)行異常檢測(cè),sklearn中也實(shí)現(xiàn)了該算法IsolationForest.更多內(nèi)容請(qǐng)閱讀我的博客。

參考

sklearn官方文檔:ensemble
sklearn官方文檔:使用完全隨機(jī)數(shù)進(jìn)行散列特征轉(zhuǎn)換
sklearn官方文檔:Hashing feature transformation using Totally Random Trees
sklearn ApacheCN中文官方文檔:集成算法

總結(jié)

以上是生活随笔為你收集整理的RandomForest:随机森林的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：通过模型进行特征选择
下一篇： Gradient Tree Boosti