當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

随机森林算法的随机性_理解随机森林算法的图形指南

發布時間：2023/12/15 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了随机森林算法的随机性_理解随机森林算法的图形指南小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

隨機森林算法的隨機性

本文是關于什么的 (What this article is about)

In this article , we will see how the Random Forest algorithm works internally. To truly appreciate it, it might be helpful to understand a bit about Decision-Tree Classifiers. But its not entirely required.

在本文中，我們將了解隨機森林算法在內部如何工作。要真正欣賞它，了解一些有關決策樹分類器可能會有所幫助。但是它不是完全必需的。

👉 Note : We are not covering the pre-processing or feature creature steps involved in modelling — but only see what happens within the algorithm when we call the .fit() and .transform() methods for sklearn’s RandomForestClassifier package does.

注意：我們不討論建模中涉及的預處理或特征生物步驟，而是僅在調用sklearn的RandomForestClassifier包的.fit()和.transform()方法時，才能看到算法中發生了什么。

一段隨機森林 (Random Forest in one paragraph)

Random Forest ( RF) is a tree based algorithm . It is an ensemble of multiple random trees of different kinds. The final value of the model is the average of all the prediction/estimates created by each individual tree .

隨機森林(RF)是一種基于樹的算法。它是多個不同種類的隨機樹的集合。模型的最終值是每個單獨的樹創建的所有預測/估計的平均值。

包裝 (The Package)

We will be basing our article on sklearn’s RandomForestClassifier module

我們將以sklearn的RandomForestClassifier模塊為基礎

sklearn.ensemble.RandomForestClassifier

數據 (The Data)

For illustration, we will be using a training data similar to the one below.

為了便于說明，我們將使用與以下數據類似的訓練數據。

Image by Author)作者提供的圖像)

👉 Note :age ,glucose_level, weight, gender, smoking .. … f98, f99 are all the independent variables or the features.

👉 注意： age ,glucose_level, weight, gender, smoking .. … f98, f99均為自變量或特征。

Diabetic is the y-variable / dependent variable that we have to predict.

Diabetic是我們必須預測的y變量/因變量。

內部真正發生了什么 (What really happens internally)

With these basic information, lets get started and understand what happens with we pass this training set to the algorithm …

有了這些基本信息，我們就可以開始并了解將培訓集傳遞給算法所發生的事情……

第1步 -自舉 (Step 1 — Bootstrapping)

Image by Author)作者提供的圖像)

Once we provide the training data to the RandomForestClassifier model, it (the algorithm) selects a bunch of rows randomly . This process is called Bootstrapping (Random replacement). For our example, lets assume that it selects m records.

一旦我們將訓練數據提供給RandomForestClassifier模型，它( 該算法 )就會隨機選擇一堆行。此過程稱為自舉(隨機替換)。對于我們的示例，假設它選擇了m條記錄。

Note 👉 The number of rows to be selected can be provided by the user in the hyper-parameter- max_samples)

注意 to可以通過在超參數- max_samples)使用r提供要選擇的行數

Note 👉 One row might get selected more than once

注意👉可能會多次選擇一行

第2步-選擇子樹的功能 (Step 2 — Selecting features for sub-trees)

Choose the features for the mini decision tree選擇迷你決策樹的功能

Now, RF randomly selects a subset of features / columns . Here for the sake of simplicity and for the example, we are choosing 3 random features.

現在，RF隨機選擇要素/列的子集。為了簡單起見，在此示例中，我們選擇3個隨機特征。

Note 👉 You can control this number in your hyper-parameter — max_features similar to the code below

注意 👉您可以在超參數中控制此數字— max_features與下面的代碼類似

import sklearn.ensemblemy_rf = RandomForestClassifiermax_features=8)

步驟3 —選擇根節點 (Step 3 — Selecting root node)

Once the 3 random features are selected, the algorithm runs a splitting of the m record (from step 1) and does a quick calculation of the before and after values of a metric.

一旦選擇了3個隨機特征，該算法將對m條記錄進行分割(來自步驟1)，并快速計算度量的前后值。

This metric could be either gini-impurity or the entropy. It is based on the The criteria — gini or entropy based on the choice you have provided in your hyper-parameter .

該度量可以是基尼雜質或熵。它基于準則- gini或entropy基于您在超參數中提供的選擇。

criterion = 'gini' ( or 'entropy' . default= 'gini’ )

criterion = 'gini' (或' entropy '。 default= 'gini' )

Which ever of the random feature gives the most minimum gini impurity / entropy value is selected as the root node .

選擇哪個隨機特征給出最大的基尼雜質/熵值最小的根節點。

The records are split at this node based on the best splitting point.

將根據最佳拆分點在此節點上拆分記錄。

步驟4 —選擇子節點 (Step 4 — Selecting the child nodes)

Select the features randomly隨機選擇功能

The algorithm performs the same process as in Step 2 and Step 4 and selects another set of 3 random features . ( 3 is the number we have specified — you can choose what you like — or leave it to the algorithm to choose the best number )

該算法執行與步驟2和步驟4相同的過程，并選擇另一組3個隨機特征。 (3是我們指定的數字-您可以選擇自己喜歡的數字-或將其留給算法以選擇最佳數字)

Based on the criteria ( gini / entropy ), it selects which feature will go into the next node / child node , and further splitting of the records happens here .

根據條件(基尼/熵)，它選擇哪個特征將進入下一個節點/子節點，并在此處進一步分割記錄。

步驟5 —進一步拆分并創建子節點 (Step 5 —Further split and create child nodes)

continue selection of the features ( columns ) to select the further child nodes繼續選擇特征(列)以選擇其他子節點

This process continues ( Steps 2, 4 ) of selecting the random feature and splitting of the nodes happens till either of the following conditions happen

繼續選擇隨機特征并分裂節點的過程(步驟2、4)，直到發生以下任一情況

a) you have ran out of the number of rows to split ( or the threshold — minimum number of rows to be present in each child node )
a)您已用完要拆分的行數(或閾值-每個子節點中存在的最小行數)
b) the gini / entropy after splitting does not decrease
b)分裂后的基尼/熵不降低

Now we have the first level of child nodes現在我們有了第一級子節點

You now have your first “mini-decision tree ”.

現在，您有了第一個“小型決策樹”。

The first mini-decision tree created using the randomly selected rows ( records) & columns (features) (Image by Author)使用隨機選擇的行(記錄)和列(功能)創建的第一個小型決策樹( 作者提供的圖像)

第6步-創建更多的小型決策樹 (Step 6 — Create more mini-decision trees)

Algorithm goes back to your data and does steps 1–5 to creates the 2nd “mini-tree”

算法返回到您的數據并執行步驟1-5，以創建第二個“迷你樹”

second mini tree that we created using another set of randomly chosen rows & columns第二個迷你樹

步驟7.建立樹木森林 (Step 7. Build the forest of trees)

Once the default value of 100 trees is reached ( you now have 100 mini decision trees ), the model is said to have completed its fit() process.

一旦達到100棵樹的默認值(您現在有100棵微型決策樹)，該模型就被稱為完成了fit()過程。

2 trees from the list of 100 trees100棵樹中的2棵樹

Note 👉 You can specify the number of trees you want to generate in your hyper-parameter ( n_estimators)

注意 👉您可以在超參數中指定要生成的樹數( n_estimators)

import sklearn.ensemblemy_rf = RandomForestClassifiern_estimators=300)n_estimators variable or a default value of 100, if not specified ) (n_estimators變量指定的數量，或者如果未指定，則默認值為100)( Image by Author)作者提供的圖像)

Now you have a forest of randomly created mini-trees ( hence the name Random Forest )

現在，您有一個隨機創建的迷你樹森林( 因此命名為Random Forest )

步驟7.推論 (Step 7. Inferencing)

Now lets predict the values in an unseen data set ( the test data set )

現在讓我們預測一個看不見的數據集(測試數據集)中的值

For inferencing (more commonly referred to as predicting/ scoring ) the test data, the algorithm passes the record through each mini-tree.

為了推斷( 更通常稱為預測/評分 )測試數據，算法將記錄傳遞到每個小樹中。

Image by Author)作者提供的圖像)

The values from the record traverses through the mini tree based on the variables that each node represents,and reaches a leaf node ultimately. Based on the predetermined value of the leaf-node(during training) where this record ends up, that mini-tree is assigned one prediction output.

記錄中的值基于每個節點表示的變量遍歷迷你樹，并最終到達葉節點。根據該記錄最終到達的葉節點的預定值(在訓練過程中)，為該小樹分配一個預測輸出。

Image by Author)作者提供的圖片)

Similarly, the same record goes through all the 100 mini-decision trees and each of the 100 trees have a prediction output. The final prediction value for this record is calculated by taking a simple voting of these 100 mini trees.

同樣，同一條記錄遍歷所有100個小型決策樹，并且這100棵樹中的每一個都有預測輸出。該記錄的最終預測值是通過對這100棵迷你樹進行簡單表決而計算出的。

Now we have the prediction for a single record.

現在我們有了單個記錄的預測。

The algorithm iterates through all the records of the test set following the same process and does a calculation of the overall accuracy !

該算法按照相同的過程遍歷測試集的所有記錄，并計算整體精度 ！

Iterate the process of obtaining the prediction for each row of the test set to arrive at the final accuracy.迭代獲取測試集每一行的預測的過程，以達到最終精度。

翻譯自: https://towardsdatascience.com/a-pictorial-guide-to-understanding-random-forest-algorithm-fbf570a0ae0d

隨機森林算法的隨機性

總結

以上是生活随笔為你收集整理的随机森林算法的随机性_理解随机森林算法的图形指南的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：农行贷款30万一年多少利息
下一篇：南加州大学机器视觉实验室_机器学习带动南