當(dāng)前位置：首頁 >

机器学习算法 --- Decision Trees Algorithms

發(fā)布時(shí)間：2025/6/16 90 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习算法 --- Decision Trees Algorithms 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

一、Decision Trees Agorithms的簡(jiǎn)介

?　　決策樹算法（Decision Trees Agorithms），是如今最流行的機(jī)器學(xué)習(xí)算法之一，它即能做分類又做回歸（不像之前介紹的其他學(xué)習(xí)算法），在本文中，將介紹如何用它來對(duì)數(shù)據(jù)做分類。

　　本文參照了Madhu Sanjeevi ( Mady )的Decision Trees Algorithms，有能力的讀者可去閱讀原文。

　　說明：本文有幾處直接引用了原文，并不是不想做翻譯，而是感覺翻譯過來總感覺不夠清晰，而原文卻講的很明白清晰。（個(gè)人觀點(diǎn)：任何語言的翻譯都會(huì)損失一定量的信息，所以盡量支持原版）

二、Why Decision trees?

?　　在已經(jīng)有了很多種學(xué)習(xí)算法的情況下，為什么還要?jiǎng)?chuàng)造出回歸樹這種學(xué)習(xí)算法呢？它相比于其他算法有和優(yōu)點(diǎn)？

　　　　至于為什么，原因有很多，這里主要講兩點(diǎn)，這兩點(diǎn)也是在我看來相比于其他算法最大的優(yōu)點(diǎn)。

　　　　其一，決策樹的算法思想與人類做決定時(shí)的思考方式很相似，它相比于其他算法，無需計(jì)算很多很多的各種參數(shù)，它能像人類一樣綜合各種考慮，做出很好的選擇（不一定是最好啊ㄟ(▔,▔)ㄏ）。

　　　　其二，它能將它做出決策的邏輯過程可視化（不同于SVM, NN, 或是神經(jīng)網(wǎng)絡(luò)等，對(duì)于用戶而言是一個(gè)黑盒）, 例如下圖，就是一個(gè)銀行是否給客戶發(fā)放貸款使用決策樹決策的一個(gè)過程。

三、What is the decision tree??

　　A decision tree is a tree where each node represents a feature(attribute), each link(branch) represents a decision(rule) and each leaf represents an outcome(categorical or continues value).

　　類似于下圖中左邊的數(shù)據(jù)，對(duì)于數(shù)據(jù)的分類我們使用右邊的方式對(duì)其分類：

　　step 1：判斷Age，Age<27.5，則Class=High；否則，執(zhí)行step 2。

　　step 2: 判斷CarType，CarType∈Sports，則Class=High；否則Class=Low。

　　對(duì)于一組數(shù)據(jù)，只需按照決策樹的分支一步步的走下去，便可得到最終的結(jié)果，有點(diǎn)兒類似于程序設(shè)計(jì)中的多分支選擇結(jié)構(gòu)。

四、How to build this??

　　學(xué)習(xí)新知識(shí)，最主要的三個(gè)問題就是why，what，how。前兩個(gè)問題已經(jīng)在上面的介紹中解決了，接下來就是how，即如何建立一顆決策樹？

　　建立決策樹，有很多種算法，本文主要講解一下兩種：

ID3 (Iterative Dichotomiser 3) → uses?Entropy function?and?Information gain?as metrics.

CART (Classification and Regression Trees) → uses?Gini Index(Classification)?as metric.? ? ? ? ?

—————————————————————————————————————————————————————————————————————————————————————————————————————　首先，我們使用第一種算法來對(duì)一個(gè)經(jīng)典的分類問題建立決策樹：

　　Let’s just take a famous dataset in the machine learning world which is whether dataset(playing game Y or N based on whether condition).

　　We have four X values (outlook,temp,humidity and windy) being categorical and one y value (play Y or N) also being categorical.

　　So we need to learn the mapping (what machine learning always does) between X and y.

　　This is a binary classification problem, lets build the tree using the?ID3?algorithm.

　　首先，決策樹，也是一棵樹，在計(jì)算機(jī)科學(xué)中，樹是一種數(shù)據(jù)結(jié)構(gòu)，它有根節(jié)點(diǎn)(root node)，分枝(branch)，和葉子節(jié)點(diǎn)(leaf node)。

　　而對(duì)于一顆決策樹，each node represents a feature(attribute)，so first,?we need to choose the root node from (outlook, temp, humidity, windy). 那么改如何選擇呢？

　　Answer:?Determine the attribute that best classifies the training data; use this attribute at the root of the tree. Repeat this process at for each branch.　

　　這也就意味著，我們要對(duì)決策樹的空間進(jìn)行自頂向下的貪婪搜索。

　　所以問題又來了，how do we choose the best attribute?　

　　Answer: use the attribute with the highest?information gain?in?ID3.

　　In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called?entropy(熵)?that characterizes the impurity of an arbitrary collection of examples.”

　　So what's the entropy??(下圖是wikipedia給出的定義)

　　從上面的公式中我們可以得到，對(duì)于一個(gè)二分類問題，如果entropy=0，則要么全為正樣本，要么全為負(fù)樣本（即理論上樣本應(yīng)該屬于兩個(gè)，實(shí)際上所有的樣本全屬于一類）。如果entropy=1，則正負(fù)樣本各占一半。

　　有了Entropy的概念，便可以定義Information gain：

　　有了上述兩個(gè)概念，便可建立決策樹了，步驟如下：　　　　　　　　　　

1.compute the entropy for data-set 2.for every attribute/feature:1.calculate entropy for all categorical values2.take average information entropy for the current attribute3.calculate gain for the current attribute 3. pick the highest gain attribute. 4. Repeat until we get the tree we desired.

　　對(duì)于這個(gè)實(shí)例，我們來具體使用一下它：

　　　　step1（計(jì)算數(shù)據(jù)集整體的entropy）：

　　　　step2（計(jì)算每一項(xiàng)feature的entropy and information gain）：

　　　　　　這里只計(jì)算了兩項(xiàng)，其他兩項(xiàng)的計(jì)算方法類似。

　　　　step3 （選擇Info gain最高的屬性）：

　　　　　　上表列出了每一項(xiàng)feature的entropy and information gain，我們可以發(fā)現(xiàn)Outlook便是我們要找的那個(gè)attribute。

　　　　So our root node is?Outlook:

　　　接著對(duì)于圖中左邊的未知節(jié)點(diǎn)，我們將由sunny得來的數(shù)據(jù)當(dāng)做數(shù)據(jù)集，然后從這些數(shù)據(jù)中按照上述的步驟選擇其他三個(gè)屬性的一種作為此節(jié)點(diǎn)，對(duì)于右邊的節(jié)點(diǎn)做類似操作即可：

　　最終，建立的決策樹如下：

—————————————————————————————————————————————————————————————————————————————————————————————————————　　接著，我們使用第二種算法來建立決策樹（Classification with using the?CART?algorithm）：

　　　　CART算法其實(shí)與ID3非常相像，只是每次選擇時(shí)的指標(biāo)不同，在ID3中我們使用entropy來計(jì)算Informaition gain，而在CART中，我們使用Gini index來計(jì)算Gini gain。

　　　　同樣的，對(duì)于一個(gè)二分類問題而言（Yes or No），有四種組合：1 0 , 0 1 , 1 0 , 0 0，則存在

P(Target=1).P(Target=1) + P(Target=1).P(Target=0) + P(Target=0).P(Target=1) + P(Target=0).P(Target=0) = 1

P(Target=1).P(Target=0) + P(Target=0).P(Target=1) = 1 — P^2(Target=0) — P^2(Target=1)

　　　　那么，對(duì)于二分類問題的Gini index定義如下：

　　A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes.

?　　所以，對(duì)于一個(gè)二分類問題，最大的Gini index：

　　= 1?—?(1/2)^2?—?(1/2)^2
　　= 1–2*(1/2)^2
　　= 1- 2*(1/4)
　　= 1–0.5
　　= 0.5

　　和二分類類似，我們可以定義出多分類時(shí)Gini index的計(jì)算公式：

　　Maximum value of Gini Index could be when all target values are equally distributed.

　　同樣的，當(dāng)取最大的Gini index時(shí)，可以寫為（一共有k類且每一類數(shù)量相等時(shí)）： = 1–1/k

　　當(dāng)所有樣本屬于同一類別時(shí)，Gini index為0。

　　此時(shí)我們就可以根據(jù)Gini gani來選擇所需的node，Gini gani的計(jì)算公式（類似于information gain的計(jì)算）如下：

　　那么便可以使用類似于ID3的算法的思想建立decision tree，步驟如下：

1.compute the gini index for data-set 2.for every attribute/feature:1.calculate gini index for all categorical values2.take average information entropy(這里指GiniGain(A,S)的右半部分，跟ID3中的不同) for the current attribute 3.calculate the gini gain 3. pick the best gini gain attribute. 4. Repeat until we get the tree we desired.

　　最終，形成的decision tree如下：

　　其實(shí)這兩種算法本質(zhì)沒有任何區(qū)別，只是選擇node時(shí)所用的指標(biāo)（表達(dá)式）不同而已。

轉(zhuǎn)載于:https://www.cnblogs.com/God-Li/p/9179039.html

總結(jié)

以上是生活随笔為你收集整理的机器学习算法 --- Decision Trees Algorithms的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： ES6中表达export default
下一篇： MySql的安装、配置（转）