决策树案例理解
小王是一家著名高爾夫俱樂部的經(jīng)理。但是他被雇員數(shù)量問題搞得心情十分不好。某些天好像所有人都來玩高爾夫,以至于所有員工都忙的團(tuán)團(tuán)轉(zhuǎn)還是應(yīng)付不過來,而有些天不知道什么原因卻一個(gè)人也不來,俱樂部為雇員數(shù)量浪費(fèi)了不少資金。
小王的目的是通過下周天氣預(yù)報(bào)尋找什么時(shí)候人們會(huì)打高爾夫,以適時(shí)調(diào)整雇員數(shù)量。因此首先他必須了解人們決定是否打球的原因。
在2周時(shí)間內(nèi)我們得到以下記錄:
天氣狀況有晴,云和雨;氣溫用華氏溫度表示;相對(duì)濕度用百分比;還有有無風(fēng)。當(dāng)然還有顧客是不是在這些日子光顧俱樂部。最終他得到了14列5行的數(shù)據(jù)表格。
決策樹模型就被建起來用于解決問題。
決策樹是一個(gè)有向無環(huán)圖。根結(jié)點(diǎn)代表所有數(shù)據(jù)。分類樹算法可以通過變量outlook,找出最好地解釋非獨(dú)立變量play(打高爾夫的人)的方法。變量outlook的范疇被劃分為以下三個(gè)組:
晴天,多云天和雨天。
我們得出第一個(gè)結(jié)論: 如果天氣是多云,人們總是選擇玩高爾夫,而只有少數(shù)很著迷的甚至在雨天也會(huì)玩。
接下來我們把晴天組的分為兩部分,我們發(fā)現(xiàn)顧客不喜歡濕度高于70%的天氣。最終我們還發(fā)現(xiàn),如果雨天還有風(fēng)的話,就不會(huì)有人打了。
這就通過分類樹給出了一個(gè)解決方案。 David(老板)在晴天,潮濕的天氣或者刮風(fēng)的雨天解雇了大部分員工,因?yàn)檫@種天氣不會(huì)有人打高爾夫。而其他的天氣會(huì)有很多人打高爾夫,因此可以雇用一些臨時(shí)員工來工作。
結(jié)論是決策樹幫助我們把復(fù)雜的數(shù)據(jù)表示轉(zhuǎn)換成相對(duì)簡(jiǎn)單的直觀的結(jié)構(gòu)。
公式
熵
算法ID3?,?C4.5?和C5.0生成樹算法使用熵。這一度量是給予信息學(xué)理論中熵的概念。
相對(duì)于其他數(shù)據(jù)挖掘算法,決策樹在以下幾個(gè)方面擁有優(yōu)勢(shì):
- 決策樹易于理解和實(shí)現(xiàn).?人們?cè)谕ㄟ^解釋后都有能力去理解決策樹所表達(dá)的意義。
- 對(duì)于決策樹,數(shù)據(jù)的準(zhǔn)備往往是簡(jiǎn)單或者是不必要的 .?其他的技術(shù)往往要求先把數(shù)據(jù)一般化,比如去掉多余的或者空白的屬性。
- 能夠同時(shí)處理數(shù)據(jù)型和常規(guī)型屬性。?其他的技術(shù)往往要求數(shù)據(jù)屬性的單一。
- 是一個(gè)白盒模型?如果給定一個(gè)觀察的模型,那么根據(jù)所產(chǎn)生的決策樹很容易推出相應(yīng)的邏輯表達(dá)式。
- 易于通過靜態(tài)測(cè)試來對(duì)模型進(jìn)行評(píng)測(cè)。?表示有可能測(cè)量該模型的可信度。
- 在相對(duì)短的時(shí)間內(nèi)能夠?qū)Υ笮蛿?shù)據(jù)源做出可行且效果良好的結(jié)果。
由決策樹擴(kuò)展為決策圖
在決策樹中所有從根到葉節(jié)點(diǎn)的路徑都是通過“與”(AND)運(yùn)算連接。在決策圖中可以使用“或”來連接多于一個(gè)的路徑。
決策樹算法是一種逼近離散函數(shù)值的方法。它是一種典型的分類方法,首先對(duì)數(shù)據(jù)進(jìn)行處理,利用歸納算法生成可讀的規(guī)則和決策樹,然后使用決策對(duì)新數(shù)據(jù)進(jìn)行分析。本質(zhì)上決策樹是通過一系列規(guī)則對(duì)數(shù)據(jù)進(jìn)行分類的過程。
決策樹構(gòu)造可以分兩步進(jìn)行。第一步,決策樹的生成:由訓(xùn)練樣本集生成決策樹的過程。一般情況下,訓(xùn)練樣本數(shù)據(jù)集是根據(jù)實(shí)際需要有歷史的、有一定綜合程度的,用于數(shù)據(jù)分析處理的數(shù)據(jù)集。第二步,決策樹的剪枝:決策樹的剪枝是對(duì)上一階段生成的決策樹進(jìn)行檢驗(yàn)、校正和修下的過程,主要是用新的樣本數(shù)據(jù)集(稱為測(cè)試數(shù)據(jù)集)中的數(shù)據(jù)校驗(yàn)決策樹生成過程中產(chǎn)生的初步規(guī)則,將那些影響預(yù)衡準(zhǔn)確性的分枝剪除。
java實(shí)現(xiàn)代碼如下:
?| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | packagedemo; importjava.util.HashMap; importjava.util.LinkedList; importjava.util.List; importjava.util.Map; importjava.util.Map.Entry; importjava.util.Set; publicclass DicisionTree { ??publicstatic void main(String[] args) throwsException { ????System.out.print("腳本之家測(cè)試結(jié)果:"); ????String[] attrNames = newString[] { "AGE","INCOME","STUDENT", ????????"CREDIT_RATING"}; ????// 讀取樣本集 ????Map<Object, List<Sample>> samples = readSamples(attrNames); ????// 生成決策樹 ????Object decisionTree = generateDecisionTree(samples, attrNames); ????// 輸出決策樹 ????outputDecisionTree(decisionTree,0,null); ??} ??/** ???* 讀取已分類的樣本集,返回Map:分類 -> 屬于該分類的樣本的列表 ???*/ ??staticMap<Object, List<Sample>> readSamples(String[] attrNames) { ????// 樣本屬性及其所屬分類(數(shù)組中的最后一個(gè)元素為樣本所屬分類) ????Object[][] rawData = newObject[][] { ????????{"<30 ","High ","No ","Fair?? ","0"}, ????????{"<30 ","High ","No ","Excellent","0"}, ????????{"30-40","High ","No ","Fair?? ","1"}, ????????{">40 ","Medium","No ","Fair?? ","1"}, ????????{">40 ","Low? ","Yes","Fair?? ","1"}, ????????{">40 ","Low? ","Yes","Excellent","0"}, ????????{"30-40","Low? ","Yes","Excellent","1"}, ????????{"<30 ","Medium","No ","Fair?? ","0"}, ????????{"<30 ","Low? ","Yes","Fair?? ","1"}, ????????{">40 ","Medium","Yes","Fair?? ","1"}, ????????{"<30 ","Medium","Yes","Excellent","1"}, ????????{"30-40","Medium","No ","Excellent","1"}, ????????{"30-40","High ","Yes","Fair?? ","1"}, ????????{">40 ","Medium","No ","Excellent","0"} }; ????// 讀取樣本屬性及其所屬分類,構(gòu)造表示樣本的Sample對(duì)象,并按分類劃分樣本集 ????Map<Object, List<Sample>> ret = newHashMap<Object, List<Sample>>(); ????for(Object[] row : rawData) { ??????Sample sample = newSample(); ??????inti = 0; ??????for(intn = row.length - 1; i < n; i++) ????????sample.setAttribute(attrNames[i], row[i]); ??????sample.setCategory(row[i]); ??????List<Sample> samples = ret.get(row[i]); ??????if(samples == null) { ????????samples = newLinkedList<Sample>(); ????????ret.put(row[i], samples); ??????} ??????samples.add(sample); ????} ????returnret; ??} ??/** ???* 構(gòu)造決策樹 ???*/ ??staticObject generateDecisionTree( ??????Map<Object, List<Sample>> categoryToSamples, String[] attrNames) { ????// 如果只有一個(gè)樣本,將該樣本所屬分類作為新樣本的分類 ????if(categoryToSamples.size() == 1) ??????returncategoryToSamples.keySet().iterator().next(); ????// 如果沒有供決策的屬性,則將樣本集中具有最多樣本的分類作為新樣本的分類,即投票選舉出分類 ????if(attrNames.length == 0) { ??????intmax = 0; ??????Object maxCategory = null; ??????for(Entry<Object, List<Sample>> entry : categoryToSamples ??????????.entrySet()) { ????????intcur = entry.getValue().size(); ????????if(cur > max) { ??????????max = cur; ??????????maxCategory = entry.getKey(); ????????} ??????} ??????returnmaxCategory; ????} ????// 選取測(cè)試屬性 ????Object[] rst = chooseBestTestAttribute(categoryToSamples, attrNames); ????// 決策樹根結(jié)點(diǎn),分支屬性為選取的測(cè)試屬性 ????Tree tree = newTree(attrNames[(Integer) rst[0]]); ????// 已用過的測(cè)試屬性不應(yīng)再次被選為測(cè)試屬性 ????String[] subA = newString[attrNames.length - 1]; ????for(inti = 0, j = 0; i < attrNames.length; i++) ??????if(i != (Integer) rst[0]) ????????subA[j++] = attrNames[i]; ????// 根據(jù)分支屬性生成分支 ????@SuppressWarnings("unchecked") ????Map<Object, Map<Object, List<Sample>>> splits = ????/* NEW LINE */(Map<Object, Map<Object, List<Sample>>>) rst[2]; ????for (Entry<Object, Map<Object, List<Sample>>> entry : splits.entrySet()) { ??????Object attrValue = entry.getKey(); ??????Map<Object, List<Sample>> split = entry.getValue(); ??????Object child = generateDecisionTree(split, subA); ??????tree.setChild(attrValue, child); ????} ????return tree; ??} ??/** ???* 選取最優(yōu)測(cè)試屬性。最優(yōu)是指如果根據(jù)選取的測(cè)試屬性分支,則從各分支確定新樣本 ???* 的分類需要的信息量之和最小,這等價(jià)于確定新樣本的測(cè)試屬性獲得的信息增益最大 ???* 返回?cái)?shù)組:選取的屬性下標(biāo)、信息量之和、Map(屬性值->(分類->樣本列表)) ???*/ ??static Object[] chooseBestTestAttribute( ??????Map<Object, List<Sample>> categoryToSamples, String[] attrNames) { ????int minIndex = -1; // 最優(yōu)屬性下標(biāo) ????double minValue = Double.MAX_VALUE; // 最小信息量 ????Map<Object, Map<Object, List<Sample>>> minSplits = null; // 最優(yōu)分支方案 ????// 對(duì)每一個(gè)屬性,計(jì)算將其作為測(cè)試屬性的情況下在各分支確定新樣本的分類需要的信息量之和,選取最小為最優(yōu) ????for (int attrIndex = 0; attrIndex < attrNames.length; attrIndex++) { ??????int allCount = 0; // 統(tǒng)計(jì)樣本總數(shù)的計(jì)數(shù)器 ??????// 按當(dāng)前屬性構(gòu)建Map:屬性值->(分類->樣本列表) ??????Map<Object, Map<Object, List<Sample>>> curSplits = ??????/* NEW LINE */new HashMap<Object, Map<Object, List<Sample>>>(); ??????for (Entry<Object, List<Sample>> entry : categoryToSamples ??????????.entrySet()) { ????????Object category = entry.getKey(); ????????List<Sample> samples = entry.getValue(); ????????for (Sample sample : samples) { ??????????Object attrValue = sample ??????????????.getAttribute(attrNames[attrIndex]); ??????????Map<Object, List<Sample>> split = curSplits.get(attrValue); ??????????if (split == null) { ????????????split = new HashMap<Object, List<Sample>>(); ????????????curSplits.put(attrValue, split); ??????????} ??????????List<Sample> splitSamples = split.get(category); ??????????if (splitSamples == null) { ????????????splitSamples = new LinkedList<Sample>(); ????????????split.put(category, splitSamples); ??????????} ??????????splitSamples.add(sample); ????????} ????????allCount += samples.size(); ??????} ??????// 計(jì)算將當(dāng)前屬性作為測(cè)試屬性的情況下在各分支確定新樣本的分類需要的信息量之和 ??????double curValue = 0.0; // 計(jì)數(shù)器:累加各分支 ??????for (Map<Object, List<Sample>> splits : curSplits.values()) { ????????double perSplitCount = 0; ????????for (List<Sample> list : splits.values()) ??????????perSplitCount += list.size(); // 累計(jì)當(dāng)前分支樣本數(shù) ????????double perSplitValue = 0.0; // 計(jì)數(shù)器:當(dāng)前分支 ????????for (List<Sample> list : splits.values()) { ??????????double p = list.size() / perSplitCount; ??????????perSplitValue -= p * (Math.log(p) / Math.log(2)); ????????} ????????curValue += (perSplitCount / allCount) * perSplitValue; ??????} ??????// 選取最小為最優(yōu) ??????if (minValue > curValue) { ????????minIndex = attrIndex; ????????minValue = curValue; ????????minSplits = curSplits; ??????} ????} ????return new Object[] { minIndex, minValue, minSplits }; ??} ??/** ???* 將決策樹輸出到標(biāo)準(zhǔn)輸出 ???*/ ??static void outputDecisionTree(Object obj, int level, Object from) { ????for (int i = 0; i < level; i++) ??????System.out.print("|-----"); ????if (from != null) ??????System.out.printf("(%s):", from); ????if (obj instanceof Tree) { ??????Tree tree = (Tree) obj; ??????String attrName = tree.getAttribute(); ??????System.out.printf("[%s = ?]\n", attrName); ??????for (Object attrValue : tree.getAttributeValues()) { ????????Object child = tree.getChild(attrValue); ????????outputDecisionTree(child, level + 1, attrName + " = " ????????????+ attrValue); ??????} ????} else { ??????System.out.printf("[CATEGORY = %s]\n", obj); ????} ??} ??/** ???* 樣本,包含多個(gè)屬性和一個(gè)指明樣本所屬分類的分類值 ???*/ ??static class Sample { ????private Map<String, Object> attributes = new HashMap<String, Object>(); ????private Object category; ????public Object getAttribute(String name) { ??????return attributes.get(name); ????} ????public void setAttribute(String name, Object value) { ??????attributes.put(name, value); ????} ????public Object getCategory() { ??????return category; ????} ????public void setCategory(Object category) { ??????this.category = category; ????} ????public String toString() { ??????return attributes.toString(); ????} ??} ??/** ???* 決策樹(非葉結(jié)點(diǎn)),決策樹中的每個(gè)非葉結(jié)點(diǎn)都引導(dǎo)了一棵決策樹 ???* 每個(gè)非葉結(jié)點(diǎn)包含一個(gè)分支屬性和多個(gè)分支,分支屬性的每個(gè)值對(duì)應(yīng)一個(gè)分支,該分支引導(dǎo)了一棵子決策樹 ???*/ ??staticclass Tree { ????privateString attribute; ????privateMap<Object, Object> children = newHashMap<Object, Object>(); ????publicTree(String attribute) { ??????this.attribute = attribute; ????} ????publicString getAttribute() { ??????returnattribute; ????} ????publicObject getChild(Object attrValue) { ??????returnchildren.get(attrValue); ????} ????publicvoid setChild(Object attrValue, Object child) { ??????children.put(attrValue, child); ????} ????publicSet<Object> getAttributeValues() { ??????returnchildren.keySet(); ????} ??} } |
運(yùn)行結(jié)果:
總結(jié)
- 上一篇: SVM分类算法的基本理论问题
- 下一篇: 聚类、K-Means、例子、细节