FP-growth算法高效发现频繁项集
在用搜索引擎時(shí),我們發(fā)現(xiàn)輸入單詞的一部分時(shí),搜索引擎會(huì)自動(dòng)補(bǔ)全查詢?cè)~項(xiàng),這里的原理其實(shí)是通過(guò)查詢互聯(lián)網(wǎng)上的詞來(lái)找出經(jīng)常出現(xiàn)在一塊的詞對(duì),這需要一種高效發(fā)現(xiàn)頻繁集的方法。
它基于Apriori構(gòu)建,但在完成相同任務(wù)時(shí)采用了一些不同的技術(shù)。這里的任務(wù)是將數(shù)據(jù)集存儲(chǔ)在一個(gè)特定的稱作FP樹的結(jié)構(gòu)之后發(fā)現(xiàn)頻繁項(xiàng)集或者頻繁項(xiàng)對(duì),即常在一塊出現(xiàn)的元素項(xiàng)的集合FP樹。這種做法使得算法的執(zhí)行速度要快于Apriori,通常性能要好兩個(gè)數(shù)量級(jí)以上。
注意:
這種算法雖然能更為高效地發(fā)現(xiàn)頻繁項(xiàng)集,但不能用于發(fā)現(xiàn)關(guān)聯(lián)規(guī)則。
FP-growth算法只需要對(duì)數(shù)據(jù)庫(kù)進(jìn)行兩次掃描,而Apriori算法對(duì)于每個(gè)潛在的頻繁項(xiàng)集都會(huì)掃描數(shù)據(jù)集判定給定模式是否頻繁,因此FP-gr0wth算法的速度要比Apriori算法快。
FP-growth算法發(fā)現(xiàn)頻繁項(xiàng)集的過(guò)程:
- 構(gòu)建FP樹
- 從FP樹中挖掘頻繁項(xiàng)集
1. 構(gòu)建FP樹
1.1 這里必須著重說(shuō)下FP樹,很重要!
- FP-growth算法將數(shù)據(jù)存儲(chǔ)在一種稱為FP樹的緊湊數(shù)據(jù)結(jié)構(gòu)中。FP代表頻繁模式(Frequent Pattern)。一棵FP樹看上去與計(jì)算機(jī)科學(xué)中的其他樹結(jié)構(gòu)類似,但是它通過(guò)鏈接(link)來(lái)連接相似元素,被連起來(lái)的元素項(xiàng)可以看成一個(gè)鏈表。
- 與搜索樹不同的是,一個(gè)元素項(xiàng)可以在一棵FP樹種出現(xiàn)多次。FP樹會(huì)存儲(chǔ)項(xiàng)集的出現(xiàn)頻率,而每個(gè)項(xiàng)集會(huì)以路徑的方式存儲(chǔ)在數(shù)中。存在相似元素的集合會(huì)共享樹的一部分。只有當(dāng)集合之間完全不同時(shí),樹才會(huì)分叉。 樹節(jié)點(diǎn)上給出集合中的單個(gè)元素及其在序列中的出現(xiàn)次數(shù),路徑會(huì)給出該序列的出現(xiàn)次數(shù)。
- 相似項(xiàng)之間的鏈接稱為節(jié)點(diǎn)鏈接(node link),用于快速發(fā)現(xiàn)相似項(xiàng)的位置。
1.2 FP-growth算法的工作流程如下
首先構(gòu)建FP樹,然后利用它來(lái)挖掘頻繁項(xiàng)集。為構(gòu)建FP樹,需要對(duì)原始數(shù)據(jù)集掃描兩遍。第一遍對(duì)所有元素項(xiàng)的出現(xiàn)次數(shù)進(jìn)行計(jì)數(shù)。數(shù)據(jù)庫(kù)的第一遍掃描用來(lái)統(tǒng)計(jì)出現(xiàn)的頻率,而第二遍掃描中只考慮那些頻繁元素。
FP-growth算法還需要一個(gè)稱為頭指針表的數(shù)據(jù)結(jié)構(gòu),其實(shí)很簡(jiǎn)單,就是用來(lái)記錄各個(gè)元素項(xiàng)的總出現(xiàn)次數(shù)的數(shù)組,再附帶一個(gè)指針指向FP樹中該元素項(xiàng)的第一個(gè)節(jié)點(diǎn)。這樣每個(gè)元素項(xiàng)都構(gòu)成一條單鏈表。
1.3 事務(wù)數(shù)據(jù)樣例
代碼:
# -*- coding: utf-8 -*-# 返回一個(gè)事物列表 def loadSimpDat():simpDat = [['r', 'z', 'h', 'j', 'p'],['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],['z'],['r', 'x', 'n', 'o', 's'],['y', 'r', 'x', 'z', 'q', 't', 'p'],['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]return simpDat# 用于實(shí)現(xiàn)列表到字典的轉(zhuǎn)換過(guò)程 def createInitSet(dataSet): # 把每條事務(wù)記錄由列表轉(zhuǎn)換為frozenset類型,并且其鍵對(duì)應(yīng)的值為1retDict = {}for trans in dataSet:retDict[frozenset(trans)] = 1return retDict# 構(gòu)建FP樹的類定義 class treeNode:def __init__(self, nameValue, numOccur, parentNode):self.name = nameValueself.count = numOccurself.nodeLink = None # 用于鏈接相似的元素項(xiàng)self.parent = parentNode # 指向當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)self.children = {} def inc(self, numOccur):self.count += numOccurdef disp(self, ind=1): # 用于將樹以文本形式顯示print ' ' *ind, self.name, ' ', self.count # ' ' *ind此處代表的是空格數(shù),也即是為了顯示運(yùn)行結(jié)果的結(jié)構(gòu)的for child in self.children.values(): # 子節(jié)點(diǎn)也是treeNode對(duì)象child.disp(ind+1) # 遞歸調(diào)用disp()# FPA樹構(gòu)建函數(shù) def createTree(dataSet, minSup=1): # minSup最小支持度headerTable = {}# 兩次遍歷數(shù)據(jù)集for trans in dataSet: # 第一次遍歷,統(tǒng)計(jì)每個(gè)元素出現(xiàn)的頻度for item in trans:# 這個(gè)式子很牛,headerTable[item]得到的是遍歷每個(gè)事務(wù)項(xiàng)中的每個(gè)元素后的個(gè)數(shù),即得到頭指針表headerTable[item] = headerTable.get(item, 0) + dataSet[trans] print 'headerTable_1:',headerTablefor k in headerTable.keys(): # 刪除頭指針表中出現(xiàn)次數(shù)小于minsup的項(xiàng)if headerTable[k] < minSup: del(headerTable[k])print 'headerTable_2:',headerTablefreqItemSet = set(headerTable.keys()) # 得到頻繁項(xiàng)的元素,即字典的鍵print 'freqItemSet: ',freqItemSetif len(freqItemSet) == 0: return None, None # 如果沒(méi)有元素項(xiàng)滿足要求,則退出for k in headerTable: # 遍歷過(guò)濾后的頭指針表headerTable[k] = [headerTable[k], None] # 每個(gè)項(xiàng)(字典鍵)的值是[計(jì)數(shù)值,元素項(xiàng)指針]print 'headerTable_3: ',headerTableretTree = treeNode('Null Set', 1, None) # 創(chuàng)建根節(jié)點(diǎn)for tranSet, count in dataSet.items(): # 遍歷dataSet中的每一項(xiàng)[],tranSet, count是[項(xiàng),數(shù)1]localD = {}print 'tranSet and count:',tranSet,'-->',countfor item in tranSet: if item in freqItemSet: # 基于頻繁項(xiàng)集再遍歷一遍localD[item] = headerTable[item][0] print 'localD:',localD if len(localD) > 0:# 列表推到式進(jìn)行排序,得到降序排列的每個(gè)事務(wù)項(xiàng)(過(guò)濾后的)orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)] print 'orderedItems:',orderedItems updateTree(orderedItems, retTree, headerTable, count) return retTree, headerTable # # 更新樹 def updateTree(items, inTree, headerTable, count): # intree:樹節(jié)點(diǎn)對(duì)象,count=1,items過(guò)濾后的事務(wù)項(xiàng)print 'children:',inTree.children.keys()if items[0] in inTree.children: # 事務(wù)中的第一個(gè)元素是否作為子節(jié)點(diǎn)存在print 'here...0'inTree.children[items[0]].inc(count) # 如果存在則更新該元素項(xiàng)的計(jì)數(shù)else: # 如果不存在則將其作為一個(gè)子節(jié)點(diǎn)添加到樹中inTree.children[items[0]] = treeNode(items[0], count, inTree) # inTree是父節(jié)點(diǎn),count=1,item[0]是節(jié)點(diǎn)名print 'here....1'inTree.disp()if headerTable[items[0]][1] == None: # [items[0]][1]是第一個(gè)鍵的值(列表)的第二個(gè)元素 print 'here...2'headerTable[items[0]][1] = inTree.children[items[0]] # 把節(jié)點(diǎn)對(duì)象賦給headerTable的[items[0]][1]#print 'headerTable_4:',headerTableelse: # 頭指針表也要更新以指向新的節(jié)點(diǎn)print 'here....3'updateHeader(headerTable[items[0]][1], inTree.children[items[0]])if len(items) > 1: # inTree.children[items[0]]頭指針表也要指向新的節(jié)點(diǎn)print 'len(items):',len(items)updateTree(items[1::], inTree.children[items[0]], headerTable, count)# 更新頭指針表 def updateHeader(nodeToTest, targetNode): # nodeToTest是節(jié)點(diǎn)對(duì)象,targetNode也是節(jié)點(diǎn)對(duì)象print 'updateHeader:',nodeToTest.name,targetNode.namewhile (nodeToTest.nodeLink != None): print 'gaga...'nodeToTest = nodeToTest.nodeLinknodeToTest.nodeLink = targetNodeprint 'nodeToTest.nodeLink-->',nodeToTest.nodeLink.namedef ascendTree(leafNode, prefixPath): #ascends from leaf node to rootif leafNode.parent != None:prefixPath.append(leafNode.name)ascendTree(leafNode.parent, prefixPath)# 主函數(shù)# 測(cè)試 rootNode=treeNode('pyramid',9,None) rootNode.disp() rootNode.children['eye']=treeNode('pyramid',13,None) rootNode.disp() rootNode.children['phoenix']=treeNode('phoenix',3,None) rootNode.disp()# 構(gòu)建FP樹 simDat=loadSimpDat() initSet=createInitSet(simDat) print 'initSet:',initSet myFPtree,myHeaderTab=createTree(initSet, 3) print 'complete tree:',myFPtree.disp() #print 'myHeaderTab:',myHeaderTab運(yùn)行結(jié)果:
pyramid 9pyramid 9pyramid 13pyramid 9pyramid 13phoenix 3 initSet: {frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']): 1, frozenset(['x', 's', 'r', 'o', 'n']): 1, frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']): 1, frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']): 1, frozenset(['h', 'r', 'z', 'p', 'j']): 1, frozenset(['z']): 1} headerTable_1: {'e': 1, 'h': 1, 'j': 1, 'm': 1, 'o': 1, 'n': 1, 'q': 2, 'p': 2, 's': 3, 'r': 3, 'u': 1, 't': 3, 'w': 1, 'v': 1, 'y': 3, 'x': 4, 'z': 5} headerTable_2: {'s': 3, 'r': 3, 't': 3, 'y': 3, 'x': 4, 'z': 5} freqItemSet: set(['s', 'r', 't', 'y', 'x', 'z']) headerTable_3: {'s': [3, None], 'r': [3, None], 't': [3, None], 'y': [3, None], 'x': [4, None], 'z': [5, None]} tranSet and count: frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3} orderedItems: ['z', 'x', 'y', 's', 't'] children: [] here....1Null Set 1z 1 here...2 len(items): 5 children: [] here....1z 1x 1 here...2 len(items): 4 children: [] here....1x 1y 1 here...2 len(items): 3 children: [] here....1y 1s 1 here...2 len(items): 2 children: [] here....1s 1t 1 here...2 tranSet and count: frozenset(['x', 's', 'r', 'o', 'n']) --> 1 localD: {'x': 4, 's': 3, 'r': 3} orderedItems: ['x', 's', 'r'] children: ['z'] here....1Null Set 1x 1z 1x 1y 1s 1t 1 here....3 updateHeader: x x nodeToTest.nodeLink--> x len(items): 3 children: [] here....1x 1s 1 here....3 updateHeader: s s nodeToTest.nodeLink--> s len(items): 2 children: [] here....1s 1r 1 here...2 tranSet and count: frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3} orderedItems: ['z', 'x', 'y', 's', 't'] children: ['x', 'z'] here...0 len(items): 5 children: ['x'] here...0 len(items): 4 children: ['y'] here...0 len(items): 3 children: ['s'] here...0 len(items): 2 children: ['t'] here...0 tranSet and count: frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 'r': 3, 't': 3, 'z': 5} orderedItems: ['z', 'x', 'y', 'r', 't'] children: ['x', 'z'] here...0 len(items): 5 children: ['x'] here...0 len(items): 4 children: ['y'] here...0 len(items): 3 children: ['s'] here....1y 3s 2t 2r 1 here....3 updateHeader: r r nodeToTest.nodeLink--> r len(items): 2 children: [] here....1r 1t 1 here....3 updateHeader: t t nodeToTest.nodeLink--> t tranSet and count: frozenset(['h', 'r', 'z', 'p', 'j']) --> 1 localD: {'r': 3, 'z': 5} orderedItems: ['z', 'r'] children: ['x', 'z'] here...0 len(items): 2 children: ['x'] here....1z 4x 3y 3s 2t 2r 1t 1r 1 here....3 updateHeader: r r gaga... nodeToTest.nodeLink--> r tranSet and count: frozenset(['z']) --> 1 localD: {'z': 5} orderedItems: ['z'] children: ['x', 'z'] here...0 complete tree: Null Set 1x 1s 1r 1z 5x 3y 3s 2t 2r 1t 1r 1 None myHeaderTab: {'s': [3, <__main__.treeNode instance at 0x000000000B905188>], 'r': [3, <__main__.treeNode instance at 0x000000000B9B0E08>], 't': [3, <__main__.treeNode instance at 0x000000000B905F08>], 'y': [3, <__main__.treeNode instance at 0x000000000B9051C8>], 'x': [4, <__main__.treeNode instance at 0x000000000B9E5688>], 'z': [5, <__main__.treeNode instance at 0x000000000B9E59C8>]}以上就是FP樹的構(gòu)建過(guò)程,已經(jīng)把具體流程打印出類了,一步一步對(duì)應(yīng)上面帶頭指針表的圖就可以搞清楚其中的細(xì)節(jié)了,具體解釋參考《機(jī)器學(xué)習(xí)實(shí)戰(zhàn)》。
構(gòu)建FP樹的前兩步:
這里我只想說(shuō),數(shù)據(jù)結(jié)構(gòu)很重要!數(shù)據(jù)結(jié)構(gòu)很重要!數(shù)據(jù)結(jié)構(gòu)很重要!
python中frozenset( )的用法
2. 從FP樹中挖掘頻繁項(xiàng)
有了FP樹之后,就可以抽取頻繁項(xiàng)集了。這里的思路與Apriori算法大致類似,首先從單元素項(xiàng)集合開始,然后在此基礎(chǔ)上逐步構(gòu)建更大的集合。
從FP樹中抽取頻繁項(xiàng)集的三個(gè)基本步驟如下:
- 從FP樹中獲得條件模式基;
- 利用條件模式基,構(gòu)建一個(gè)條件FP樹;
- 迭代重復(fù)步驟1步驟2,直到樹包含一個(gè)元素項(xiàng)為止。
其中關(guān)鍵是尋找條件模式基的過(guò)程,之后為每一個(gè)條件模式基創(chuàng)建對(duì)應(yīng)的條件FP樹。
2.1 抽取條件模式基
首先從頭指針表中的每個(gè)頻繁元素項(xiàng)開始,對(duì)每個(gè)元素項(xiàng),獲得其對(duì)應(yīng)的條件模式基(conditional pattern base)。條件模式基是以所查找元素項(xiàng)為結(jié)尾的路徑集合。每一條路徑其實(shí)都是一條前綴路徑(prefix path)。簡(jiǎn)而言之,一條前綴路徑是介于所查找元素項(xiàng)與樹根節(jié)點(diǎn)之間的所有內(nèi)容。
則由吐1.1得到每一個(gè)頻繁元素項(xiàng)的所有前綴路徑(條件模式基)為:
前綴路徑將在下一步中用于構(gòu)建條件FP樹,暫時(shí)先不考慮。如何發(fā)現(xiàn)某個(gè)頻繁元素項(xiàng)的所在的路徑?利用先前創(chuàng)建的頭指針表和FP樹中的相似元素節(jié)點(diǎn)指針,我們已經(jīng)有了每個(gè)元素對(duì)應(yīng)的單鏈表,因而可以直接獲取。
在代碼實(shí)現(xiàn)中:為給定元素項(xiàng)生成一個(gè)條件模式基(前綴路徑),這通過(guò)訪問(wèn)樹中所有包含給定元素項(xiàng)的節(jié)點(diǎn)來(lái)完成。
2.2 創(chuàng)建條件FP樹
對(duì)于每一個(gè)頻繁項(xiàng),都要?jiǎng)?chuàng)建一棵條件FP樹??梢允褂脛偛虐l(fā)現(xiàn)的條件模式基作為輸入數(shù)據(jù),并通過(guò)相同的建樹代碼來(lái)構(gòu)建這些樹。例如,對(duì)于r,即以“{x, s}: 1, {z, x, y}: 1, {z}: 1”為輸入,調(diào)用函數(shù)createTree()獲得r的條件FP樹;對(duì)于t,輸入是對(duì)應(yīng)的條件模式基“{z, x, y, s}: 2, {z, x, y, r}: 1”,然后再遞歸地發(fā)現(xiàn)頻繁項(xiàng)集,發(fā)現(xiàn)條件模式基,以及發(fā)現(xiàn)另外的條件樹。
圖示:
2.3 遞歸查找頻繁項(xiàng)集
有了FP樹和條件FP樹,我們就可以在前兩步的基礎(chǔ)上遞歸得查找頻繁項(xiàng)集。
完整代碼:
# -*- coding: utf-8 -*-# 返回一個(gè)事物列表 def loadSimpDat():simpDat = [['r', 'z', 'h', 'j', 'p'],['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],['z'],['r', 'x', 'n', 'o', 's'],['y', 'r', 'x', 'z', 'q', 't', 'p'],['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]return simpDat# 用于實(shí)現(xiàn)列表到字典的轉(zhuǎn)換過(guò)程 def createInitSet(dataSet): # 把每條事務(wù)記錄由列表轉(zhuǎn)換為frozenset類型,并且其鍵對(duì)應(yīng)的值為1retDict = {}for trans in dataSet:retDict[frozenset(trans)] = 1return retDict# 構(gòu)建FP樹的類定義 class treeNode:def __init__(self, nameValue, numOccur, parentNode):self.name = nameValueself.count = numOccurself.nodeLink = None # 用于鏈接相似的元素項(xiàng)self.parent = parentNode # 指向當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)self.children = {} def inc(self, numOccur):self.count += numOccurdef disp(self, ind=1): # 用于將樹以文本形式顯示print ' ' *ind, self.name, ' ', self.count # ' ' *ind此處代表的是空格數(shù),也即是為了顯示運(yùn)行結(jié)果的結(jié)構(gòu)的for child in self.children.values(): # 子節(jié)點(diǎn)也是treeNode對(duì)象child.disp(ind+1) # 遞歸調(diào)用disp()# FPA樹構(gòu)建函數(shù) def createTree(dataSet, minSup=1): # minSup最小支持度headerTable = {}for trans in dataSet: # 第一次遍歷,統(tǒng)計(jì)每個(gè)元素出現(xiàn)的頻度for item in trans:headerTable[item] = headerTable.get(item, 0) + dataSet[trans] for k in headerTable.keys(): # 刪除頭指針表中出現(xiàn)次數(shù)小于minsup的項(xiàng)if headerTable[k] < minSup: del(headerTable[k])freqItemSet = set(headerTable.keys()) # 得到頻繁項(xiàng)的元素,即字典的鍵if len(freqItemSet) == 0: return None, None # 如果沒(méi)有元素項(xiàng)滿足要求,則退出for k in headerTable: # 遍歷過(guò)濾后的頭指針表headerTable[k] = [headerTable[k], None] # 每個(gè)項(xiàng)(字典鍵)的值是[計(jì)數(shù)值,元素項(xiàng)指針]retTree = treeNode('Null Set', 1, None) # 創(chuàng)建根節(jié)點(diǎn)for tranSet, count in dataSet.items(): # 遍歷dataSet中的每一項(xiàng)[],tranSet, count是[項(xiàng),數(shù)1]localD = {}for item in tranSet: if item in freqItemSet: # 基于頻繁項(xiàng)集再遍歷一遍localD[item] = headerTable[item][0] if len(localD) > 0: orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)] updateTree(orderedItems, retTree, headerTable, count) return retTree, headerTable # 更新樹 def updateTree(items, inTree, headerTable, count): # intree:樹節(jié)點(diǎn)對(duì)象,count=1,items過(guò)濾后的事務(wù)項(xiàng)if items[0] in inTree.children: # 事務(wù)中的第一個(gè)元素是否作為子節(jié)點(diǎn)存在 inTree.children[items[0]].inc(count) # 如果存在則更新該元素項(xiàng)的計(jì)數(shù)else: # 如果不存在則將其作為一個(gè)子節(jié)點(diǎn)添加到樹中inTree.children[items[0]] = treeNode(items[0], count, inTree) # inTree是父節(jié)點(diǎn),count=1,item[0]是節(jié)點(diǎn)名 inTree.disp()if headerTable[items[0]][1] == None: # [items[0]][1]是第一個(gè)鍵的值(列表)的第二個(gè)元素 headerTable[items[0]][1] = inTree.children[items[0]] # 把節(jié)點(diǎn)對(duì)象賦給headerTable的[items[0]][1] else: # 頭指針表也要更新以指向新的節(jié)點(diǎn) updateHeader(headerTable[items[0]][1], inTree.children[items[0]])if len(items) > 1: # inTree.children[items[0]]頭指針表也要指向新的節(jié)點(diǎn) updateTree(items[1::], inTree.children[items[0]], headerTable, count)# 更新頭指針表 def updateHeader(nodeToTest, targetNode): # nodeToTest是節(jié)點(diǎn)對(duì)象,targetNode也是節(jié)點(diǎn)對(duì)象while (nodeToTest.nodeLink != None): nodeToTest = nodeToTest.nodeLinknodeToTest.nodeLink = targetNode# 發(fā)現(xiàn)給定元素結(jié)尾的所有路徑的函數(shù)(上溯FP樹) def ascendTree(leafNode, prefixPath): if leafNode.parent != None: # 迭代上溯整棵樹,因?yàn)橹挥懈?jié)點(diǎn)的父節(jié)點(diǎn)是NoneprefixPath.append(leafNode.name)ascendTree(leafNode.parent, prefixPath)# 創(chuàng)建條件基(遍歷鏈表直到到達(dá)結(jié)尾,,每遇到一個(gè)元素項(xiàng)都會(huì)調(diào)用asscendtree()來(lái)上溯FP樹) def findPrefixPath(basePat, treeNode): # 兩個(gè)參數(shù):給定元素項(xiàng)的節(jié)點(diǎn)和該節(jié)點(diǎn)指向的對(duì)象condPats = {} # 條件模式基字典while treeNode != None: # prefixPath = [] # 上溯列表ascendTree(treeNode, prefixPath)print 'prefixPath:',prefixPathif len(prefixPath) > 1: condPats[frozenset(prefixPath[1:])] = treeNode.countprint 'condPats:',condPats treeNode = treeNode.nodeLinkreturn condPats # 返回對(duì)應(yīng)的條件模式基# 遞歸查找頻繁項(xiàng)集#(myFPtree, myHeaderTab, 3, set([]), freqItems=[]) def mineTree(inTree, headerTable, minSup, preFix, freqItemList):bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])] # 排序頭指針表,按升序排列print 'bigL:',bigL # 頭指針表for basePat in bigL: # 從bigL的底部開始newFreqSet = preFix.copy()newFreqSet.add(basePat) # set集合用add()print 'finalFrequent Item: ',newFreqSet # 頻繁項(xiàng)freqItemList.append(newFreqSet) # 列表用append()#print 'treenood:',headerTable[basePat][1].namecondPattBases = findPrefixPath(basePat, headerTable[basePat][1]) # 第二個(gè)參數(shù)是print 'condPattBases :',basePat, '-->',condPattBases # 得到模式基字典# 針對(duì)每一個(gè)條件模式基創(chuàng)建條件FP樹myCondTree, myHead = createTree(condPattBases, minSup) # myCondTree條件fp樹print 'head from conditional tree: ', myHead if myHead != None: print 'conditional tree for: ',newFreqSetmyCondTree.disp(1) mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)# 主函數(shù) # 構(gòu)建FP樹 simDat=loadSimpDat() initSet=createInitSet(simDat) print 'initSet:',initSet myFPtree,myHeaderTab=createTree(initSet, 3) myFPtree.disp() print 'myHeaderTab:',myHeaderTab freqItems=[] mineTree(myFPtree, myHeaderTab, 3, set([]), freqItems) print 'freqItems:',freqItems運(yùn)行結(jié)果:
initSet: {frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']): 1, frozenset(['x', 's', 'r', 'o', 'n']): 1, frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']): 1, frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']): 1, frozenset(['h', 'r', 'z', 'p', 'j']): 1, frozenset(['z']): 1}Null Set 1z 1z 1x 1x 1y 1y 1s 1s 1t 1Null Set 1x 1z 1x 1y 1s 1t 1x 1s 1s 1r 1y 3s 2t 2r 1r 1t 1z 4x 3y 3s 2t 2r 1t 1r 1Null Set 1x 1s 1r 1z 5x 3y 3s 2t 2r 1t 1r 1 myHeaderTab: {'s': [3, <__main__.treeNode instance at 0x000000000B8E9808>], 'r': [3, <__main__.treeNode instance at 0x000000000B8E9608>], 't': [3, <__main__.treeNode instance at 0x000000000B8E9788>], 'y': [3, <__main__.treeNode instance at 0x000000000B8E9848>], 'x': [4, <__main__.treeNode instance at 0x000000000B8E97C8>], 'z': [5, <__main__.treeNode instance at 0x000000000B8E9508>]} bigL: ['r', 't', 's', 'y', 'x', 'z'] finalFrequent Item: set(['r']) prefixPath: ['r', 's', 'x'] condPats: {frozenset(['x', 's']): 1} prefixPath: ['r', 'y', 'x', 'z'] condPats: {frozenset(['x', 's']): 1, frozenset(['y', 'x', 'z']): 1} prefixPath: ['r', 'z'] condPats: {frozenset(['x', 's']): 1, frozenset(['z']): 1, frozenset(['y', 'x', 'z']): 1} condPattBases : r --> {frozenset(['x', 's']): 1, frozenset(['z']): 1, frozenset(['y', 'x', 'z']): 1} head from conditional tree: None finalFrequent Item: set(['t']) prefixPath: ['t', 's', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 's', 'z']): 2} prefixPath: ['t', 'r', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 's', 'z']): 2, frozenset(['y', 'x', 'r', 'z']): 1} condPattBases : t --> {frozenset(['y', 'x', 's', 'z']): 2, frozenset(['y', 'x', 'r', 'z']): 1}Null Set 1y 2y 2x 2x 2z 2 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9888>], 'x': [3, <__main__.treeNode instance at 0x000000000B8E9448>], 'z': [3, <__main__.treeNode instance at 0x000000000B8E9408>]} conditional tree for: set(['t'])Null Set 1y 3x 3z 3 bigL: ['z', 'x', 'y'] finalFrequent Item: set(['z', 't']) prefixPath: ['z', 'x', 'y'] condPats: {frozenset(['y', 'x']): 3} condPattBases : z --> {frozenset(['y', 'x']): 3}Null Set 1y 3y 3x 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9488>], 'x': [3, <__main__.treeNode instance at 0x000000000B8E93C8>]} conditional tree for: set(['z', 't'])Null Set 1y 3x 3 bigL: ['x', 'y'] finalFrequent Item: set(['x', 'z', 't']) prefixPath: ['x', 'y'] condPats: {frozenset(['y']): 3} condPattBases : x --> {frozenset(['y']): 3}Null Set 1y 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E92C8>]} conditional tree for: set(['x', 'z', 't'])Null Set 1y 3 bigL: ['y'] finalFrequent Item: set(['y', 'x', 'z', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['y', 'z', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['x', 't']) prefixPath: ['x', 'y'] condPats: {frozenset(['y']): 3} condPattBases : x --> {frozenset(['y']): 3}Null Set 1y 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9188>]} conditional tree for: set(['x', 't'])Null Set 1y 3 bigL: ['y'] finalFrequent Item: set(['y', 'x', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['y', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['s']) prefixPath: ['s', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 'z']): 2} prefixPath: ['s', 'x'] condPats: {frozenset(['y', 'x', 'z']): 2, frozenset(['x']): 1} condPattBases : s --> {frozenset(['y', 'x', 'z']): 2, frozenset(['x']): 1}Null Set 1x 2 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B69B788>]} conditional tree for: set(['s'])Null Set 1x 3 bigL: ['x'] finalFrequent Item: set(['x', 's']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['y']) prefixPath: ['y', 'x', 'z'] condPats: {frozenset(['x', 'z']): 3} condPattBases : y --> {frozenset(['x', 'z']): 3}Null Set 1x 3x 3z 3 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B6A26C8>], 'z': [3, <__main__.treeNode instance at 0x000000000B84B248>]} conditional tree for: set(['y'])Null Set 1x 3z 3 bigL: ['x', 'z'] finalFrequent Item: set(['y', 'x']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['y', 'z']) prefixPath: ['z', 'x'] condPats: {frozenset(['x']): 3} condPattBases : z --> {frozenset(['x']): 3}Null Set 1x 3 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B6A10C8>]} conditional tree for: set(['y', 'z'])Null Set 1x 3 bigL: ['x'] finalFrequent Item: set(['y', 'x', 'z']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['x']) prefixPath: ['x', 'z'] condPats: {frozenset(['z']): 3} prefixPath: ['x'] condPattBases : x --> {frozenset(['z']): 3}Null Set 1z 3 head from conditional tree: {'z': [3, <__main__.treeNode instance at 0x000000000B823908>]} conditional tree for: set(['x'])Null Set 1z 3 bigL: ['z'] finalFrequent Item: set(['x', 'z']) prefixPath: ['z'] condPattBases : z --> {} head from conditional tree: None finalFrequent Item: set(['z']) prefixPath: ['z'] condPattBases : z --> {} head from conditional tree: None freqItems: [set(['r']), set(['t']), set(['z', 't']), set(['x', 'z', 't']), set(['y', 'x', 'z', 't']), set(['y', 'z', 't']), set(['x', 't']), set(['y', 'x', 't']), set(['y', 't']), set(['s']), set(['x', 's']), set(['y']), set(['y', 'x']), set(['y', 'z']), set(['y', 'x', 'z']), set(['x']), set(['x', 'z']), set(['z'])]上面是具體的過(guò)程。
補(bǔ)充:因?yàn)橹虚g涉及到很多遞歸,所以具體的過(guò)程比較麻煩,這里舉一個(gè)例子.
for basePat in bigL:一行當(dāng)basePat為’t’時(shí)的過(guò)程:
對(duì)照上面代碼的運(yùn)行結(jié)果可以幫助分析,沒(méi)別的,就是數(shù)據(jù)結(jié)構(gòu)的東西。
3. 從新聞網(wǎng)站點(diǎn)擊流中挖掘新聞報(bào)道
書中的這兩章有不少精彩的示例,這里只選取比較有代表性的一個(gè)——從新聞網(wǎng)站點(diǎn)擊流中挖掘熱門新聞報(bào)道。這是一個(gè)很大的數(shù)據(jù)集,有將近100萬(wàn)條記錄(參見(jiàn)擴(kuò)展閱讀:kosarak)。在源數(shù)據(jù)集合保存在文件kosarak.dat中。該文件中的每一行包含某個(gè)用戶瀏覽過(guò)的新聞報(bào)道。新聞報(bào)道被編碼成整數(shù),我們可以使用Apriori或FP-growth算法挖掘其中的頻繁項(xiàng)集,查看那些新聞ID被用戶大量觀看到。
在2中的代碼主函數(shù)部分改成如下:
parsedDat = [line.split() for line in open('kosarak.dat').readlines()] # 將數(shù)據(jù)集導(dǎo)入到列表 initSet=createInitSet(parsedDat) # 對(duì)初始集合格式化 # 然后構(gòu)建FP樹,并從中尋找那些至少被10萬(wàn)人瀏覽過(guò)的新聞報(bào)道 myFPtree, myHeaderTab = createTree(initSet, 100000) myFreqList = [] # 創(chuàng)建一個(gè)空列表來(lái)保存這些頻繁項(xiàng)集 mineTree(myFPtree, myHeaderTab, 100000, set([]), myFreqList) print 'length:',len(myFreqList) # 查看多少新聞報(bào)道或報(bào)道集合曾經(jīng)被10萬(wàn)或者更多的人瀏覽過(guò) print 'myFreqList',myFreqList # 具體的內(nèi)容運(yùn)行結(jié)果:
... condPattBases : 6 --> {} head from conditional tree: None finalFrequent Item: set(['6']) prefixPath: ['6'] condPattBases : 6 --> {} head from conditional tree: None length: 9 myFreqList [set(['1']), set(['1', '6']), set(['3']), set(['11', '3']), set(['11', '3', '6']), set(['3', '6']), set(['11']), set(['11', '6']), set(['6'])]同時(shí)也可以使用其他設(shè)置來(lái)查看運(yùn)行結(jié)果,比如降低置信度級(jí)別。
總結(jié):
- FP-growth算法是一種用于發(fā)現(xiàn)數(shù)據(jù)集中頻繁模式的有效方法。FP-growth算法利用Apriori原則,執(zhí)行更快。
- FP-growth算法還有一個(gè)map-reduce版本的實(shí)現(xiàn),它也很不錯(cuò),可以擴(kuò)展到多臺(tái)機(jī)器上運(yùn)行。Google使用該算法通過(guò)遍歷大量文本來(lái)發(fā)現(xiàn)頻繁共現(xiàn)詞,其做法和我們剛才介紹的例子非常類似。
4. 筆記
(1)Python 字典(Dictionary) get()方法:
Python 字典(Dictionary) get() 函數(shù)返回指定鍵的值,如果值不在字典中返回默認(rèn)值。
get()方法語(yǔ)法: dict.get(key, default=None)
key – 字典中要查找的鍵
default – 如果指定鍵的值不存在時(shí),返回該默認(rèn)值值。
示例:
>>> dict = {'Name': 'Zara', 'Age': 27} >>> dict.get('Age') 27 >>> dict.get('Sex', 0) 0 >>>(2) initSet=createInitSet(simDat)的用法:
In [13]: m=['e', 'm', 'q', 's', 't', 'y', 'x', 'z']In [14]: mm=frozenset(m)In [15]: initSet Out[15]: {frozenset({'e', 'm', 'q', 's', 't', 'x', 'y', 'z'}): 1,frozenset({'n', 'o', 'r', 's', 'x'}): 1,frozenset({'z'}): 1,frozenset({'s', 't', 'u', 'v', 'w', 'x', 'y', 'z'}): 1,frozenset({'p', 'q', 'r', 't', 'x', 'y', 'z'}): 1,frozenset({'h', 'j', 'p', 'r', 'z'}): 1}In [16]: initSet[mm] Out[16]: 1(3)orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]的用法:
In [31]: localD={'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3}In [32]: [v[0] for v in sorted(localD.items(), \...: key=lambda p: p[1], reverse=True)] Out[32]: ['z', 'x', 'y', 's', 't']# rr是針對(duì)localD以其value為排序目標(biāo)進(jìn)行的降序排列(p[1]) In [33]: rr=sorted(localD.items(), \...: key=lambda p: p[1], reverse=True)In [34]: rr Out[34]: [('z', 5), ('x', 4), ('y', 3), ('s', 3), ('t', 3)]# rr是針對(duì)localD以其key為排序目標(biāo)進(jìn)行的降序排列(p[0]) In [35]: rr=sorted(localD.items(), \...: key=lambda p: p[0], reverse=True)In [36]: rr Out[36]: [('z', 5), ('y', 3), ('x', 4), ('t', 3), ('s', 3)]# 得到了rr中每個(gè)元組的第一個(gè)元素 In [37]: [v[0] for v in rr] Out[37]: ['z', 'y', 'x', 't', 's']In [38]: [v[1] for v in rr] Out[38]: [5, 3, 4, 3, 3]In [39]: rr[0] Out[39]: ('z', 5)In [40]: type(rr[0]) Out[40]: tuple(4)updateTree(items[1::], inTree.children[items[0]], headerTable, count)的用法:
>>> items=['z', 'x', 'y', 's', 't'] >>> items[1::] ['x', 'y', 's', 't'] >>> items[1:] ['x', 'y', 's', 't'] >>> items[2::] ['y', 's', 't'] >>>參考:https://www.cnblogs.com/qwertWZ/p/4510857.html
總結(jié)
以上是生活随笔為你收集整理的FP-growth算法高效发现频繁项集的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 小溜刚骑就提示还车目的地怎么回事?
- 下一篇: PCA简化数据