FP-growth算法高效发现频繁项集
在用搜索引擎時,我們發(fā)現(xiàn)輸入單詞的一部分時,搜索引擎會自動補(bǔ)全查詢詞項(xiàng),這里的原理其實(shí)是通過查詢互聯(lián)網(wǎng)上的詞來找出經(jīng)常出現(xiàn)在一塊的詞對,這需要一種高效發(fā)現(xiàn)頻繁集的方法。
它基于Apriori構(gòu)建,但在完成相同任務(wù)時采用了一些不同的技術(shù)。這里的任務(wù)是將數(shù)據(jù)集存儲在一個特定的稱作FP樹的結(jié)構(gòu)之后發(fā)現(xiàn)頻繁項(xiàng)集或者頻繁項(xiàng)對,即常在一塊出現(xiàn)的元素項(xiàng)的集合FP樹。這種做法使得算法的執(zhí)行速度要快于Apriori,通常性能要好兩個數(shù)量級以上。
注意:
這種算法雖然能更為高效地發(fā)現(xiàn)頻繁項(xiàng)集,但不能用于發(fā)現(xiàn)關(guān)聯(lián)規(guī)則。
FP-growth算法只需要對數(shù)據(jù)庫進(jìn)行兩次掃描,而Apriori算法對于每個潛在的頻繁項(xiàng)集都會掃描數(shù)據(jù)集判定給定模式是否頻繁,因此FP-gr0wth算法的速度要比Apriori算法快。
FP-growth算法發(fā)現(xiàn)頻繁項(xiàng)集的過程:
- 構(gòu)建FP樹
- 從FP樹中挖掘頻繁項(xiàng)集
1. 構(gòu)建FP樹
1.1 這里必須著重說下FP樹,很重要!
- FP-growth算法將數(shù)據(jù)存儲在一種稱為FP樹的緊湊數(shù)據(jù)結(jié)構(gòu)中。FP代表頻繁模式(Frequent Pattern)。一棵FP樹看上去與計(jì)算機(jī)科學(xué)中的其他樹結(jié)構(gòu)類似,但是它通過鏈接(link)來連接相似元素,被連起來的元素項(xiàng)可以看成一個鏈表。
- 與搜索樹不同的是,一個元素項(xiàng)可以在一棵FP樹種出現(xiàn)多次。FP樹會存儲項(xiàng)集的出現(xiàn)頻率,而每個項(xiàng)集會以路徑的方式存儲在數(shù)中。存在相似元素的集合會共享樹的一部分。只有當(dāng)集合之間完全不同時,樹才會分叉。 樹節(jié)點(diǎn)上給出集合中的單個元素及其在序列中的出現(xiàn)次數(shù),路徑會給出該序列的出現(xiàn)次數(shù)。
- 相似項(xiàng)之間的鏈接稱為節(jié)點(diǎn)鏈接(node link),用于快速發(fā)現(xiàn)相似項(xiàng)的位置。
1.2 FP-growth算法的工作流程如下
首先構(gòu)建FP樹,然后利用它來挖掘頻繁項(xiàng)集。為構(gòu)建FP樹,需要對原始數(shù)據(jù)集掃描兩遍。第一遍對所有元素項(xiàng)的出現(xiàn)次數(shù)進(jìn)行計(jì)數(shù)。數(shù)據(jù)庫的第一遍掃描用來統(tǒng)計(jì)出現(xiàn)的頻率,而第二遍掃描中只考慮那些頻繁元素。
FP-growth算法還需要一個稱為頭指針表的數(shù)據(jù)結(jié)構(gòu),其實(shí)很簡單,就是用來記錄各個元素項(xiàng)的總出現(xiàn)次數(shù)的數(shù)組,再附帶一個指針指向FP樹中該元素項(xiàng)的第一個節(jié)點(diǎn)。這樣每個元素項(xiàng)都構(gòu)成一條單鏈表。
1.3 事務(wù)數(shù)據(jù)樣例
代碼:
# -*- coding: utf-8 -*-# 返回一個事物列表 def loadSimpDat():simpDat = [['r', 'z', 'h', 'j', 'p'],['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],['z'],['r', 'x', 'n', 'o', 's'],['y', 'r', 'x', 'z', 'q', 't', 'p'],['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]return simpDat# 用于實(shí)現(xiàn)列表到字典的轉(zhuǎn)換過程 def createInitSet(dataSet): # 把每條事務(wù)記錄由列表轉(zhuǎn)換為frozenset類型,并且其鍵對應(yīng)的值為1retDict = {}for trans in dataSet:retDict[frozenset(trans)] = 1return retDict# 構(gòu)建FP樹的類定義 class treeNode:def __init__(self, nameValue, numOccur, parentNode):self.name = nameValueself.count = numOccurself.nodeLink = None # 用于鏈接相似的元素項(xiàng)self.parent = parentNode # 指向當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)self.children = {} def inc(self, numOccur):self.count += numOccurdef disp(self, ind=1): # 用于將樹以文本形式顯示print ' ' *ind, self.name, ' ', self.count # ' ' *ind此處代表的是空格數(shù),也即是為了顯示運(yùn)行結(jié)果的結(jié)構(gòu)的for child in self.children.values(): # 子節(jié)點(diǎn)也是treeNode對象child.disp(ind+1) # 遞歸調(diào)用disp()# FPA樹構(gòu)建函數(shù) def createTree(dataSet, minSup=1): # minSup最小支持度headerTable = {}# 兩次遍歷數(shù)據(jù)集for trans in dataSet: # 第一次遍歷,統(tǒng)計(jì)每個元素出現(xiàn)的頻度for item in trans:# 這個式子很牛,headerTable[item]得到的是遍歷每個事務(wù)項(xiàng)中的每個元素后的個數(shù),即得到頭指針表headerTable[item] = headerTable.get(item, 0) + dataSet[trans] print 'headerTable_1:',headerTablefor k in headerTable.keys(): # 刪除頭指針表中出現(xiàn)次數(shù)小于minsup的項(xiàng)if headerTable[k] < minSup: del(headerTable[k])print 'headerTable_2:',headerTablefreqItemSet = set(headerTable.keys()) # 得到頻繁項(xiàng)的元素,即字典的鍵print 'freqItemSet: ',freqItemSetif len(freqItemSet) == 0: return None, None # 如果沒有元素項(xiàng)滿足要求,則退出for k in headerTable: # 遍歷過濾后的頭指針表headerTable[k] = [headerTable[k], None] # 每個項(xiàng)(字典鍵)的值是[計(jì)數(shù)值,元素項(xiàng)指針]print 'headerTable_3: ',headerTableretTree = treeNode('Null Set', 1, None) # 創(chuàng)建根節(jié)點(diǎn)for tranSet, count in dataSet.items(): # 遍歷dataSet中的每一項(xiàng)[],tranSet, count是[項(xiàng),數(shù)1]localD = {}print 'tranSet and count:',tranSet,'-->',countfor item in tranSet: if item in freqItemSet: # 基于頻繁項(xiàng)集再遍歷一遍localD[item] = headerTable[item][0] print 'localD:',localD if len(localD) > 0:# 列表推到式進(jìn)行排序,得到降序排列的每個事務(wù)項(xiàng)(過濾后的)orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)] print 'orderedItems:',orderedItems updateTree(orderedItems, retTree, headerTable, count) return retTree, headerTable # # 更新樹 def updateTree(items, inTree, headerTable, count): # intree:樹節(jié)點(diǎn)對象,count=1,items過濾后的事務(wù)項(xiàng)print 'children:',inTree.children.keys()if items[0] in inTree.children: # 事務(wù)中的第一個元素是否作為子節(jié)點(diǎn)存在print 'here...0'inTree.children[items[0]].inc(count) # 如果存在則更新該元素項(xiàng)的計(jì)數(shù)else: # 如果不存在則將其作為一個子節(jié)點(diǎn)添加到樹中inTree.children[items[0]] = treeNode(items[0], count, inTree) # inTree是父節(jié)點(diǎn),count=1,item[0]是節(jié)點(diǎn)名print 'here....1'inTree.disp()if headerTable[items[0]][1] == None: # [items[0]][1]是第一個鍵的值(列表)的第二個元素 print 'here...2'headerTable[items[0]][1] = inTree.children[items[0]] # 把節(jié)點(diǎn)對象賦給headerTable的[items[0]][1]#print 'headerTable_4:',headerTableelse: # 頭指針表也要更新以指向新的節(jié)點(diǎn)print 'here....3'updateHeader(headerTable[items[0]][1], inTree.children[items[0]])if len(items) > 1: # inTree.children[items[0]]頭指針表也要指向新的節(jié)點(diǎn)print 'len(items):',len(items)updateTree(items[1::], inTree.children[items[0]], headerTable, count)# 更新頭指針表 def updateHeader(nodeToTest, targetNode): # nodeToTest是節(jié)點(diǎn)對象,targetNode也是節(jié)點(diǎn)對象print 'updateHeader:',nodeToTest.name,targetNode.namewhile (nodeToTest.nodeLink != None): print 'gaga...'nodeToTest = nodeToTest.nodeLinknodeToTest.nodeLink = targetNodeprint 'nodeToTest.nodeLink-->',nodeToTest.nodeLink.namedef ascendTree(leafNode, prefixPath): #ascends from leaf node to rootif leafNode.parent != None:prefixPath.append(leafNode.name)ascendTree(leafNode.parent, prefixPath)# 主函數(shù)# 測試 rootNode=treeNode('pyramid',9,None) rootNode.disp() rootNode.children['eye']=treeNode('pyramid',13,None) rootNode.disp() rootNode.children['phoenix']=treeNode('phoenix',3,None) rootNode.disp()# 構(gòu)建FP樹 simDat=loadSimpDat() initSet=createInitSet(simDat) print 'initSet:',initSet myFPtree,myHeaderTab=createTree(initSet, 3) print 'complete tree:',myFPtree.disp() #print 'myHeaderTab:',myHeaderTab運(yùn)行結(jié)果:
pyramid 9pyramid 9pyramid 13pyramid 9pyramid 13phoenix 3 initSet: {frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']): 1, frozenset(['x', 's', 'r', 'o', 'n']): 1, frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']): 1, frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']): 1, frozenset(['h', 'r', 'z', 'p', 'j']): 1, frozenset(['z']): 1} headerTable_1: {'e': 1, 'h': 1, 'j': 1, 'm': 1, 'o': 1, 'n': 1, 'q': 2, 'p': 2, 's': 3, 'r': 3, 'u': 1, 't': 3, 'w': 1, 'v': 1, 'y': 3, 'x': 4, 'z': 5} headerTable_2: {'s': 3, 'r': 3, 't': 3, 'y': 3, 'x': 4, 'z': 5} freqItemSet: set(['s', 'r', 't', 'y', 'x', 'z']) headerTable_3: {'s': [3, None], 'r': [3, None], 't': [3, None], 'y': [3, None], 'x': [4, None], 'z': [5, None]} tranSet and count: frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3} orderedItems: ['z', 'x', 'y', 's', 't'] children: [] here....1Null Set 1z 1 here...2 len(items): 5 children: [] here....1z 1x 1 here...2 len(items): 4 children: [] here....1x 1y 1 here...2 len(items): 3 children: [] here....1y 1s 1 here...2 len(items): 2 children: [] here....1s 1t 1 here...2 tranSet and count: frozenset(['x', 's', 'r', 'o', 'n']) --> 1 localD: {'x': 4, 's': 3, 'r': 3} orderedItems: ['x', 's', 'r'] children: ['z'] here....1Null Set 1x 1z 1x 1y 1s 1t 1 here....3 updateHeader: x x nodeToTest.nodeLink--> x len(items): 3 children: [] here....1x 1s 1 here....3 updateHeader: s s nodeToTest.nodeLink--> s len(items): 2 children: [] here....1s 1r 1 here...2 tranSet and count: frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3} orderedItems: ['z', 'x', 'y', 's', 't'] children: ['x', 'z'] here...0 len(items): 5 children: ['x'] here...0 len(items): 4 children: ['y'] here...0 len(items): 3 children: ['s'] here...0 len(items): 2 children: ['t'] here...0 tranSet and count: frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 'r': 3, 't': 3, 'z': 5} orderedItems: ['z', 'x', 'y', 'r', 't'] children: ['x', 'z'] here...0 len(items): 5 children: ['x'] here...0 len(items): 4 children: ['y'] here...0 len(items): 3 children: ['s'] here....1y 3s 2t 2r 1 here....3 updateHeader: r r nodeToTest.nodeLink--> r len(items): 2 children: [] here....1r 1t 1 here....3 updateHeader: t t nodeToTest.nodeLink--> t tranSet and count: frozenset(['h', 'r', 'z', 'p', 'j']) --> 1 localD: {'r': 3, 'z': 5} orderedItems: ['z', 'r'] children: ['x', 'z'] here...0 len(items): 2 children: ['x'] here....1z 4x 3y 3s 2t 2r 1t 1r 1 here....3 updateHeader: r r gaga... nodeToTest.nodeLink--> r tranSet and count: frozenset(['z']) --> 1 localD: {'z': 5} orderedItems: ['z'] children: ['x', 'z'] here...0 complete tree: Null Set 1x 1s 1r 1z 5x 3y 3s 2t 2r 1t 1r 1 None myHeaderTab: {'s': [3, <__main__.treeNode instance at 0x000000000B905188>], 'r': [3, <__main__.treeNode instance at 0x000000000B9B0E08>], 't': [3, <__main__.treeNode instance at 0x000000000B905F08>], 'y': [3, <__main__.treeNode instance at 0x000000000B9051C8>], 'x': [4, <__main__.treeNode instance at 0x000000000B9E5688>], 'z': [5, <__main__.treeNode instance at 0x000000000B9E59C8>]}以上就是FP樹的構(gòu)建過程,已經(jīng)把具體流程打印出類了,一步一步對應(yīng)上面帶頭指針表的圖就可以搞清楚其中的細(xì)節(jié)了,具體解釋參考《機(jī)器學(xué)習(xí)實(shí)戰(zhàn)》。
構(gòu)建FP樹的前兩步:
這里我只想說,數(shù)據(jù)結(jié)構(gòu)很重要!數(shù)據(jù)結(jié)構(gòu)很重要!數(shù)據(jù)結(jié)構(gòu)很重要!
python中frozenset( )的用法
2. 從FP樹中挖掘頻繁項(xiàng)
有了FP樹之后,就可以抽取頻繁項(xiàng)集了。這里的思路與Apriori算法大致類似,首先從單元素項(xiàng)集合開始,然后在此基礎(chǔ)上逐步構(gòu)建更大的集合。
從FP樹中抽取頻繁項(xiàng)集的三個基本步驟如下:
- 從FP樹中獲得條件模式基;
- 利用條件模式基,構(gòu)建一個條件FP樹;
- 迭代重復(fù)步驟1步驟2,直到樹包含一個元素項(xiàng)為止。
其中關(guān)鍵是尋找條件模式基的過程,之后為每一個條件模式基創(chuàng)建對應(yīng)的條件FP樹。
2.1 抽取條件模式基
首先從頭指針表中的每個頻繁元素項(xiàng)開始,對每個元素項(xiàng),獲得其對應(yīng)的條件模式基(conditional pattern base)。條件模式基是以所查找元素項(xiàng)為結(jié)尾的路徑集合。每一條路徑其實(shí)都是一條前綴路徑(prefix path)。簡而言之,一條前綴路徑是介于所查找元素項(xiàng)與樹根節(jié)點(diǎn)之間的所有內(nèi)容。
則由吐1.1得到每一個頻繁元素項(xiàng)的所有前綴路徑(條件模式基)為:
前綴路徑將在下一步中用于構(gòu)建條件FP樹,暫時先不考慮。如何發(fā)現(xiàn)某個頻繁元素項(xiàng)的所在的路徑?利用先前創(chuàng)建的頭指針表和FP樹中的相似元素節(jié)點(diǎn)指針,我們已經(jīng)有了每個元素對應(yīng)的單鏈表,因而可以直接獲取。
在代碼實(shí)現(xiàn)中:為給定元素項(xiàng)生成一個條件模式基(前綴路徑),這通過訪問樹中所有包含給定元素項(xiàng)的節(jié)點(diǎn)來完成。
2.2 創(chuàng)建條件FP樹
對于每一個頻繁項(xiàng),都要創(chuàng)建一棵條件FP樹。可以使用剛才發(fā)現(xiàn)的條件模式基作為輸入數(shù)據(jù),并通過相同的建樹代碼來構(gòu)建這些樹。例如,對于r,即以“{x, s}: 1, {z, x, y}: 1, {z}: 1”為輸入,調(diào)用函數(shù)createTree()獲得r的條件FP樹;對于t,輸入是對應(yīng)的條件模式基“{z, x, y, s}: 2, {z, x, y, r}: 1”,然后再遞歸地發(fā)現(xiàn)頻繁項(xiàng)集,發(fā)現(xiàn)條件模式基,以及發(fā)現(xiàn)另外的條件樹。
圖示:
2.3 遞歸查找頻繁項(xiàng)集
有了FP樹和條件FP樹,我們就可以在前兩步的基礎(chǔ)上遞歸得查找頻繁項(xiàng)集。
完整代碼:
# -*- coding: utf-8 -*-# 返回一個事物列表 def loadSimpDat():simpDat = [['r', 'z', 'h', 'j', 'p'],['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],['z'],['r', 'x', 'n', 'o', 's'],['y', 'r', 'x', 'z', 'q', 't', 'p'],['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]return simpDat# 用于實(shí)現(xiàn)列表到字典的轉(zhuǎn)換過程 def createInitSet(dataSet): # 把每條事務(wù)記錄由列表轉(zhuǎn)換為frozenset類型,并且其鍵對應(yīng)的值為1retDict = {}for trans in dataSet:retDict[frozenset(trans)] = 1return retDict# 構(gòu)建FP樹的類定義 class treeNode:def __init__(self, nameValue, numOccur, parentNode):self.name = nameValueself.count = numOccurself.nodeLink = None # 用于鏈接相似的元素項(xiàng)self.parent = parentNode # 指向當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)self.children = {} def inc(self, numOccur):self.count += numOccurdef disp(self, ind=1): # 用于將樹以文本形式顯示print ' ' *ind, self.name, ' ', self.count # ' ' *ind此處代表的是空格數(shù),也即是為了顯示運(yùn)行結(jié)果的結(jié)構(gòu)的for child in self.children.values(): # 子節(jié)點(diǎn)也是treeNode對象child.disp(ind+1) # 遞歸調(diào)用disp()# FPA樹構(gòu)建函數(shù) def createTree(dataSet, minSup=1): # minSup最小支持度headerTable = {}for trans in dataSet: # 第一次遍歷,統(tǒng)計(jì)每個元素出現(xiàn)的頻度for item in trans:headerTable[item] = headerTable.get(item, 0) + dataSet[trans] for k in headerTable.keys(): # 刪除頭指針表中出現(xiàn)次數(shù)小于minsup的項(xiàng)if headerTable[k] < minSup: del(headerTable[k])freqItemSet = set(headerTable.keys()) # 得到頻繁項(xiàng)的元素,即字典的鍵if len(freqItemSet) == 0: return None, None # 如果沒有元素項(xiàng)滿足要求,則退出for k in headerTable: # 遍歷過濾后的頭指針表headerTable[k] = [headerTable[k], None] # 每個項(xiàng)(字典鍵)的值是[計(jì)數(shù)值,元素項(xiàng)指針]retTree = treeNode('Null Set', 1, None) # 創(chuàng)建根節(jié)點(diǎn)for tranSet, count in dataSet.items(): # 遍歷dataSet中的每一項(xiàng)[],tranSet, count是[項(xiàng),數(shù)1]localD = {}for item in tranSet: if item in freqItemSet: # 基于頻繁項(xiàng)集再遍歷一遍localD[item] = headerTable[item][0] if len(localD) > 0: orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)] updateTree(orderedItems, retTree, headerTable, count) return retTree, headerTable # 更新樹 def updateTree(items, inTree, headerTable, count): # intree:樹節(jié)點(diǎn)對象,count=1,items過濾后的事務(wù)項(xiàng)if items[0] in inTree.children: # 事務(wù)中的第一個元素是否作為子節(jié)點(diǎn)存在 inTree.children[items[0]].inc(count) # 如果存在則更新該元素項(xiàng)的計(jì)數(shù)else: # 如果不存在則將其作為一個子節(jié)點(diǎn)添加到樹中inTree.children[items[0]] = treeNode(items[0], count, inTree) # inTree是父節(jié)點(diǎn),count=1,item[0]是節(jié)點(diǎn)名 inTree.disp()if headerTable[items[0]][1] == None: # [items[0]][1]是第一個鍵的值(列表)的第二個元素 headerTable[items[0]][1] = inTree.children[items[0]] # 把節(jié)點(diǎn)對象賦給headerTable的[items[0]][1] else: # 頭指針表也要更新以指向新的節(jié)點(diǎn) updateHeader(headerTable[items[0]][1], inTree.children[items[0]])if len(items) > 1: # inTree.children[items[0]]頭指針表也要指向新的節(jié)點(diǎn) updateTree(items[1::], inTree.children[items[0]], headerTable, count)# 更新頭指針表 def updateHeader(nodeToTest, targetNode): # nodeToTest是節(jié)點(diǎn)對象,targetNode也是節(jié)點(diǎn)對象while (nodeToTest.nodeLink != None): nodeToTest = nodeToTest.nodeLinknodeToTest.nodeLink = targetNode# 發(fā)現(xiàn)給定元素結(jié)尾的所有路徑的函數(shù)(上溯FP樹) def ascendTree(leafNode, prefixPath): if leafNode.parent != None: # 迭代上溯整棵樹,因?yàn)橹挥懈?jié)點(diǎn)的父節(jié)點(diǎn)是NoneprefixPath.append(leafNode.name)ascendTree(leafNode.parent, prefixPath)# 創(chuàng)建條件基(遍歷鏈表直到到達(dá)結(jié)尾,,每遇到一個元素項(xiàng)都會調(diào)用asscendtree()來上溯FP樹) def findPrefixPath(basePat, treeNode): # 兩個參數(shù):給定元素項(xiàng)的節(jié)點(diǎn)和該節(jié)點(diǎn)指向的對象condPats = {} # 條件模式基字典while treeNode != None: # prefixPath = [] # 上溯列表ascendTree(treeNode, prefixPath)print 'prefixPath:',prefixPathif len(prefixPath) > 1: condPats[frozenset(prefixPath[1:])] = treeNode.countprint 'condPats:',condPats treeNode = treeNode.nodeLinkreturn condPats # 返回對應(yīng)的條件模式基# 遞歸查找頻繁項(xiàng)集#(myFPtree, myHeaderTab, 3, set([]), freqItems=[]) def mineTree(inTree, headerTable, minSup, preFix, freqItemList):bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])] # 排序頭指針表,按升序排列print 'bigL:',bigL # 頭指針表for basePat in bigL: # 從bigL的底部開始newFreqSet = preFix.copy()newFreqSet.add(basePat) # set集合用add()print 'finalFrequent Item: ',newFreqSet # 頻繁項(xiàng)freqItemList.append(newFreqSet) # 列表用append()#print 'treenood:',headerTable[basePat][1].namecondPattBases = findPrefixPath(basePat, headerTable[basePat][1]) # 第二個參數(shù)是print 'condPattBases :',basePat, '-->',condPattBases # 得到模式基字典# 針對每一個條件模式基創(chuàng)建條件FP樹myCondTree, myHead = createTree(condPattBases, minSup) # myCondTree條件fp樹print 'head from conditional tree: ', myHead if myHead != None: print 'conditional tree for: ',newFreqSetmyCondTree.disp(1) mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)# 主函數(shù) # 構(gòu)建FP樹 simDat=loadSimpDat() initSet=createInitSet(simDat) print 'initSet:',initSet myFPtree,myHeaderTab=createTree(initSet, 3) myFPtree.disp() print 'myHeaderTab:',myHeaderTab freqItems=[] mineTree(myFPtree, myHeaderTab, 3, set([]), freqItems) print 'freqItems:',freqItems運(yùn)行結(jié)果:
initSet: {frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']): 1, frozenset(['x', 's', 'r', 'o', 'n']): 1, frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']): 1, frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']): 1, frozenset(['h', 'r', 'z', 'p', 'j']): 1, frozenset(['z']): 1}Null Set 1z 1z 1x 1x 1y 1y 1s 1s 1t 1Null Set 1x 1z 1x 1y 1s 1t 1x 1s 1s 1r 1y 3s 2t 2r 1r 1t 1z 4x 3y 3s 2t 2r 1t 1r 1Null Set 1x 1s 1r 1z 5x 3y 3s 2t 2r 1t 1r 1 myHeaderTab: {'s': [3, <__main__.treeNode instance at 0x000000000B8E9808>], 'r': [3, <__main__.treeNode instance at 0x000000000B8E9608>], 't': [3, <__main__.treeNode instance at 0x000000000B8E9788>], 'y': [3, <__main__.treeNode instance at 0x000000000B8E9848>], 'x': [4, <__main__.treeNode instance at 0x000000000B8E97C8>], 'z': [5, <__main__.treeNode instance at 0x000000000B8E9508>]} bigL: ['r', 't', 's', 'y', 'x', 'z'] finalFrequent Item: set(['r']) prefixPath: ['r', 's', 'x'] condPats: {frozenset(['x', 's']): 1} prefixPath: ['r', 'y', 'x', 'z'] condPats: {frozenset(['x', 's']): 1, frozenset(['y', 'x', 'z']): 1} prefixPath: ['r', 'z'] condPats: {frozenset(['x', 's']): 1, frozenset(['z']): 1, frozenset(['y', 'x', 'z']): 1} condPattBases : r --> {frozenset(['x', 's']): 1, frozenset(['z']): 1, frozenset(['y', 'x', 'z']): 1} head from conditional tree: None finalFrequent Item: set(['t']) prefixPath: ['t', 's', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 's', 'z']): 2} prefixPath: ['t', 'r', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 's', 'z']): 2, frozenset(['y', 'x', 'r', 'z']): 1} condPattBases : t --> {frozenset(['y', 'x', 's', 'z']): 2, frozenset(['y', 'x', 'r', 'z']): 1}Null Set 1y 2y 2x 2x 2z 2 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9888>], 'x': [3, <__main__.treeNode instance at 0x000000000B8E9448>], 'z': [3, <__main__.treeNode instance at 0x000000000B8E9408>]} conditional tree for: set(['t'])Null Set 1y 3x 3z 3 bigL: ['z', 'x', 'y'] finalFrequent Item: set(['z', 't']) prefixPath: ['z', 'x', 'y'] condPats: {frozenset(['y', 'x']): 3} condPattBases : z --> {frozenset(['y', 'x']): 3}Null Set 1y 3y 3x 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9488>], 'x': [3, <__main__.treeNode instance at 0x000000000B8E93C8>]} conditional tree for: set(['z', 't'])Null Set 1y 3x 3 bigL: ['x', 'y'] finalFrequent Item: set(['x', 'z', 't']) prefixPath: ['x', 'y'] condPats: {frozenset(['y']): 3} condPattBases : x --> {frozenset(['y']): 3}Null Set 1y 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E92C8>]} conditional tree for: set(['x', 'z', 't'])Null Set 1y 3 bigL: ['y'] finalFrequent Item: set(['y', 'x', 'z', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['y', 'z', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['x', 't']) prefixPath: ['x', 'y'] condPats: {frozenset(['y']): 3} condPattBases : x --> {frozenset(['y']): 3}Null Set 1y 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9188>]} conditional tree for: set(['x', 't'])Null Set 1y 3 bigL: ['y'] finalFrequent Item: set(['y', 'x', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['y', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['s']) prefixPath: ['s', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 'z']): 2} prefixPath: ['s', 'x'] condPats: {frozenset(['y', 'x', 'z']): 2, frozenset(['x']): 1} condPattBases : s --> {frozenset(['y', 'x', 'z']): 2, frozenset(['x']): 1}Null Set 1x 2 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B69B788>]} conditional tree for: set(['s'])Null Set 1x 3 bigL: ['x'] finalFrequent Item: set(['x', 's']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['y']) prefixPath: ['y', 'x', 'z'] condPats: {frozenset(['x', 'z']): 3} condPattBases : y --> {frozenset(['x', 'z']): 3}Null Set 1x 3x 3z 3 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B6A26C8>], 'z': [3, <__main__.treeNode instance at 0x000000000B84B248>]} conditional tree for: set(['y'])Null Set 1x 3z 3 bigL: ['x', 'z'] finalFrequent Item: set(['y', 'x']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['y', 'z']) prefixPath: ['z', 'x'] condPats: {frozenset(['x']): 3} condPattBases : z --> {frozenset(['x']): 3}Null Set 1x 3 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B6A10C8>]} conditional tree for: set(['y', 'z'])Null Set 1x 3 bigL: ['x'] finalFrequent Item: set(['y', 'x', 'z']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['x']) prefixPath: ['x', 'z'] condPats: {frozenset(['z']): 3} prefixPath: ['x'] condPattBases : x --> {frozenset(['z']): 3}Null Set 1z 3 head from conditional tree: {'z': [3, <__main__.treeNode instance at 0x000000000B823908>]} conditional tree for: set(['x'])Null Set 1z 3 bigL: ['z'] finalFrequent Item: set(['x', 'z']) prefixPath: ['z'] condPattBases : z --> {} head from conditional tree: None finalFrequent Item: set(['z']) prefixPath: ['z'] condPattBases : z --> {} head from conditional tree: None freqItems: [set(['r']), set(['t']), set(['z', 't']), set(['x', 'z', 't']), set(['y', 'x', 'z', 't']), set(['y', 'z', 't']), set(['x', 't']), set(['y', 'x', 't']), set(['y', 't']), set(['s']), set(['x', 's']), set(['y']), set(['y', 'x']), set(['y', 'z']), set(['y', 'x', 'z']), set(['x']), set(['x', 'z']), set(['z'])]上面是具體的過程。
補(bǔ)充:因?yàn)橹虚g涉及到很多遞歸,所以具體的過程比較麻煩,這里舉一個例子.
for basePat in bigL:一行當(dāng)basePat為’t’時的過程:
對照上面代碼的運(yùn)行結(jié)果可以幫助分析,沒別的,就是數(shù)據(jù)結(jié)構(gòu)的東西。
3. 從新聞網(wǎng)站點(diǎn)擊流中挖掘新聞報道
書中的這兩章有不少精彩的示例,這里只選取比較有代表性的一個——從新聞網(wǎng)站點(diǎn)擊流中挖掘熱門新聞報道。這是一個很大的數(shù)據(jù)集,有將近100萬條記錄(參見擴(kuò)展閱讀:kosarak)。在源數(shù)據(jù)集合保存在文件kosarak.dat中。該文件中的每一行包含某個用戶瀏覽過的新聞報道。新聞報道被編碼成整數(shù),我們可以使用Apriori或FP-growth算法挖掘其中的頻繁項(xiàng)集,查看那些新聞ID被用戶大量觀看到。
在2中的代碼主函數(shù)部分改成如下:
parsedDat = [line.split() for line in open('kosarak.dat').readlines()] # 將數(shù)據(jù)集導(dǎo)入到列表 initSet=createInitSet(parsedDat) # 對初始集合格式化 # 然后構(gòu)建FP樹,并從中尋找那些至少被10萬人瀏覽過的新聞報道 myFPtree, myHeaderTab = createTree(initSet, 100000) myFreqList = [] # 創(chuàng)建一個空列表來保存這些頻繁項(xiàng)集 mineTree(myFPtree, myHeaderTab, 100000, set([]), myFreqList) print 'length:',len(myFreqList) # 查看多少新聞報道或報道集合曾經(jīng)被10萬或者更多的人瀏覽過 print 'myFreqList',myFreqList # 具體的內(nèi)容運(yùn)行結(jié)果:
... condPattBases : 6 --> {} head from conditional tree: None finalFrequent Item: set(['6']) prefixPath: ['6'] condPattBases : 6 --> {} head from conditional tree: None length: 9 myFreqList [set(['1']), set(['1', '6']), set(['3']), set(['11', '3']), set(['11', '3', '6']), set(['3', '6']), set(['11']), set(['11', '6']), set(['6'])]同時也可以使用其他設(shè)置來查看運(yùn)行結(jié)果,比如降低置信度級別。
總結(jié):
- FP-growth算法是一種用于發(fā)現(xiàn)數(shù)據(jù)集中頻繁模式的有效方法。FP-growth算法利用Apriori原則,執(zhí)行更快。
- FP-growth算法還有一個map-reduce版本的實(shí)現(xiàn),它也很不錯,可以擴(kuò)展到多臺機(jī)器上運(yùn)行。Google使用該算法通過遍歷大量文本來發(fā)現(xiàn)頻繁共現(xiàn)詞,其做法和我們剛才介紹的例子非常類似。
4. 筆記
(1)Python 字典(Dictionary) get()方法:
Python 字典(Dictionary) get() 函數(shù)返回指定鍵的值,如果值不在字典中返回默認(rèn)值。
get()方法語法: dict.get(key, default=None)
key – 字典中要查找的鍵
default – 如果指定鍵的值不存在時,返回該默認(rèn)值值。
示例:
>>> dict = {'Name': 'Zara', 'Age': 27} >>> dict.get('Age') 27 >>> dict.get('Sex', 0) 0 >>>(2) initSet=createInitSet(simDat)的用法:
In [13]: m=['e', 'm', 'q', 's', 't', 'y', 'x', 'z']In [14]: mm=frozenset(m)In [15]: initSet Out[15]: {frozenset({'e', 'm', 'q', 's', 't', 'x', 'y', 'z'}): 1,frozenset({'n', 'o', 'r', 's', 'x'}): 1,frozenset({'z'}): 1,frozenset({'s', 't', 'u', 'v', 'w', 'x', 'y', 'z'}): 1,frozenset({'p', 'q', 'r', 't', 'x', 'y', 'z'}): 1,frozenset({'h', 'j', 'p', 'r', 'z'}): 1}In [16]: initSet[mm] Out[16]: 1(3)orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]的用法:
In [31]: localD={'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3}In [32]: [v[0] for v in sorted(localD.items(), \...: key=lambda p: p[1], reverse=True)] Out[32]: ['z', 'x', 'y', 's', 't']# rr是針對localD以其value為排序目標(biāo)進(jìn)行的降序排列(p[1]) In [33]: rr=sorted(localD.items(), \...: key=lambda p: p[1], reverse=True)In [34]: rr Out[34]: [('z', 5), ('x', 4), ('y', 3), ('s', 3), ('t', 3)]# rr是針對localD以其key為排序目標(biāo)進(jìn)行的降序排列(p[0]) In [35]: rr=sorted(localD.items(), \...: key=lambda p: p[0], reverse=True)In [36]: rr Out[36]: [('z', 5), ('y', 3), ('x', 4), ('t', 3), ('s', 3)]# 得到了rr中每個元組的第一個元素 In [37]: [v[0] for v in rr] Out[37]: ['z', 'y', 'x', 't', 's']In [38]: [v[1] for v in rr] Out[38]: [5, 3, 4, 3, 3]In [39]: rr[0] Out[39]: ('z', 5)In [40]: type(rr[0]) Out[40]: tuple(4)updateTree(items[1::], inTree.children[items[0]], headerTable, count)的用法:
>>> items=['z', 'x', 'y', 's', 't'] >>> items[1::] ['x', 'y', 's', 't'] >>> items[1:] ['x', 'y', 's', 't'] >>> items[2::] ['y', 's', 't'] >>>參考:https://www.cnblogs.com/qwertWZ/p/4510857.html
總結(jié)
以上是生活随笔為你收集整理的FP-growth算法高效发现频繁项集的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 小溜刚骑就提示还车目的地怎么回事?
- 下一篇: PCA简化数据