日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 >

FP-growth算法高效发现频繁项集

發(fā)布時間:2024/9/20 32 豆豆
生活随笔 收集整理的這篇文章主要介紹了 FP-growth算法高效发现频繁项集 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在用搜索引擎時,我們發(fā)現輸入單詞的一部分時,搜索引擎會自動補全查詢詞項,這里的原理其實是通過查詢互聯網上的詞來找出經常出現在一塊的詞對,這需要一種高效發(fā)現頻繁集的方法。

它基于Apriori構建,但在完成相同任務時采用了一些不同的技術。這里的任務是將數據集存儲在一個特定的稱作FP樹的結構之后發(fā)現頻繁項集或者頻繁項對,即常在一塊出現的元素項的集合FP樹。這種做法使得算法的執(zhí)行速度要快于Apriori,通常性能要好兩個數量級以上。

注意:

  • 這種算法雖然能更為高效地發(fā)現頻繁項集,但不能用于發(fā)現關聯規(guī)則。

  • FP-growth算法只需要對數據庫進行兩次掃描,而Apriori算法對于每個潛在的頻繁項集都會掃描數據集判定給定模式是否頻繁,因此FP-gr0wth算法的速度要比Apriori算法快

FP-growth算法發(fā)現頻繁項集的過程:

  • 構建FP樹
  • 從FP樹中挖掘頻繁項集

1. 構建FP樹

1.1 這里必須著重說下FP樹,很重要!

  • FP-growth算法將數據存儲在一種稱為FP樹的緊湊數據結構中。FP代表頻繁模式(Frequent Pattern)。一棵FP樹看上去與計算機科學中的其他樹結構類似,但是它通過鏈接(link)來連接相似元素,被連起來的元素項可以看成一個鏈表
  • 與搜索樹不同的是,一個元素項可以在一棵FP樹種出現多次。FP樹會存儲項集的出現頻率,而每個項集會以路徑的方式存儲在數中。存在相似元素的集合會共享樹的一部分。只有當集合之間完全不同時,樹才會分叉。 樹節(jié)點上給出集合中的單個元素及其在序列中的出現次數,路徑會給出該序列的出現次數。
  • 相似項之間的鏈接稱為節(jié)點鏈接(node link),用于快速發(fā)現相似項的位置。

1.2 FP-growth算法的工作流程如下

首先構建FP樹,然后利用它來挖掘頻繁項集。為構建FP樹,需要對原始數據集掃描兩遍。第一遍對所有元素項的出現次數進行計數。數據庫的第一遍掃描用來統計出現的頻率,而第二遍掃描中只考慮那些頻繁元素。

FP-growth算法還需要一個稱為頭指針表的數據結構,其實很簡單,就是用來記錄各個元素項的總出現次數的數組,再附帶一個指針指向FP樹中該元素項的第一個節(jié)點。這樣每個元素項都構成一條單鏈表。

1.3 事務數據樣例

代碼:

# -*- coding: utf-8 -*-# 返回一個事物列表 def loadSimpDat():simpDat = [['r', 'z', 'h', 'j', 'p'],['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],['z'],['r', 'x', 'n', 'o', 's'],['y', 'r', 'x', 'z', 'q', 't', 'p'],['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]return simpDat# 用于實現列表到字典的轉換過程 def createInitSet(dataSet): # 把每條事務記錄由列表轉換為frozenset類型,并且其鍵對應的值為1retDict = {}for trans in dataSet:retDict[frozenset(trans)] = 1return retDict# 構建FP樹的類定義 class treeNode:def __init__(self, nameValue, numOccur, parentNode):self.name = nameValueself.count = numOccurself.nodeLink = None # 用于鏈接相似的元素項self.parent = parentNode # 指向當前節(jié)點的父節(jié)點self.children = {} def inc(self, numOccur):self.count += numOccurdef disp(self, ind=1): # 用于將樹以文本形式顯示print ' ' *ind, self.name, ' ', self.count # ' ' *ind此處代表的是空格數,也即是為了顯示運行結果的結構的for child in self.children.values(): # 子節(jié)點也是treeNode對象child.disp(ind+1) # 遞歸調用disp()# FPA樹構建函數 def createTree(dataSet, minSup=1): # minSup最小支持度headerTable = {}# 兩次遍歷數據集for trans in dataSet: # 第一次遍歷,統計每個元素出現的頻度for item in trans:# 這個式子很牛,headerTable[item]得到的是遍歷每個事務項中的每個元素后的個數,即得到頭指針表headerTable[item] = headerTable.get(item, 0) + dataSet[trans] print 'headerTable_1:',headerTablefor k in headerTable.keys(): # 刪除頭指針表中出現次數小于minsup的項if headerTable[k] < minSup: del(headerTable[k])print 'headerTable_2:',headerTablefreqItemSet = set(headerTable.keys()) # 得到頻繁項的元素,即字典的鍵print 'freqItemSet: ',freqItemSetif len(freqItemSet) == 0: return None, None # 如果沒有元素項滿足要求,則退出for k in headerTable: # 遍歷過濾后的頭指針表headerTable[k] = [headerTable[k], None] # 每個項(字典鍵)的值是[計數值,元素項指針]print 'headerTable_3: ',headerTableretTree = treeNode('Null Set', 1, None) # 創(chuàng)建根節(jié)點for tranSet, count in dataSet.items(): # 遍歷dataSet中的每一項[],tranSet, count是[項,數1]localD = {}print 'tranSet and count:',tranSet,'-->',countfor item in tranSet: if item in freqItemSet: # 基于頻繁項集再遍歷一遍localD[item] = headerTable[item][0] print 'localD:',localD if len(localD) > 0:# 列表推到式進行排序,得到降序排列的每個事務項(過濾后的)orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)] print 'orderedItems:',orderedItems updateTree(orderedItems, retTree, headerTable, count) return retTree, headerTable # # 更新樹 def updateTree(items, inTree, headerTable, count): # intree:樹節(jié)點對象,count=1,items過濾后的事務項print 'children:',inTree.children.keys()if items[0] in inTree.children: # 事務中的第一個元素是否作為子節(jié)點存在print 'here...0'inTree.children[items[0]].inc(count) # 如果存在則更新該元素項的計數else: # 如果不存在則將其作為一個子節(jié)點添加到樹中inTree.children[items[0]] = treeNode(items[0], count, inTree) # inTree是父節(jié)點,count=1,item[0]是節(jié)點名print 'here....1'inTree.disp()if headerTable[items[0]][1] == None: # [items[0]][1]是第一個鍵的值(列表)的第二個元素 print 'here...2'headerTable[items[0]][1] = inTree.children[items[0]] # 把節(jié)點對象賦給headerTable的[items[0]][1]#print 'headerTable_4:',headerTableelse: # 頭指針表也要更新以指向新的節(jié)點print 'here....3'updateHeader(headerTable[items[0]][1], inTree.children[items[0]])if len(items) > 1: # inTree.children[items[0]]頭指針表也要指向新的節(jié)點print 'len(items):',len(items)updateTree(items[1::], inTree.children[items[0]], headerTable, count)# 更新頭指針表 def updateHeader(nodeToTest, targetNode): # nodeToTest是節(jié)點對象,targetNode也是節(jié)點對象print 'updateHeader:',nodeToTest.name,targetNode.namewhile (nodeToTest.nodeLink != None): print 'gaga...'nodeToTest = nodeToTest.nodeLinknodeToTest.nodeLink = targetNodeprint 'nodeToTest.nodeLink-->',nodeToTest.nodeLink.namedef ascendTree(leafNode, prefixPath): #ascends from leaf node to rootif leafNode.parent != None:prefixPath.append(leafNode.name)ascendTree(leafNode.parent, prefixPath)# 主函數# 測試 rootNode=treeNode('pyramid',9,None) rootNode.disp() rootNode.children['eye']=treeNode('pyramid',13,None) rootNode.disp() rootNode.children['phoenix']=treeNode('phoenix',3,None) rootNode.disp()# 構建FP樹 simDat=loadSimpDat() initSet=createInitSet(simDat) print 'initSet:',initSet myFPtree,myHeaderTab=createTree(initSet, 3) print 'complete tree:',myFPtree.disp() #print 'myHeaderTab:',myHeaderTab

運行結果:

pyramid 9pyramid 9pyramid 13pyramid 9pyramid 13phoenix 3 initSet: {frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']): 1, frozenset(['x', 's', 'r', 'o', 'n']): 1, frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']): 1, frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']): 1, frozenset(['h', 'r', 'z', 'p', 'j']): 1, frozenset(['z']): 1} headerTable_1: {'e': 1, 'h': 1, 'j': 1, 'm': 1, 'o': 1, 'n': 1, 'q': 2, 'p': 2, 's': 3, 'r': 3, 'u': 1, 't': 3, 'w': 1, 'v': 1, 'y': 3, 'x': 4, 'z': 5} headerTable_2: {'s': 3, 'r': 3, 't': 3, 'y': 3, 'x': 4, 'z': 5} freqItemSet: set(['s', 'r', 't', 'y', 'x', 'z']) headerTable_3: {'s': [3, None], 'r': [3, None], 't': [3, None], 'y': [3, None], 'x': [4, None], 'z': [5, None]} tranSet and count: frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3} orderedItems: ['z', 'x', 'y', 's', 't'] children: [] here....1Null Set 1z 1 here...2 len(items): 5 children: [] here....1z 1x 1 here...2 len(items): 4 children: [] here....1x 1y 1 here...2 len(items): 3 children: [] here....1y 1s 1 here...2 len(items): 2 children: [] here....1s 1t 1 here...2 tranSet and count: frozenset(['x', 's', 'r', 'o', 'n']) --> 1 localD: {'x': 4, 's': 3, 'r': 3} orderedItems: ['x', 's', 'r'] children: ['z'] here....1Null Set 1x 1z 1x 1y 1s 1t 1 here....3 updateHeader: x x nodeToTest.nodeLink--> x len(items): 3 children: [] here....1x 1s 1 here....3 updateHeader: s s nodeToTest.nodeLink--> s len(items): 2 children: [] here....1s 1r 1 here...2 tranSet and count: frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3} orderedItems: ['z', 'x', 'y', 's', 't'] children: ['x', 'z'] here...0 len(items): 5 children: ['x'] here...0 len(items): 4 children: ['y'] here...0 len(items): 3 children: ['s'] here...0 len(items): 2 children: ['t'] here...0 tranSet and count: frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']) --> 1 localD: {'y': 3, 'x': 4, 'r': 3, 't': 3, 'z': 5} orderedItems: ['z', 'x', 'y', 'r', 't'] children: ['x', 'z'] here...0 len(items): 5 children: ['x'] here...0 len(items): 4 children: ['y'] here...0 len(items): 3 children: ['s'] here....1y 3s 2t 2r 1 here....3 updateHeader: r r nodeToTest.nodeLink--> r len(items): 2 children: [] here....1r 1t 1 here....3 updateHeader: t t nodeToTest.nodeLink--> t tranSet and count: frozenset(['h', 'r', 'z', 'p', 'j']) --> 1 localD: {'r': 3, 'z': 5} orderedItems: ['z', 'r'] children: ['x', 'z'] here...0 len(items): 2 children: ['x'] here....1z 4x 3y 3s 2t 2r 1t 1r 1 here....3 updateHeader: r r gaga... nodeToTest.nodeLink--> r tranSet and count: frozenset(['z']) --> 1 localD: {'z': 5} orderedItems: ['z'] children: ['x', 'z'] here...0 complete tree: Null Set 1x 1s 1r 1z 5x 3y 3s 2t 2r 1t 1r 1 None myHeaderTab: {'s': [3, <__main__.treeNode instance at 0x000000000B905188>], 'r': [3, <__main__.treeNode instance at 0x000000000B9B0E08>], 't': [3, <__main__.treeNode instance at 0x000000000B905F08>], 'y': [3, <__main__.treeNode instance at 0x000000000B9051C8>], 'x': [4, <__main__.treeNode instance at 0x000000000B9E5688>], 'z': [5, <__main__.treeNode instance at 0x000000000B9E59C8>]}

以上就是FP樹的構建過程,已經把具體流程打印出類了,一步一步對應上面帶頭指針表的圖就可以搞清楚其中的細節(jié)了,具體解釋參考《機器學習實戰(zhàn)》。

構建FP樹的前兩步:

這里我只想說,數據結構很重要!數據結構很重要!數據結構很重要!

python中frozenset( )的用法

2. 從FP樹中挖掘頻繁項

有了FP樹之后,就可以抽取頻繁項集了。這里的思路與Apriori算法大致類似,首先從單元素項集合開始,然后在此基礎上逐步構建更大的集合。

從FP樹中抽取頻繁項集的三個基本步驟如下:

  • 從FP樹中獲得條件模式基;
  • 利用條件模式基,構建一個條件FP樹;
  • 迭代重復步驟1步驟2,直到樹包含一個元素項為止。

其中關鍵是尋找條件模式基的過程,之后為每一個條件模式基創(chuàng)建對應的條件FP樹。

2.1 抽取條件模式基

首先從頭指針表中的每個頻繁元素項開始,對每個元素項,獲得其對應的條件模式基(conditional pattern base)。條件模式基是以所查找元素項為結尾的路徑集合。每一條路徑其實都是一條前綴路徑(prefix path)。簡而言之,一條前綴路徑是介于所查找元素項與樹根節(jié)點之間的所有內容。

則由吐1.1得到每一個頻繁元素項的所有前綴路徑(條件模式基)為:

前綴路徑將在下一步中用于構建條件FP樹,暫時先不考慮。如何發(fā)現某個頻繁元素項的所在的路徑?利用先前創(chuàng)建的頭指針表和FP樹中的相似元素節(jié)點指針,我們已經有了每個元素對應的單鏈表,因而可以直接獲取。
在代碼實現中:為給定元素項生成一個條件模式基(前綴路徑),這通過訪問樹中所有包含給定元素項的節(jié)點來完成。

2.2 創(chuàng)建條件FP樹

對于每一個頻繁項,都要創(chuàng)建一棵條件FP樹。可以使用剛才發(fā)現的條件模式基作為輸入數據,并通過相同的建樹代碼來構建這些樹。例如,對于r,即以“{x, s}: 1, {z, x, y}: 1, {z}: 1”為輸入,調用函數createTree()獲得r的條件FP樹;對于t,輸入是對應的條件模式基“{z, x, y, s}: 2, {z, x, y, r}: 1”,然后再遞歸地發(fā)現頻繁項集,發(fā)現條件模式基,以及發(fā)現另外的條件樹。

圖示:

2.3 遞歸查找頻繁項集

有了FP樹和條件FP樹,我們就可以在前兩步的基礎上遞歸得查找頻繁項集。

完整代碼:

# -*- coding: utf-8 -*-# 返回一個事物列表 def loadSimpDat():simpDat = [['r', 'z', 'h', 'j', 'p'],['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],['z'],['r', 'x', 'n', 'o', 's'],['y', 'r', 'x', 'z', 'q', 't', 'p'],['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]return simpDat# 用于實現列表到字典的轉換過程 def createInitSet(dataSet): # 把每條事務記錄由列表轉換為frozenset類型,并且其鍵對應的值為1retDict = {}for trans in dataSet:retDict[frozenset(trans)] = 1return retDict# 構建FP樹的類定義 class treeNode:def __init__(self, nameValue, numOccur, parentNode):self.name = nameValueself.count = numOccurself.nodeLink = None # 用于鏈接相似的元素項self.parent = parentNode # 指向當前節(jié)點的父節(jié)點self.children = {} def inc(self, numOccur):self.count += numOccurdef disp(self, ind=1): # 用于將樹以文本形式顯示print ' ' *ind, self.name, ' ', self.count # ' ' *ind此處代表的是空格數,也即是為了顯示運行結果的結構的for child in self.children.values(): # 子節(jié)點也是treeNode對象child.disp(ind+1) # 遞歸調用disp()# FPA樹構建函數 def createTree(dataSet, minSup=1): # minSup最小支持度headerTable = {}for trans in dataSet: # 第一次遍歷,統計每個元素出現的頻度for item in trans:headerTable[item] = headerTable.get(item, 0) + dataSet[trans] for k in headerTable.keys(): # 刪除頭指針表中出現次數小于minsup的項if headerTable[k] < minSup: del(headerTable[k])freqItemSet = set(headerTable.keys()) # 得到頻繁項的元素,即字典的鍵if len(freqItemSet) == 0: return None, None # 如果沒有元素項滿足要求,則退出for k in headerTable: # 遍歷過濾后的頭指針表headerTable[k] = [headerTable[k], None] # 每個項(字典鍵)的值是[計數值,元素項指針]retTree = treeNode('Null Set', 1, None) # 創(chuàng)建根節(jié)點for tranSet, count in dataSet.items(): # 遍歷dataSet中的每一項[],tranSet, count是[項,數1]localD = {}for item in tranSet: if item in freqItemSet: # 基于頻繁項集再遍歷一遍localD[item] = headerTable[item][0] if len(localD) > 0: orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)] updateTree(orderedItems, retTree, headerTable, count) return retTree, headerTable # 更新樹 def updateTree(items, inTree, headerTable, count): # intree:樹節(jié)點對象,count=1,items過濾后的事務項if items[0] in inTree.children: # 事務中的第一個元素是否作為子節(jié)點存在 inTree.children[items[0]].inc(count) # 如果存在則更新該元素項的計數else: # 如果不存在則將其作為一個子節(jié)點添加到樹中inTree.children[items[0]] = treeNode(items[0], count, inTree) # inTree是父節(jié)點,count=1,item[0]是節(jié)點名 inTree.disp()if headerTable[items[0]][1] == None: # [items[0]][1]是第一個鍵的值(列表)的第二個元素 headerTable[items[0]][1] = inTree.children[items[0]] # 把節(jié)點對象賦給headerTable的[items[0]][1] else: # 頭指針表也要更新以指向新的節(jié)點 updateHeader(headerTable[items[0]][1], inTree.children[items[0]])if len(items) > 1: # inTree.children[items[0]]頭指針表也要指向新的節(jié)點 updateTree(items[1::], inTree.children[items[0]], headerTable, count)# 更新頭指針表 def updateHeader(nodeToTest, targetNode): # nodeToTest是節(jié)點對象,targetNode也是節(jié)點對象while (nodeToTest.nodeLink != None): nodeToTest = nodeToTest.nodeLinknodeToTest.nodeLink = targetNode# 發(fā)現給定元素結尾的所有路徑的函數(上溯FP樹) def ascendTree(leafNode, prefixPath): if leafNode.parent != None: # 迭代上溯整棵樹,因為只有根節(jié)點的父節(jié)點是NoneprefixPath.append(leafNode.name)ascendTree(leafNode.parent, prefixPath)# 創(chuàng)建條件基(遍歷鏈表直到到達結尾,,每遇到一個元素項都會調用asscendtree()來上溯FP樹) def findPrefixPath(basePat, treeNode): # 兩個參數:給定元素項的節(jié)點和該節(jié)點指向的對象condPats = {} # 條件模式基字典while treeNode != None: # prefixPath = [] # 上溯列表ascendTree(treeNode, prefixPath)print 'prefixPath:',prefixPathif len(prefixPath) > 1: condPats[frozenset(prefixPath[1:])] = treeNode.countprint 'condPats:',condPats treeNode = treeNode.nodeLinkreturn condPats # 返回對應的條件模式基# 遞歸查找頻繁項集#(myFPtree, myHeaderTab, 3, set([]), freqItems=[]) def mineTree(inTree, headerTable, minSup, preFix, freqItemList):bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])] # 排序頭指針表,按升序排列print 'bigL:',bigL # 頭指針表for basePat in bigL: # 從bigL的底部開始newFreqSet = preFix.copy()newFreqSet.add(basePat) # set集合用add()print 'finalFrequent Item: ',newFreqSet # 頻繁項freqItemList.append(newFreqSet) # 列表用append()#print 'treenood:',headerTable[basePat][1].namecondPattBases = findPrefixPath(basePat, headerTable[basePat][1]) # 第二個參數是print 'condPattBases :',basePat, '-->',condPattBases # 得到模式基字典# 針對每一個條件模式基創(chuàng)建條件FP樹myCondTree, myHead = createTree(condPattBases, minSup) # myCondTree條件fp樹print 'head from conditional tree: ', myHead if myHead != None: print 'conditional tree for: ',newFreqSetmyCondTree.disp(1) mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)# 主函數 # 構建FP樹 simDat=loadSimpDat() initSet=createInitSet(simDat) print 'initSet:',initSet myFPtree,myHeaderTab=createTree(initSet, 3) myFPtree.disp() print 'myHeaderTab:',myHeaderTab freqItems=[] mineTree(myFPtree, myHeaderTab, 3, set([]), freqItems) print 'freqItems:',freqItems

運行結果:

initSet: {frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']): 1, frozenset(['x', 's', 'r', 'o', 'n']): 1, frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']): 1, frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']): 1, frozenset(['h', 'r', 'z', 'p', 'j']): 1, frozenset(['z']): 1}Null Set 1z 1z 1x 1x 1y 1y 1s 1s 1t 1Null Set 1x 1z 1x 1y 1s 1t 1x 1s 1s 1r 1y 3s 2t 2r 1r 1t 1z 4x 3y 3s 2t 2r 1t 1r 1Null Set 1x 1s 1r 1z 5x 3y 3s 2t 2r 1t 1r 1 myHeaderTab: {'s': [3, <__main__.treeNode instance at 0x000000000B8E9808>], 'r': [3, <__main__.treeNode instance at 0x000000000B8E9608>], 't': [3, <__main__.treeNode instance at 0x000000000B8E9788>], 'y': [3, <__main__.treeNode instance at 0x000000000B8E9848>], 'x': [4, <__main__.treeNode instance at 0x000000000B8E97C8>], 'z': [5, <__main__.treeNode instance at 0x000000000B8E9508>]} bigL: ['r', 't', 's', 'y', 'x', 'z'] finalFrequent Item: set(['r']) prefixPath: ['r', 's', 'x'] condPats: {frozenset(['x', 's']): 1} prefixPath: ['r', 'y', 'x', 'z'] condPats: {frozenset(['x', 's']): 1, frozenset(['y', 'x', 'z']): 1} prefixPath: ['r', 'z'] condPats: {frozenset(['x', 's']): 1, frozenset(['z']): 1, frozenset(['y', 'x', 'z']): 1} condPattBases : r --> {frozenset(['x', 's']): 1, frozenset(['z']): 1, frozenset(['y', 'x', 'z']): 1} head from conditional tree: None finalFrequent Item: set(['t']) prefixPath: ['t', 's', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 's', 'z']): 2} prefixPath: ['t', 'r', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 's', 'z']): 2, frozenset(['y', 'x', 'r', 'z']): 1} condPattBases : t --> {frozenset(['y', 'x', 's', 'z']): 2, frozenset(['y', 'x', 'r', 'z']): 1}Null Set 1y 2y 2x 2x 2z 2 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9888>], 'x': [3, <__main__.treeNode instance at 0x000000000B8E9448>], 'z': [3, <__main__.treeNode instance at 0x000000000B8E9408>]} conditional tree for: set(['t'])Null Set 1y 3x 3z 3 bigL: ['z', 'x', 'y'] finalFrequent Item: set(['z', 't']) prefixPath: ['z', 'x', 'y'] condPats: {frozenset(['y', 'x']): 3} condPattBases : z --> {frozenset(['y', 'x']): 3}Null Set 1y 3y 3x 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9488>], 'x': [3, <__main__.treeNode instance at 0x000000000B8E93C8>]} conditional tree for: set(['z', 't'])Null Set 1y 3x 3 bigL: ['x', 'y'] finalFrequent Item: set(['x', 'z', 't']) prefixPath: ['x', 'y'] condPats: {frozenset(['y']): 3} condPattBases : x --> {frozenset(['y']): 3}Null Set 1y 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E92C8>]} conditional tree for: set(['x', 'z', 't'])Null Set 1y 3 bigL: ['y'] finalFrequent Item: set(['y', 'x', 'z', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['y', 'z', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['x', 't']) prefixPath: ['x', 'y'] condPats: {frozenset(['y']): 3} condPattBases : x --> {frozenset(['y']): 3}Null Set 1y 3 head from conditional tree: {'y': [3, <__main__.treeNode instance at 0x000000000B8E9188>]} conditional tree for: set(['x', 't'])Null Set 1y 3 bigL: ['y'] finalFrequent Item: set(['y', 'x', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['y', 't']) prefixPath: ['y'] condPattBases : y --> {} head from conditional tree: None finalFrequent Item: set(['s']) prefixPath: ['s', 'y', 'x', 'z'] condPats: {frozenset(['y', 'x', 'z']): 2} prefixPath: ['s', 'x'] condPats: {frozenset(['y', 'x', 'z']): 2, frozenset(['x']): 1} condPattBases : s --> {frozenset(['y', 'x', 'z']): 2, frozenset(['x']): 1}Null Set 1x 2 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B69B788>]} conditional tree for: set(['s'])Null Set 1x 3 bigL: ['x'] finalFrequent Item: set(['x', 's']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['y']) prefixPath: ['y', 'x', 'z'] condPats: {frozenset(['x', 'z']): 3} condPattBases : y --> {frozenset(['x', 'z']): 3}Null Set 1x 3x 3z 3 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B6A26C8>], 'z': [3, <__main__.treeNode instance at 0x000000000B84B248>]} conditional tree for: set(['y'])Null Set 1x 3z 3 bigL: ['x', 'z'] finalFrequent Item: set(['y', 'x']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['y', 'z']) prefixPath: ['z', 'x'] condPats: {frozenset(['x']): 3} condPattBases : z --> {frozenset(['x']): 3}Null Set 1x 3 head from conditional tree: {'x': [3, <__main__.treeNode instance at 0x000000000B6A10C8>]} conditional tree for: set(['y', 'z'])Null Set 1x 3 bigL: ['x'] finalFrequent Item: set(['y', 'x', 'z']) prefixPath: ['x'] condPattBases : x --> {} head from conditional tree: None finalFrequent Item: set(['x']) prefixPath: ['x', 'z'] condPats: {frozenset(['z']): 3} prefixPath: ['x'] condPattBases : x --> {frozenset(['z']): 3}Null Set 1z 3 head from conditional tree: {'z': [3, <__main__.treeNode instance at 0x000000000B823908>]} conditional tree for: set(['x'])Null Set 1z 3 bigL: ['z'] finalFrequent Item: set(['x', 'z']) prefixPath: ['z'] condPattBases : z --> {} head from conditional tree: None finalFrequent Item: set(['z']) prefixPath: ['z'] condPattBases : z --> {} head from conditional tree: None freqItems: [set(['r']), set(['t']), set(['z', 't']), set(['x', 'z', 't']), set(['y', 'x', 'z', 't']), set(['y', 'z', 't']), set(['x', 't']), set(['y', 'x', 't']), set(['y', 't']), set(['s']), set(['x', 's']), set(['y']), set(['y', 'x']), set(['y', 'z']), set(['y', 'x', 'z']), set(['x']), set(['x', 'z']), set(['z'])]

上面是具體的過程。

補充:因為中間涉及到很多遞歸,所以具體的過程比較麻煩,這里舉一個例子.
for basePat in bigL:一行當basePat為’t’時的過程:

對照上面代碼的運行結果可以幫助分析,沒別的,就是數據結構的東西。

3. 從新聞網站點擊流中挖掘新聞報道

書中的這兩章有不少精彩的示例,這里只選取比較有代表性的一個——從新聞網站點擊流中挖掘熱門新聞報道。這是一個很大的數據集,有將近100萬條記錄(參見擴展閱讀:kosarak)。在源數據集合保存在文件kosarak.dat中。該文件中的每一行包含某個用戶瀏覽過的新聞報道。新聞報道被編碼成整數,我們可以使用Apriori或FP-growth算法挖掘其中的頻繁項集,查看那些新聞ID被用戶大量觀看到。

在2中的代碼主函數部分改成如下:

parsedDat = [line.split() for line in open('kosarak.dat').readlines()] # 將數據集導入到列表 initSet=createInitSet(parsedDat) # 對初始集合格式化 # 然后構建FP樹,并從中尋找那些至少被10萬人瀏覽過的新聞報道 myFPtree, myHeaderTab = createTree(initSet, 100000) myFreqList = [] # 創(chuàng)建一個空列表來保存這些頻繁項集 mineTree(myFPtree, myHeaderTab, 100000, set([]), myFreqList) print 'length:',len(myFreqList) # 查看多少新聞報道或報道集合曾經被10萬或者更多的人瀏覽過 print 'myFreqList',myFreqList # 具體的內容

運行結果:

... condPattBases : 6 --> {} head from conditional tree: None finalFrequent Item: set(['6']) prefixPath: ['6'] condPattBases : 6 --> {} head from conditional tree: None length: 9 myFreqList [set(['1']), set(['1', '6']), set(['3']), set(['11', '3']), set(['11', '3', '6']), set(['3', '6']), set(['11']), set(['11', '6']), set(['6'])]

同時也可以使用其他設置來查看運行結果,比如降低置信度級別。

總結:

  • FP-growth算法是一種用于發(fā)現數據集中頻繁模式的有效方法。FP-growth算法利用Apriori原則,執(zhí)行更快。
  • FP-growth算法還有一個map-reduce版本的實現,它也很不錯,可以擴展到多臺機器上運行。Google使用該算法通過遍歷大量文本來發(fā)現頻繁共現詞,其做法和我們剛才介紹的例子非常類似。

4. 筆記

(1)Python 字典(Dictionary) get()方法:

Python 字典(Dictionary) get() 函數返回指定鍵的值,如果值不在字典中返回默認值。
get()方法語法: dict.get(key, default=None)
key – 字典中要查找的鍵
default – 如果指定鍵的值不存在時,返回該默認值值。

示例:

>>> dict = {'Name': 'Zara', 'Age': 27} >>> dict.get('Age') 27 >>> dict.get('Sex', 0) 0 >>>

(2) initSet=createInitSet(simDat)的用法:

In [13]: m=['e', 'm', 'q', 's', 't', 'y', 'x', 'z']In [14]: mm=frozenset(m)In [15]: initSet Out[15]: {frozenset({'e', 'm', 'q', 's', 't', 'x', 'y', 'z'}): 1,frozenset({'n', 'o', 'r', 's', 'x'}): 1,frozenset({'z'}): 1,frozenset({'s', 't', 'u', 'v', 'w', 'x', 'y', 'z'}): 1,frozenset({'p', 'q', 'r', 't', 'x', 'y', 'z'}): 1,frozenset({'h', 'j', 'p', 'r', 'z'}): 1}In [16]: initSet[mm] Out[16]: 1

(3)orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]的用法:

In [31]: localD={'y': 3, 'x': 4, 's': 3, 'z': 5, 't': 3}In [32]: [v[0] for v in sorted(localD.items(), \...: key=lambda p: p[1], reverse=True)] Out[32]: ['z', 'x', 'y', 's', 't']# rr是針對localD以其value為排序目標進行的降序排列(p[1]) In [33]: rr=sorted(localD.items(), \...: key=lambda p: p[1], reverse=True)In [34]: rr Out[34]: [('z', 5), ('x', 4), ('y', 3), ('s', 3), ('t', 3)]# rr是針對localD以其key為排序目標進行的降序排列(p[0]) In [35]: rr=sorted(localD.items(), \...: key=lambda p: p[0], reverse=True)In [36]: rr Out[36]: [('z', 5), ('y', 3), ('x', 4), ('t', 3), ('s', 3)]# 得到了rr中每個元組的第一個元素 In [37]: [v[0] for v in rr] Out[37]: ['z', 'y', 'x', 't', 's']In [38]: [v[1] for v in rr] Out[38]: [5, 3, 4, 3, 3]In [39]: rr[0] Out[39]: ('z', 5)In [40]: type(rr[0]) Out[40]: tuple

(4)updateTree(items[1::], inTree.children[items[0]], headerTable, count)的用法:

>>> items=['z', 'x', 'y', 's', 't'] >>> items[1::] ['x', 'y', 's', 't'] >>> items[1:] ['x', 'y', 's', 't'] >>> items[2::] ['y', 's', 't'] >>>

參考:https://www.cnblogs.com/qwertWZ/p/4510857.html

總結

以上是生活随笔為你收集整理的FP-growth算法高效发现频繁项集的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。