0
点赞
收藏
分享

微信扫一扫

Chapter 12 使用FP-growth算法来高效发现频繁项集

北溟有渔夫 2022-05-03 阅读 64

本博文内容包括以下:

  • 发现事务数据中的公共模式
  • FP-growth算法
  • 发现twitter源中的共同词

FP-growth 算法 是基于Apriori算法,但在完成相同的任务(将数据集存储在一个特定的称作FP树的结构之后发现频繁项集或频繁项对,即常在一块出现的元素项的集合FP树)时采用了一些不同的技术。这种算法的执行速度要快于Apriori,通常性能要好两个数量级以上。

讨论从数据集中获取有趣信息的方法,最常用的两种分别是频繁项集与关联规则。

FP-growth算法只需要对数据库进行两次扫描,而Apriori算法对于每个潜在的频繁项集都会 扫描数据集判定给定模式是否频繁,因此FP-growth算法的速度要比Apriori算法快。在小规模数 据集上,这不是什么问题,但当处理更大数据集时,就会产生较大问题。FP-growth只会扫描数据集两次,它发现频繁项集的基本过程如下:

  1. 构建FP树
  2. 从FP树中挖掘频繁项集

12.1 FP树:用于编码数据集的有效方式

FP-growth 算法将数据存储在一种称为FP树的紧凑数据结构中。FP代表频繁模式(Frequent Pattern).

相似项之间的链接即节点链接(node link),用于快速发现相似项的位置。

 

FP-growth算法的工作流程如下。首先构建FP树,然后利用它来挖掘频繁项集。为构建FP树, 需要对原始数据集扫描两遍。第一遍对所有元素项的出现次数进行计数。记住第11章中给出的 Apriori原理,即如果某元素是不频繁的,那么包含该元素的超集也是不频繁的,所以就不需要考 虑这些超集。数据库的第一遍扫描用来统计出现的频率,而第二遍扫描中只考虑那些频繁元素。;

12.2 构建FP树

12.2.1 创建FP树的数据结构

'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 05:09:31
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
    def __init__(self, nameValue, numOccur,parentNode):
        self.name = nameValue  # hold the name of the node
        self.count = numOccur  # hold the count of the node
        self.nodeLink = None   # the nodeLink variable will be used to link similar items
        self.parent = parentNode # refer to the parent of this node in the tree
        self.children = {} # the node contains an empty dictionary for the children of this node
    
    # the method inc() increments the count variable by a given amount
    def inc(self, numOccur):
        self.count += numOccur
    # The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
    def disp(self, ind=1):
        print(' '*ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.disp(ind+1)

 Source Code:

# Source Code

import fpGrowth
rootNode = fpGrowth.treeNode('pyramid',9,None)

# creat a single tree node. add a child node to it.
rootNode.children['eye']=fpGrowth.treeNode('eye', 13, None)

# to display the child node
rootNode.disp()


# Add another node to see how two child nodes are displayed
rootNode.children['phoenix']=fpGrowth.treeNode('phoenix', 3, None)

rootNode.disp()

12.2.2 构建FP树

需要一个头指针来指向给定类型的第一个实例 。利用头指针表,可以快速访问FP树中一个给定类型的所有元素。

这里使用一个字典作为数据结构,来保存头指针表。除了存放指针外,头指针表还可以用来保存FP树中每类元素的总数。

  1. 第一次遍历数据集会获得每个元素项的出现频率。
  2. 去掉不满足最小支持度的元素项。
  3. 构建FP树。构建时,读入每个项集并将其添加到一条已经存在的路径中。若不存在,新建一条路径。每个事务都是一个无序集合。相同项会只表示一次。为了解决此问题,将集合添加到树之前,需要对每个集合进行排序。排序基于元素项的绝对出现频率来进行。在对事务记录过滤和排序后,就可以构建FP树了。
  4. 从空集(符号为\oslash)开始,向其中不断添加频繁项集。过滤、排序后的事务依次添加到树中,如果树中已存在现有元素。则增加现有元素的值。如果现有元素不存在,则向树添加一个分枝。

# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as 
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
    headerTable = {}
    # the first pass goes through everything in the dataset and counts the frequency of each term.
    for trans in dataSet:
        for item in trans:
            headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
    headerTableCopy = headerTable.copy()
    # Remove item not meeting min support
    for k in headerTableCopy.keys():
        if headerTable[k] < minSup:
            del(headerTable[k])
    freqItemSet = set(headerTable.keys())
    # if no items meet min support, exit
    if len(freqItemSet) == 0:
        return None, None
    for k in headerTable:
        headerTable[k] = [headerTable[k],None]
    retTree = treeNode('Null Set', 1, None)
    for tranSet, count in dataSet.items():
        localD = {}
        # Sort transaction by global frequency
        for item in tranSet:
            if item in freqItemSet:
                localD[item] = headerTable[item][0]
        if len(localD) > 0:
            orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
            # Populate tree with ordered freq itemset
            updateTree(orderedItems,retTree,headerTable,count)
    return retTree, headerTable

def updateTree(items, inTree, headerTable, count):
    if items[0] in inTree.children:
        inTree.children[items[0]].inc(count)
    else:
        inTree.children[items[0]] = treeNode(items[0], count, inTree)
        if headerTable[items[0]][1] == None: # update header table
            headerTable[items[0]][1] = inTree.children[items[0]]
        else:
            updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
    # Recursively call updateTree on remaining items
    if len(items) > 1:
        updateTree(items[1::], inTree.children[items[0]],headerTable,count)

def updateHeader(nodeToTest, targetNode):
    while(nodeToTest.nodeLink != None):
        nodeToTest = nodeToTest.nodeLink
    nodeToTest.nodeLink = targetNode

# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
    simpDat = [['r', 'z', 'h', 'j', 'p'],
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
               ['z'],
               ['r', 'x', 'n', 'o', 's'],
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
    return simpDat

def createInitSet(dataSet):
    retDict = {}
    for trans in dataSet:
        retDict[frozenset(trans)] = 1
    return retDict

import fpGrowth

import imp

imp.reload(fpGrowth)


# load the example dataset


simpDat = fpGrowth.loadSimpDat()


simpDat

# need to format this for createTree()

initSet = fpGrowth.createInitSet(simpDat)
initSet


# create the FP-tree

myFPtree, myHeaderTab = fpGrowth.createTree(initSet, 3)

myFPtree.disp()


# This item and its frequency count are displayed with indentation representing the depth of the tree.



 现在我们已经构建FP树,接下来就使用它进行频繁项集挖掘。

12.3 从一棵FP树中挖掘频繁项集

从FP树中抽取频繁项集的三个基本步骤如下:

(1) 从FP树中获得条件模式基;

(2) 利用条件模式基,构建一个条件FP树;

(3) 迭代重复步骤(1)步骤(2),直到树包含一个元素项为止。

接下来重点关注第(1)步,即寻找条件模式基的过程。之后,为每一个条件模式基创建对应的 条件FP树。最后需要构造少许代码来封装上述两个函数,并从FP树中获得频繁项集。

12.3.1 抽取条件模式基

从已经保存在头指针中的单个频繁元素项开始,对于每一个元素项,获得其对应的条件模式基(conditional pattern base)。条件模式基是以所查找元素项为结尾的路径集合。每一条路径其实都是一条前缀路径(prefix path)。一条前缀路径是介于所查找元素项与树根节点之间的所有内容。

前缀路径被用于构建条件FP树,可以做到穷举式搜索,直到获得想要的频繁项为止。为了能够更加高效的加速搜索过程。可以利用创建的头指针表来得到一种更高效的方法。头指针表包含相同类型元素链表的起始指针。一旦到达了每一个元素项,就可以上溯这棵树直到根节点为止。

# A function to find all paths ending with a given item.

def ascendTree(leafNode, prefixPath):
    if leafNode.parent != None:
        prefixPath.append(leafNode.name)
        ascendTree(leafNode.parent,prefixPath)

def findPrefixPath(basePat, treeNode):
    condPats = {}
    while treeNode != None:
        prefixPath = []
        ascendTree(treeNode, prefixPath)
        if len(prefixPath) > 1:
            condPats[frozenset(prefixPath[1:])] = treeNode.count
        treeNode = treeNode.nodeLink
    return condPats

12.3.2 创建条件FP树

对于每一个频繁项,都要创建一棵条件FP树。

def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
    # Start from bottom of header table
    bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
    for basePat in bigL:
        newFreqSet = preFix.copy()
        newFreqSet.add(basePat)
        freqItemList.append(newFreqSet)
        # Construct cond. FP-tree from cond. pattern base
        condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
        myCondTree,myHead = createTree(condPattBases,minSup)
        # Mine cond. FP-tree
        if myHead != None:
            print('conditional tree for: ',newFreqSet)
            myCondTree.disp(1)
            mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)

 

import fpGrowth

import imp

imp.reload(fpGrowth)

freqItems = []

fpGrowth.mineTree(myFPtree, myHeaderTab, 3, set([]), freqItems)

freqItems

# The itemsets match the conditional FP-trees, which is what you’d expect.

完整的FP-growth算法已经可以运行。代码如下:

'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 13:00:39
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
    def __init__(self, nameValue, numOccur,parentNode):
        self.name = nameValue  # hold the name of the node
        self.count = numOccur  # hold the count of the node
        self.nodeLink = None   # the nodeLink variable will be used to link similar items
        self.parent = parentNode # refer to the parent of this node in the tree
        self.children = {} # the node contains an empty dictionary for the children of this node
    
    # the method inc() increments the count variable by a given amount
    def inc(self, numOccur):
        self.count += numOccur
    # The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
    def disp(self, ind=1):
        print(' '*ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.disp(ind+1)

# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as 
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
    headerTable = {}
    # the first pass goes through everything in the dataset and counts the frequency of each term.
    for trans in dataSet:
        for item in trans:
            headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
    headerTableCopy = headerTable.copy()
    # Remove item not meeting min support
    for k in headerTableCopy.keys():
        if headerTable[k] < minSup:
            del(headerTable[k])
    freqItemSet = set(headerTable.keys())
    # if no items meet min support, exit
    if len(freqItemSet) == 0:
        return None, None
    for k in headerTable:
        headerTable[k] = [headerTable[k],None]
    retTree = treeNode('Null Set', 1, None)
    for tranSet, count in dataSet.items():
        localD = {}
        # Sort transaction by global frequency
        for item in tranSet:
            if item in freqItemSet:
                localD[item] = headerTable[item][0]
        if len(localD) > 0:
            orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
            # Populate tree with ordered freq itemset
            updateTree(orderedItems,retTree,headerTable,count)
    return retTree, headerTable

def updateTree(items, inTree, headerTable, count):
    if items[0] in inTree.children:
        inTree.children[items[0]].inc(count)
    else:
        inTree.children[items[0]] = treeNode(items[0], count, inTree)
        if headerTable[items[0]][1] == None: # update header table
            headerTable[items[0]][1] = inTree.children[items[0]]
        else:
            updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
    # Recursively call updateTree on remaining items
    if len(items) > 1:
        updateTree(items[1::], inTree.children[items[0]],headerTable,count)

def updateHeader(nodeToTest, targetNode):
    while(nodeToTest.nodeLink != None):
        nodeToTest = nodeToTest.nodeLink
    nodeToTest.nodeLink = targetNode

# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
    simpDat = [['r', 'z', 'h', 'j', 'p'],
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
               ['z'],
               ['r', 'x', 'n', 'o', 's'],
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
    return simpDat

def createInitSet(dataSet):
    retDict = {}
    for trans in dataSet:
        retDict[frozenset(trans)] = 1
    return retDict

# A function to find all paths ending with a given item.

def ascendTree(leafNode, prefixPath):
    if leafNode.parent != None:
        prefixPath.append(leafNode.name)
        ascendTree(leafNode.parent,prefixPath)

def findPrefixPath(basePat, treeNode):
    condPats = {}
    while treeNode != None:
        prefixPath = []
        ascendTree(treeNode, prefixPath)
        if len(prefixPath) > 1:
            condPats[frozenset(prefixPath[1:])] = treeNode.count
        treeNode = treeNode.nodeLink
    return condPats

def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
    # Start from bottom of header table
    bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
    for basePat in bigL:
        newFreqSet = preFix.copy()
        newFreqSet.add(basePat)
        freqItemList.append(newFreqSet)
        # Construct cond. FP-tree from cond. pattern base
        condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
        myCondTree,myHead = createTree(condPattBases,minSup)
        # Mine cond. FP-tree
        if myHead != None:
            print('conditional tree for: ',newFreqSet)
            myCondTree.disp(1)
            mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)

 12.4 示例: 在Twitter 源中发现一些共现词。

会用到一个叫做python-twitter的python库。

'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 21:26:43
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
    def __init__(self, nameValue, numOccur,parentNode):
        self.name = nameValue  # hold the name of the node
        self.count = numOccur  # hold the count of the node
        self.nodeLink = None   # the nodeLink variable will be used to link similar items
        self.parent = parentNode # refer to the parent of this node in the tree
        self.children = {} # the node contains an empty dictionary for the children of this node
    
    # the method inc() increments the count variable by a given amount
    def inc(self, numOccur):
        self.count += numOccur
    # The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
    def disp(self, ind=1):
        print(' '*ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.disp(ind+1)

# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as 
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
    headerTable = {}
    # the first pass goes through everything in the dataset and counts the frequency of each term.
    for trans in dataSet:
        for item in trans:
            headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
    headerTableCopy = headerTable.copy()
    # Remove item not meeting min support
    for k in headerTableCopy.keys():
        if headerTable[k] < minSup:
            del(headerTable[k])
    freqItemSet = set(headerTable.keys())
    # if no items meet min support, exit
    if len(freqItemSet) == 0:
        return None, None
    for k in headerTable:
        headerTable[k] = [headerTable[k],None]
    retTree = treeNode('Null Set', 1, None)
    for tranSet, count in dataSet.items():
        localD = {}
        # Sort transaction by global frequency
        for item in tranSet:
            if item in freqItemSet:
                localD[item] = headerTable[item][0]
        if len(localD) > 0:
            orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
            # Populate tree with ordered freq itemset
            updateTree(orderedItems,retTree,headerTable,count)
    return retTree, headerTable

def updateTree(items, inTree, headerTable, count):
    if items[0] in inTree.children:
        inTree.children[items[0]].inc(count)
    else:
        inTree.children[items[0]] = treeNode(items[0], count, inTree)
        if headerTable[items[0]][1] == None: # update header table
            headerTable[items[0]][1] = inTree.children[items[0]]
        else:
            updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
    # Recursively call updateTree on remaining items
    if len(items) > 1:
        updateTree(items[1::], inTree.children[items[0]],headerTable,count)

def updateHeader(nodeToTest, targetNode):
    while(nodeToTest.nodeLink != None):
        nodeToTest = nodeToTest.nodeLink
    nodeToTest.nodeLink = targetNode

# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
    simpDat = [['r', 'z', 'h', 'j', 'p'],
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
               ['z'],
               ['r', 'x', 'n', 'o', 's'],
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
    return simpDat

def createInitSet(dataSet):
    retDict = {}
    for trans in dataSet:
        retDict[frozenset(trans)] = 1
    return retDict

# A function to find all paths ending with a given item.

def ascendTree(leafNode, prefixPath):
    if leafNode.parent != None:
        prefixPath.append(leafNode.name)
        ascendTree(leafNode.parent,prefixPath)

def findPrefixPath(basePat, treeNode):
    condPats = {}
    while treeNode != None:
        prefixPath = []
        ascendTree(treeNode, prefixPath)
        if len(prefixPath) > 1:
            condPats[frozenset(prefixPath[1:])] = treeNode.count
        treeNode = treeNode.nodeLink
    return condPats

def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
    # Start from bottom of header table
    bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
    for basePat in bigL:
        newFreqSet = preFix.copy()
        newFreqSet.add(basePat)
        freqItemList.append(newFreqSet)
        # Construct cond. FP-tree from cond. pattern base
        condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
        myCondTree,myHead = createTree(condPattBases,minSup)
        # Mine cond. FP-tree
        if myHead != None:
            print('conditional tree for: ',newFreqSet)
            myCondTree.disp(1)
            mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)


# 12.6 code to access the Twitter Python library

import twitter

from time import sleep
import re

def getLotsOfTweets(searchStr):
    CONSUMER_KEY = 'get when you create an app'
    CONSUMER_SECRET = 'get when you create an app'
    ACCESS_TOKEN_KEY = 'get from Oauth, specific to a user'
    ACCESS_TOKEN_SECRET = 'get from Oauth, specific to a user'
    api = twitter.Api(consumer_key=CONSUMER_KEY, consumer_secret=CONSUMER_SECRET,
                      access_token_key=ACCESS_TOKEN_KEY, 
                      access_token_secret=ACCESS_TOKEN_SECRET)
    
    # you can get 1500 results 15 pages * 100 per page
    resultPages = []
    for i in range(1,15):
        print("fetching page %d" %i)
        searchResults = api.GetSearch(searchStr,per_page=100,pag=i)
        resultPages.append(searchResults)
        sleep(6)
    return resultPages

 

 12.6 本章小结

FP-growth算法是一种用于发现数据集中频繁模式的有效方法。FP-growth算法利用Apriori原 则,执行更快。Apriori算法产生候选项集,然后扫描数据集来检查它们是否频繁。由于只对数据 集扫描两次,因此FP-growth算法执行更快。在FP-growth算法中,数据集存储在一个称为FP树的 结构中。FP树构建完成后,可以通过查找元素项的条件基及构建条件FP树来发现频繁项集。该 过程不断以更多元素作为条件重复进行,直到FP树只包含一个元素为止。可以使用FP-growth算法在多种文本文档中查找频繁单词。Twitter网站为开发者提供了大量的 API来使用他们的服务。利用Python模块Python-Twitter可以很容易访问Twitter。在Twitter源 上对某个话题应用FP-growth算法,可以得到一些有关该话题的摘要信息。频繁项集生成还有其 他的一些应用,比如购物交易、医学诊断及大气研究等。

举报

相关推荐

0 条评论