Chapter 12 使用FP-growth算法来高效发现频繁项集-CFANZ编程社区

本博文内容包括以下：

发现事务数据中的公共模式
FP-growth算法
发现twitter源中的共同词

FP-growth 算法是基于Apriori算法，但在完成相同的任务（将数据集存储在一个特定的称作FP树的结构之后发现频繁项集或频繁项对，即常在一块出现的元素项的集合FP树）时采用了一些不同的技术。这种算法的执行速度要快于Apriori,通常性能要好两个数量级以上。

讨论从数据集中获取有趣信息的方法，最常用的两种分别是频繁项集与关联规则。

FP-growth算法只需要对数据库进行两次扫描，而Apriori算法对于每个潜在的频繁项集都会扫描数据集判定给定模式是否频繁，因此FP-growth算法的速度要比Apriori算法快。在小规模数据集上，这不是什么问题，但当处理更大数据集时，就会产生较大问题。FP-growth只会扫描数据集两次，它发现频繁项集的基本过程如下：

构建FP树
从FP树中挖掘频繁项集

12.1 FP树：用于编码数据集的有效方式

FP-growth 算法将数据存储在一种称为FP树的紧凑数据结构中。FP代表频繁模式（Frequent Pattern）.

相似项之间的链接即节点链接（node link），用于快速发现相似项的位置。

FP-growth算法的工作流程如下。首先构建FP树，然后利用它来挖掘频繁项集。为构建FP树，需要对原始数据集扫描两遍。第一遍对所有元素项的出现次数进行计数。记住第11章中给出的 Apriori原理，即如果某元素是不频繁的，那么包含该元素的超集也是不频繁的，所以就不需要考虑这些超集。数据库的第一遍扫描用来统计出现的频率，而第二遍扫描中只考虑那些频繁元素。;

12.2 构建FP树

12.2.1 创建FP树的数据结构

'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 05:09:31
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
    def __init__(self, nameValue, numOccur,parentNode):
        self.name = nameValue  # hold the name of the node
        self.count = numOccur  # hold the count of the node
        self.nodeLink = None   # the nodeLink variable will be used to link similar items
        self.parent = parentNode # refer to the parent of this node in the tree
        self.children = {} # the node contains an empty dictionary for the children of this node
    
    # the method inc() increments the count variable by a given amount
    def inc(self, numOccur):
        self.count += numOccur
    # The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
    def disp(self, ind=1):
        print(' '*ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.disp(ind+1)

Source Code:

# Source Code

import fpGrowth
rootNode = fpGrowth.treeNode('pyramid',9,None)

# creat a single tree node. add a child node to it.
rootNode.children['eye']=fpGrowth.treeNode('eye', 13, None)

# to display the child node
rootNode.disp()


# Add another node to see how two child nodes are displayed
rootNode.children['phoenix']=fpGrowth.treeNode('phoenix', 3, None)

rootNode.disp()

12.2.2 构建FP树

需要一个头指针来指向给定类型的第一个实例。利用头指针表，可以快速访问FP树中一个给定类型的所有元素。

这里使用一个字典作为数据结构，来保存头指针表。除了存放指针外，头指针表还可以用来保存FP树中每类元素的总数。

第一次遍历数据集会获得每个元素项的出现频率。
去掉不满足最小支持度的元素项。
构建FP树。构建时，读入每个项集并将其添加到一条已经存在的路径中。若不存在，新建一条路径。每个事务都是一个无序集合。相同项会只表示一次。为了解决此问题，将集合添加到树之前，需要对每个集合进行排序。排序基于元素项的绝对出现频率来进行。在对事务记录过滤和排序后，就可以构建FP树了。
从空集（符号为 $\oslash$ ）开始，向其中不断添加频繁项集。过滤、排序后的事务依次添加到树中，如果树中已存在现有元素。则增加现有元素的值。如果现有元素不存在，则向树添加一个分枝。

# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as 
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
    headerTable = {}
    # the first pass goes through everything in the dataset and counts the frequency of each term.
    for trans in dataSet:
        for item in trans:
            headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
    headerTableCopy = headerTable.copy()
    # Remove item not meeting min support
    for k in headerTableCopy.keys():
        if headerTable[k] < minSup:
            del(headerTable[k])
    freqItemSet = set(headerTable.keys())
    # if no items meet min support, exit
    if len(freqItemSet) == 0:
        return None, None
    for k in headerTable:
        headerTable[k] = [headerTable[k],None]
    retTree = treeNode('Null Set', 1, None)
    for tranSet, count in dataSet.items():
        localD = {}
        # Sort transaction by global frequency
        for item in tranSet:
            if item in freqItemSet:
                localD[item] = headerTable[item][0]
        if len(localD) > 0:
            orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
            # Populate tree with ordered freq itemset
            updateTree(orderedItems,retTree,headerTable,count)
    return retTree, headerTable

def updateTree(items, inTree, headerTable, count):
    if items[0] in inTree.children:
        inTree.children[items[0]].inc(count)
    else:
        inTree.children[items[0]] = treeNode(items[0], count, inTree)
        if headerTable[items[0]][1] == None: # update header table
            headerTable[items[0]][1] = inTree.children[items[0]]
        else:
            updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
    # Recursively call updateTree on remaining items
    if len(items) > 1:
        updateTree(items[1::], inTree.children[items[0]],headerTable,count)

def updateHeader(nodeToTest, targetNode):
    while(nodeToTest.nodeLink != None):
        nodeToTest = nodeToTest.nodeLink
    nodeToTest.nodeLink = targetNode

# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
    simpDat = [['r', 'z', 'h', 'j', 'p'],
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
               ['z'],
               ['r', 'x', 'n', 'o', 's'],
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
    return simpDat

def createInitSet(dataSet):
    retDict = {}
    for trans in dataSet:
        retDict[frozenset(trans)] = 1
    return retDict

import fpGrowth

import imp

imp.reload(fpGrowth)


# load the example dataset


simpDat = fpGrowth.loadSimpDat()


simpDat

# need to format this for createTree()

initSet = fpGrowth.createInitSet(simpDat)
initSet


# create the FP-tree

myFPtree, myHeaderTab = fpGrowth.createTree(initSet, 3)

myFPtree.disp()


# This item and its frequency count are displayed with indentation representing the depth of the tree.

现在我们已经构建FP树，接下来就使用它进行频繁项集挖掘。

12.3 从一棵FP树中挖掘频繁项集

从FP树中抽取频繁项集的三个基本步骤如下：

(1) 从FP树中获得条件模式基；

(2) 利用条件模式基，构建一个条件FP树；

(3) 迭代重复步骤(1)步骤(2)，直到树包含一个元素项为止。

接下来重点关注第(1)步，即寻找条件模式基的过程。之后，为每一个条件模式基创建对应的条件FP树。最后需要构造少许代码来封装上述两个函数，并从FP树中获得频繁项集。

12.3.1 抽取条件模式基

从已经保存在头指针中的单个频繁元素项开始，对于每一个元素项，获得其对应的条件模式基（conditional pattern base）。条件模式基是以所查找元素项为结尾的路径集合。每一条路径其实都是一条前缀路径（prefix path）。一条前缀路径是介于所查找元素项与树根节点之间的所有内容。

前缀路径被用于构建条件FP树，可以做到穷举式搜索，直到获得想要的频繁项为止。为了能够更加高效的加速搜索过程。可以利用创建的头指针表来得到一种更高效的方法。头指针表包含相同类型元素链表的起始指针。一旦到达了每一个元素项，就可以上溯这棵树直到根节点为止。

# A function to find all paths ending with a given item.

def ascendTree(leafNode, prefixPath):
    if leafNode.parent != None:
        prefixPath.append(leafNode.name)
        ascendTree(leafNode.parent,prefixPath)

def findPrefixPath(basePat, treeNode):
    condPats = {}
    while treeNode != None:
        prefixPath = []
        ascendTree(treeNode, prefixPath)
        if len(prefixPath) > 1:
            condPats[frozenset(prefixPath[1:])] = treeNode.count
        treeNode = treeNode.nodeLink
    return condPats

12.3.2 创建条件FP树

对于每一个频繁项，都要创建一棵条件FP树。

def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
    # Start from bottom of header table
    bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
    for basePat in bigL:
        newFreqSet = preFix.copy()
        newFreqSet.add(basePat)
        freqItemList.append(newFreqSet)
        # Construct cond. FP-tree from cond. pattern base
        condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
        myCondTree,myHead = createTree(condPattBases,minSup)
        # Mine cond. FP-tree
        if myHead != None:
            print('conditional tree for: ',newFreqSet)
            myCondTree.disp(1)
            mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)

import fpGrowth

import imp

imp.reload(fpGrowth)

freqItems = []

fpGrowth.mineTree(myFPtree, myHeaderTab, 3, set([]), freqItems)

freqItems

# The itemsets match the conditional FP-trees, which is what you’d expect.

完整的FP-growth算法已经可以运行。代码如下：

'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 13:00:39
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
    def __init__(self, nameValue, numOccur,parentNode):
        self.name = nameValue  # hold the name of the node
        self.count = numOccur  # hold the count of the node
        self.nodeLink = None   # the nodeLink variable will be used to link similar items
        self.parent = parentNode # refer to the parent of this node in the tree
        self.children = {} # the node contains an empty dictionary for the children of this node
    
    # the method inc() increments the count variable by a given amount
    def inc(self, numOccur):
        self.count += numOccur
    # The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
    def disp(self, ind=1):
        print(' '*ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.disp(ind+1)

# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as 
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
    headerTable = {}
    # the first pass goes through everything in the dataset and counts the frequency of each term.
    for trans in dataSet:
        for item in trans:
            headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
    headerTableCopy = headerTable.copy()
    # Remove item not meeting min support
    for k in headerTableCopy.keys():
        if headerTable[k] < minSup:
            del(headerTable[k])
    freqItemSet = set(headerTable.keys())
    # if no items meet min support, exit
    if len(freqItemSet) == 0:
        return None, None
    for k in headerTable:
        headerTable[k] = [headerTable[k],None]
    retTree = treeNode('Null Set', 1, None)
    for tranSet, count in dataSet.items():
        localD = {}
        # Sort transaction by global frequency
        for item in tranSet:
            if item in freqItemSet:
                localD[item] = headerTable[item][0]
        if len(localD) > 0:
            orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
            # Populate tree with ordered freq itemset
            updateTree(orderedItems,retTree,headerTable,count)
    return retTree, headerTable

def updateTree(items, inTree, headerTable, count):
    if items[0] in inTree.children:
        inTree.children[items[0]].inc(count)
    else:
        inTree.children[items[0]] = treeNode(items[0], count, inTree)
        if headerTable[items[0]][1] == None: # update header table
            headerTable[items[0]][1] = inTree.children[items[0]]
        else:
            updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
    # Recursively call updateTree on remaining items
    if len(items) > 1:
        updateTree(items[1::], inTree.children[items[0]],headerTable,count)

def updateHeader(nodeToTest, targetNode):
    while(nodeToTest.nodeLink != None):
        nodeToTest = nodeToTest.nodeLink
    nodeToTest.nodeLink = targetNode

# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
    simpDat = [['r', 'z', 'h', 'j', 'p'],
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
               ['z'],
               ['r', 'x', 'n', 'o', 's'],
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
    return simpDat

def createInitSet(dataSet):
    retDict = {}
    for trans in dataSet:
        retDict[frozenset(trans)] = 1
    return retDict

# A function to find all paths ending with a given item.

def ascendTree(leafNode, prefixPath):
    if leafNode.parent != None:
        prefixPath.append(leafNode.name)
        ascendTree(leafNode.parent,prefixPath)

def findPrefixPath(basePat, treeNode):
    condPats = {}
    while treeNode != None:
        prefixPath = []
        ascendTree(treeNode, prefixPath)
        if len(prefixPath) > 1:
            condPats[frozenset(prefixPath[1:])] = treeNode.count
        treeNode = treeNode.nodeLink
    return condPats

def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
    # Start from bottom of header table
    bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
    for basePat in bigL:
        newFreqSet = preFix.copy()
        newFreqSet.add(basePat)
        freqItemList.append(newFreqSet)
        # Construct cond. FP-tree from cond. pattern base
        condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
        myCondTree,myHead = createTree(condPattBases,minSup)
        # Mine cond. FP-tree
        if myHead != None:
            print('conditional tree for: ',newFreqSet)
            myCondTree.disp(1)
            mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)

12.4 示例：在Twitter 源中发现一些共现词。

会用到一个叫做python-twitter的python库。

'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 21:26:43
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
    def __init__(self, nameValue, numOccur,parentNode):
        self.name = nameValue  # hold the name of the node
        self.count = numOccur  # hold the count of the node
        self.nodeLink = None   # the nodeLink variable will be used to link similar items
        self.parent = parentNode # refer to the parent of this node in the tree
        self.children = {} # the node contains an empty dictionary for the children of this node
    
    # the method inc() increments the count variable by a given amount
    def inc(self, numOccur):
        self.count += numOccur
    # The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
    def disp(self, ind=1):
        print(' '*ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.disp(ind+1)

# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as 
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
    headerTable = {}
    # the first pass goes through everything in the dataset and counts the frequency of each term.
    for trans in dataSet:
        for item in trans:
            headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
    headerTableCopy = headerTable.copy()
    # Remove item not meeting min support
    for k in headerTableCopy.keys():
        if headerTable[k] < minSup:
            del(headerTable[k])
    freqItemSet = set(headerTable.keys())
    # if no items meet min support, exit
    if len(freqItemSet) == 0:
        return None, None
    for k in headerTable:
        headerTable[k] = [headerTable[k],None]
    retTree = treeNode('Null Set', 1, None)
    for tranSet, count in dataSet.items():
        localD = {}
        # Sort transaction by global frequency
        for item in tranSet:
            if item in freqItemSet:
                localD[item] = headerTable[item][0]
        if len(localD) > 0:
            orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
            # Populate tree with ordered freq itemset
            updateTree(orderedItems,retTree,headerTable,count)
    return retTree, headerTable

def updateTree(items, inTree, headerTable, count):
    if items[0] in inTree.children:
        inTree.children[items[0]].inc(count)
    else:
        inTree.children[items[0]] = treeNode(items[0], count, inTree)
        if headerTable[items[0]][1] == None: # update header table
            headerTable[items[0]][1] = inTree.children[items[0]]
        else:
            updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
    # Recursively call updateTree on remaining items
    if len(items) > 1:
        updateTree(items[1::], inTree.children[items[0]],headerTable,count)

def updateHeader(nodeToTest, targetNode):
    while(nodeToTest.nodeLink != None):
        nodeToTest = nodeToTest.nodeLink
    nodeToTest.nodeLink = targetNode

# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
    simpDat = [['r', 'z', 'h', 'j', 'p'],
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
               ['z'],
               ['r', 'x', 'n', 'o', 's'],
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
    return simpDat

def createInitSet(dataSet):
    retDict = {}
    for trans in dataSet:
        retDict[frozenset(trans)] = 1
    return retDict

# A function to find all paths ending with a given item.

def ascendTree(leafNode, prefixPath):
    if leafNode.parent != None:
        prefixPath.append(leafNode.name)
        ascendTree(leafNode.parent,prefixPath)

def findPrefixPath(basePat, treeNode):
    condPats = {}
    while treeNode != None:
        prefixPath = []
        ascendTree(treeNode, prefixPath)
        if len(prefixPath) > 1:
            condPats[frozenset(prefixPath[1:])] = treeNode.count
        treeNode = treeNode.nodeLink
    return condPats

def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
    # Start from bottom of header table
    bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
    for basePat in bigL:
        newFreqSet = preFix.copy()
        newFreqSet.add(basePat)
        freqItemList.append(newFreqSet)
        # Construct cond. FP-tree from cond. pattern base
        condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
        myCondTree,myHead = createTree(condPattBases,minSup)
        # Mine cond. FP-tree
        if myHead != None:
            print('conditional tree for: ',newFreqSet)
            myCondTree.disp(1)
            mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)


# 12.6 code to access the Twitter Python library

import twitter

from time import sleep
import re

def getLotsOfTweets(searchStr):
    CONSUMER_KEY = 'get when you create an app'
    CONSUMER_SECRET = 'get when you create an app'
    ACCESS_TOKEN_KEY = 'get from Oauth, specific to a user'
    ACCESS_TOKEN_SECRET = 'get from Oauth, specific to a user'
    api = twitter.Api(consumer_key=CONSUMER_KEY, consumer_secret=CONSUMER_SECRET,
                      access_token_key=ACCESS_TOKEN_KEY, 
                      access_token_secret=ACCESS_TOKEN_SECRET)
    
    # you can get 1500 results 15 pages * 100 per page
    resultPages = []
    for i in range(1,15):
        print("fetching page %d" %i)
        searchResults = api.GetSearch(searchStr,per_page=100,pag=i)
        resultPages.append(searchResults)
        sleep(6)
    return resultPages

12.6 本章小结

FP-growth算法是一种用于发现数据集中频繁模式的有效方法。FP-growth算法利用Apriori原则，执行更快。Apriori算法产生候选项集，然后扫描数据集来检查它们是否频繁。由于只对数据集扫描两次，因此FP-growth算法执行更快。在FP-growth算法中，数据集存储在一个称为FP树的结构中。FP树构建完成后，可以通过查找元素项的条件基及构建条件FP树来发现频繁项集。该过程不断以更多元素作为条件重复进行，直到FP树只包含一个元素为止。可以使用FP-growth算法在多种文本文档中查找频繁单词。Twitter网站为开发者提供了大量的 API来使用他们的服务。利用Python模块Python-Twitter可以很容易访问Twitter。在Twitter源上对某个话题应用FP-growth算法，可以得到一些有关该话题的摘要信息。频繁项集生成还有其他的一些应用，比如购物交易、医学诊断及大气研究等。