本博文内容包括以下:
- 发现事务数据中的公共模式
- FP-growth算法
- 发现twitter源中的共同词
FP-growth 算法 是基于Apriori算法,但在完成相同的任务(将数据集存储在一个特定的称作FP树的结构之后发现频繁项集或频繁项对,即常在一块出现的元素项的集合FP树)时采用了一些不同的技术。这种算法的执行速度要快于Apriori,通常性能要好两个数量级以上。
讨论从数据集中获取有趣信息的方法,最常用的两种分别是频繁项集与关联规则。
FP-growth算法只需要对数据库进行两次扫描,而Apriori算法对于每个潜在的频繁项集都会 扫描数据集判定给定模式是否频繁,因此FP-growth算法的速度要比Apriori算法快。在小规模数 据集上,这不是什么问题,但当处理更大数据集时,就会产生较大问题。FP-growth只会扫描数据集两次,它发现频繁项集的基本过程如下:
- 构建FP树
- 从FP树中挖掘频繁项集
12.1 FP树:用于编码数据集的有效方式
FP-growth 算法将数据存储在一种称为FP树的紧凑数据结构中。FP代表频繁模式(Frequent Pattern).
相似项之间的链接即节点链接(node link),用于快速发现相似项的位置。
FP-growth算法的工作流程如下。首先构建FP树,然后利用它来挖掘频繁项集。为构建FP树, 需要对原始数据集扫描两遍。第一遍对所有元素项的出现次数进行计数。记住第11章中给出的 Apriori原理,即如果某元素是不频繁的,那么包含该元素的超集也是不频繁的,所以就不需要考 虑这些超集。数据库的第一遍扫描用来统计出现的频率,而第二遍扫描中只考虑那些频繁元素。;
12.2 构建FP树
12.2.1 创建FP树的数据结构
'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 05:09:31
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
def __init__(self, nameValue, numOccur,parentNode):
self.name = nameValue # hold the name of the node
self.count = numOccur # hold the count of the node
self.nodeLink = None # the nodeLink variable will be used to link similar items
self.parent = parentNode # refer to the parent of this node in the tree
self.children = {} # the node contains an empty dictionary for the children of this node
# the method inc() increments the count variable by a given amount
def inc(self, numOccur):
self.count += numOccur
# The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
def disp(self, ind=1):
print(' '*ind, self.name, ' ', self.count)
for child in self.children.values():
child.disp(ind+1)
Source Code:
# Source Code
import fpGrowth
rootNode = fpGrowth.treeNode('pyramid',9,None)
# creat a single tree node. add a child node to it.
rootNode.children['eye']=fpGrowth.treeNode('eye', 13, None)
# to display the child node
rootNode.disp()
# Add another node to see how two child nodes are displayed
rootNode.children['phoenix']=fpGrowth.treeNode('phoenix', 3, None)
rootNode.disp()
12.2.2 构建FP树
需要一个头指针来指向给定类型的第一个实例 。利用头指针表,可以快速访问FP树中一个给定类型的所有元素。
这里使用一个字典作为数据结构,来保存头指针表。除了存放指针外,头指针表还可以用来保存FP树中每类元素的总数。
- 第一次遍历数据集会获得每个元素项的出现频率。
- 去掉不满足最小支持度的元素项。
- 构建FP树。构建时,读入每个项集并将其添加到一条已经存在的路径中。若不存在,新建一条路径。每个事务都是一个无序集合。相同项会只表示一次。为了解决此问题,将集合添加到树之前,需要对每个集合进行排序。排序基于元素项的绝对出现频率来进行。在对事务记录过滤和排序后,就可以构建FP树了。
- 从空集(符号为
)开始,向其中不断添加频繁项集。过滤、排序后的事务依次添加到树中,如果树中已存在现有元素。则增加现有元素的值。如果现有元素不存在,则向树添加一个分枝。
# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
headerTable = {}
# the first pass goes through everything in the dataset and counts the frequency of each term.
for trans in dataSet:
for item in trans:
headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
headerTableCopy = headerTable.copy()
# Remove item not meeting min support
for k in headerTableCopy.keys():
if headerTable[k] < minSup:
del(headerTable[k])
freqItemSet = set(headerTable.keys())
# if no items meet min support, exit
if len(freqItemSet) == 0:
return None, None
for k in headerTable:
headerTable[k] = [headerTable[k],None]
retTree = treeNode('Null Set', 1, None)
for tranSet, count in dataSet.items():
localD = {}
# Sort transaction by global frequency
for item in tranSet:
if item in freqItemSet:
localD[item] = headerTable[item][0]
if len(localD) > 0:
orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
# Populate tree with ordered freq itemset
updateTree(orderedItems,retTree,headerTable,count)
return retTree, headerTable
def updateTree(items, inTree, headerTable, count):
if items[0] in inTree.children:
inTree.children[items[0]].inc(count)
else:
inTree.children[items[0]] = treeNode(items[0], count, inTree)
if headerTable[items[0]][1] == None: # update header table
headerTable[items[0]][1] = inTree.children[items[0]]
else:
updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
# Recursively call updateTree on remaining items
if len(items) > 1:
updateTree(items[1::], inTree.children[items[0]],headerTable,count)
def updateHeader(nodeToTest, targetNode):
while(nodeToTest.nodeLink != None):
nodeToTest = nodeToTest.nodeLink
nodeToTest.nodeLink = targetNode
# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
simpDat = [['r', 'z', 'h', 'j', 'p'],
['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
['z'],
['r', 'x', 'n', 'o', 's'],
['y', 'r', 'x', 'z', 'q', 't', 'p'],
['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
return simpDat
def createInitSet(dataSet):
retDict = {}
for trans in dataSet:
retDict[frozenset(trans)] = 1
return retDict
import fpGrowth
import imp
imp.reload(fpGrowth)
# load the example dataset
simpDat = fpGrowth.loadSimpDat()
simpDat
# need to format this for createTree()
initSet = fpGrowth.createInitSet(simpDat)
initSet
# create the FP-tree
myFPtree, myHeaderTab = fpGrowth.createTree(initSet, 3)
myFPtree.disp()
# This item and its frequency count are displayed with indentation representing the depth of the tree.
现在我们已经构建FP树,接下来就使用它进行频繁项集挖掘。
12.3 从一棵FP树中挖掘频繁项集
从FP树中抽取频繁项集的三个基本步骤如下:
(1) 从FP树中获得条件模式基;
(2) 利用条件模式基,构建一个条件FP树;
(3) 迭代重复步骤(1)步骤(2),直到树包含一个元素项为止。
接下来重点关注第(1)步,即寻找条件模式基的过程。之后,为每一个条件模式基创建对应的 条件FP树。最后需要构造少许代码来封装上述两个函数,并从FP树中获得频繁项集。
12.3.1 抽取条件模式基
从已经保存在头指针中的单个频繁元素项开始,对于每一个元素项,获得其对应的条件模式基(conditional pattern base)。条件模式基是以所查找元素项为结尾的路径集合。每一条路径其实都是一条前缀路径(prefix path)。一条前缀路径是介于所查找元素项与树根节点之间的所有内容。
前缀路径被用于构建条件FP树,可以做到穷举式搜索,直到获得想要的频繁项为止。为了能够更加高效的加速搜索过程。可以利用创建的头指针表来得到一种更高效的方法。头指针表包含相同类型元素链表的起始指针。一旦到达了每一个元素项,就可以上溯这棵树直到根节点为止。
# A function to find all paths ending with a given item.
def ascendTree(leafNode, prefixPath):
if leafNode.parent != None:
prefixPath.append(leafNode.name)
ascendTree(leafNode.parent,prefixPath)
def findPrefixPath(basePat, treeNode):
condPats = {}
while treeNode != None:
prefixPath = []
ascendTree(treeNode, prefixPath)
if len(prefixPath) > 1:
condPats[frozenset(prefixPath[1:])] = treeNode.count
treeNode = treeNode.nodeLink
return condPats
12.3.2 创建条件FP树
对于每一个频繁项,都要创建一棵条件FP树。
def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
# Start from bottom of header table
bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
for basePat in bigL:
newFreqSet = preFix.copy()
newFreqSet.add(basePat)
freqItemList.append(newFreqSet)
# Construct cond. FP-tree from cond. pattern base
condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
myCondTree,myHead = createTree(condPattBases,minSup)
# Mine cond. FP-tree
if myHead != None:
print('conditional tree for: ',newFreqSet)
myCondTree.disp(1)
mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)
import fpGrowth
import imp
imp.reload(fpGrowth)
freqItems = []
fpGrowth.mineTree(myFPtree, myHeaderTab, 3, set([]), freqItems)
freqItems
# The itemsets match the conditional FP-trees, which is what you’d expect.
完整的FP-growth算法已经可以运行。代码如下:
'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 13:00:39
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
def __init__(self, nameValue, numOccur,parentNode):
self.name = nameValue # hold the name of the node
self.count = numOccur # hold the count of the node
self.nodeLink = None # the nodeLink variable will be used to link similar items
self.parent = parentNode # refer to the parent of this node in the tree
self.children = {} # the node contains an empty dictionary for the children of this node
# the method inc() increments the count variable by a given amount
def inc(self, numOccur):
self.count += numOccur
# The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
def disp(self, ind=1):
print(' '*ind, self.name, ' ', self.count)
for child in self.children.values():
child.disp(ind+1)
# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
headerTable = {}
# the first pass goes through everything in the dataset and counts the frequency of each term.
for trans in dataSet:
for item in trans:
headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
headerTableCopy = headerTable.copy()
# Remove item not meeting min support
for k in headerTableCopy.keys():
if headerTable[k] < minSup:
del(headerTable[k])
freqItemSet = set(headerTable.keys())
# if no items meet min support, exit
if len(freqItemSet) == 0:
return None, None
for k in headerTable:
headerTable[k] = [headerTable[k],None]
retTree = treeNode('Null Set', 1, None)
for tranSet, count in dataSet.items():
localD = {}
# Sort transaction by global frequency
for item in tranSet:
if item in freqItemSet:
localD[item] = headerTable[item][0]
if len(localD) > 0:
orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
# Populate tree with ordered freq itemset
updateTree(orderedItems,retTree,headerTable,count)
return retTree, headerTable
def updateTree(items, inTree, headerTable, count):
if items[0] in inTree.children:
inTree.children[items[0]].inc(count)
else:
inTree.children[items[0]] = treeNode(items[0], count, inTree)
if headerTable[items[0]][1] == None: # update header table
headerTable[items[0]][1] = inTree.children[items[0]]
else:
updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
# Recursively call updateTree on remaining items
if len(items) > 1:
updateTree(items[1::], inTree.children[items[0]],headerTable,count)
def updateHeader(nodeToTest, targetNode):
while(nodeToTest.nodeLink != None):
nodeToTest = nodeToTest.nodeLink
nodeToTest.nodeLink = targetNode
# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
simpDat = [['r', 'z', 'h', 'j', 'p'],
['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
['z'],
['r', 'x', 'n', 'o', 's'],
['y', 'r', 'x', 'z', 'q', 't', 'p'],
['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
return simpDat
def createInitSet(dataSet):
retDict = {}
for trans in dataSet:
retDict[frozenset(trans)] = 1
return retDict
# A function to find all paths ending with a given item.
def ascendTree(leafNode, prefixPath):
if leafNode.parent != None:
prefixPath.append(leafNode.name)
ascendTree(leafNode.parent,prefixPath)
def findPrefixPath(basePat, treeNode):
condPats = {}
while treeNode != None:
prefixPath = []
ascendTree(treeNode, prefixPath)
if len(prefixPath) > 1:
condPats[frozenset(prefixPath[1:])] = treeNode.count
treeNode = treeNode.nodeLink
return condPats
def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
# Start from bottom of header table
bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
for basePat in bigL:
newFreqSet = preFix.copy()
newFreqSet.add(basePat)
freqItemList.append(newFreqSet)
# Construct cond. FP-tree from cond. pattern base
condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
myCondTree,myHead = createTree(condPattBases,minSup)
# Mine cond. FP-tree
if myHead != None:
print('conditional tree for: ',newFreqSet)
myCondTree.disp(1)
mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)
12.4 示例: 在Twitter 源中发现一些共现词。
会用到一个叫做python-twitter的python库。
'''
Author: Maxwell Pan
Date: 2022-05-03 04:53:59
LastEditTime: 2022-05-03 21:26:43
FilePath: \cp12\fpGrowth.py
Description: Efficiently finding frequent itemsets with FP-Growth
Software:VSCode,env:
'''
# Due to FP-tree is more involved than the other trees in this book,
# we need to create a class to hold each node of the tree.
class treeNode:
def __init__(self, nameValue, numOccur,parentNode):
self.name = nameValue # hold the name of the node
self.count = numOccur # hold the count of the node
self.nodeLink = None # the nodeLink variable will be used to link similar items
self.parent = parentNode # refer to the parent of this node in the tree
self.children = {} # the node contains an empty dictionary for the children of this node
# the method inc() increments the count variable by a given amount
def inc(self, numOccur):
self.count += numOccur
# The last method,disp(), is used to display the tree in text.It isn't needed to create the tree,but it's useful for debugging.
def disp(self, ind=1):
print(' '*ind, self.name, ' ', self.count)
for child in self.children.values():
child.disp(ind+1)
# FP-tree creation code
# the function one createTree(), takes the dataset and the minimum support as
# arguments and builds the FP-tree.
def createTree(dataSet, minSup=1):
headerTable = {}
# the first pass goes through everything in the dataset and counts the frequency of each term.
for trans in dataSet:
for item in trans:
headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
headerTableCopy = headerTable.copy()
# Remove item not meeting min support
for k in headerTableCopy.keys():
if headerTable[k] < minSup:
del(headerTable[k])
freqItemSet = set(headerTable.keys())
# if no items meet min support, exit
if len(freqItemSet) == 0:
return None, None
for k in headerTable:
headerTable[k] = [headerTable[k],None]
retTree = treeNode('Null Set', 1, None)
for tranSet, count in dataSet.items():
localD = {}
# Sort transaction by global frequency
for item in tranSet:
if item in freqItemSet:
localD[item] = headerTable[item][0]
if len(localD) > 0:
orderedItems = [v[0] for v in sorted(localD.items(),key=lambda p: p[1],reverse=True)]
# Populate tree with ordered freq itemset
updateTree(orderedItems,retTree,headerTable,count)
return retTree, headerTable
def updateTree(items, inTree, headerTable, count):
if items[0] in inTree.children:
inTree.children[items[0]].inc(count)
else:
inTree.children[items[0]] = treeNode(items[0], count, inTree)
if headerTable[items[0]][1] == None: # update header table
headerTable[items[0]][1] = inTree.children[items[0]]
else:
updateHeader(headerTable[items[0]][1],inTree.children[items[0]])
# Recursively call updateTree on remaining items
if len(items) > 1:
updateTree(items[1::], inTree.children[items[0]],headerTable,count)
def updateHeader(nodeToTest, targetNode):
while(nodeToTest.nodeLink != None):
nodeToTest = nodeToTest.nodeLink
nodeToTest.nodeLink = targetNode
# Simple dataset and data wrapper
# the loadSimpDat() function will return a list of transactions.
def loadSimpDat():
simpDat = [['r', 'z', 'h', 'j', 'p'],
['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
['z'],
['r', 'x', 'n', 'o', 's'],
['y', 'r', 'x', 'z', 'q', 't', 'p'],
['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
return simpDat
def createInitSet(dataSet):
retDict = {}
for trans in dataSet:
retDict[frozenset(trans)] = 1
return retDict
# A function to find all paths ending with a given item.
def ascendTree(leafNode, prefixPath):
if leafNode.parent != None:
prefixPath.append(leafNode.name)
ascendTree(leafNode.parent,prefixPath)
def findPrefixPath(basePat, treeNode):
condPats = {}
while treeNode != None:
prefixPath = []
ascendTree(treeNode, prefixPath)
if len(prefixPath) > 1:
condPats[frozenset(prefixPath[1:])] = treeNode.count
treeNode = treeNode.nodeLink
return condPats
def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
# Start from bottom of header table
bigL = [v[0] for v in sorted(headerTable.items(),key=lambda p: p[0])]
for basePat in bigL:
newFreqSet = preFix.copy()
newFreqSet.add(basePat)
freqItemList.append(newFreqSet)
# Construct cond. FP-tree from cond. pattern base
condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
myCondTree,myHead = createTree(condPattBases,minSup)
# Mine cond. FP-tree
if myHead != None:
print('conditional tree for: ',newFreqSet)
myCondTree.disp(1)
mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)
# 12.6 code to access the Twitter Python library
import twitter
from time import sleep
import re
def getLotsOfTweets(searchStr):
CONSUMER_KEY = 'get when you create an app'
CONSUMER_SECRET = 'get when you create an app'
ACCESS_TOKEN_KEY = 'get from Oauth, specific to a user'
ACCESS_TOKEN_SECRET = 'get from Oauth, specific to a user'
api = twitter.Api(consumer_key=CONSUMER_KEY, consumer_secret=CONSUMER_SECRET,
access_token_key=ACCESS_TOKEN_KEY,
access_token_secret=ACCESS_TOKEN_SECRET)
# you can get 1500 results 15 pages * 100 per page
resultPages = []
for i in range(1,15):
print("fetching page %d" %i)
searchResults = api.GetSearch(searchStr,per_page=100,pag=i)
resultPages.append(searchResults)
sleep(6)
return resultPages
12.6 本章小结
FP-growth算法是一种用于发现数据集中频繁模式的有效方法。FP-growth算法利用Apriori原 则,执行更快。Apriori算法产生候选项集,然后扫描数据集来检查它们是否频繁。由于只对数据 集扫描两次,因此FP-growth算法执行更快。在FP-growth算法中,数据集存储在一个称为FP树的 结构中。FP树构建完成后,可以通过查找元素项的条件基及构建条件FP树来发现频繁项集。该 过程不断以更多元素作为条件重复进行,直到FP树只包含一个元素为止。可以使用FP-growth算法在多种文本文档中查找频繁单词。Twitter网站为开发者提供了大量的 API来使用他们的服务。利用Python模块Python-Twitter可以很容易访问Twitter。在Twitter源 上对某个话题应用FP-growth算法,可以得到一些有关该话题的摘要信息。频繁项集生成还有其 他的一些应用,比如购物交易、医学诊断及大气研究等。