NLTK语言处理与文本

----语言处理与文本----

从NLTK中导入文本

from nltk.book import *
texts()
help(Text)

文本操作

1. 文本检索

接受一个单词字符串，打印出输入单词在文本中出现的上下文，查看单词的上下文可以帮助我们了解单词的词性。可指定窗口大小

text1.concordance('have',30,10)
text1.concordance('have')

2. 根据上下文寻找
打印出和输入单词具有相同上下文的其他单词，也就是说找到用法、意义与该单词相似的词。第二个参数为显示目标个数，默认为20

text1.similar('have'，5)
text1.similar('time')

运行结果：

找到两个或多个特定单词上下文中的公共词汇即共同的上下文

text1.common_contexts(['have','time'])
text2.common_contexts(['monstrous','very'])

3.找搭配词
经常一起出现的双连词text4.collocations()

4.词汇分布图的绘制
生成散点图

text4.dispersion_plot(["citizens","democracy","freedom","duties","American"])

5.文本词汇计数与排序

len(text3)			//text3文章词汇个数
len(set(text3))		//去重后的词汇个数
sorted(set(text3))	//去重后排序

练习：文本词汇计数与排序
1.统计text4文本单词总数；
2.统计text4文本中不重复的单词个数；
3. 打印text4文本中前100个单词；
4. 打印text4去重后文本中的前100个单词；
5. 对text4去重后，并排序，然后打印前20个单词。

print(len(text4))
print(len(set(text4)))
print(text4[:100])
a = list(set(text4))
print(a[:100])
b = list(sorted(set(text4)))
print(b[:20])

注意：set不可切片，转换成list再进行切片即可

6.文本词汇丰富度度量
平均每个词在文本中被使用的次数len(x)/len(set(x))
统计某个词在文本中出现次数x.count("y")
其占据百分比100*x.count("y")/len(x)

def lexical_diversity(text):
    return len(text) / len(set(text));  # 建立函数测算某个文本中所有单词出现的平均次数  

def percentage(count, total):
    return 100 * count / total;  # 建立函数测算某个单词在某个文本中所占的百分比

文本简单统计

1.频率分布
用FreqDist进行频率分布统计，注意得到的是字典型数据结构，需要转换成list才能进行切片处理。

fdist1 = FreqDist(text1)

vocabulary1=list(fdist1.keys())
vocabulary1[0:10]

fdist1.plot(50,cumulative=True)

这里用的频率分布函数FreqDist继承自dict，默认键为单词，值为单词对应词频
➢http://www.nltk.org/_modules/nltk/probability.html#ConditionalFreqDist
打印该字典：

for key in fdist1:
    print(key, fdist1[key])

fdist = FreqDist(samples)
创建包含给定样本的频率分布
fdist.inc(sample)
增加样本
fdist[‘sample’]
计算给定样本的出现次数
fdist.freq[‘sample’]
计算给定样本的频率
fdist.N()
样本总数
fdist.keys()
递减顺序排序的样本链表
for sample in fdist:
递减顺序遍历样本
fdist.max()
该方法会返回出现次数最多的项
fdist.plot(n)
绘制出现次数最多的前n项
fdist.tabulate(n)
以表格的方式打印出现次数最多的前n项
fdist.most_common(n)
返回出现次数最多的前n项列表
fdist1<fdist2
测试样本在fdiist1中出现的概率是否<fdist2

2.选择长单词

V = set(text1)
long_words = [w for w in V if len(w)>15]
sorted(long_words)

选择长单词练习

请找出text1中长度为12到14的单词。
请超出text2中长度为12到14的单词。
找出同时出现在两个列表中的单词

V = set(text1)
U = set(text2)
long_words = [w for w in V if len(w)>12 & len(w)<14]
print(long_words)
long_words = [w for w in U if len(w)>12 & len(w)<14]
print(long_words)
U.union(V)
print(sorted(U))

3.分析文本中不同词长的单词的频率分布

def AndWordFreq():
    fdist=FreqDist([len(w) for w in text2])
    print(fdist.keys())		//以频率递减顺序遍历样本
    print(fdist.items())
    print(fdist.max())
    print(fdist[3])
AndWordFreq()

注意这里的键值改变了[len(w) for w in text2]为词长

课堂练习：

查找text5中的搭配
找出text5（聊天语料库）中所有四个字母的单词。使用频率分布函数FreqDist,以频率由低到高的顺序显示。

print(text5.collocations())
V = set(text5)
fdist1 = FreqDist([w for w in V if len(w)>4])
print(sorted(fdist1.items(), key=lambda item:item[1]))

----获得和处理文本语料----

NLTK中经典语料库介绍及导入方法

➢ 古腾堡项目电子文本档案部分文本
名字：gutenberg

➢ 网络文本集合
名字：webtext
包括：电影剧本、个人广告、葡萄酒评论

➢ 即时聊天会话语料库
名字：nps_chat

➢ 布朗语料库
名字：brown
介绍：该语料库是研究问题之间的系统性差异的资源。

➢ 路透社语料库
名字：reuters
介绍：类别互相重叠，主要用于文本主题检测

➢ 就职演说语料库-----inaugural

NLTK定义的基本语料库处理方法

######nltk定义的基本语料库处理方法
fileids()	                #语料库中的文件
fileids([categories])       #分类对应
categories()                #分类
categories([fileids])       #文件对应语料库中的分类
raw()                       #原始内容
raw(fileids=[f1,f2])        #指定文件、分类的原始内容
words()                     #整个语料库的词汇
words(fileids=[f1,f2])      #指定文件，类别
sents()                     #句子，指定文件
abspath(fileid)             #指定文件在磁盘上的位置

NLTK中语料库的基本处理步骤：

导入语料库 from nltk.corpus import gutenberg as gb
使用实例化对象对该语料库文本进行操作
➢查看该语料库有多少个文件？ files=gb.fileids()
➢ 查看某个文件的单词个数？ words=gb.words(fileids=’austen-emma.txt’)

NLTK语料库的使用-1
目标：
➢ 导入古腾堡项目语料库gutenberg
➢ 查看该语料库有多少个文件？
➢ 查看和打印文件 'blake-poems.txt’有多少个单词？多少个句子？

from nltk.corpus import gutenberg as gb
print(len(gb.fileids()))
words = gb.words('blake-poems.txt')
print(len(words))
sents = gb.sents('blake-poems.txt')
print(len(sents)

NLTK语料库的使用-1 练习
➢Brown语料库一共有多少个类别？一共有多少个文件？
➢类别’news’下，有多少篇新闻文本？
➢新闻’ca01’ 包含了多少个单词？多少个句子？
➢打印新闻’ca02’的原始文本。

from nltk.corpus import brown
print(len(brown.categories()))
print(len(brown.fileids(['news'])))
words = brown.words(fileids='ca01')
sents = brown.sents(fileids='ca01')
print(len(words))
print(len(sents))
print(brown.raw(fileids='ca02'))

NLTK语料库的使用-2
➢读入古腾堡项目语料库gutenberg的所有文本，并统计该语料库中每个文本的平均词长、平均句子长度、每个词出现的平均次数

from nltk.corpus import gutenberg as gb
for fileid in gb.fileids():
    num_chars = len(gb.raw(fileid))         #原始文本中的所有字符个数
    num_words = len(gb.words(fileid))       #文本中的单词列表中的单词个数
    num_sents = len(gb.sents(fileid))       #文本句子个数
    num_vocab = len(set([w.lower() for w in gb.words(fileid)])) #当前文本的唯一词数量
    print(fileid,"==========")
    print("平均词长：",num_chars*1.0/num_words)
    print("平均句子长度：",num_words*1.0/num_sents)
    print("文本用词丰富度",num_words*1.0/num_vocab)

NLTK语料库的使用-2 练习
➢使用gutenberg语料库模块处理语料库’austen-persuasion.txt’，这本书中有多少词标识符(token)？多少词类型(unique word)？
标识符(token)-不去重
词类型(unique word)-去重
利用集合去重

from nltk.corpus import gutenberg as gb
token = gb.words(fileids='austen-persuasion.txt')
print(len(token))
unique = set(token)
print(len(unique))

NLTK语料库的使用-3
统计语料库中某个文件中词的分布情况

FreqDist函数（见上文）
ConditionalFreqDist函数

➢导入并打开古腾堡项目语料库gutenberg中’austenemma.txt’文本
➢ 统计该文本的词与词频的字典信息
➢ 画出该文本的词与词频的二维图
➢ 查看该文本中出现次数最多的项

from nltk.corpus import gutenberg as gb
words = gb.words('austen-emma.txt')
fdist1 = nltk.FreqDist(words)
print(list(fdist1.items())[100:150])
fdist1.plot()
print(fdist1.max())

统计brown语料库中’news’类别下所有文本
中’can’和’could’(不分大小写)两个词的词频对比：

from nltk.corpus import brown
files = brown.fileids('news')
for file in files:
    words_file=brown.words(file)
    fdis=nltk.FreqDist([w.lower() for w in words_file])
    print(file,'can:'+str(fdis['can']),'could:'+str(fdis['could']))

nltk有专门统计条件词频的类ConditionalFreqDist
统计语料库中多个词的词频及分布信息：ConditionalFreqDist类
➢http://www.nltk.org/_modules/nltk/probability.html#ConditionalFreqDist
继承defaultdict类，对应的方法类似于FreqDist。使用ConditionalFreqDist类的数据及画图函数，可以绘制出不同语料文本中不同词的词频分布对比。

➢’can’,’could’,’may’,’might’,’must’,’will’这几个单词在brown语料库中的’news’,’religion’,’hobbies’,’science_fiction’,’romance’,’humor’几个主题下的词频对比。

from nltk.corpus import brown
cfg = nltk.ConditionalFreqDist(
    (title,word)
    # 横坐标，类别
    for title in ["news","religion","hobbies","science_fiction","romance","humor"]
    # 纵坐标 在上述类别中的指定词语
    for word in brown.words(categories = title)
    #### 建立约束条件
)
modals = ["can","could","may","might","must","will"]
cfg.plot()

➢使用names语料库，统计美国女性名字和男性名字中最后一个字母在男生名字和女生名字之间的频率使用差异现象.

####统计美国女性名字和男性名字中最后一个字母在男生名字和女生名字之间的频率使用差异现象.
from nltk.corpus import names
cfg = nltk.ConditionalFreqDist(
    (sex,word[-1])
    # 横坐标，类别
    for sex in names.fileids()
    # 纵坐标 在上述类别中的指定词语
    for word in names.words()
    #### 建立约束条件
)
cfg.plot()

➢在名字语料库上定义一个条件频率分布，看看哪个首字母在男性名字中比在女性名字中更常用。

######在名字语料库上定义一个条件频率分布，看看哪个首字母在男性名字中比在女性名字中更常用。
from nltk.corpus import names
cfg = nltk.ConditionalFreqDist(
    (sex,word[0])
    # 横坐标，类别
    for sex in names.fileids()
    # 纵坐标 在上述类别中的指定词语
    for word in names.words()
    #### 建立约束条件
)
cfg.plot()

➢就职演说语料库：inaugural
查看该语料库中包含多少篇文本
查看1797年的文本
画出前20个文本中，单词america和citizen随时间推移的使用情况对比

from nltk.corpus import inaugural as ig
print(len(ig.fileids()))
cfg = nltk.ConditionalFreqDist(
    (target,fileid[:4])
    for fileid in ig.fileids()[0:20]
    for w in ig.words(fileid)
    for target in ['american','citizen']
    if (w.lower().startswith(target))
)
cfg.plot()

➢访问《国情咨文报告》语料库文本(名称为：state_union) 。计数每个文档中出现的men,women和people随时间的推移，词的用法有什么变化？

#######state_union的men,women和people随时间的推移
from nltk.corpus import state_union as su
files = su.fileids()

cfg = nltk.ConditionalFreqDist(
    (target,fileid[:4])
    for fileid in su.fileids()[0:30]
    for w in su.words(fileid)
    for target in ['men','women','people']
    if (w.lower()==target)
)
cfg.plot()

使用NLTK载入并分析自己的语料库

path = “gw-news"
gw =nltk.corpus.PlaintextCorpusReader(path,".*")