准备工作

安装nltk packages
下载和安装nltk资料库

import nltk
nltk.download()

运行上述命令，发现无法下载，但可以找到安装路径
解决办法：本地下载packages存入该地址，也可以在D盘新建nlkt_data地址（python会在好几个默认地址自动寻找packages），存入D:/nlkt_data地址下

运行

from nltk.book import *

出现text1 ~ text9，则表示安装成功，可以使用。

正式开始

NLKT（基于pycharn），需用print打出来才能看到结果

导入packages和相关资料

import nltk
from nltk.book import *

文本搜索

1.text1.concordance 搜索文本出现位置

print(text1.concordance("monstrous"))     # 查看text1中monstrous的出现位置
print(text2.concordance("affection"))     # 查看text2中affection的出现位置

词语索引使我们看到词的上下文

2.text1.similar 搜索有哪些词与（例词monstrous）一样，出现在相似的上下文中

print(text1.similar("monstrous"))        # 搜索text1中与 monstrous上下文相似的词
print(text2.similar("monstrous"))        # 搜索text2中与 monstrous上下文相似的词

3.ommon_contexts 允许我们研究两个或两个以上的词共同的上下文

print(text2.common_contexts(["monstrous", "very"]))    # 搜索特性text2中，monstrous 和 very 的共同上下文

4.dispersion_plot 绘制指定词汇在文章中的分布（前提是安装了 NumPy 和 Matplotlib 包）。

print(text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]))

5.generate() 尝试产生一些刚才看到的不同风格的随机文本

print(text3.generate())        # 根据text3产生随机文本
print(text3.generate())        # 根据text4产生随机文本

词频计数

1.len 计算文本从头到尾的长度

print(len(text3))       # text3元素统计

2.sorted 排序文本（包含重复）

print(sorted(text3))    #对text3中的所有元素进行排序

3.set 获取词汇表（即不重复的，如the出现了10次数，只算一个）

print(set(text3))       #(text3)获得text3的词汇表

4.结合以上

print(sorted(set(text3)))
print(len(sorted(set(text3))))


print(len(set(text3)) / len(text3))   # 对文本丰富度进行计算

5.count 特定词词频统计

print(text3.count("smote"))

print(100 * text4.count('a') / len(text4))  # 特定的词在文本中占据的百分比

定义函数

有时候我们需要对多个词计算多词词频，这时候定义函数帮助我们更加方便

def lexical_diversity(text): ❶
    return len(set(text)) / len(text) ❷

def percentage(count, total): ❸
	return 100 * count / total

lexical_diversity()❶的定义中，我们指定了一个text参数。这个参数是我
们想要计算词汇多样性的实际文本的一个“占位符”，并在用到这个函数的时
候出现在将要运行的代码块中❷。类似地，percentage()定义了两个参
数，count和total❸,定义好函数之后就可以使用了，如下：

print(lexical_diversity(text3))
print(lexical_diversity(text5))
print(percentage(4, 5))
print(percentage(text4.count('a'), len(text4)))


0.06230453042623537
0.13477005109975562
80.0
1.4643016433938312

索引列表

print(text4[173])             # 查看text4的第173个词
print(text4.index('awaken'))  # 查看 awaken 一词出现的位置
print(text5[16715:16735])     # 查看text5中，16715-16735词，注意python的计数从0开始，而非从1开始，m:n表示元素m…n-1。

频率分布

1.FreqDist 统计词频

import nltk
from nltk.book import *

fdist1 = FreqDist(text1)
print(fdist1)

print(fdist1.most_common(50))     # most_common(50) 给出文本中 50 个出现频率最高的单词类型
print(fdist1.most_common(30))     # most_common(30) 给出文本中 30 个出现频率最高的单词类型

print(fdist.max())                # 最常用的词有多长

fdist1['whale']                   #查看词 whale 的出现次数

fdist1.plot(50, cumulative=True)  # 查看前 50 词的词频分布图

print(fdist1.hapaxes())           # 查看只出现 1 次的词

细粒度的选择词（指定长度的词）
我们想要找出文本词汇表长度中超过 15 个字符的词。我们定义这个性质为P，则P(w)为真当且仅当词w的长度大余 15 个字符。现在我们可以用(1a) 中的数学集合符号表示我们感兴趣的词汇。它的含义是：此集合中所有w都满足w是集合V（词汇表）的一个元素且w有性质P

V = set(text1)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words))

3. 结合以上两点
fdist5 = FreqDist(text5)
print(sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7))  #长度超过 7 字符， 频率超过 7 次


添加条件
print(sorted(w for w in set(text1) if w.endswith('ableness')))        #添加条件，词的后缀
print(sorted(term for term in set(text4) if 'gnt' in term))           #添加条件，包含gnt
print(sorted(item for item in set(text6) if item.istitle()))

> sorted(w for w in set(text7) if '-' in w and 'index' in w)
> sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10)
> sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)

> sorted(w for w in set(sent7) if not w.islower())     #sent7 是自创的list，此处无法运行
> [len(w) for w in text1]
> [w.upper() for w in text1]

> len(text1)
260819
> len(set(text1))
19317
> len(set(word.lower() for word in text1))
17231
len(set(word.lower() for word in text1 ifword.isalpha()))
16948

词语搭配和双连词

print(text4.collocations()) # 查看 text4 中搭配使用的双连词
print(text8.collocations()) # 查看 text8 中搭配使用的双连词

计算文本中词长的分布
通过创造一长串数字的列表的FreqDist，其中每个数字是文本中对应词的长度

print( [len(w) for w in text1])              # 导出text1中每个词的长度的列表开始

fdist = FreqDist(len(w) for w in text1)      # FreqDist计数列表中每个数字出现的次数

print(fdist)

print(fdist.most_common())     # 统计各长度的词出现次数

print(fdist.max())             # 各长度的词中，出现最多次数的词长多长，结果是 3 ，接着下一句， 词长为 3 的出现了多少次
print(fdist[3])

自然语言处理笔记