1.目的
将文档从一种矢量表示转换为另一种。此过程有两个目标:
- 要找出语料库中的隐藏结构,请发现单词之间的关系,并使用它们以一种新颖的(希望)更具语义的方式描述文档。
- 使文档表示更加紧凑。这既提高了效率(新的表示消耗了更少的资源)又提高了效率(忽略了边际数据趋势,降低了噪声)。
2.构造语料
from collections import defaultdict
from gensim import corpora
# 语料
documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]
# 构造停用词列表
stoplist = set('for a of the and to in'.split())
#分词
texts = [
[word for word in document.lower().split() if word not in stoplist]
for document in documents
]
# 词频统计
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
print(frequency)
# 删除只出现一次的单词
texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]
print('\n',texts)
dictionary = corpora.Dictionary(texts)
print('\n',dictionary)
corpus = [dictionary.doc2bow(text) for text in texts]
print('\n',corpus)
3. 构造转换(模型)
from gensim import models
tfidf = models.TfidfModel(corpus)
4. 使用模型转换向量
tfidf对象被视为只读对象,可用于将任何矢量从旧表示形式(单词袋整数计数)转换为新表示形式(TfIdf实值权重)。
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])
"""
转换整个语料库
"""
print('\n')
corpus_tfidf = tfidf[corpus]
print(corpus_tfidf)
for doc in corpus_tfidf:
print(doc)