0
点赞
收藏
分享

微信扫一扫

WMD实现 python

参考这个链接using Gensim’s implemenation of the WMD.。

  1. 下载NLTK,去掉停用词
# Import and download stopwords from NLTK
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]

sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
  1. 下载用GoogleNEWS训练好的词向量模型
import gensim.downloader as api
model = api.load('word2vec-google-news-300')
  1. 计算WMD距离
distance = model.wmdistance(sentence_obama, sentence_president)
print('distance = %.4f' % distance)
  1. 计算下其他句子对的相似度
sentence_orange = preprocess('Oranges are my favorite fruit')
distance = model.wmdistance(sentence_obama, sentence_orange)
print('distance = %.4f' % distance)

在这里插入图片描述

(也可以点击这个链接下载)整个代码:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
print(sentence_obama)

# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]

sentence_obama = preprocess(sentence_obama)
print(sentence_obama)
sentence_president = preprocess(sentence_president)

import gensim.downloader as api
model = api.load('word2vec-google-news-300')
print('api over')
distance = model.wmdistance(sentence_obama, sentence_president)
print('distance = %.4f' % distance)

sentence_orange = preprocess('Oranges are my favorite fruit')
distance = model.wmdistance(sentence_obama, sentence_orange)
print('distance = %.4f' % distance)

但是我更想知道WMD计算的内部过程…埋下坑


分析中文短文本相似度:WMD的过程

  1. 中文分词:
    分词工具:中文分词工具比较、NLP笔记:中文分词工具简介
    出名的有jieba、哈工大的pyLTP的使用手册

  2. 词向量
    下载别人训练好的词向量,或者自己使用gensim训练词向量(当我们处理特定领域的数据的时候,很多领域词在其他公共语料里面是没有的,这就必须要求我们能够训练自己的词向量,来处理特定领域的数据。)

  • 最全中文词向量数据下载-都是训练好的优质向量 或者其GitHub地址

  • 中文词向量的下载与使用探索 (tensorflow加载词向量)

  • gensim训练词向量的参考链接

  1. genism提供的库WmdSimilarity计算
举报

相关推荐

0 条评论