要使用 jieba
和 wordfreq
对英文文章进行关键词提取并高亮,同时为高亮词提供读音和最贴近上下文含义的解释,可以按照以下步骤进行。下面提供一个完整的 Python 示例,涵盖了关键词提取、高亮显示、读音获取和定义解释。
步骤概述
- 安装必要的库
- 加载和预处理英文文本
- 使用
jieba
进行分词 - 利用
wordfreq
计算词频并提取关键词 - 高亮关键词
- 获取读音和定义
1. 安装必要的库
首先,确保安装了以下 Python 库:
pip install jieba wordfreq requests pronouncing
jieba
: 中文分词工具,但也支持英文分词。wordfreq
: 用于获取单词频率。requests
: 发送 HTTP 请求,用于调用词典 API 获取定义。pronouncing
: 获取单词的发音(基于 CMU 发音词典)。
2. 加载和预处理英文文本
假设我们有一段英文文章存储在一个字符串变量中。
import jieba
from wordfreq import word_frequency
import pronouncing
import requests
import re
# 示例英文文章
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.
"""
# 转换为小写以统一处理
text_lower = text.lower()
3. 使用 jieba
进行分词
虽然 jieba
主要用于中文分词,但也可以对英文进行简单的分词。
# 使用 jieba 进行英文分词
words = list(jieba.cut(text_lower, cut_all=False))
# 移除标点符号和空格
words = [word.strip() for word in words if word.strip().isalnum()]
4. 利用 wordfreq
计算词频并提取关键词
我们可以根据词频来选择关键词,词频越低的词通常信息量越大。
# 计算每个词的频率(以英语为参考)
frequency = {word: word_frequency(word, 'en') for word in set(words)}
# 排序词频,从低到高(低频词优先)
sorted_words = sorted(frequency.items(), key=lambda item: item[1])
# 选择前N个关键词,N根据需要调整
N = 10
keywords = [word for word, freq in sorted_words[:N]]
5. 高亮关键词
在原文中高亮显示关键词。这里使用 HTML 的 <mark>
标签来高亮。
# 为了高亮,使用正则表达式进行替换
def highlight_keywords(text, keywords):
# 按照长度从长到短排序,避免部分匹配
keywords_sorted = sorted(keywords, key=len, reverse=True)
for word in keywords_sorted:
# 使用正则忽略大小写进行替换
pattern = re.compile(re.escape(word), re.IGNORECASE)
replacement = f"<mark>{word}</mark>"
text = pattern.sub(replacement, text)
return text
highlighted_text = highlight_keywords(text, keywords)
6. 获取读音和定义
使用 pronouncing
库获取单词的发音,使用词典 API 获取定义。这里以 DictionaryAPI 为例,它是一个免费的词典 API。
def get_pronunciation(word):
pronunciations = pronouncing.phones_for_word(word)
if pronunciations:
return pronunciations[0] # 返回第一个发音
else:
return "N/A"
def get_definition(word):
url = f"https://api.dictionaryapi.dev/api/v2/entries/en/{word}"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
try:
# 提取第一个定义
definition = data[0]['meanings'][0]['definitions'][0]['definition']
return definition
except (IndexError, KeyError):
return "Definition not found."
else:
return "Definition not found."
# 为每个关键词获取读音和定义
keyword_info = {}
for word in keywords:
pronunciation = get_pronunciation(word)
definition = get_definition(word)
keyword_info[word] = {
'pronunciation': pronunciation,
'definition': definition
}
7. 输出结果
将高亮后的文本和关键词的读音及定义展示出来。
from IPython.core.display import display, HTML
# 显示高亮文本
display(HTML(highlighted_text))
# 显示关键词信息
for word, info in keyword_info.items():
print(f"**{word.capitalize()}**")
print(f"- Pronunciation: {info['pronunciation']}")
print(f"- Definition: {info['definition']}\n")
示例输出:
Artificial intelligence (<mark>AI</mark>) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading <mark>AI</mark> textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.
**Ai**
- Pronunciation: EY1
- Definition: Definition not found.
**Agents**
- Pronunciation: AE1 JH AH0 N T S
- Definition: A person who acts on behalf of another.
**Actions**
- Pronunciation: AE1 K SH AH0 N S
- Definition: The fact or process of doing something, typically to achieve an aim.
... (其他关键词信息)
注意事项
jieba
对英文支持有限:jieba
主要针对中文分词,处理英文时效果不如专门的英文分词工具,如nltk
或spaCy
。如果对英文分词的准确性有较高要求,建议使用这些工具。- 词频选择关键词的策略:上述方法选择的是低频词作为关键词,这是一种简单的关键词提取方法。更高级的方法可以考虑 TF-IDF、TextRank 等算法,以获得更准确的关键词。
- 发音获取的局限性:
pronouncing
库基于 CMU 发音词典,主要适用于美式发音,对于其他发音可能无法提供支持。 - 词典 API 的依赖:使用外部 API(如 DictionaryAPI)需要网络连接,并且可能受到 API 限制。确保在生产环境中处理好异常情况。
- 高亮显示的呈现方式:上述示例使用了 HTML 的
<mark>
标签进行高亮。如果在不同的环境中显示(如命令行),需要采用不同的高亮方式。
完整代码示例
import jieba
from wordfreq import word_frequency
import pronouncing
import requests
import re
from IPython.core.display import display, HTML
# 示例英文文章
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.
"""
# 转换为小写以统一处理
text_lower = text.lower()
# 使用 jieba 进行英文分词
words = list(jieba.cut(text_lower, cut_all=False))
# 移除标点符号和空格
words = [word.strip() for word in words if word.strip().isalnum()]
# 计算每个词的频率(以英语为参考)
frequency = {word: word_frequency(word, 'en') for word in set(words)}
# 排序词频,从低到高(低频词优先)
sorted_words = sorted(frequency.items(), key=lambda item: item[1])
# 选择前N个关键词,N根据需要调整
N = 10
keywords = [word for word, freq in sorted_words[:N]]
# 高亮关键词
def highlight_keywords(text, keywords):
# 按照长度从长到短排序,避免部分匹配
keywords_sorted = sorted(keywords, key=len, reverse=True)
for word in keywords_sorted:
# 使用正则忽略大小写进行替换
pattern = re.compile(re.escape(word), re.IGNORECASE)
replacement = f"<mark>{word}</mark>"
text = pattern.sub(replacement, text)
return text
highlighted_text = highlight_keywords(text, keywords)
# 获取发音和定义
def get_pronunciation(word):
pronunciations = pronouncing.phones_for_word(word)
if pronunciations:
return pronunciations[0] # 返回第一个发音
else:
return "N/A"
def get_definition(word):
url = f"https://api.dictionaryapi.dev/api/v2/entries/en/{word}"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
try:
# 提取第一个定义
definition = data[0]['meanings'][0]['definitions'][0]['definition']
return definition
except (IndexError, KeyError):
return "Definition not found."
else:
return "Definition not found."
keyword_info = {}
for word in keywords:
pronunciation = get_pronunciation(word)
definition = get_definition(word)
keyword_info[word] = {
'pronunciation': pronunciation,
'definition': definition
}
# 显示高亮文本
display(HTML(highlighted_text))
# 显示关键词信息
for word, info in keyword_info.items():
print(f"**{word.capitalize()}**")
print(f"- Pronunciation: {info['pronunciation']}")
print(f"- Definition: {info['definition']}\n")
运行上述代码后,您将得到高亮显示的英文文章以及每个关键词的发音和定义。这有助于更好地理解和学习英文文章中的重要词汇。