用 jieba 和 wordfreq 对英文文章进行关键词提取并高亮-CFANZ编程社区

要使用 jieba 和 wordfreq 对英文文章进行关键词提取并高亮，同时为高亮词提供读音和最贴近上下文含义的解释，可以按照以下步骤进行。下面提供一个完整的 Python 示例，涵盖了关键词提取、高亮显示、读音获取和定义解释。

步骤概述

安装必要的库
加载和预处理英文文本
使用 jieba 进行分词
利用 wordfreq 计算词频并提取关键词
高亮关键词
获取读音和定义

1. 安装必要的库

首先，确保安装了以下 Python 库：

pip install jieba wordfreq requests pronouncing

jieba: 中文分词工具，但也支持英文分词。
wordfreq: 用于获取单词频率。
requests: 发送 HTTP 请求，用于调用词典 API 获取定义。
pronouncing: 获取单词的发音（基于 CMU 发音词典）。

2. 加载和预处理英文文本

假设我们有一段英文文章存储在一个字符串变量中。

import jieba
from wordfreq import word_frequency
import pronouncing
import requests
import re

# 示例英文文章
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.
"""

# 转换为小写以统一处理
text_lower = text.lower()

3. 使用 `jieba` 进行分词

虽然 jieba 主要用于中文分词，但也可以对英文进行简单的分词。

# 使用 jieba 进行英文分词
words = list(jieba.cut(text_lower, cut_all=False))
# 移除标点符号和空格
words = [word.strip() for word in words if word.strip().isalnum()]

4. 利用 `wordfreq` 计算词频并提取关键词

我们可以根据词频来选择关键词，词频越低的词通常信息量越大。

# 计算每个词的频率（以英语为参考）
frequency = {word: word_frequency(word, 'en') for word in set(words)}

# 排序词频，从低到高（低频词优先）
sorted_words = sorted(frequency.items(), key=lambda item: item[1])

# 选择前N个关键词，N根据需要调整
N = 10
keywords = [word for word, freq in sorted_words[:N]]

5. 高亮关键词

在原文中高亮显示关键词。这里使用 HTML 的 <mark> 标签来高亮。

# 为了高亮，使用正则表达式进行替换
def highlight_keywords(text, keywords):
    # 按照长度从长到短排序，避免部分匹配
    keywords_sorted = sorted(keywords, key=len, reverse=True)
    for word in keywords_sorted:
        # 使用正则忽略大小写进行替换
        pattern = re.compile(re.escape(word), re.IGNORECASE)
        replacement = f"<mark>{word}</mark>"
        text = pattern.sub(replacement, text)
    return text

highlighted_text = highlight_keywords(text, keywords)

6. 获取读音和定义

使用 pronouncing 库获取单词的发音，使用词典 API 获取定义。这里以 DictionaryAPI 为例，它是一个免费的词典 API。

def get_pronunciation(word):
    pronunciations = pronouncing.phones_for_word(word)
    if pronunciations:
        return pronunciations[0]  # 返回第一个发音
    else:
        return "N/A"

def get_definition(word):
    url = f"https://api.dictionaryapi.dev/api/v2/entries/en/{word}"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        try:
            # 提取第一个定义
            definition = data[0]['meanings'][0]['definitions'][0]['definition']
            return definition
        except (IndexError, KeyError):
            return "Definition not found."
    else:
        return "Definition not found."

# 为每个关键词获取读音和定义
keyword_info = {}
for word in keywords:
    pronunciation = get_pronunciation(word)
    definition = get_definition(word)
    keyword_info[word] = {
        'pronunciation': pronunciation,
        'definition': definition
    }

7. 输出结果

将高亮后的文本和关键词的读音及定义展示出来。

from IPython.core.display import display, HTML

# 显示高亮文本
display(HTML(highlighted_text))

# 显示关键词信息
for word, info in keyword_info.items():
    print(f"**{word.capitalize()}**")
    print(f"- Pronunciation: {info['pronunciation']}")
    print(f"- Definition: {info['definition']}\n")

示例输出：

Artificial intelligence (<mark>AI</mark>) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading <mark>AI</mark> textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.

**Ai**
- Pronunciation: EY1
- Definition: Definition not found.

**Agents**
- Pronunciation: AE1 JH AH0 N T S
- Definition: A person who acts on behalf of another.

**Actions**
- Pronunciation: AE1 K SH AH0 N S
- Definition: The fact or process of doing something, typically to achieve an aim.

... (其他关键词信息)

注意事项

jieba 对英文支持有限：jieba 主要针对中文分词，处理英文时效果不如专门的英文分词工具，如 nltk 或 spaCy。如果对英文分词的准确性有较高要求，建议使用这些工具。
词频选择关键词的策略：上述方法选择的是低频词作为关键词，这是一种简单的关键词提取方法。更高级的方法可以考虑 TF-IDF、TextRank 等算法，以获得更准确的关键词。
发音获取的局限性：pronouncing 库基于 CMU 发音词典，主要适用于美式发音，对于其他发音可能无法提供支持。
词典 API 的依赖：使用外部 API（如 DictionaryAPI）需要网络连接，并且可能受到 API 限制。确保在生产环境中处理好异常情况。
高亮显示的呈现方式：上述示例使用了 HTML 的 <mark> 标签进行高亮。如果在不同的环境中显示（如命令行），需要采用不同的高亮方式。

完整代码示例

import jieba
from wordfreq import word_frequency
import pronouncing
import requests
import re
from IPython.core.display import display, HTML

# 示例英文文章
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.
"""

# 转换为小写以统一处理
text_lower = text.lower()

# 使用 jieba 进行英文分词
words = list(jieba.cut(text_lower, cut_all=False))
# 移除标点符号和空格
words = [word.strip() for word in words if word.strip().isalnum()]

# 计算每个词的频率（以英语为参考）
frequency = {word: word_frequency(word, 'en') for word in set(words)}

# 排序词频，从低到高（低频词优先）
sorted_words = sorted(frequency.items(), key=lambda item: item[1])

# 选择前N个关键词，N根据需要调整
N = 10
keywords = [word for word, freq in sorted_words[:N]]

# 高亮关键词
def highlight_keywords(text, keywords):
    # 按照长度从长到短排序，避免部分匹配
    keywords_sorted = sorted(keywords, key=len, reverse=True)
    for word in keywords_sorted:
        # 使用正则忽略大小写进行替换
        pattern = re.compile(re.escape(word), re.IGNORECASE)
        replacement = f"<mark>{word}</mark>"
        text = pattern.sub(replacement, text)
    return text

highlighted_text = highlight_keywords(text, keywords)

# 获取发音和定义
def get_pronunciation(word):
    pronunciations = pronouncing.phones_for_word(word)
    if pronunciations:
        return pronunciations[0]  # 返回第一个发音
    else:
        return "N/A"

def get_definition(word):
    url = f"https://api.dictionaryapi.dev/api/v2/entries/en/{word}"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        try:
            # 提取第一个定义
            definition = data[0]['meanings'][0]['definitions'][0]['definition']
            return definition
        except (IndexError, KeyError):
            return "Definition not found."
    else:
        return "Definition not found."

keyword_info = {}
for word in keywords:
    pronunciation = get_pronunciation(word)
    definition = get_definition(word)
    keyword_info[word] = {
        'pronunciation': pronunciation,
        'definition': definition
    }

# 显示高亮文本
display(HTML(highlighted_text))

# 显示关键词信息
for word, info in keyword_info.items():
    print(f"**{word.capitalize()}**")
    print(f"- Pronunciation: {info['pronunciation']}")
    print(f"- Definition: {info['definition']}\n")

运行上述代码后，您将得到高亮显示的英文文章以及每个关键词的发音和定义。这有助于更好地理解和学习英文文章中的重要词汇。