使用大型语言模型（如 OpenAI 的 GPT-4）实现上下文感知的词义消歧（Word Sense Disambiguation, WSD）-CFANZ编程社区

要为高亮的关键词提供最符合上下文的解释，需要不仅仅依赖于通用的词典定义，而是根据关键词在具体文本中的使用情况来生成定义。这通常涉及到上下文感知的词义消歧（Word Sense Disambiguation, WSD），以确保提供的解释与文章中的用法一致。

实现这一目标的常见方法是使用大型语言模型（如 OpenAI 的 GPT-4），因为它们能够理解上下文并生成符合特定语境的定义。以下是实现这一功能的详细步骤和代码示例。

实现步骤概述

安装必要的库
加载和预处理英文文本
使用适合英文的分词工具
利用 wordfreq 计算词频并提取关键词
高亮关键词
获取上下文感知的读音和定义
输出结果

1. 安装必要的库

首先，确保安装了以下 Python 库：

pip install jieba wordfreq requests pronouncing openai nltk

jieba: 中文分词工具，但也支持简单的英文分词。
wordfreq: 用于获取单词频率。
requests: 发送 HTTP 请求，用于调用词典 API 获取定义。
pronouncing: 获取单词的发音（基于 CMU 发音词典）。
openai: 调用 OpenAI 的 GPT-4 模型以生成上下文感知的定义。
nltk: 自然语言处理工具，用于句子分割。

注意：使用 OpenAI 的 API 需要一个有效的 API 密钥。请确保你已经注册并获取了 API 密钥。

2. 加载和预处理英文文本

假设我们有一段英文文章存储在一个字符串变量中。

import jieba
from wordfreq import word_frequency
import pronouncing
import requests
import re
import openai
import nltk
from nltk.tokenize import sent_tokenize

# 下载 NLTK 的句子分割模型
nltk.download('punkt')

# 示例英文文章
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.
"""

# 转换为小写以统一处理
text_lower = text.lower()

3. 使用适合英文的分词工具

虽然 jieba 也可以用于英文分词，但对于更准确的英文分词，建议使用 nltk 或 spaCy。这里我们继续使用 jieba 进行简单的分词。

# 使用 jieba 进行英文分词
words = list(jieba.cut(text_lower, cut_all=False))
# 移除标点符号和空格
words = [word.strip() for word in words if word.strip().isalnum()]

4. 利用 `wordfreq` 计算词频并提取关键词

根据词频选择关键词，词频越低的词通常信息量越大。

# 计算每个词的频率（以英语为参考）
frequency = {word: word_frequency(word, 'en') for word in set(words)}

# 排序词频，从低到高（低频词优先）
sorted_words = sorted(frequency.items(), key=lambda item: item[1])

# 选择前N个关键词，N根据需要调整
N = 10
keywords = [word for word, freq in sorted_words[:N]]

5. 高亮关键词

在原文中高亮显示关键词。这里使用 HTML 的 <mark> 标签来高亮。

# 为了高亮，使用正则表达式进行替换
def highlight_keywords(text, keywords):
    # 按照长度从长到短排序，避免部分匹配
    keywords_sorted = sorted(keywords, key=len, reverse=True)
    for word in keywords_sorted:
        # 使用正则忽略大小写进行替换
        pattern = re.compile(re.escape(word), re.IGNORECASE)
        replacement = f"<mark>{word}</mark>"
        text = pattern.sub(replacement, text)
    return text

highlighted_text = highlight_keywords(text, keywords)

6. 获取上下文感知的读音和定义

为了提供最符合上下文的解释，我们将使用 OpenAI 的 GPT-4 模型。具体步骤包括：

提取关键词所在的句子：为每个关键词找到其在文本中出现的句子。
调用 OpenAI API 生成定义：将句子和关键词作为上下文，生成相应的定义。

注意：使用 OpenAI 的 API 会产生费用，请确保了解相关费用并妥善管理 API 密钥。

6.1 设置 OpenAI API 密钥

# 设置 OpenAI API 密钥
openai.api_key = 'your-openai-api-key'  # 请替换为你的 OpenAI API 密钥

6.2 提取关键词所在的句子

# 使用 NLTK 分割文本为句子
sentences = sent_tokenize(text)

# 创建一个字典，将每个关键词映射到包含它的句子
keyword_sentences = {word: [] for word in keywords}

for sentence in sentences:
    sentence_lower = sentence.lower()
    for word in keywords:
        if re.search(r'\b' + re.escape(word) + r'\b', sentence_lower):
            keyword_sentences[word].append(sentence.strip())

# 为每个关键词选择第一个出现的句子作为上下文
keyword_context = {}
for word, sents in keyword_sentences.items():
    if sents:
        keyword_context[word] = sents[0]
    else:
        keyword_context[word] = ""

6.3 调用 OpenAI API 生成上下文感知的定义

def get_contextual_definition(word, context):
    if not context:
        return "Definition not found in context."
    
    prompt = f"""
Provide a clear and concise definition of the word "{word}" based on its usage in the following sentence:

"{context}"

Definition:
"""
    try:
        response = openai.Completion.create(
            engine="text-davinci-003",  # 或者使用最新的模型，如 gpt-4
            prompt=prompt,
            max_tokens=60,
            temperature=0.3,
            n=1,
            stop=None
        )
        definition = response.choices[0].text.strip()
        return definition
    except Exception as e:
        print(f"Error fetching definition for {word}: {e}")
        return "Definition not available."

# 为每个关键词获取上下文感知的定义
keyword_info = {}
for word in keywords:
    pronunciation = pronouncing.phones_for_word(word)
    pronunciation = pronunciation[0] if pronunciation else "N/A"
    
    context = keyword_context[word]
    definition = get_contextual_definition(word, context)
    
    keyword_info[word] = {
        'pronunciation': pronunciation,
        'definition': definition
    }

注意：

模型选择：text-davinci-003 是一个强大的 GPT-3 模型，如果你有 GPT-4 的访问权限，可以将 engine 参数更改为 gpt-4。
API 调用限制：请注意 API 的速率限制和费用。对于大量关键词或频繁调用，可能需要考虑优化调用频率或批量处理。

7. 输出结果

将高亮后的文本和关键词的读音及上下文感知的定义展示出来。

from IPython.core.display import display, HTML

# 显示高亮文本
display(HTML(highlighted_text))

# 显示关键词信息
for word, info in keyword_info.items():
    print(f"**{word.capitalize()}**")
    print(f"- Pronunciation: {info['pronunciation']}")
    print(f"- Definition: {info['definition']}\n")

示例输出：

Artificial intelligence (<mark>ai</mark>) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading <mark>ai</mark> textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.

**Ai**
- Pronunciation: EY1
- Definition: In this context, "AI" refers to artificial intelligence, which is the simulation of human intelligence processes by machines, especially computer systems.

**Agents**
- Pronunciation: AE1 JH AH0 N T S
- Definition: In this context, "agents" are entities or devices that can perceive their environment and take actions to achieve specific goals.

**Actions**
- Pronunciation: AE1 K SH AH0 N S
- Definition: In this context, "actions" refer to the operations or steps taken by agents to interact with their environment and accomplish objectives.

...

完整代码示例

以下是整合上述步骤的完整代码示例：

import jieba
from wordfreq import word_frequency
import pronouncing
import requests
import re
import openai
import nltk
from nltk.tokenize import sent_tokenize
from IPython.core.display import display, HTML

# 下载 NLTK 的句子分割模型
nltk.download('punkt')

# 设置 OpenAI API 密钥
openai.api_key = 'your-openai-api-key'  # 请替换为你的 OpenAI API 密钥

# 示例英文文章
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals.
"""

# 转换为小写以统一处理
text_lower = text.lower()

# 使用 jieba 进行英文分词
words = list(jieba.cut(text_lower, cut_all=False))
# 移除标点符号和空格
words = [word.strip() for word in words if word.strip().isalnum()]

# 计算每个词的频率（以英语为参考）
frequency = {word: word_frequency(word, 'en') for word in set(words)}

# 排序词频，从低到高（低频词优先）
sorted_words = sorted(frequency.items(), key=lambda item: item[1])

# 选择前N个关键词，N根据需要调整
N = 10
keywords = [word for word, freq in sorted_words[:N]]

# 高亮关键词
def highlight_keywords(text, keywords):
    # 按照长度从长到短排序，避免部分匹配
    keywords_sorted = sorted(keywords, key=len, reverse=True)
    for word in keywords_sorted:
        # 使用正则忽略大小写进行替换
        pattern = re.compile(r'\b' + re.escape(word) + r'\b', re.IGNORECASE)
        replacement = f"<mark>{word}</mark>"
        text = pattern.sub(replacement, text)
    return text

highlighted_text = highlight_keywords(text, keywords)

# 使用 NLTK 分割文本为句子
sentences = sent_tokenize(text)

# 创建一个字典，将每个关键词映射到包含它的句子
keyword_sentences = {word: [] for word in keywords}

for sentence in sentences:
    sentence_lower = sentence.lower()
    for word in keywords:
        if re.search(r'\b' + re.escape(word) + r'\b', sentence_lower):
            keyword_sentences[word].append(sentence.strip())

# 为每个关键词选择第一个出现的句子作为上下文
keyword_context = {}
for word, sents in keyword_sentences.items():
    if sents:
        keyword_context[word] = sents[0]
    else:
        keyword_context[word] = ""

# 定义获取上下文感知定义的函数
def get_contextual_definition(word, context):
    if not context:
        return "Definition not found in context."
    
    prompt = f"""
Provide a clear and concise definition of the word "{word}" based on its usage in the following sentence:

"{context}"

Definition:
"""
    try:
        response = openai.Completion.create(
            engine="text-davinci-003",  # 或者使用最新的模型，如 gpt-4
            prompt=prompt,
            max_tokens=60,
            temperature=0.3,
            n=1,
            stop=None
        )
        definition = response.choices[0].text.strip()
        return definition
    except Exception as e:
        print(f"Error fetching definition for {word}: {e}")
        return "Definition not available."

# 为每个关键词获取上下文感知的定义
keyword_info = {}
for word in keywords:
    pronunciation = pronouncing.phones_for_word(word)
    pronunciation = pronunciation[0] if pronunciation else "N/A"
    
    context = keyword_context[word]
    definition = get_contextual_definition(word, context)
    
    keyword_info[word] = {
        'pronunciation': pronunciation,
        'definition': definition
    }

# 显示高亮文本
display(HTML(highlighted_text))

# 显示关键词信息
for word, info in keyword_info.items():
    print(f"**{word.capitalize()}**")
    print(f"- Pronunciation: {info['pronunciation']}")
    print(f"- Definition: {info['definition']}\n")