0
点赞
收藏
分享

微信扫一扫

27.MATLAB NL Text Analytics Toolbox使用之文本预处理

陆佃 2023-06-08 阅读 130


Text Analytics Toolbox官方文档

版本是2019b,亲测能用

2020年美赛的时候用的是2019a,没有这个工具箱,现学python吃了很多亏。

1.文本预处理可能包括下面内容

  • Variations in case, for example "new" and "New"
  • Variations in word forms, for example "walk" and "walking"
  • Words which add noise, for example "stop words" such as "the" and "of"
  • Punctuation and special characters
  • HTML and XML tags

2.官方示例代码

textData = [
    "A large tree is downed and blocking traffic outside Apple Hill."
    "There is lots of damage to many car windshields in the parking lot."];
documents = preprocessTextData(textData)
function documents = preprocessTextData(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Lemmatize the words. To improve lemmatization, first use 
% addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

3.documents = tokenizedDocument(textData);

标记文本,效果如下:

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_Data

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_学python_02

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_Data_03

4.documents = addPartOfSpeechDetails(documents);

为文件添加词性细节

打印出具体的代码

tdetails = tokenDetails(documents);
head(tdetails)

转换前:

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_学python_04

转换后:

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_预处理_05

5.单词化成原形

documents = normalizeWords(documents,'Style','lemma');

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_Data_06

 

6.documents = erasePunctuation(documents);

去除标点

7.documents = removeStopWords(documents);

去除停顿词,如to the等

查询stopwords:

words = stopWords;
reshape(words,[25 9])

自定义StopWords:

customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];
documents = removeWords(documents,customStopWords);
documents(1:5)

8.删除长度小于2和大于5的词

documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

 

举报

相关推荐

0 条评论