27.MATLAB NL Text Analytics Toolbox使用之文本预处理-CFANZ编程社区

27.MATLAB NL Text Analytics Toolbox使用之文本预处理

Text Analytics Toolbox官方文档

版本是2019b，亲测能用

2020年美赛的时候用的是2019a，没有这个工具箱，现学python吃了很多亏。

1.文本预处理可能包括下面内容

Variations in case, for example "new" and "New"
Variations in word forms, for example "walk" and "walking"
Words which add noise, for example "stop words" such as "the" and "of"
Punctuation and special characters
HTML and XML tags

2.官方示例代码

textData = [
    "A large tree is downed and blocking traffic outside Apple Hill."
    "There is lots of damage to many car windshields in the parking lot."];
documents = preprocessTextData(textData)
function documents = preprocessTextData(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Lemmatize the words. To improve lemmatization, first use 
% addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

3.documents = tokenizedDocument(textData);

标记文本，效果如下：

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_Data

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_学python_02

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_Data_03

4.documents = addPartOfSpeechDetails(documents);

为文件添加词性细节

打印出具体的代码

tdetails = tokenDetails(documents);
head(tdetails)

转换前：

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_学python_04

转换后：

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_预处理_05

5.单词化成原形

documents = normalizeWords(documents,'Style','lemma');

27.MATLAB NL Text Analytics Toolbox使用之文本预处理_Data_06

6.documents = erasePunctuation(documents);

去除标点

7.documents = removeStopWords(documents);

去除停顿词，如to the等

查询stopwords：

words = stopWords;
reshape(words,[25 9])

自定义StopWords：

customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];
documents = removeWords(documents,customStopWords);
documents(1:5)

8.删除长度小于2和大于5的词

documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

0 条评论