1. 下载数据
The Chinese training data($PATH/NERdata/) come from:https://github.com/zjy-ucas/ChineseNER
链接:https://pan.baidu.com/s/1JBnda5rgUsZjgYR5W7u-Fg
提取码:x16l
2. 数据介绍
2.1 标注方式
【NLP】序列标注BIO介绍(也叫IOB2)_mjiansun的专栏-CSDN博客
2.2 类别个数
总共四类:persons, locations, organizations ,miscellaneous entities
对应的简写:PER LOC ORG MISC
3.3 导入数据
3.3.1 将中文数据的数据转成上述格式
import os
if __name__ == "__main__":
rootPath = "/data2/PrivateExperiment/bilstm-crf-ner/NERdata/train.txt"
savePath = "/data2/PrivateExperiment/bilstm-crf-ner/NERdata/process/toformat.txt"
collectData = []
with open(rootPath, "r") as f:
for line in f.readlines():
lineStrs = line.strip().split()
if len(lineStrs) > 0:
collectData.append([lineStrs[0], lineStrs[1] + "\n"])
else:
collectData.append(["\n"])
with open(savePath, "w", encoding="utf-8") as f:
for line in collectData:
f.write("\t".join(line))
生成结果为
3.3.2 将输入拖拽导入
3.4导出数据
admin.jsonl为正常的数据,unknown.jsonl为不正常数据。
admin.jsonl
unknown.jsonl