0
点赞
收藏
分享

微信扫一扫

ES 检索 word、pdf 文档插件 ingest attachment 的管道配置和文档结构映射


一、安装 ingest attachment 插件

安装方法:​

二、定义文本抽取管道

1.单附件(示例)

PUT _ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{
"attachment": {
"field": "data",
"ignore_missing": true
}
},
{
"remove": {
"field": "data",
}
}
]
}


其中remove段的配置表示附件经过管道处理后删除附件本身,只将附件的文字存入ES中,附件自身base64的数据抛弃掉。


2.多附件(示例)

PUT _ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment"
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"remove": {
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment"
}
}
}
}
]
}


需要注意的是,多附件的情况下,field 和 target_field 必须要写成 _ingest._value.*,否则不能匹配正确的字段。
从 es 8.0 版本开始,需要删除二进制文件内容,​​只需要为 attachment 添加一个属性​​ ​​remove_binary 为 true​​,就不需要像上面那样单独写一个 remove 处理器了。


三、建立文档结构映射

1.单附件(示例)

PUT newdoc_dispatch
{
"mappings": {
"properties": {
"businessId":{
"type": "keyword"
},
"title":{
"type": "text",
"analyzer": "ik_smart"
},
"fullDocNO":{
"type": "text",
"analyzer": "ik_smart"
},
"drafterUser":{
"type": "keyword"
},
"dispatchNO":{
"type": "text",
"analyzer": "ik_smart"
},
"dispatchDept":{
"type": "keyword"
},
"dispatchTime":{
"type": "date"
},
"abolish":{
"type": "keyword"
},
"tenantId":{
"type": "keyword"
},
"attachment": {
"properties": {
"content":{
"type": "text",
"analyzer": "ik_smart"
}
}
}
}
}
}

2.多附件(示例)

PUT newdoc_dispatch
{
"mappings": {
"properties": {
"businessId":{
"type": "keyword"
},
"title":{
"type": "text",
"analyzer": "ik_smart"
},
"fullDocNO":{
"type": "text",
"analyzer": "ik_smart"
},
"drafterUser":{
"type": "keyword"
},
"dispatchNO":{
"type": "text",
"analyzer": "ik_smart"
},
"dispatchDept":{
"type": "keyword"
},
"dispatchTime":{
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
},
"abolish":{
"type": "keyword"
},
"tenantid":{
"type": "keyword"
},
"attachments" : {
"properties" : {
"attachment" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer": "ik_smart"
}
}
}
}
}
}
}
}


工程中的代码是多附件的示例,mapping结构映射的对象详见ESDispatchDocumentVo


官网参考资料:https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

代码工程参考:https://gitee.com/catoop/es-attachment

(END)



举报

相关推荐

0 条评论