ElasticSearch

1、ElasticSearch学习随笔之基础介绍
2、ElasticSearch学习随笔之简单操作
3、ElasticSearch学习随笔之java api 操作
4、ElasticSearch学习随笔之SpringBoot Starter 操作
5、ElasticSearch学习随笔之嵌套操作
6、ElasticSearch学习随笔之分词算法
7、ElasticSearch学习随笔之高级检索

ElasticSearch，创始人 Shay Banon（谢巴农）
本文主要讲解ElasticSearch 高级搜索实战，来满足复杂的业务场景，还是用 Kibana 来操作。

文章目录

ElasticSearch
前言
一、Boosting（控制相关度）
二、Bool 布尔查询
三、多字段查询
总结

前言

本文主要介绍 ES 的一下高级检索功能，80% 的业务场景，简单搜索就可以实现了，但是在一些复杂的业务场景中，我们必须使用一些高级的功能来满足，比如在一些与舆情监测项目或者智能推荐、猜你想搜这种复杂的功能，有时候需要搜索中需要提高相关度，有时候搜索需要减低相关度等。
Don’t bebe so much, show the codes…

让我们来准备一些数据来进行高级搜索测试，但是问题来了，这些数据真的是头疼，不好生成一些示例数据出来，这时候，chatGTP这是无敌了，简直太方便了。
在这里插入图片描述

一、Boosting（控制相关度）

boost > 1时，打分的权重相对性提升
0 < boost < 1时，打分权重相对性降低
boot < 0 时，负分

GET electronics/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "desc": "手机"
        }
      },
      "negative": {
        "match": {
          "desc": "相机"
        }
      },
      "negative_boost": 0.2
    }
  },
  "from": 1,
  "size": 10
}

应用场景：希望检索的关键词的结果不是不出现，而是排在最后面。

二、Bool 布尔查询

2.1 复合查询

must：相当于JAVA中的 && 操作符，必须匹配，支持评分。
should：相当于 JAVA 中的 || 操作符，选择性匹配，支持评分。
must_not：相当于JAVA中的 !，必须不匹配，不支持评分。
filter：简单的过滤，不支持评分。

注意面试中可能会被问到：

操作	描述
Query	会进行相关性算分，检索性能不高
Filter	不会进行相关性算分，检索性能会更好

注意： 把多个子查询合并为一个复合查询时，比如 bool 查询，每个子查询计算的评分会被合并到相关性总评分中。

brand.keyword 这里，因为我并没有定义 schema，但是我用了 term 来精确匹配的，所以需要用 (.keyword) 来把 brand 字段设置成关键词匹配。

POST electronics/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "苹果"
          }
        }
      ],
      "should": [
        {
          "match": {
            "desc": "手机"
          }
        },
        {
          "match": {
            "desc": "笔记本"
          }
        }
      ],
      "filter": {
        "term": {
          "brand.keyword": "Apple"
        }
      },
      "must_not": [
        {
          "term": {
            "color": "black"
          }
        }
      ]
    }
  }
}

2.2 包含不相等查询

POST electronics/_search
{
  "query": {
    "match": {
      "product_agency": "打电话"
    }
  },
  "from": 0,
  "size": 20
}

POST electronics/_search
{
  "query": {
    "term": {
      "product_agency.keyword": "打电话"
    }
  },
  "from": 0,
  "size": 20
}

当然，如果 must 改为 filter, 则效率更佳！

POST electronics/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "product_agency.keyword": {
              "value": "打电话"
            }
          }
        },
        {
          "term": {
            "product_agency_count": {
              "value": 1
            }
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 20
}

说实话，写完之后自我觉得这个例子没啥特别之处。

2.3 实现 should_not 查询

POST electronics/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "product_agency.keyword": {
              "value": "打电话"
            }
          }
        }
      ],
      "should": [
        {
          "bool": {
            "must_not": [
              {
                "term": {
                  "price": {
                    "value": 699.99
                  }
                }
              }
            ]
          }
        }
      ],
      "minimum_should_match": 1
    }
  },
  "from": 0,
  "size": 20
}

注意： 这里需要加上 minimum_should_match 这个参数，最少匹配到一个 should 查询，这样才会生效。
说实话，这种只有在鸡肋业务中才会用到吧。

2.4 控制查询相关性算分

三、多字段查询

有三种这样的场景：

最佳字段（Best Fields）：字段直接相互竞争又相互关联，比如 title 和 content 字段，评分来自最匹配字段。
多数字段（Most Fields）：处理英文内容常见手段，在主字段（English Analyzer）抽取词干，加入同义词以匹配更多的文档；相同的文本，加入子字段（Standard Analyzer）以更加精准的匹配。
混合字段（Cross Field）：对于某些信息，比如名称、地址、图书的信息等等，需要多个字段才能确定，单个字段只能作为整体的一部分，希望在这些字段中尽可能找到多的词。

3.1 最佳字段匹配（Best Fields）

POST electronics/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "苹果"
          }
        },
        {
          "match": {
            "desc": "苹果"
          }
        }
      ]
    }
  }
}

但是这个查询并不会匹配到最好的结果，为什么？
should 的评分过程是这样的，匹配 should 语句中的多个子查询，加和多个子查询的评分，乘以匹配语句的总评分，除以语句的总数，显然这样的评分不够完美。
title 和 desc 两个字段属于竞争关系，所以不应该是评分简单加和，而是单个最佳字段的评分。

3.1.1 Dis Max Query

看官网解释！！

POST electronics/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.2,
      "queries": [
        {
          "match": {
            "title": "苹果"
          }
        },
        {
          "match": {
            "desc": "苹果"
          }
        }
      ]
    }
  }
}

3.1.2 Multi Match Query

POST electronics/_search
{
  "query": {
    "multi_match": {
      "query": "苹果",
      "fields": ["title", "desc"],
      "type": "best_fields",
      "tie_breaker": 0.2
    }
  }
}

3.2 多数字段匹配（Most Fields）

PUT /blogs
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "std": {
            "type": "text",
            "analyzer": "standard"
          }
        }
      }
    }
  }
}

POST blogs/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }
{ "index": { "_id": 3 }}
{ "title": "I have a dog named bark" }
{ "index": { "_id": 4 }}
{ "title": "I see barking dogs on the street, they are running and playing " }

GET blogs/_search
{
  "query": {
    "match": {
      "title": "braking dogs"
    }
  }
}

GET blogs/_search
{
  "query": {
    "multi_match": {
      "query": "braking dogs",
      "fields": ["title", "title.std"],
      "type": "most_fields"
    }
  }
}

GET blogs/_search
{
  "query": {
    "multi_match": {
      "query": "braking dogs",
      "fields": ["title^10", "title.std"],
      "type": "most_fields"
    }
  }
}

3.3 跨字段匹配（Cross Fields）

PUT /address
{
    "settings" : {
        "index" : {
            "analysis.analyzer.default.type": "ik_max_word"
        }
    }
}

PUT /address/_bulk
{ "index": { "_id": "1"} }
{"province": "甘肃","city": "酒泉"}
{ "index": { "_id": "2"} }
{"province": "湖南","city": "常德"}
{ "index": { "_id": "3"} }
{"province": "陕西","city": "西安"}
{ "index": { "_id": "4"} }
{"province": "湖南","city": "邵阳"}
{ "index": { "_id": "5"} }
{"province": "甘肃","city": "武威"}
{ "index": { "_id": "6"} }
{"province": "甘肃","city": "玉门"}

GET address/_search
{
  "query": {
    "multi_match": {
      "query": "甘肃酒泉",
      "fields": ["province", "city"],
      "type": "most_fields"
    }
  }
}

GET address/_search
{
  "query": {
    "multi_match": {
      "query": "甘肃酒泉",
      "fields": ["province","city"],
      "type": "cross_fields",
      "operator": "and"
    }
  }
}

PUT /address
{
  "mappings" : {
      "properties" : {
        "province" : {
          "type" : "keyword",
          "copy_to": "full_address"
        },
        "city" : {
          "type" : "text",
          "copy_to": "full_address"
        }
      }
    },
    "settings" : {
        "index" : {
            "analysis.analyzer.default.type": "ik_max_word"
        }
    }
}

GET /address/_search
{
  "query": {
    "match": {
      "full_address": {
        "query": "湖南常德",
        "operator": "and"
      }
    }
  }
}

3.4 Terms Set 检索

GET electronics/_search
{
  "query": {
    "terms_set": {
      "product_agency.keyword":{
        "terms":["打电话","王者","看电影"],
        "minimum_should_match_field":"product_count"
      }
    }
  }
}

搜索结果

GET electronics2/_search
{
  "query": {
    "terms_set": {
      "product_agency":{
        "terms":["打电话","办公","看大片"],
        "minimum_should_match_script":{
          "source":"doc['product_count'].value * 0.7"
        }
      }
    }
  }
}

检索结果

总结

ES 是目前全文检索排行榜首位，而且不断升级，现在一到 8.x 版本了，基本上 solr 搜索引擎用的公司已经不多了，虽然搜索速度很快，但做不到实时，这就很尴尬了，再加上定制写一写什么关联查询，简直鸡肋了。
ES 提供了很多高级检索，能很好的满足一些复杂的业务场景，可以嵌套，这样就不用总是加字段这种很 low 的方式来解决一些业务问题了。
全文检索排行老大