Elasticsearch Term & Phrase Suggester搜索建议-CFANZ编程社区

文章目录

1. 什么是搜索建议
2. Elasticsearch Suggester API
3. Term Suggester

3.1 missing Mode
3.2 popular mode
3.3 always mode

4. Phrase Suggester

1. 什么是搜索建议

现代的搜索引擎，一般都会提供 Suggest as you type 的功能
帮助用户在输入搜索的过程中，进行自动补全或者纠错。通过协助用户输入更加精准的关键词，提高后续搜索阶段文档匹配的程度
在 google 上搜索，一开始会自动补全。当输入到一定长度，如因为单词拼写错误无法补全，就会开始提示相似的词或者句子、

2. Elasticsearch Suggester API

搜索引擎中类似的功能，在 ES 中通过 Sugester API 实现的
原理：将输入的文档分解为 Token，然后在索引的字段里查找相似的 Term 并返回
根据不同的使用场景，ES 设计了 4 种类别的 Suggesters

Term & Phrase Suggester
Complete & Context Suggester

一般搜索：

DELETE articles
PUT articles
{
  "mappings": {
    "properties": {
      "title_completion":{
        "type": "completion"
      }
    }
  }
}

POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }


POST articles/_search?pretty
{
  "size": 0,
  "suggest": {
    "article-suggester": {
      "prefix": "elk ",
      "completion": {
        "field": "title_completion"  //查询 title_completion 字段以 elk 开头的所有文档
      }
    }
  }
}

返回输出：
{
  "took" : 996,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "article-suggester" : [
      {
        "text" : "elk ",
        "offset" : 0,
        "length" : 4,
        "options" : [
          {
            "text" : "Elk stack rocks",
            "_index" : "articles",
            "_type" : "_doc",
            "_id" : "cTmozXcBVWBR55DpWY-3",
            "_score" : 1.0,
            "_source" : {
              "title_completion" : "Elk stack rocks"
            }
          },
          {
            "text" : "Elk stack rocks",
            "_index" : "articles",
            "_type" : "_doc",
            "_id" : "KjmrzXcBVWBR55DpHJCd",
            "_score" : 1.0,
            "_source" : {
              "title_completion" : "Elk stack rocks"
            }
          }
        ]
      }
    ]
  }
}

3. Term Suggester

Suggester 就是一种特殊类型的搜索。“text” 里是调用时候提供的文本，通常来自用户界面上用户输入的内容
用户输入的 “lucen” 是一个错误的拼写
会到指定的字段 “body” 上搜索，当无法搜索到结果时（missing），返回建议的词

搜索 “lucen rock”：
每个建议都包含了一个算分，相似性是通过 Levenshtein Edit Distance 的算法实现的。核心思想就是一个词改动多少字段就可以和另外一个词一致。提供了很多可选参数来控制相似性的模糊程度。
几种 Suggestion Mode

Missing - 如索引中已存在，就不提供建议
Popular - 推荐出现频率更加高的词
Always - 无论是否存在，都提供建议

3.1 missing Mode

DELETE articles

POST articles/_bulk
{ "index" : { } }
{ "body": "lucene is very cool"}
{ "index" : { } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "body": "Elasticsearch rocks"}
{ "index" : { } }
{ "body": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "body": "Elk stack rocks"}
{ "index" : {} }
{  "body": "elasticsearch is rock solid"}


POST _analyze
{
  "analyzer": "standard",
  "text": ["Elk stack  rocks rock"]
}
返回输出：
{
  "tokens" : [
    {
      "token" : "elk",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "stack",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "rocks",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "rock",
      "start_offset" : 17,
      "end_offset" : 21,
      "type" : "",
      "position" : 3
    }
  ]
}

--
POST /articles/_search
{
  "size": 1,
  "query": {
    "match": {
      "body": "lucen rock"
    }
  },
  "suggest": {
    "term-suggestion": {
      "text": "lucen rock",
      "term": {
        "suggest_mode": "missing",
        "field": "body"
      }
    }
  }
}

返回输出：
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.5904956,
    "hits" : [
      {
        "_index" : "articles",
        "_type" : "_doc",
        "_id" : "UDmvzXcBVWBR55DpdZHm",
        "_score" : 1.5904956,
        "_source" : {
          "body" : "elasticsearch is rock solid"
        }
      }
    ]
  },
  "suggest" : {
    "term-suggestion" : [
      {
        "text" : "lucen",
        "offset" : 0,
        "length" : 5,
        "options" : [
          {
            "text" : "lucene",
            "score" : 0.8,
            "freq" : 2
          }
        ]
      },
      {
        "text" : "rock",
        "offset" : 6,
        "length" : 4,
        "options" : [ ]
      }
    ]
  }
}

3.2 popular mode

POST /articles/_search
{

  "suggest": {
    "term-suggestion": {
      "text": "lucen rock",
      "term": {
        "suggest_mode": "popular",
        "field": "body"
      }
    }
  }
}
返回输出：

{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "term-suggestion" : [
      {
        "text" : "lucen",
        "offset" : 0,
        "length" : 5,
        "options" : [
          {
            "text" : "lucene",
            "score" : 0.8,
            "freq" : 2
          }
        ]
      },
      {
        "text" : "rock",
        "offset" : 6,
        "length" : 4,
        "options" : [
          {
            "text" : "rocks",
            "score" : 0.75,
            "freq" : 2
          }
        ]
      }
    ]
  }
}

3.3 always mode

POST /articles/_search
{

  "suggest": {
    "term-suggestion": {
      "text": "lucen rock",
      "term": {
        "suggest_mode": "always",
        "field": "body",
      }
    }
  }
}
返回输出：
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "term-suggestion" : [
      {
        "text" : "lucen",
        "offset" : 0,
        "length" : 5,
        "options" : [
          {
            "text" : "lucene",
            "score" : 0.8,
            "freq" : 2
          }
        ]
      },
      {
        "text" : "rock",
        "offset" : 6,
        "length" : 4,
        "options" : [
          {
            "text" : "rocks",
            "score" : 0.75,
            "freq" : 2
          }
        ]
      }
    ]
  }
}

POST /articles/_search
{

  "suggest": {
    "term-suggestion": {
      "text": "lucen hocks",
      "term": {
        "suggest_mode": "always",
        "field": "body",
        "prefix_length":0,
        "sort": "frequency"
      }
    }
  }
}

返回输出：
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "term-suggestion" : [
      {
        "text" : "lucen",
        "offset" : 0,
        "length" : 5,
        "options" : [
          {
            "text" : "lucene",
            "score" : 0.8,
            "freq" : 2
          }
        ]
      },
      {
        "text" : "hocks",
        "offset" : 6,
        "length" : 5,
        "options" : [
          {
            "text" : "rocks",
            "score" : 0.8,
            "freq" : 2
          }
        ]
      }
    ]
  }
}

4. Phrase Suggester

Phrase Suggesetr 上增加了一些额外的逻辑
一些参数

Suggeset Mode ： missing,popular ,always
Max Errors: 最多可以拼错的 Terms 数
Condfidence ：限制返回结果数，默认为 1

POST /articles/_search
{
  "suggest": {
    "my-suggestion": {
      "text": "lucne and elasticsear rock hello world ",
      "phrase": {
        "field": "body",
        "max_errors":2,
        "confidence":0,
        "direct_generator":[{
          "field":"body",
          "suggest_mode":"always"
        }],
        "highlight": {
          "pre_tag": "",
          "post_tag": ""
        }
      }
    }
  }
}


返回输出：
{
  "took" : 47,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "my-suggestion" : [
      {
        "text" : "lucne and elasticsear rock hello world ",
        "offset" : 0,
        "length" : 39,
        "options" : [
          {
            "text" : "lucene and elasticsearch rock hello world",
            "highlighted" : "lucene and elasticsearch rock hello world",
            "score" : 1.5788074E-4
          },
          {
            "text" : "lucne and elasticsearch rocks hello world",
            "highlighted" : "lucne and elasticsearch rocks hello world",
            "score" : 1.136111E-4
          },
          {
            "text" : "lucne and elasticsearch rock hello world",
            "highlighted" : "lucne and elasticsearch rock hello world",
            "score" : 1.05567684E-4
          },
          {
            "text" : "lucene and elasticsear rocks hello world",
            "highlighted" : "lucene and elasticsear rocks hello world",
            "score" : 9.929376E-5
          },
          {
            "text" : "lucene and elasticsear rock hello world",
            "highlighted" : "lucene and elasticsear rock hello world",
            "score" : 9.2263974E-5
          }
        ]
      }
    ]
  }
}