ES适用于大数据搜索,近乎实时. 支持集群(主分片数在创建索引时确定,后面不可改),副本可以有多份,分担读压力.
ES基于Lucene实现,采用倒排索引,即对文档中的内容进行分词,每个词对应多个文档id,后面对词进行搜索可以找到关联的文档id.
ES首先需要创建索引,在版本6之后,一个索引里只允许有一个type即_doc, type中的field可以动态新增,但一般设成不允许动态新增.
创建索引
创建索引时可以指定主分片数和副本数,然后指定分词器和过滤器等.如下:
{
"settings": {
"index": {
"max_ngram_diff": 7,
"max_result_window": 1000000
},
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer",
"filter":["lowercase", "cjk_width"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 8
}
},
"normalizer":{
"lowercase":{
"type":"custom",
"filter":["lowercase", "cjk_width"]
}
}
},
"number_of_replicas": 1
},
"mappings": {
"dynamic": "strict",
"properties": {
"clientClassify": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer":"lowercase"
}
},
"analyzer": "ngram_analyzer"
},
"clientClassifyPinYin": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer":"lowercase"
}
},
"analyzer": "ngram_analyzer"
},
"clientClassifyPY": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer":"lowercase"
}
},
"analyzer": "ngram_analyzer"
},
"clientClassifyId": {
"type": "keyword"
}
}
}
ES常用数据类型
- text 字符串, 一般会分词
- keyword 字符串,不分词,如java中的枚举
- date 日期类型
- double 浮点数
- integer 整数
等等
导入数据
ETL通过es rest api导入数据, 数据类型是json. 略.
ES搜索
通过_search rest方式搜索,一般使用DSL方式构建搜索内容, 如下:
{
"from": 0, //分页
"size": 1000,
"timeout": "60s", //超时
"query": {
"bool": {
"filter": [//对数据进行过滤,分数信息丢失. 其他bool搜索方式还有must, should, must_not
{
"term": { // term搜索不会对搜索内容进行分词, match搜索的话会分词, 其他还有phase_match, prefix query, range query等
"ownerId": {
"value": "8", //根据ownerId过滤
"boost": 1.0
}
}
},
{
"bool": {
"should": [
{ // 再根据orderNumber和productName过滤. 和ownerId过滤是and关系
"term": {
"orderNumber": {
"value": "zengl",
"boost": 1.0
}
}
},
{
"term": {
"productName": {
"value": "袜子",
"boost": 1.0
}
}
}
],
"adjust_pure_negative": true,
"minimum_should_match": "1",//orderNumber和productName只要满足一个条件即可
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"_source": {
"includes": [ //返回字段
"*"
],
"excludes": []
},
"sort": [ //排序
{
"orderDate": {
"order": "desc"
}
},
{
"id": {
"order": "desc"
}
}
],
"track_scores": true, //统计分数
"track_total_hits": 2147483647 //最大返回行数
}
返回json格式:
{
"took": 89,//时间ms
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 65569,//满足条件行数
"relation": "eq"
},
"max_score": 0.0,
"hits": [//具体结果
{
"_index": "sales_order_prod_0007",
"_type": "_doc",
"_id": "QFqrG3UBTeK-WjKzYaLy",
"_score": 0.0,
"_source": {
"colorName": [
"增量颜色"
],
}
]
}
}
ES聚合
目前没用到,略.
ES数据更新
通过_update_by_query可以对满足条件的数据进行更新,如下用到了painless脚本进行更新:
{
"query": {
"bool": {
"filter": [
{
"term": {
"ownerId": "2366"
}
},
{
"term":{
"productId":"82327"
}
}
]
}
},
"script": {
"lang": "painless",
// "inline": "Debug.explain(ctx._source.productId[0])",
"inline": "for(int i=0;i<ctx._source.productId.size();i++){if(ctx._source.productId[i] == params.productId){ctx._source.productName[i] = params.productName; ctx._source.productNamePY[i] = params.productNamePY; ctx._source.productNamePinYin = params.productNamePinYin;}}",
"params": {
"ownerId": "2366",
"productId": "82327",
"productName": "围巾3",
"productNamePY": "wj3",
"productNamePinYin": "weijin3"
}
}
}
ES默认数据更新1S后才能被查询到,可以加上refrehsh=true强制刷盘,但是会降低性能.
ES删除数据
通过_delete_by_query, 删除满足条件的数据
ES更新冲突
可以根据更新返回的结果重试. 目前我们系统更新是串行的,没有这个问题.