书籍学习-一本书讲透Elasticsearch：原理、进阶和工程实践（二）

本文是个人学习书籍《一本书讲透Elasticsearch：原理、进阶和工程实践》过程中所记录的一些笔记，内容来源于书籍

Elasticsearch文档

新增文档
- 单条写入（PUT /_doc/[id]）
- 批量写入（POST [index]/_bulk）
删除文档
- 单个删除（DELETE /_doc/）
- 批量删除（POST /_delete_by_query）
修改/更新文档
- 前置条件：mapping内的_source字段必须设置为true，这也是默认设置。如果将其手动设置为false，执行update会报错
- 单个文档部分更新
  - 在原有基础上新增字段（POST /_update/，“doc”）
  - 在原有字段基础上部分修改字段值（POST /_update/，“script”）
  - 存在则更新，不存在则插入给定值（POST /_update/，“upsert”）
  - 全部文档更新（PUT /_doc/）
  PS：Mapping不会因为文档的写入或更新操作而导致收缩，除非通过reindex操作将数据迁移到新的索引上
- 批量文档更新
  - 基于Painless脚本的批量更新（POST /_update_by_query，”script”）
  - 基于ingest预处理管道的批量更新（POST /_update_by_query?pipeline=）
  - 取消更新
    - 获取更新任务的任务ID（GET tasks?detailed=true&actions=*by_query）
    - 查看任务ID，了解任务详情（GET /_tasks/）
    - 取消更新任务（POST tasks//cancel）

迁移文档（reindex）

同集群索引之间的全量数据迁移

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
POST _reindex
{
	"conflicts": "proceed",
	"source": {
		"index": "source_index"
	},
	"dest": {
		"index": "dest_index"
	}
}

同集群索引之间基于特定条件的数据迁移

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
POST _reindex
{
	"source": {
		"index": "source_index",
		"query": {
			"term": {
				"user": "horin"
			}
		}
	},
	"dest": {
		"index": "dest_index"
	}
}

基于script脚本的索引迁移

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
POST _reindex
{
	"source": {
		"index": "source_index"
	},
	"dest": {
		"index": "dest_index",
		"version_type": "external"
	},
	"script": {
		"lang": "painless",
		"source": "if (ctx._source.user == 'horin') { ctx._source.remove('user') }"
	}
}

基于预处理管道的数据迁移

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
POST _reindex
{
	"source": {
		"index": "source_index"
	},
	"dest": {
		"index": "dest_index",
		"pipeline": "my_pipeline",
	}
}

不同集群索引之间的数据迁移

elasticsearch.yml配置文件配置白名单
1
reindex.remote.whitelist: "xxx"

同普通的跨索引数据迁移的实现方式一致

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
POST _reindex
{
	"source": {
		"remote": {
			"host": "http://otherhost:9200"
		},
		"index": "source_index"
	},
	"dest": {
		"index": "dest_index"
	}
}

查看及取消reindex任务

1
2
3
4
5
6
# 获取reindex相关任务
GET _tasks?detailed=true&actions=*reindex
# 获取任务ID及相关信息
GET /_tasks/<taskId>
# 取消任务
POST /_tasks/<taskId>/_cancel

业务零掉线情况下的数据迁移：reindex + 别名

Elasticsearch脚本

应用场景

自定义字段

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 截取返回日期格式中的年份
POST my_index/_search
{
	"script_fields": {
		"insert_year": {
			"script": {
				"source": "doc['insert_year'].value.getYear()"
			}
		}
	}
}

自定义评分

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 自定义评分检索
POST my_index/_search
{
	"function_score": {
		"script_score": {
			"script": {
				"lang": "expression",
				"source": "_score * doc['popularity']"
			}
		}
	}
}

自定义更新

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 已有字段更新为其他字段
POST my_index/_update/1
{
	"script": {
		"lang": "painless",
		"source": """
			ctx._source.theatre = params.theatre
		""",
		"params": {
			"theatre": "jingju"
		}
	}
}

自定义reindex

自定义聚合

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
POST my_index/_search
{
	"aggs": {
		"terms_aggs": {
			"terms": {
				"script": {
					"lang": "painless",
					"source": "doc['popularity'].value"
				}
			}
		}
	}
}

其他自定义操作

Elasticsearch检索

常用检索

全文检索
- match
  - 应用于高召回率和结果精准度要求较低的场景
    1 2 3 4 5 6 7 8
    POST my_index/_search { "query": { "match": { "title": "乌兰新闻" } } }
- match_phrase（短语检索）
  - 适用于注重精准度的召回场景，match_phrase检索要求查询的词条顺序和文档中的词条顺序保持一致，强调短语的完整性和顺序
    1 2 3 4 5 6 7 8 9 10
    POST my_index/_search { "query": { "match_phrase": { "title": { "query": "乌兰新闻" } } } }
- match_phrase_prefix（短语前缀检索）
  - 查询词语需要按顺序匹配文档中的内容，同时允许最后一个词语只匹配其前缀
    1 2 3 4 5 6 7 8
    POST my_index/_search { "query": { "match_phrase_prefix": { "title": "乌兰新" } } }
- multi_match（多字段检索）
  - 适用于在多个字段上执行match检索的场景
    1 2 3 4 5 6 7 8 9
    POST my_index/_search { "query": { "multi_match": { "query": "乌兰", "fields": ["title^3", "message"] } } }
- query_string（支持与或非表达式的检索）
  - 允许用户使用Lucene查询语法直接编写复杂的查询表达式
    1 2 3 4 5 6 7 8 9
    POST my_index/_search { "query": { "query_string": { "default_field": "title", "query": "乌兰 AND 新闻" } } }
- simple_query_string
  - simple_query_string和query_string的区别
    - simple_query_string对语法的核查并不严格
    - simple_query_string是一种简单的查询语法，只支持单词查询、短语查询或者包含查询，不支持使用通配符和正则表达式
- match boolean prefix
- intervals
- combined fields

精确匹配

term（单字段精确匹配）
- 应用于单字段精准匹配的场景
- term检索针对的是非text类型，用于text类型时并不会报错，但检索结果一般会达不到预期
  1 2 3 4 5 6 7 8 9 10
  POST my_index/_search { "query": { "term": { "name": { "value": "horin" } } } }

terms（多字段精确匹配）

应用于多值精准匹配场景，例如筛选出具有多个特定标签或状态的项目

1
2
3
4
5
6
7
8
POST my_index/_search
{
	"query": {
		"terms": {
			"tags": ["weibo", "wechat"]
		}
	}
}

range（范围检索）

适合对数字、日期或其他可排序数据类型的字段进行范围筛选

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
POST my_index/_search
{
	"query": {
		"range": {
			"popular_degree": {
				"gte": 10,
				"lte": 100
			}
		}
	}
}

exists（是否存在检索）

适用于检查文档中是否存在某个字段，或者该字段是否包含非空值

1
2
3
4
5
6
7
8
POST my_index/_search
{
	"query": {
		"exists": {
			"field": "title.keyword"
		}
	}
}

wildcard（通配符检索）

适用于对部分已知内容的文本字段进行模糊检索

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
POST my_index/_search
{
	"query": {
		"wildcard": {
			"title.keyword": {
				"value": "*乌兰*"
			}
		}
	}
}

prefix（前缀匹配检索）

用于检索以特定字符或字符串作为名称开头的文档

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
POST my_index/_search
{
	"query": {
		"prefix": {
			"title.keyword": {
				"value": "乌兰"
			}
		}
	}
}

terms set

可以检索匹配一定数量给定词项的文档，其中匹配的数量可以是固定值，也可以是基于另一个字段的动态值

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
POST my_index/_search
{
	"query": {
		"terms_set": {
			"tags": {
				"terms": ["喜剧", "动作", "科幻"],
				"minimum_should_match_field": "tags_count"
			}
		}
	}
}

fuzzy（支持编辑距离的模糊查询）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
POST my_index/_search
{
	"query": {
		"fuzzy": {
			"title": {
				"value": "language"
			}
		}
	}
}

IDs

基于给定的ID组快速召回相关数据

1
2
3
4
5
6
7
8
POST my_index/_search
{
	"query": {
		"ids": {
			"values": ["1", "2", "3"]
		}
	}
}

regexp（正则匹配检索）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
POST my_index/_search
{
	"query": {
		"regexp": {
			"product_name.keyword": {
				"value": "Lap.."
			}
		}
	}
}

多表关联检索
- Nested（Nested嵌套类型检索）
- Has child（子文档查询父文档）
- Has parent（父文档查询子文档）
- Parent ID
组合检索
- bool（组合检索）
  - must：查询结果必须满足指定条件
  - must_not：查询结果必须不满足指定条件
  - filter：过滤条件
  - should：查询结果可以满足的部分条件，具体满足条件的最小数量由minimum_should_match参数控制
- boosting（提升评分）
- constant score
- disjunction max
- function_score（自定义评分）

相对不常用检索
- 经纬度检索
- 形状类型检索
- 跨度检索
- 特定检索
  - script（脚本检索）
  - script_score（脚本评分检索）
  - more like this（相似度检索）
  - percolate
  - rank feature
  - wrapper
  - pinned query
  - distance feature
query和filter的区别
- query用于评估文档相关性，并对结果进行评分，通常用于搜索场景
- filter用于筛选文档，不会对文档评分，通常用于过滤场景

高亮、排序和分页

高亮语法

fragment_size：每个高亮片段的字符数
number_of_fragments：高亮最大片段数，如果片段数设置为0，则不返回任何高亮片段，而是将整个字段内容突出显示并返回，同时fragment_size将被忽略。默认值为5

fields：待高亮字段

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
POST kibana_sample_data_ecommerce/_search
{
"_source": ["products.product_name"], 
"query": {
	"match": {
	"products.product_name": "dress"
	}
},
"highlight": {
	"number_of_fragments": 0,
	"fragment_size": 150,
	"fields": {
	"products.product_name": {
		"pre_tags": ["<em>"],
		"post_tags": ["</em>"]
	}
	}
}
}

排序语法

_score：按分数排序

_doc：按索引顺序排序（通常用在scroll遍历上）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
POST my_index/_search
{
	"query": {
		"match": {
			"title": "乌兰"
		}
	},
	"sort": [
		{
			"popular_degree": {
				"order": "desc"
			}
		},
		{
			"_score": {
				"order": "asc"
			}
		}
	] 
}

分页语法
- from：表示结果集的起始位置，从0开始，默认值为0
- size：表示每页返回的文档数量，默认值为10
  1 2 3 4 5 6 7 8
  POST my_index/_search { "from": 0, "size": 10, "query": { "match_all": {} } }

自定义评分

Index Boost：在索引层面修改相关度

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
POST my_index_*/_search
{
	"indices_boost": [
		{
			my_index_1: 1.5
		},
		{
			my_index_2: 1.2
		}
	],
	"query": {
		"term": {
			"subject.keyword": {
				"value": "subject"
			}
		}
	}
}

boosting：修改文档相关度

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
POST my_index/_search
{
	"query": {
		"bool": {
			"must": [
				{
					"match": {
						"title": {
							"query": "新闻",
							"boost": 3
						}
					}
				}
			]
		}
	}
}

negative_boost：降低相关度

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
POST my_index/_search
{
	"query": {
		"boosting": {
			"positive": {
				"match": {
					"title": "乌兰"
				}
			},
			"negative": {
				"match": {
					"title": "新闻"
				}
			},
			"negative_boost": 0.1
		}
	}
}

function_score：自定义评分

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
POST my_index/_search
{
	"query": {
		"function_score": {
			"query": {
				"match_all": {}
			},
			"script_score": {
				"script": {
					"source": "_score * (doc['sales'].value + doc['vistors'].value)"
				}
			}
		}
	}
}

rescore_query：查询后二次打分

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
POST my_index/_search
{
	"query": {
		"match": {
			"title": "乌兰"
		}
	},
	"rescore": {
		"window_size": 50,
		"query": {
			"rescore_query": {
				"function_socre": {
					"script_score": {
						"script": {
							"source": "doc['popular_degree'].value"
						}
					}
				}
			}
		}
	}
}

检索模板

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 定义检索模板
POST _scripts/cur_search_template
{
	"script": {
		"lang": "mustache",
		"source": {
			"query": {
				"match": {
					"{{cur_field}}": "{{cur_value}}"
				}
			},
			"size": "{{cur_size}}"
		}
	}
}
# 基于检索模板进行检索
POST my_index/_search/template
{
	"id": "cur_search_template",
	"params": {
		"cur_field": "item_id",
		"cur_value": 1,
		"cur_size": 50
	}
}

分页查询

from+size查询
- 优缺点
  - 优点
    - 支持随机翻页
  - 缺点
    - 限于max_result_window设置，不能无限制翻页
    - 存在深度翻页问题，越往后翻页越慢
- 适用场景
  - 小型数据集或者从大数据集中返回Top N(N≤10000)结果集的业务场景
  - 主流PC搜索引擎中支持随机跳转分页的业务场景
- 不适用场景
  - 过度分页或一次请求太多结果
    - 原因：搜索请求通常会跨多个分片，每个分片必须将其请求的命中内容以及先前页面的命中内容加载到内存中，会显著增加内存和CPU使用率，导致性能下降

search_after查询

执行步骤

创建PIT视图

1
POST kibana_sample_data_logs/_pit?keep_alive=5m

创建基础查询语句

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
GET /_search
{
	"size": 10,
	"query": {
		"match": {
			"host": "elastic"
		}
	},
	"pit": {
		# 上一步请求返回的id
		"id": "xxx"
	},
	"sort": {
		"response.keyword": "asc"
	}
}

实现后续翻页

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
GET /_search
{
	"size": 10,
	"query": {
		"match": {
			"host": "elastic"
		}
	},
	"pit": {
		"id": "xxx"
	},
	"sort": {
		"response.keyword": "asc"
	},
	# 上一步返回的sort结果
	"search_after": [
		"200", 4
	]
}

优缺点
- 优点
  - 不严格受制于max_result_window（单次请求值不能超过max_result_window），可以无限地往后翻页
- 缺点
  - 只支持向后翻页，不支持随机翻页
适用场景
- 适合在手机端应用的场景中使用

scroll查询
- 执行步骤
  - 指定检索语句的同时设置scroll上下文保留时间
    1 2 3 4 5 6 7 8 9
    POST kibana_sample_data_logs/_search?scroll=3m { "size": 100, "query": { "match": { "host": "elastic" } } }
  - 向后翻页，继续获取数据，直到没有要返回的结果为止
    1 2 3 4 5 6
    POST _search/scroll { "sroll": "3m", # 上一步请求返回的id "scroll_id": "xxx" }
- 优缺点
  - 优点
    - 支持全量遍历，是检索大量文档的重要方法，但单次遍历的size值不能超过max_result_window的大小
  - 缺点
    - 响应是非实时的；保留上下文需要具有足够的堆内存空间；需要通过更多的网络请求才能获取所有结果
- 适用场景
  - 大量文档检索：当要检索的文档数量很大，甚至需要全量召回数据时，scroll查询是一个很好的选择
  - 大量文档的数据处理：滚动API适合对大量文档进行数据处理，例如索引迁移或将数据导入其他技术栈

Elasticsearch聚合

聚合分类

分桶聚合：用于将数据分组
- Terms分桶聚合（分组聚合结果）
- Range范围聚合（分区间聚合）
- Histogram直方图聚合（间隔聚合）
- Date histogram日期聚合（时间间隔聚合）
- Date range日期范围聚合（自定义日期范围聚合）
- Composite组合聚合（支持聚合后分页）
- Filters过滤聚合（满足给定过滤条件的聚合）
  1 2 3 4 5 6 7 8 9 10 11
  POST my_index/_search { "size": 0, "aggs": { "color_terms_agg": { "terms": { "field": "color" } } } }
指标聚合：用于计算数据的指标
- Avg平均值聚合（求平均值）
- Sum汇总聚合（求汇总之和）
- Max最大值聚合（求最大值）
- Min最小值聚合（求最小值）
- Stats统计聚合（求统计结果值）
- Top hits详情聚合（求各外层桶的详情）
- Cardinality去重聚合（去重）
- Value count计数聚合（计数）
  1 2 3 4 5 6 7 8 9 10 11
  POST my_index/_search { "size": 0, "aggs": { "max_agg": { "max": { "field": "size" } } } }

管道子聚合：对其他聚合的结果进行再次计算和分析，用于对数据进行复杂的分析

Bucket selector选择子聚合
Bucket sort排序子聚合
Max bucket最大值子聚合
Min bucket最小值子聚合
Stats bucket统计子聚合

Sum bucket求和子聚合

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
POST my_index/_search
{
"size": 0, 
"aggs": {
	"hole_terms_agg": {
	"terms": {
		"field": "has_hole"
	},
		"aggs": {
			"max_value_agg": {
				"max": {
					"field": "size"
				}
			}
		}
	},
	"max_hole_color_agg": {
		"max_bucket": {
			"buckets_path": "hole_terms_agg>max_value_agg"
		}
	}
}
}

书籍学习-一本书讲透Elasticsearch：原理、进阶和工程实践（二）

Elasticsearch文档

Elasticsearch脚本

Elasticsearch检索

Elasticsearch聚合

相关文章：