analyzer（分析器） · Elasticsearch 5.4 中文文档

# analyzer（分析器）原文链接 : [https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html) 译文链接 : [http://www.apache.wiki/pages/editpage.action?pageId=9405573](http://www.apache.wiki/pages/editpage.action?pageId=9405573) 贡献者 : [程威](/display/~chengwei)，[ApacheCN](/display/~apachecn)，[Apache中文网](/display/~apachechina) `[analyzed](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/mapping-index.html)（`被分析）的 **string** **fields**（字符串字段）的值通过 [`analyzer`](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis.html)（分析器）来传递，将字符串转换为一串 **`tokens`**（标记）标记或者 **`terms`**（词条）。例如，基于某种分析器，字符串 "**The quick Brown Foxes**" 被解析为 : **`quick `**`，`**`brown`，`fox `**`。`这些是索引该字段的实际 **`terms`**（词条），可以用来有效地搜索大块文本内的单个单词。这样的分析过程不仅发生在索引的时候，而且在查询时也需要 : 查询字符串需要通过相同（或类似的）**`analyzer `**分析器传递，以便尝试查找那些存在于索引的相同格式的 **`terms`**（词条）。 **Elasticsearch **内置了许多 [`pre-defined analyzers`](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-analyzers.html)（预定义的分析器），可以在不进一步配置的情况下使用。它还附带许多 [`character filters`](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-charfilters.html)（字符过滤器），[`tokenizers`](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-tokenizers.html)（分词器）和[`Token Filters`](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-tokenfilters.html)（标记过滤器）。可以用来组合配置每个索引的自定义`analyzer`（分析器）。每一个查询，每一个字段或索引都可以指定分析器，在索引的时候，**Elasticsearch **将按以下顺序查找 **`analyzer`**（分析器）: * 定义在字段映射中的 **`analyzer`**（分析器）。 * 索引设置中 **`default`**（默认）的 **`analyzer`**（分析器）。 * [`standard`](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-standard-analyzer.html)（标准的）**`analyzer`**（分析器）。在查询时，还有几层 : * 在 [`full-text query`](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/full-text-queries.html)（全文查找）中定义的 **`analyzer`**（分析器）。 * 在字段映射中定义的 **`search_analyzer`**（搜索分析器）。 * 在字段映射中定义的 **`analyzer`**（分析器）。 * 在索引配置中 **`default_search`**（默认搜索的）**`analyzer`**（分析器）。 * 索引设置中 **`default`**（默认）的 **`analyzer`**（分析器）。 * [`standard`](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-standard-analyzer.html)（标准的）**`analyzer`**（分析器）。为特定字段指定分析器的最简单的方法是在字段映射中进行定义，如下所示 : ``` curl -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d' { "mappings": { "my_type": { "properties": { "text": { # 1 "type": "text", "fields": { "english": { # 2 "type": "text", "analyzer": "english" } } } } } } } ' curl -XGET 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d' # 3 { "field": "text", "text": "The quick Brown Foxes." } ' curl -XGET 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d' # 4 { "field": "text.english", "text": "The quick Brown Foxes." } ' ``` | 1 | `**text**` 字段使用默认的 **`standard`**（标准的）分析器。 | | 2 | **`text.english` **多字段使用 **`english` **分词器，可以删除 **`stop words`**（停用词）并应用于 **`stemming` **词干。 | | 3 | 返回 **`tokens`**（标记）: [**`the`**，**`quick`**，**`brown`**，**`foxes`**]。 | | 4 | 返回 **`tokens`**（标记）: [**`quick`**，**`brown`**，**`fox`**]。 | ## search_quote_analyzer（搜索引用分析器） `该 **search_quote_analyzer **`设置允许你为短语指定 **`analyzer`**（分析器），这在处理禁用短语的 **`stop words`**（停用词）时特别有用。要使用三个 **`analyzer`**（分析器）设置来禁用短语的停用词 : 1. 一个 **`analyzer`**（分析器）设置成索引所有的 **`terms`**（词条）包括 **`stop words`**（停用词）。 2. 一个 **`search_analyzer `**设置成将移除 **`stop words`**（停用词）的非短语查询。 3. 一个 `**search_quote_analyzer** `设置不会移除 **`stop words`**（停用词）的短语查询。 ``` curl -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d' { "settings":{ "analysis":{ "analyzer":{ "my_analyzer":{ # 1 "type":"custom", "tokenizer":"standard", "filter":[ "lowercase" ] }, "my_stop_analyzer":{ # 2 "type":"custom", "tokenizer":"standard", "filter":[ "lowercase", "english_stop" ] } }, "filter":{ "english_stop":{ "type":"stop", "stopwords":"_english_" } } } }, "mappings":{ "my_type":{ "properties":{ "title": { "type":"text", "analyzer":"my_analyzer", # 3 "search_analyzer":"my_stop_analyzer", # 4 "search_quote_analyzer":"my_analyzer" # 5 } } } } } ' ``` ``` PUT my_index/my_type/1 { "title":"The Quick Brown Fox" } PUT my_index/my_type/2 { "title":"A Quick Brown Fox" } GET my_index/my_type/_search { "query":{ "query_string":{ "query":"\"the quick brown fox\"" # 1 } } } ``` | 1 | **`my_analyzer` **分析器，用于标识所有 `terms`（词条）包括 **`stop words`**（停用词）。 | | 2 | 移除 `**stop** **words**`（停用词）的 **`my_stop_analyzer` **分析器。 | | 3 | **`analyzer`**（分析器）设置指向将在索引时使用的 **`my_analyzer` **分析器。 | | 4 | **`search_analyzer` **设置指向 **`my_stop_analyzer`**，并移除非短语查询的 **`stop words`**（停用词）。 | | 5 | **`search_quote_analyzer` **设置指向 **`my_analyzer` **分析器，并确保 **`stop words`**（停用词）不会从短语查询中移除。 | | 1 | 由于查询时用括号括起来的,因此它被检测为短语查询。因此 **`search_quote_analyzer` **会启动并确保停用词不会从查询中移除。**`my_analyzer` **分析器将返回与其中一个文档相匹配的 **`terms`**（词条）[`**the**,``**quick**,``**brown**,`**`fox`**]。同时，将通过 **`my_stop_analyzer` **分析器分析 **`terms`**（词条）查询，该分析器将过滤掉 **`stop words`**（停用词）。因此，搜索 **`The quick brown fox`** 或 **`A quick brown fox`** 将返回两个文档，因为这两个文档都包含以下 **`tokens`**（词元）[`**quick**,``**brown**,`**`fox`**]。没有 **`search_quote_analyzer`**，将不可能对 **phrase** **queries**（短语查询）做到精确匹配，因为短语查询时 **`stop words`**（停用词）会被删除，从而导致两个文档都会被匹配到。 |