指定分析器 · Elasticsearch7.x

当Elasticsearch在你的文档中检测到一个新的字符串域，它会自动设置其为一个全文字符串域，使用<mark>标准分析器</mark>对它进行分析。你不希望总是这样。可能你想使用一个不同的分析器，适用于你的数据使用的语言。有时候你想要一个字符串域就是一个字符串域—不使用分析，直接索引你传入的精确值，例如用户 ID 或者一个内部的状态域或标签。要做到这一点，我们必须手动指定这些域的映射。 [TOC] # 1. IK 分词器 ES的默认分词器无法识别中文单词这样的词汇，而是简单的将每个字拆为一个词。 ```json GET /_analyze { "text": "测试单词" } 响应结果如下： { "tokens" : [ { "token" : "测", # token实际存储到索引中的词条 "start_offset" : 0, # start_offset和end_offset指明字符在原始字符串中的位置 "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 # position指明词条在原始文本中出现的位置 }, { "token" : "试", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "单", "start_offset" : 2, "end_offset" : 3, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "词", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 3 } ] } ``` 这样的结果显然不符合我们的使用要求，所以我们需要下载 ES 对应版本的中文分词器。步骤如下： **1. 下载IK分词器** https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.8.0 ![](https://img.kancloud.cn/84/fa/84fa9f7d04c9bb8bd8cd34488c2c1b1a_1291x268.png) **2. 解压到`%ES_HOME%/plugins/`目录下** ![](https://img.kancloud.cn/1a/d6/1ad69744db8da28f7f3bba6210bbbe39_1458x167.png) **3. 重启ES** **4. 指定IK分词器** * `ik_max_word`：会将文本做最细粒度的拆分。 * `ik_smart`：会将文本做最粗粒度的拆分。 ```json GET /_analyze { "text": "测试单词", "analyzer":"ik_max_word" } 响应结果如下： { "tokens" : [ { "token" : "测试", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "单词", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 } ] } ``` <br/> ES 中也可以进行扩展词汇，下面的查询仅仅可以得到每个字的分词结果，我们需要做的就是使分词器识别到弗雷尔卓德也是一个词语。 ```json GET /_analyze { "text": "弗雷尔卓德", "analyzer":"ik_max_word" } 响应结果如下： { "tokens" : [ { "token" : "弗", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "雷", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "尔", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "卓", "start_offset" : 3, "end_offset" : 4, "type" : "CN_CHAR", "position" : 3 }, { "token" : "德", "start_offset" : 4, "end_offset" : 5, "type" : "CN_CHAR", "position" : 4 } ] ``` 使分词器识别到弗雷尔卓德也是一个词语，需要做如下工作。 **1. 创建`%ES_HOME%/plugins/ik分词器目录/config/**.dic`文件** 创建`custom.dic`（文件名自定义）文件并将需要作为中文词语的字符串写入文件中。 ![](https://img.kancloud.cn/f0/81/f0815c0c9337db266cb17b0681b4681c_1254x273.png) ``` 弗雷尔卓德测试单词 ``` **2. 在文件`%ES_HOME%/plugins/ik分词器目录/config/IKAnalyzer.cfg.xml`中配置`custom.dic`文件** ```xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment>  <entry key="ext_dict">custom.dic</entry>  <entry key="ext_stopwords"></entry>     </properties> ``` **3. 重启ES** **4. 测试** ```json GET /_analyze { "text": "弗雷尔卓德", "analyzer":"ik_max_word" } 响应结果如下： { "tokens" : [ { "token" : "弗雷尔卓德", "start_offset" : 0, "end_offset" : 5, "type" : "CN_WORD", "position" : 0 } ] } GET /_analyze { "text": "测试单词", "analyzer":"ik_max_word" } 响应结果如下： { "tokens" : [ { "token" : "测试单词", "start_offset" : 0, "end_offset" : 4, "type" : "CN_WORD", "position" : 0 }, { "token" : "测试", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 1 }, { "token" : "单词", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2 } ] } ```