中文分词 · FA+Ecloud互联网应用开发平台综合技术入门指南(READ)

## 一、分词当一个文档被存储时，ElasticSearch会使用分词器从文档中提取出若干词元（token）来支持索引的存储和搜索；ElasticSearch内置了很多分词器，但内置的分词器对中文的处理不好；举例来说；使用分词命令分析； ``` curl --user elastic:'ray!@#333' -H "Content-Type: application/json" -X POST localhost:9200/_analyze?pretty -d '{"text":"ray elasticsearch"} ' ``` 等同 ``` curl --user elastic:'ray!@#333' -H "Content-Type: application/json" -X POST localhost:9200/_analyze?pretty -d '{"analyzer": "standard","text":"ray elasticsearch"} ' ``` ![](https://img.kancloud.cn/ff/64/ff64cdcb78a5eafa3dce9e05d79d7464_1419x424.png) 上面结果显示 "ray elasticsearch"语句被分为两个单词，因为英文天生以空格分隔，自然就以空格来分词，这没有任何问题；下面举一个中文的例子； ``` curl --user elastic:'ray!@#333' -H "Content-Type: application/json" -X POST localhost:9200/_analyze?pretty -d '{"text":"全文检索网"} ' ``` 等同 ``` curl --user elastic:'ray!@#333' -H "Content-Type: application/json" -X POST localhost:9200/_analyze?pretty -d '{"analyzer": "standard","text":"全文检索网"} ' ``` ![](https://img.kancloud.cn/1a/2e/1a2eb5ec97623e6f46c65e3774a74bad_1416x835.png) 从结果可以看出，这种分词把每个汉字都独立分开来了，这对中文分词就没有意义了，所以ElasticSearch默认的分词器对中文处理是有问题的；上面默认的分词器的名称是standard；当我们换一个分词器处理分词时，只需将"analyzer"字段设置相应的分词器名称即可； ES通过安装插件的方式来支持第三方分词器； ## 二、中文分词常用的是中文分词器是中科院ICTCLAS的smartcn和IKAnanlyzer分词器，我们使用IKAnanlyzer分词器； ### **安装** 进入${elasticsearch}/plugins目录下，创建ik子目录；下载： ``` wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.15.1/elasticsearch-analysis-ik-7.15.1.zip ``` 解压： ``` unzip elasticsearch-analysis-ik-7.15.1.zip ``` 重启elasticsearch进程，即可启用IK分词器了； ### **测试** ``` curl --user elastic:'ray!@#333' -H "Content-Type: application/json" -X POST localhost:9200/_analyze?pretty -d '{"analyzer": "ik_max_word","text":"全文检索网"} ' ``` ![](https://img.kancloud.cn/e6/bc/e6bc56f64321d03e1e791905681cf389_1410x565.png) 可以看得出来，对比standard分词器，IK分词就比较合理了； >[danger] IK包含了两个分词器，ik_max_word和ik_smart；