中文分词器 · TUNA-daily

[TOC] ## IK 中文分词器 1. 什么是分词器 > 切分词语，normalization（提升recall召回率） > 给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换），分瓷器 > recall，召回率：搜索的时候，增加能够搜索到的结果的数量 > * 分词器的作用： > character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you） > tokenizer：分词，hello you and me --> hello, you, and, me > token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little > 一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引 2. 内置分词器的介绍 ~~~ Set the shape to semi-transparent by calling set_trans(5) standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard） simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5) language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5 ~~~ * 安装 1. mkdir /usr/share/elasticsearch/plugins/ik 时解压放在/usr/share/elasticsearch/plugins/ik目录下 1. query string分词 > query string必须以和index建立时相同的analyzer进行分词（搜索语句和index是一样的索引） > query string对exact value和full text的区别对待 ~~~ date：exact value _all：full text # 不指定index的查询 ~~~ > 比如我们有一个document，其中有一个field，包含的value是：hello you and me，建立倒排索引 > 我们要搜索这个document对应的index，搜索文本是hell me，这个搜索文本就是query string > query string，默认情况下，es会使用它对应的field建立倒排索引时相同的分词器去进行分词，分词和normalization，只有这样，才能实现正确的搜索 > 我们建立倒排索引的时候，将dogs --> dog，结果你搜索的时候，还是一个dogs，那不就搜索不到了吗？所以搜索的时候，那个dogs也必须变成dog才行，才能搜索到。 > 知识点： > 不同类型的field，可能有的就是full text，有的就是exact value ~~~ post_date，date：exact value # 精确值 _all：full text，分词，normalization # 全文索引 ~~~ 2. mapping引入案例遗留问题大揭秘 `GET /_search?q=2017` `搜索的是_all field，document所有的field都会拼接成一个大串，进行分词` ~~~ 2017-01-02 my second article this is my second article in this website 11400 doc1 doc2 doc3 2017 * * * 01 * 02 * 03 * ~~~ > _all，2017，自然会搜索到3个docuemnt `GET /_search?q=2017-01-01` ~~~ _all，2017-01-01，query string(查询语句)会用跟建立倒排索引一样的分词器去进行分词 2017 01 01 ~~~ `GET /_search?q=post_date:2017-01-01 ` > date，会作为exact value（精确值）去建立索引 # query string 和index使用相同的分词器去搜索 ~~~ doc1 doc2 doc3 2017-01-01 * 2017-01-02 * 2017-01-03 * post_date:2017-01-01，2017-01-01，doc1一条document ~~~ GET /_search?q=post_date:2017，这个在这里不讲解，因为是es 5.2以后做的一个优化 3、测试分词器 ~~~ GET /_analyze { "analyzer": "standard", "text": "Text to analyze" } ~~~ ### 1. 测试分词器效果 > * IK分词分为两类：ik_smart和ik_max_word ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合； ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。 * * * * * #### 1.1 分词测试 * ik_smart 测试 ~~~ GET _analyze?pretty { "analyzer": "ik_smart", "text": "中华人民共和国国歌" } ~~~ 得到 `中华人民共和国国歌` 两个词，如下 ~~~ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "国歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 1 } ~~~ * 测试 ik_max_word ~~~ GET _analyze?pretty { "analyzer": "ik_max_word", "text": "中华人民共和国国歌" } ~~~ 得到 `中华人民共和国中华人民中华华人人民共和国人民共和国国国歌` ~~~ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "中华人民", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "中华", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 2 }, 。。。。 ~~~ * 由此得到结论两种分析器都是先分大块词，而ik_max_word在从大块词中分析，以此类推。 ik_max_word分的更加详细 * * * * * ### 1.2 基于mysql热更新分词测试分词 ~~~ GET _analyze?pretty { "analyzer": "ik_max_word", "text": "王者荣耀是最好玩的游戏" } ~~~ > 得到 `王者荣耀是最好好玩的游戏 ` 的分词结果，但是我们想要`王者荣耀`是一个分词怎么做到呢？就需要热更新比较流行的分词 #### 1.2.1 修改ik源码 1. 自定义线程类HotDictReloadThread，作用时不断的更新词典 ~~~ public class HotDictReloadThread implements Runnable { private static final Logger logger = ESLoggerFactory.getLogger(HotDictReloadThread.class.getName()); @Override public void run() { logger.info("==========reload hot dic from mysql......."); while (true){ //不断的重新加载字典 Dictionary.getSingleton().reLoadMainDict(); } } } ~~~ 2. 修改Dictionary类的initial方法，启动线程不断的更新词典 ~~~ public static synchronized Dictionary initial(Configuration cfg) { if (singleton == null) { synchronized (Dictionary.class) { if (singleton == null) { singleton = new Dictionary(cfg); singleton.loadMainDict(); singleton.loadSurnameDict(); singleton.loadQuantifierDict(); singleton.loadSuffixDict(); singleton.loadPrepDict(); singleton.loadStopWordDict(); # 这里是我们自定义的线程类，不断的重新加载词典########## new Thread(new HotDictReloadThread()).start(); if(cfg.isEnableRemoteDict()){ // 建立监控线程 for (String location : singleton.getRemoteExtDictionarys()) { // 10 秒是初始延迟可以修改的 60是间隔时间单位秒 pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS); } for (String location : singleton.getRemoteExtStopWordDictionarys()) { pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS); } } return singleton; } } } return singleton; } ~~~ 3. 自定义loadMySQLExtDict方法，加载mysql中流行词 ~~~ private static Properties prop = new Properties(); static { try { Class.forName("com.mysql.jdbc.Driver"); } catch (ClassNotFoundException e) { logger.error("error",e); } } private void loadMySQLExtDict() { try { Connection connection = null; Statement statement = null; ResultSet resultSet = null; Path file = PathUtils.get(getDictRoot(),"mysql.properties"); prop.load(new FileInputStream(file.toFile())); logger.info("============JDBC reload properties"); for (Object key : prop.keySet()) logger.info("[==========] query hot dict from mysql," + prop.getProperty(String.valueOf(key))); connection = DriverManager.getConnection( prop.getProperty("jdbc.url"), prop.getProperty("jdbc.user"), prop.getProperty("jdbc.password")); statement = connection.createStatement(); resultSet = statement.executeQuery(prop.getProperty("jdbc.reload.sql")); while (resultSet.next()){ String theWord = resultSet.getString("word"); logger.info("[==========] hot word from mysql: " + theWord); _MainDict.fillSegment(theWord.trim().toCharArray()); } Thread.sleep(Integer.valueOf(prop.getProperty("jdbc.reload.interval"))); } catch (Exception e) { e.printStackTrace(); } } ~~~ 4. 自定义loadMySQLStopwordDict方法，加载停用词 ~~~ private void loadMySQLStopwordDict() { { try { Connection connection = null; Statement statement = null; ResultSet resultSet = null; Path file = PathUtils.get(getDictRoot(),"mysql.properties"); prop.load(new FileInputStream(file.toFile())); logger.info("============JDBC reload properties"); for (Object key : prop.keySet()) logger.info("[==========] query hot dict from mysql," + prop.getProperty(String.valueOf(key))); connection = DriverManager.getConnection( prop.getProperty("jdbc.url"), prop.getProperty("jdbc.user"), prop.getProperty("jdbc.password")); statement = connection.createStatement(); resultSet = statement.executeQuery(prop.getProperty("jdbc.reload.stopword.sql")); while (resultSet.next()){ String theWord = resultSet.getString("word"); logger.info("[==========] hot word from mysql: " + theWord); _StopWords.fillSegment(theWord.trim().toCharArray()); } Thread.sleep(Integer.valueOf(prop.getProperty("jdbc.reload.interval"))); } catch (Exception e) { e.printStackTrace(); } } } ~~~ 5. 在Dictionary类的loadMainDict方法，调用loadMySQLExtDict方法，加载流行词 ~~~ private void loadMainDict() { // 建立一个主词典实例 _MainDict = new DictSegment((char) 0); // 读取主词典文件 Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN); InputStream is = null; try { is = new FileInputStream(file.toFile()); } catch (FileNotFoundException e) { logger.error(e.getMessage(), e); } try { BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"), 512); String theWord = null; do { theWord = br.readLine(); if (theWord != null && !"".equals(theWord.trim())) { _MainDict.fillSegment(theWord.trim().toCharArray()); } } while (theWord != null); } catch (IOException e) { logger.error("ik-analyzer", e); } finally { try { if (is != null) { is.close(); is = null; } } catch (IOException e) { logger.error("ik-analyzer", e); } } // 加载扩展词典 this.loadExtDict(); // 加载远程自定义词库 this.loadRemoteExtDict(); //加载mysql热词 this.loadMySQLExtDict(); } ~~~ 6. 在Dictionary类的loadStopWordDict方法，调用loadMySQLStopwordDict方法 ~~~ private void loadStopWordDict() { // 建立主词典实例 _StopWords = new DictSegment((char) 0); // 读取主词典文件 Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_STOP); InputStream is = null; try { is = new FileInputStream(file.toFile()); } catch (FileNotFoundException e) { logger.error(e.getMessage(), e); } try { BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"), 512); String theWord = null; do { theWord = br.readLine(); if (theWord != null && !"".equals(theWord.trim())) { _StopWords.fillSegment(theWord.trim().toCharArray()); } } while (theWord != null); } catch (IOException e) { logger.error("ik-analyzer", e); } finally { try { if (is != null) { is.close(); is = null; } } catch (IOException e) { logger.error("ik-analyzer", e); } this.loadMySQLStopwordDict(); } ~~~ 7. 添加mysql配置mysql.properties ~~~ jdbc.url=jdbc:mysql://localhost:3306/es?serverTimezone=GMT jdbc.user=root jdbc.password=tuna jdbc.reload.sql=select word from hot_words jdbc.reload.stopword.sql=select stopword as word from hot_stopwords jdbc.reload.interval=30000 ~~~ * 将mysql打成jar包，覆盖原来的 ![](https://box.kancloud.cn/deb577e6ed1dca638e4f12e54f2eb2f0_1656x50.png) * 导入mysql jar ![](https://box.kancloud.cn/4d8cfbc11bb9b2dc7f429e3992ecbf9e_1639x198.png) * 重启elasticsearch mysql中的流行词 ![](https://box.kancloud.cn/1dcccc6af3dfacc5e17d8d9c39c3255b_444x177.png) 结果 ~~~ GET _analyze { "analyzer": "ik_max_word", "text": "王者荣耀很好玩" } ~~~ 得到 ~~~ { "tokens": [ { "token": "王者荣耀", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "王者", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "荣耀", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "很好", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 3 }, { "token": "好玩", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 4 } ] } ~~~ 流行词更新完毕 ### 1.3 修改索引配置 ~~~ PUT http://192.168.159.159:9200/index1 { "settings": { "refresh_interval": "5s", "number_of_shards" : 1, // 一个主节点 "number_of_replicas" : 0 // 0个副本，后面可以加 }, "mappings": { "_default_":{ "_all": { "enabled": false } // 关闭_all字段，因为我们只搜索title字段 }, "resource": { "dynamic": false, // 关闭“动态修改索引” "properties": { "title": { "type": "string", "index": "analyzed", "fields": { "cn": { "type": "string", "analyzer": "ik" }, "en": { "type": "string", "analyzer": "english" } } } } } } } ~~~ ~~~ GET index/_search { "query": { "match": { "content": "中国渔船" } } } ~~~ ~~~ "hits": { "total": 2, "max_score": 0.6099695, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 0.6099695, "_source": { "content": "中国驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首" } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 0.54359555, "_source": { "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船" } ~~~ 设字段的分析器 ~~~ POST index/fulltext/_mapping { "properties": { "content": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" } } } ~~~ ### 1.4 中文分词文档统计 * 因为content字段是text类型，不可以聚合，所以设置 "fielddata": true, ~~~ PUT /news/_mapping/new { "properties": { "content":{ "type": "text", "fielddata": true, "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" } } } ~~~ * 查询 #### 1.4.1 terms（分组） ~~~ GET /news/_search { "query": { "match": { "content": "中国国家领导人" } }, "aggs": { "top": { "terms": { "size": "10", "field": "content" } } } } ~~~ 得到 ~~~ "aggregations": { "top": { "doc_count_error_upper_bound": 1, "sum_other_doc_count": 67, "buckets": [ { "key": "中国", "doc_count": 5 }, { "key": "在", "doc_count": 3 }, { "key": "人", "doc_count": 2 }, { "key": "冲突", "doc_count": 2 }, ~~~ 中国出现在五篇文档中，在出现在三篇文档中