索引原理 · wml_code

## 索引原理 ### 倒排索引 `倒排索引（Inverted Index）`也叫反向索引，有反向索引必有正向索引。通俗地来讲，`正向索引是通过key找value，反向索引则是通过value找key。ES底层在检索时底层使用的就是倒排索引。` ### 索引模型现有索引和映射如下: ```json { "products" : { "mappings" : { "properties" : { "description" : { "type" : "text" }, "price" : { "type" : "float" }, "title" : { "type" : "keyword" } } } } } ``` 先录入如下数据，有三个字段title、price、description等 | _id | title | price | description | | ---- | ------------ | ------ | -------------------- | | 1 | 蓝月亮洗衣液 | `19.9` | 蓝月亮洗衣液`很`高效 | | 2 | iphone13 | `19.9` | `很`不错的手机 | | 3 | 小浣熊干脆面 | 1.5 | 小浣熊`很`好吃 | 在ES中除了text类型分词，其他类型不分词，因此根据不同字段创建索引如下： - **title字段:** | term | _id(文档id) | | ------------ | ----------- | | 蓝月亮洗衣液 | 1 | | iphone13 | 2 | | 小浣熊干脆面 | 3 | - **price字段** | term | _id(文档id) | | ---- | ----------- | | 19.9 | [1,2] | | 1.5 | 3 | - **description字段** | term | _id | term | _id | term | _id | | ---- | ------------------- | ---- | ---- | ---- | ---- | | 蓝 | 1 | 不 | 2 | 小 | 3 | | 月 | 1 | 错 | 2 | 浣 | 3 | | 亮 | 1 | 的 | 2 | 熊 | 3 | | 洗 | 1 | 手 | 2 | 好 | 3 | | 衣 | 1 | 机 | 2 | 吃 | 3 | | 液 | 1 | | | | | | 很 | [1:1:9,2:1:6,3:1:6] | | | | | | 高 | 1 | | | | | | 效 | 1 | | | | | **`注意: Elasticsearch分别为每个字段都建立了一个倒排索引。因此查询时查询字段的term,就能知道文档ID，就能快速找到文档。`** ## 分词器 ### Analysis 和 Analyzer `Analysis`：文本分析是把全文本转换一系列单词(term/token)的过程，也叫分词(Analyzer)。**Analysis是通过Analyzer来实现的**。`分词就是将文档通过Analyzer分成一个一个的Term(关键词查询),每一个Term都指向包含这个Term的文档`。 ### Analyzer 组成 - 注意: 在ES中默认使用标准分词器: StandardAnalyzer 特点: 中文单字分词单词分词我是中国人 this is good man----> analyzer----> 我是中国人 this is good man > 分析器（analyzer）都由三种构件组成的：`character filters` ， `tokenizers` ， `token filters`。 - `character filter` 字符过滤器 - 在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you） - `tokenizers` 分词器 - 英文分词可以根据空格将单词分开,中文分词比较复杂,可以采用机器学习算法来分词。 - `Token filters` Token过滤器 - **将切分的单词进行加工**。大小写转换（例将“Quick”转为小写），去掉停用词（例如停用词像“a”、“and”、“the”等等），加入同义词（例如同义词像“jump”和“leap”）。 `注意:` - 三者顺序: Character Filters--->Tokenizer--->Token Filter - 三者个数：Character Filters（0个或多个） + Tokenizer + Token Filters(0个或多个) ### 内置分词器 - Standard Analyzer - 默认分词器，英文按单词词切分，并小写处理 - Simple Analyzer - 按照单词切分(符号被过滤), 小写处理 - Stop Analyzer - 小写处理，停用词过滤(the,a,is) - Whitespace Analyzer - 按照空格切分，不转小写 - Keyword Analyzer - 不分词，直接将输入当作输出 ### 内置分词器测试 - 标准分词器 - 特点: 按照单词分词英文统一转为小写过滤标点符号中文单字分词 ```http POST /_analyze { "analyzer": "standard", "text": "this is a , good Man 中华人民共和国" } ``` - Simple 分词器 - 特点: 英文按照单词分词英文统一转为小写去掉符号中文按照空格进行分词 ```http POST /_analyze { "analyzer": "simple", "text": "this is a , good Man 中华人民共和国" } ``` - Whitespace 分词器 - 特点: 中文英文按照空格分词英文不会转为小写不去掉标点符号 ```http POST /_analyze { "analyzer": "whitespace", "text": "this is a , good Man" } ``` ### 创建索引设置分词 ```json PUT /索引名 { "settings": {}, "mappings": { "properties": { "title":{ "type": "text", "analyzer": "standard" //显示指定分词器 } } } } ```