elasticsearch基础_1 · TUNA-daily

[TOC] ## 1. 什么是elasticsearch > * Elasticsearch是一个实时的分布式搜索和分析引擎 > * 可以扩展到上百台服务器，处理PB级别的结构化或非结构化数据。 * * * * * ## 2. 应用案例 > * 维基百科使用Elasticsearch来进行全文搜做并高亮显示关键词，以及提供search-as-you-type、did-you-mean等搜索建议功能。 > * 英国卫报使用Elasticsearch来处理访客日志，以便能将公众对不同文章的反应实时地反馈给各位编辑。 > * StackOverflow将全文搜索与地理位置和相关信息进行结合，以提供more-like-this相关问题的展现。 > * GitHub使用Elasticsearch来检索超过1300亿行代码。 > * 每天，Goldman Sachs使用它来处理5TB数据的索引，还有很多投行使用它来分析股票市场的变动 * * * * * ## 3. 术语 1. 集群健康状态 > green : 所有的主分片和复制分配都可用 > yellow : 所有的主分片可用，复制分片不一定都可用，说明副本没有被分配给其他节点 > red : 不是所有的主分片都可用分片 2. 分片 > * 分片分为：主分片和复制分片 > 主分片：一旦索引创建就不可以改变 > 复制分片:只是主分片的一个副本，它可以防止硬件故障导致的数据丢失，同时可以提供读请求，比如搜索或者从别的shard取回文档。 > 为了横向扩容，机器数量超过了总shard的数量，可以增加复制分片的数量，增加性能 3. 文档属性 > _index :文档存储的地方 > _type ：文档类型,代表对象的类 > _id : 文档的唯一标识 4. 文档更新 > 文档在elasticsearch中不可以修改的，想要修改只能重建索引或者替换掉原来的索引，这样_version就增加了 5. 查询结果 > hits： > ## 4. 全文搜索与精准匹配 1. exact value > 2017-01-01，exact value，搜索的时候，必须输入2017-01-01，才能搜索出来 > 如果你输入一个01，是搜索不出来的 2. full text 有以下几种匹配方式 ~~~ （1）缩写 vs. 全程：cn vs. china （2）格式转化：like liked likes （3）大小写：Tom vs tom （4）同义词：like vs love ~~~ 2017-01-01，2017 01 01，搜索2017，或者01，都可以搜索出来 ~~~ china，搜索cn，也可以将china搜索出来 # 匹配缩写 likes，搜索like，也可以将likes搜索出来 # 模糊匹配 Tom，搜索tom，也可以将Tom搜索出来 # 忽略大小写匹配 like，搜索love，同义词，也可以将like搜索出来 # 同义词匹配 ~~~ 就> 不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配 ## 5. 倒排索引 doc1：I really liked my small dogs, and I think my mom also liked them. doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him. 分词，初步的倒排索引的建立 ~~~ word doc1 doc2 I * * really * liked * * my * * small * dogs * and * think * mom * * also * them * He * never * any * so * hope * that * will * not * expect * me * to * him * ~~~ 演示了一下倒排索引最简单的建立的一个过程搜索 mother like little dog，不可能有任何结果 mother like little dog 这个是不是我们想要的搜索结果？？？绝对不是，因为在我们看来，mother和mom有区别吗？同义词，都是妈妈的意思。like和liked有区别吗？没有，都是喜欢的意思，只不过一个是现在时，一个是过去时。little和small有区别吗？同义词，都是小小的。dog和dogs有区别吗？狗，只不过一个是单数，一个是复数。 > normalization： > 建立倒排索引的时候，会执行一个操作，也就是说对拆分出的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率时态的转换，单复数的转换，同义词的转换，大小写的转换 ~~~ mom ―> mother liked ―> like small ―> little dogs ―> dog ~~~ 重新建立倒排索引，加入normalization，再次用mother liked little dog搜索，就可以搜索到了 ~~~ word doc1 doc2 I * * really * like * * liked --> like my * * little * small --> little dog * * dogs --> dog and * think * mom * * also * them * He * never * any * so * hope * that * will * not * expect * me * to * him * ~~~ ~~~ mother like little dog，分词，normalization mother --> mom like --> like little --> little dog --> dog ~~~ doc1和doc2都会搜索出来 doc1：I really liked my small dogs, and I think my mom also liked them. doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him. ## 6. _mapping ### 6.1 核心的数据类型 1. 内置类型 ~~~ string # 字符串类型 byte，short，integer，long # 数字型 float，double boolean # 布尔型 date # 日期类型 ~~~ 2. dynamic mapping ~~~ true or false --> boolean 123 --> long 123.45 --> double 2017-01-01 --> date "hello world" --> string/text ~~~ 3. 查看mapping `GET /index/_mapping/type` 4. 创建_mapping 只能创建index时手动建立mapping，或者新增field mapping，但是不能修改字段对应的mapping（update field mapping） ~~~ PUT /website { "mappings": { "article": { "properties": { "author_id": { "type": "long" }, "title": { "type": "text", "analyzer": "english" }, "content": { "type": "text" }, "post_date": { "type": "date" }, "publisher_id": { "type": "text", "index": "not_analyzed" } } } } } ~~~ 或 ### type=keyword * 现在es 5.X版本，type=text，dynamic mapping默认会设置两个field，一个是field本身，比如articleID，就是分词的；还有一个的话，就是field.keyword，articleID.keyword，默认不分词，会最多保留256个字符例如： bulk出入数据，没有建立索引，自动映射 ~~~ POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" } ~~~ 查看映射 ~~~ GET forum/_mapping/article { "forum": { "mappings": { "article": { "properties": { "articleID": { "type": "text", # articleID分词 "fields": { "keyword": { "type": "keyword", # articleID.keyword 不分词 "ignore_above": 256 } } }, "hidden": { "type": "boolean" }, "postDate": { "type": "date" }, "userID": { "type": "long" } } } } } } ~~~ ~~~ PUT /website/_mapping/article { "properties" : { "new_field" : { "type" : "string", "index": "not_analyzed" # 不分词，精准匹配 } } } ~~~ mapping中type=keyword 代表不分词