term vector · TUNA-daily

课程大纲 ## 1. term vector介绍 > 作用： > 获取document中的某个field内的各个term的统计信息 > term information: term frequency in the field, term positions, start and end offsets, term payloads > term statistics: 设置term_statistics=true; > ttf（total term frequency）, 一个term在所有document中出现的频率; > document frequency，有多少document包含这个term > field statistics: > document count：有多少document包含这个field; > sum of document frequency：一个field中所有term的df之和; > sum of total term frequency：一个field中的所有term的tf之和 > 用例： > 比如说，你想要看到某个term，某个词条，大话西游，这个词条，在多少个document中出现了。或者说某个field，film_desc，电影的说明信息，有多少个doc包含了这个说明信息。 ## 2. index-iime term vector实验 > term vector，涉及了很多的term和field相关的统计信息，有两种方式可以采集到这个统计信息 > （1）index-time，你在mapping里配置一下，然后建立索引的时候，就直接给你生成这些term和field的统计信息了 > （2）query-time，你之前没有生成过任何的Term vector信息，然后在查看term vector的时候，直接就可以看到了，会on the fly，现场计算出各种统计信息，然后返回给你 ~~~ PUT /my_index { "mappings": { "my_type": { "properties": { "text": { "type": "text", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" }, "fullname": { "type": "text", "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } } ~~~ ~~~ PUT /my_index/my_type/1 { "fullname" : "Leo Li", "text" : "hello test test test " } PUT /my_index/my_type/2 { "fullname" : "Leo Li", "text" : "other hello test ..." } ~~~ ### 2.1 term 统计 ~~~ GET /my_index/my_type/1/_termvectors { "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } ~~~ * 得到 ~~~ "term_vectors": { "text": { "field_statistics": { "sum_doc_freq": 6, "doc_count": 2, "sum_ttf": 8 }, "terms": { "hello": { "doc_freq": 2, # 在2个doc中出现了 "ttf": 2, # 在所有doc中出现了2次 "term_freq": 1, # 在当前doc中出现了多少次 "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 5, "payload": "d29yZA==" } ] }, "test": { "doc_freq": 2, "ttf": 4, "term_freq": 3, "tokens": [ # test这个单词，出现了三次，这三个的offset { "position": 1, "start_offset": 6, "end_offset": 10, "payload": "d29yZA==" }, { "position": 2, "start_offset": 11, "end_offset": 15, "payload": "d29yZA==" }, { "position": 3, "start_offset": 16, "end_offset": 20, "payload": "d29yZA==" } ] ~~~ ### 2.2 query-time term vector实验 fullname这个field没有在创建的时候产生词条信息 ~~~ GET /my_index/my_type/1/_termvectors { "fields" : ["fullname"], "offsets" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } ~~~ 一般来说，如果条件允许，你就用query time的term vector就可以了，你要探查什么数据，现场去探查一下就好了 ### 2.3 查找指定词语在所有doc某一field出现的次数这里对text进行搜索，查找找test在所有doc出现的频率 ~~~ GET /my_index/my_type/_termvectors { "doc" : { "fullname":"li xiao long", "text" : "test" }, "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } ~~~ 得到 ~~~ "terms": { "test": { "doc_freq": 2, "ttf": 5, # test在所有doc的text中出现的次数 "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 4 } ~~~ * 查找lihong长在full的频率 ~~~ GET /my_index/my_type/_termvectors { "doc" : { "fullname":"li hong zhang", "text" : "test" }, "fields" : ["fullname"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } ~~~ 手动指定一个doc，实际上不是要指定doc，而是要指定你想要安插的词条，hello test，那么就可以放在一个field中将这些term分词，然后对每个term，都去计算它在现有的所有doc中的一些统计信息这个挺有用的，可以让你手动指定要探查的term的数据情况，你就可以指定探查“大话西游”这个词条的统计信息 ### 2.4 手动指定analyzer来生成term vector ~~~ GET /my_index/my_type/_termvectors { "doc" : { "fullname" : "Leo Li", "text" : "hello test test test" }, "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true, "per_field_analyzer" : { "text": "standard" } } ~~~ ### 2.5 terms filter ~~~ GET /my_index/my_type/_termvectors { "doc" : { "fullname" : "Leo Li", "text" : "hello test test test" }, "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true, "filter" : { "max_num_terms" : 3, # 控制term出现的最大次数 "min_term_freq" : 1, # 控制term出现的最小次数 "min_doc_freq" : 1 # 最少在多少个doc中出现 } } ~~~ 这个就是说，根据term统计信息，过滤出你想要看到的term vector统计结果也挺有用的，比如你探查数据把，可以过滤掉一些出现频率过低的term，就不考虑了 ### 2.6 multi term vector ~~~ GET _mtermvectors { "docs": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "term_statistics": true }, { "_index": "my_index", "_type": "my_type", "_id": "1", "fields": [ "text" ] } ] } ~~~ ~~~ GET /my_index/_mtermvectors { "docs": [ { "_type": "test", "_id": "2", "fields": [ "text" ], "term_statistics": true }, { "_type": "test", "_id": "1" } ] } ~~~ ~~~ GET /my_index/my_type/_mtermvectors { "docs": [ { "_id": "2", "fields": [ "text" ], "term_statistics": true }, { "_id": "1" } ] } ~~~ ~~~ GET /_mtermvectors { "docs": [ { "_index": "my_index", "_type": "my_type", "doc" : { "fullname" : "Leo Li", "text" : "hello test test test" } }, { "_index": "my_index", "_type": "my_type", "doc" : { "fullname" : "Leo Li", "text" : "other hello test ..." } } ] } ~~~