#### 多数字段(Most Fields)
全文搜索是一场召回率(Recall) - 返回所有相关的文档,以及准确率(Precision) - 不返回无关文档,之间的战斗。目标是在结果的第一页给用户呈现最相关的文档。
为了提高召回率,我们会广撒网 - 不仅包括精确匹配了用户搜索词条的文档,还包括了那些我们认为和查询相关的文档。如果一个用户搜索了"quick brown fox",一份含有fast foxes的文档也可以作为一个合理的返回结果。
如果我们拥有的相关文档仅仅是含有fast foxes的文档,那么它会出现在结果列表的顶部。但是如果我们有100份含有quick brown fox的文档,那么含有fast foxes的文档的相关性就会变低,我们希望它出现在结果列表的后面。在包含了许多可能的匹配后,我们需要确保相关度高的文档出现在顶部。
一个用来调优全文搜索相关性的常用技术是将同样的文本以多种方式索引,每一种索引方式都提供了不同相关度的信号(Signal)。主要字段(Main field)中含有的词条的形式是最宽泛的(Broadest-matching),用来尽可能多的匹配文档。比如,我们可以这样做:
* 使用一个词干提取器来将jumps,jumping和jumped索引成它们的词根:jump。然后当用户搜索的是jumped时,我们仍然能够匹配含有jumping的文档。
* 包含同义词,比如jump,leap和hop。
* 移除变音符号或者声调符号:比如,ésta,está和esta都会以esta被索引。
但是,如果我们有两份文档,其中之一含有jumped,而另一份含有jumping,那么用户会希望第一份文档的排序会靠前,因为它含有用户输入的精确值。
我们可以通过将相同的文本索引到其它字段来提供更加精确的匹配。一个字段可以包含未被提取词干的版本,另一个则是含有变音符号的原始单词,然后第三个使用了shingles,用来提供和[单词邻近度](https://www.elastic.co/guide/en/elasticsearch/guide/current/proximity-matching.html)相关的信息。这些其它字段扮演的角色就是信号(Signals),它们用来增加每个匹配文档的相关度分值。能够匹配的字段越多,相关度就越高。
如果一份文档能够匹配具有最宽泛形式的主要字段(Main field),那么它就会被包含到结果列表中。如果它同时也匹配了信号字段,它会得到一些额外的分值用来将它移动到结果列表的前面。
我们会在本书的后面讨论同义词,单词邻近度,部分匹配以及其他可能的信号,但是我们会使用提取了词干和未提取词干的字段的简单例子来解释这个技术。
#### 多字段映射(Multifield Mapping)
第一件事就是将我们的字段索引两次:一次是提取了词干的形式,一次是未提取词干的形式。为了实现它,我们会使用多字段(Multifields),在字符串排序和[多字段]()中我们介绍过:
```Javascript
DELETE /my_index
PUT /my_index
{
"settings": { "number_of_shards": 1 }, <1>
"mappings": {
"my_type": {
"properties": {
"title": { <2>
"type": "string",
"analyzer": "english",
"fields": {
"std": { <3>
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
}
```
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
<1> See <<[关联失效(相关性被破坏](/100_Full_Text_Search/35_Relevance_is_broken.md)>>.
<2> title字段使用了english解析器进行词干提取。
<3> title.std字段则使用的是standard解析器,因此它没有进行词干提取。
下一步,我们会索引一些文档:
```Javascript
PUT /my_index/my_type/1
{ "title": "My rabbit jumps" }
PUT /my_index/my_type/2
{ "title": "Jumping jack rabbits" }
```
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
以下是一个简单的针对title字段的match查询,它查询jumping rabbits:
```Javascript
GET /my_index/_search
{
"query": {
"match": {
"title": "jumping rabbits"
}
}
}
```
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
它会变成一个针对两个提干后的词条jump和rabbit的查询,这要得益于english解析器。两份文档的title字段都包含了以上两个词条,因此两份文档的分值是相同的:
```Javascript
{
"hits": [
{
"_id": "1",
"_score": 0.42039964,
"_source": {
"title": "My rabbit jumps"
}
},
{
"_id": "2",
"_score": 0.42039964,
"_source": {
"title": "Jumping jack rabbits"
}
}
]
}
```
如果我们只查询title.std字段,那么只有文档2会匹配。但是,当我们查询两个字段并将它们的分值通过bool查询进行合并的话,两份文档都能够匹配(title字段也匹配了),而文档2的分值会更高一些(匹配了title.std字段):
```Javascript
GET /my_index/_search
{
"query": {
"multi_match": {
"query": "jumping rabbits",
"type": "most_fields", <1>
"fields": [ "title", "title.std" ]
}
}
}
```
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
<1> 在上述查询中,由于我们想合并所有匹配字段的分值,因此使用的类型为most_fields。这会让multi_match查询将针对两个字段的查询子句包含在一个bool查询中,而不是包含在一个dis_max查询中。
```Javascript
{
"hits": [
{
"_id": "2",
"_score": 0.8226396, <1>
"_source": {
"title": "Jumping jack rabbits"
}
},
{
"_id": "1",
"_score": 0.10741998, <1>
"_source": {
"title": "My rabbit jumps"
}
}
]
}
```
<1> 文档2的分值比文档1的高许多。
我们使用了拥有宽泛形式的title字段来匹配尽可能多的文档 - 来增加召回率(Recall),同时也使用了title.std字段作为信号来让最相关的文档能够拥有更靠前的排序(译注:增加了准确率(Precision))。
每个字段对最终分值的贡献可以通过指定boost值进行控制。比如,我们可以提升title字段来让该字段更加重要,这也减小了其它信号字段的影响:
```Javascript
GET /my_index/_search
{
"query": {
"multi_match": {
"query": "jumping rabbits",
"type": "most_fields",
"fields": [ "title^10", "title.std" ] <1>
}
}
}
```
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
<1> boost=10让title字段的相关性比title.std更重要。
<!--
[[most-fields]]
=== Most Fields
Full-text search is a battle between _recall_—returning all the
documents that are ((("most fields queries")))((("multifield search", "most fields queries")))relevant--and _precision_—not returning irrelevant
documents. The goal is to present the user with the most relevant documents
on the first page of results.
To improve recall, we cast((("recall", "improving in full text searches"))) the net wide--we include not only
documents that match the user's search terms exactly, but also
documents that we believe to be pertinent to the query. If a user searches
for ``quick brown fox,'' a document that contains `fast foxes` may well be
a reasonable result to return.
If the only pertinent document that we have is the one containing `fast
foxes`, it will appear at the top of the results list. But of course, if
we have 100 documents that contain the words `quick brown fox`, then the
`fast foxes` document may be considered less relevant, and we would want to
push it further down the list. After including many potential matches, we
need to ensure that the best ones rise to the top.
A common technique for fine-tuning full-text relevance((("relevance", "fine-tuning full text relevance"))) is to index the same
text in multiple ways, each of which provides a different relevance _signal_. The main field would contain terms in their broadest-matching form to match as
many documents as possible. For instance, we could do the following:
* Use a stemmer to index `jumps`, `jumping`, and `jumped` as their root
form: `jump`. Then it doesn't matter if the user searches for
`jumped`; we could still match documents containing `jumping`.
* Include synonyms like `jump`, `leap`, and `hop`.
* Remove diacritics, or accents: for example, `ésta`, `está`, and `esta` would
all be indexed without accents as `esta`.
However, if we have two documents, one of which contains `jumped` and the
other `jumping`, the user would probably expect the first document to rank
higher, as it contains exactly what was typed in.
We can achieve this by indexing the same text in other fields to provide more-precise matching. One field may contain the unstemmed version, another the
original word with diacritics, and a third might use _shingles_ to provide
information about <<proximity-matching,word proximity>>. These other fields
act as _signals_ that increase the relevance score of each matching document.
The more fields that match, the better.
A document is included in the results list if it matches the broad-matching
main field. If it also matches the _signal_ fields, it gets extra
points and is pushed up the results list.
We discuss synonyms, word proximity, partial-matching and other potential
signals later in the book, but we will use the simple example of stemmed and
unstemmed fields to illustrate this technique.
==== Multifield Mapping
The first thing to do is to set up our ((("most fields queries", "multifield mapping")))((("mapping (types)", "multifield mapping")))field to be indexed twice: once in a
stemmed form and once in an unstemmed form. To do this, we will use
_multifields_, which we introduced in <<multi-fields>>:
[source,js]
--------------------------------------------------
DELETE /my_index
PUT /my_index
{
"settings": { "number_of_shards": 1 }, <1>
"mappings": {
"my_type": {
"properties": {
"title": { <2>
"type": "string",
"analyzer": "english",
"fields": {
"std": { <3>
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
}
--------------------------------------------------
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
<1> See <<relevance-is-broken>>.
<2> The `title` field is stemmed by the `english` analyzer.
<3> The `title.std` field uses the `standard` analyzer and so is not stemmed.
Next we index some documents:
[source,js]
--------------------------------------------------
PUT /my_index/my_type/1
{ "title": "My rabbit jumps" }
PUT /my_index/my_type/2
{ "title": "Jumping jack rabbits" }
--------------------------------------------------
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
Here is a simple `match` query on the `title` field for `jumping rabbits`:
[source,js]
--------------------------------------------------
GET /my_index/_search
{
"query": {
"match": {
"title": "jumping rabbits"
}
}
}
--------------------------------------------------
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
This becomes a query for the two stemmed terms `jump` and `rabbit`, thanks to the
`english` analyzer. The `title` field of both documents contains both of those
terms, so both documents receive the same score:
[source,js]
--------------------------------------------------
{
"hits": [
{
"_id": "1",
"_score": 0.42039964,
"_source": {
"title": "My rabbit jumps"
}
},
{
"_id": "2",
"_score": 0.42039964,
"_source": {
"title": "Jumping jack rabbits"
}
}
]
}
--------------------------------------------------
If we were to query just the `title.std` field, then only document 2 would
match. However, if we were to query both fields and to _combine_ their scores
by using the `bool` query, then both documents would match (thanks to the `title`
field) and document 2 would score higher (thanks to the `title.std` field):
[source,js]
--------------------------------------------------
GET /my_index/_search
{
"query": {
"multi_match": {
"query": "jumping rabbits",
"type": "most_fields", <1>
"fields": [ "title", "title.std" ]
}
}
}
--------------------------------------------------
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
<1> We want to combine the scores from all matching fields, so we use the
`most_fields` type. This causes the `multi_match` query to wrap the two
field-clauses in a `bool` query instead of a `dis_max` query.
[source,js]
--------------------------------------------------
{
"hits": [
{
"_id": "2",
"_score": 0.8226396, <1>
"_source": {
"title": "Jumping jack rabbits"
}
},
{
"_id": "1",
"_score": 0.10741998, <1>
"_source": {
"title": "My rabbit jumps"
}
}
]
}
--------------------------------------------------
<1> Document 2 now scores much higher than document 1.
We are using the broad-matching `title` field to include as many documents as
possible--to increase recall--but we use the `title.std` field as a
_signal_ to push the most relevant results to the top.
The contribution of each field to the final score can be controlled by
specifying custom `boost` values. For instance, we could boost the `title`
field to make it the most important field, thus reducing the effect of any
other signal fields:
[source,js]
--------------------------------------------------
GET /my_index/_search
{
"query": {
"multi_match": {
"query": "jumping rabbits",
"type": "most_fields",
"fields": [ "title^10", "title.std" ] <1>
}
}
}
--------------------------------------------------
// SENSE: 110_Multi_Field_Search/30_Most_fields.json
<1> The `boost` value of `10` on the `title` field makes that field relatively
much more important than the `title.std` field.
-->
- Introduction
- 入门
- 是什么
- 安装
- API
- 文档
- 索引
- 搜索
- 聚合
- 小结
- 分布式
- 结语
- 分布式集群
- 空集群
- 集群健康
- 添加索引
- 故障转移
- 横向扩展
- 更多扩展
- 应对故障
- 数据
- 文档
- 索引
- 获取
- 存在
- 更新
- 创建
- 删除
- 版本控制
- 局部更新
- Mget
- 批量
- 结语
- 分布式增删改查
- 路由
- 分片交互
- 新建、索引和删除
- 检索
- 局部更新
- 批量请求
- 批量格式
- 搜索
- 空搜索
- 多索引和多类型
- 分页
- 查询字符串
- 映射和分析
- 数据类型差异
- 确切值对决全文
- 倒排索引
- 分析
- 映射
- 复合类型
- 结构化查询
- 请求体查询
- 结构化查询
- 查询与过滤
- 重要的查询子句
- 过滤查询
- 验证查询
- 结语
- 排序
- 排序
- 字符串排序
- 相关性
- 字段数据
- 分布式搜索
- 查询阶段
- 取回阶段
- 搜索选项
- 扫描和滚屏
- 索引管理
- 创建删除
- 设置
- 配置分析器
- 自定义分析器
- 映射
- 根对象
- 元数据中的source字段
- 元数据中的all字段
- 元数据中的ID字段
- 动态映射
- 自定义动态映射
- 默认映射
- 重建索引
- 别名
- 深入分片
- 使文本可以被搜索
- 动态索引
- 近实时搜索
- 持久化变更
- 合并段
- 结构化搜索
- 查询准确值
- 组合过滤
- 查询多个准确值
- 包含,而不是相等
- 范围
- 处理 Null 值
- 缓存
- 过滤顺序
- 全文搜索
- 匹配查询
- 多词查询
- 组合查询
- 布尔匹配
- 增加子句
- 控制分析
- 关联失效
- 多字段搜索
- 多重查询字符串
- 单一查询字符串
- 最佳字段
- 最佳字段查询调优
- 多重匹配查询
- 最多字段查询
- 跨字段对象查询
- 以字段为中心查询
- 全字段查询
- 跨字段查询
- 精确查询
- 模糊匹配
- Phrase matching
- Slop
- Multi value fields
- Scoring
- Relevance
- Performance
- Shingles
- Partial_Matching
- Postcodes
- Prefix query
- Wildcard Regexp
- Match phrase prefix
- Index time
- Ngram intro
- Search as you type
- Compound words
- Relevance
- Scoring theory
- Practical scoring
- Query time boosting
- Query scoring
- Not quite not
- Ignoring TFIDF
- Function score query
- Popularity
- Boosting filtered subsets
- Random scoring
- Decay functions
- Pluggable similarities
- Conclusion
- Language intro
- Intro
- Using
- Configuring
- Language pitfalls
- One language per doc
- One language per field
- Mixed language fields
- Conclusion
- Identifying words
- Intro
- Standard analyzer
- Standard tokenizer
- ICU plugin
- ICU tokenizer
- Tidying text
- Token normalization
- Intro
- Lowercasing
- Removing diacritics
- Unicode world
- Case folding
- Character folding
- Sorting and collations
- Stemming
- Intro
- Algorithmic stemmers
- Dictionary stemmers
- Hunspell stemmer
- Choosing a stemmer
- Controlling stemming
- Stemming in situ
- Stopwords
- Intro
- Using stopwords
- Stopwords and performance
- Divide and conquer
- Phrase queries
- Common grams
- Relevance
- Synonyms
- Intro
- Using synonyms
- Synonym formats
- Expand contract
- Analysis chain
- Multi word synonyms
- Symbol synonyms
- Fuzzy matching
- Intro
- Fuzziness
- Fuzzy query
- Fuzzy match query
- Scoring fuzziness
- Phonetic matching
- Aggregations
- overview
- circuit breaker fd settings
- filtering
- facets
- docvalues
- eager
- breadth vs depth
- Conclusion
- concepts buckets
- basic example
- add metric
- nested bucket
- extra metrics
- bucket metric list
- histogram
- date histogram
- scope
- filtering
- sorting ordering
- approx intro
- cardinality
- percentiles
- sigterms intro
- sigterms
- fielddata
- analyzed vs not
- 地理坐标点
- 地理坐标点
- 通过地理坐标点过滤
- 地理坐标盒模型过滤器
- 地理距离过滤器
- 缓存地理位置过滤器
- 减少内存占用
- 按距离排序
- Geohashe
- Geohashe
- Geohashe映射
- Geohash单元过滤器
- 地理位置聚合
- 地理位置聚合
- 按距离聚合
- Geohash单元聚合器
- 范围(边界)聚合器
- 地理形状
- 地理形状
- 映射地理形状
- 索引地理形状
- 查询地理形状
- 在查询中使用已索引的形状
- 地理形状的过滤与缓存
- 关系
- 关系
- 应用级别的Join操作
- 扁平化你的数据
- Top hits
- Concurrency
- Concurrency solutions
- 嵌套
- 嵌套对象
- 嵌套映射
- 嵌套查询
- 嵌套排序
- 嵌套集合
- Parent Child
- Parent child
- Indexing parent child
- Has child
- Has parent
- Children agg
- Grandparents
- Practical considerations
- Scaling
- Shard
- Overallocation
- Kagillion shards
- Capacity planning
- Replica shards
- Multiple indices
- Index per timeframe
- Index templates
- Retiring data
- Index per user
- Shared index
- Faking it
- One big user
- Scale is not infinite
- Cluster Admin
- Marvel
- Health
- Node stats
- Other stats
- Deployment
- hardware
- other
- config
- dont touch
- heap
- file descriptors
- conclusion
- cluster settings
- Post Deployment
- dynamic settings
- logging
- indexing perf
- rolling restart
- backup
- restore
- conclusion