[[practical-scoring-function]]
=== Lucene's Practical Scoring Function
For multiterm queries, Lucene takes((("relevance", "controlling", "Lucene's practical scoring function", id="ix_relcontPCF", range="startofrange")))((("Boolean Model"))) the <<boolean-model,Boolean model>>,
<<tfidf,TF/IDF>>, and the <<vector-space-model,vector space model>> and
combines ((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("Vector Space Model"))) them in a single efficient package that collects matching
documents and scores them as it goes.
A multiterm query like
[source,json]
------------------------------
GET /my_index/doc/_search
{
"query": {
"match": {
"text": "quick fox"
}
}
}
------------------------------
is rewritten internally to look like this:
[source,json]
------------------------------
GET /my_index/doc/_search
{
"query": {
"bool": {
"should": [
{"term": { "text": "quick" }},
{"term": { "text": "fox" }}
]
}
}
}
------------------------------
The `bool` query implements the Boolean model and, in this example, will
include only documents that contain either the term `quick` or the term `fox` or
both.
As soon as a document matches a query, Lucene calculates its score for that
query, combining the scores of each matching term. The formula used for
scoring is called the _practical scoring function_.((("practical scoring function"))) It looks intimidating, but
don't be put off--most of the components you already know. It introduces a
few new elements that we discuss next.
................................
score(q,d) = <1>
queryNorm(q) <2>
· coord(q,d) <3>
· ∑ ( <4>
tf(t in d) <5>
· idf(t)² <6>
· t.getBoost() <7>
· norm(t,d) <8>
) (t in q) <4>
................................
<1> `score(q,d)` is the relevance score of document `d` for query `q`.
<2> `queryNorm(q)` is the <<query-norm,_query normalization_ factor>> (new).
<3> `coord(q,d)` is the <<coord,_coordination_ factor>> (new).
<4> The sum of the weights for each term `t` in the query `q` for document `d`.
<5> `tf(t in d)` is the <<tf,term frequency>> for term `t` in document `d`.
<6> `idf(t)` is the <<idf,inverse document frequency>> for term `t`.
<7> `t.getBoost()` is the <<query-time-boosting,_boost_>> that has been
applied to the query (new).
<8> `norm(t,d)` is the <<field-norm,field-length norm>>, combined with the
<<index-boost,index-time field-level boost>>, if any. (new).
You should recognize `score`, `tf`, and `idf`. The `queryNorm`, `coord`,
`t.getBoost`, and `norm` are new.
We will talk more about <<query-time-boosting,query-time boosting>> later in
this chapter, but first let's get query normalization, coordination, and
index-time field-level boosting out of the way.
[[query-norm]]
==== Query Normalization Factor
The _query normalization factor_ (`queryNorm`) is ((("practical scoring function", "query normalization factor")))((("query normalization factor")))((("normalization", "query normalization factor")))an attempt to _normalize_ a
query so that the results from one query may be compared with the results of
another.
[TIP]
==================================================
Even though the intent of the query norm is to make results from different
queries comparable, it doesn't work very well. The only purpose of
the relevance `_score` is to sort the results of the current query in the
correct order. You should not try to compare the relevance scores from
different queries.
==================================================
This factor is calculated at the beginning of the query. The actual
calculation depends on the queries involved, but a typical implementation is as follows:
..........................
queryNorm = 1 / √sumOfSquaredWeights <1>
..........................
<1> The `sumOfSquaredWeights` is calculated by adding together the IDF of each
term in the query, squared.
TIP: The same query normalization factor is applied to every document, and you
have no way of changing it. For all intents and purposes, it can be ignored.
[[coord]]
==== Query Coordination
The _coordination factor_ (`coord`) is used to((("coordination factor (coord)")))((("query coordination")))((("practical scoring function", "coordination factor"))) reward documents that contain a
higher percentage of the query terms. The more query terms that appear in
the document, the greater the chances that the document is a good match for
the query.
Imagine that we have a query for `quick brown fox`, and that the
weight for each term is 1.5. Without the coordination factor, the score would
just be the sum of the weights of the terms in a document. For instance:
* Document with `fox` -> score: 1.5
* Document with `quick fox` -> score: 3.0
* Document with `quick brown fox` -> score: 4.5
The coordination factor multiplies the score by the number of matching terms
in the document, and divides it by the total number of terms in the query.
With the coordination factor, the scores would be as follows:
* Document with `fox` -> score: `1.5 * 1 / 3` = 0.5
* Document with `quick fox` -> score: `3.0 * 2 / 3` = 2.0
* Document with `quick brown fox` -> score: `4.5 * 3 / 3` = 4.5
The coordination factor results in the document that contains all three terms
being much more relevant than the document that contains just two of them.
Remember that the query for `quick brown fox` is rewritten into a `bool` query
like this:
[source,json]
-------------------------------
GET /_search
{
"query": {
"bool": {
"should": [
{ "term": { "text": "quick" }},
{ "term": { "text": "brown" }},
{ "term": { "text": "fox" }}
]
}
}
}
-------------------------------
The `bool` query uses query coordination by default for all `should` clauses,
but it does allow you to disable coordination. Why might you want to do this?
Well, usually the answer is, you don't. Query coordination is usually a good
thing. When you use a `bool` query to wrap several high-level queries like
the `match` query, it also makes sense to leave coordination enabled. The more
clauses that match, the higher the degree of overlap between your search
request and the documents that are returned.
However, in some advanced use cases, it might make sense to disable
coordination. Imagine that you are looking for the synonyms `jump`, `leap`, and
`hop`. You don't care how many of these synonyms are present, as they all
represent the same concept. In fact, only one of the synonyms is likely to be
present. This would be a good case for disabling the coordination factor:
[source,json]
-------------------------------
GET /_search
{
"query": {
"bool": {
"disable_coord": true,
"should": [
{ "term": { "text": "jump" }},
{ "term": { "text": "hop" }},
{ "term": { "text": "leap" }}
]
}
}
}
-------------------------------
When you use synonyms (see <<synonyms>>), this is exactly what
happens internally: the rewritten query disables coordination for the
synonyms. ((("synonyms", "query coordination and"))) Most use cases for disabling coordination are handled
automatically; you don't need to worry about it.
[[index-boost]]
==== Index-Time Field-Level Boosting
We will talk about _boosting_ a field--making it ((("indexing", "field-level index time boosts")))((("boosting", "index time field-level boosting")))((("practical scoring function", "index time field-level boosting")))more important than other
fields--at query time in <<query-time-boosting>>. It is also possible
to apply a boost to a field at index time. Actually, this boost is applied to
every term in the field, rather than to the field itself.
To store this boost value in the index without using more space
than necessary, this field-level index-time boost is combined with the ((("field-length norm")))field-length norm (see <<field-norm>>) and stored in the index as a single byte.
This is the value returned by `norm(t,d)` in the preceding formula.
[WARNING]
=========================================
We strongly recommend against using field-level index-time boosts for a few
reasons:
* Combining the boost with the field-length norm and storing it in a single
byte means that the field-length norm loses precision. The result is that
Elasticsearch is unable to distinguish between a field containing three words
and a field containing five words.
* To change an index-time boost, you have to reindex all your documents.
A query-time boost, on the other hand, can be changed with every query.
* If a field with an index-time boost has multiple values, the boost is
multiplied by itself for every value, dramatically increasing
the weight for that field.
<<query-time-boosting,Query-time boosting>> is a much simpler, cleaner, more
flexible option.
=========================================
With query normalization, coordination, and index-time boosting out of the way,
we can now move on to the most useful tool for influencing the relevance
calculation: query-time boosting.((("relevance", "controlling", "Lucene's practical scoring function", range="endofrange", startref="ix_relcontPCF")))
- Introduction
- 入门
- 是什么
- 安装
- API
- 文档
- 索引
- 搜索
- 聚合
- 小结
- 分布式
- 结语
- 分布式集群
- 空集群
- 集群健康
- 添加索引
- 故障转移
- 横向扩展
- 更多扩展
- 应对故障
- 数据
- 文档
- 索引
- 获取
- 存在
- 更新
- 创建
- 删除
- 版本控制
- 局部更新
- Mget
- 批量
- 结语
- 分布式增删改查
- 路由
- 分片交互
- 新建、索引和删除
- 检索
- 局部更新
- 批量请求
- 批量格式
- 搜索
- 空搜索
- 多索引和多类型
- 分页
- 查询字符串
- 映射和分析
- 数据类型差异
- 确切值对决全文
- 倒排索引
- 分析
- 映射
- 复合类型
- 结构化查询
- 请求体查询
- 结构化查询
- 查询与过滤
- 重要的查询子句
- 过滤查询
- 验证查询
- 结语
- 排序
- 排序
- 字符串排序
- 相关性
- 字段数据
- 分布式搜索
- 查询阶段
- 取回阶段
- 搜索选项
- 扫描和滚屏
- 索引管理
- 创建删除
- 设置
- 配置分析器
- 自定义分析器
- 映射
- 根对象
- 元数据中的source字段
- 元数据中的all字段
- 元数据中的ID字段
- 动态映射
- 自定义动态映射
- 默认映射
- 重建索引
- 别名
- 深入分片
- 使文本可以被搜索
- 动态索引
- 近实时搜索
- 持久化变更
- 合并段
- 结构化搜索
- 查询准确值
- 组合过滤
- 查询多个准确值
- 包含,而不是相等
- 范围
- 处理 Null 值
- 缓存
- 过滤顺序
- 全文搜索
- 匹配查询
- 多词查询
- 组合查询
- 布尔匹配
- 增加子句
- 控制分析
- 关联失效
- 多字段搜索
- 多重查询字符串
- 单一查询字符串
- 最佳字段
- 最佳字段查询调优
- 多重匹配查询
- 最多字段查询
- 跨字段对象查询
- 以字段为中心查询
- 全字段查询
- 跨字段查询
- 精确查询
- 模糊匹配
- Phrase matching
- Slop
- Multi value fields
- Scoring
- Relevance
- Performance
- Shingles
- Partial_Matching
- Postcodes
- Prefix query
- Wildcard Regexp
- Match phrase prefix
- Index time
- Ngram intro
- Search as you type
- Compound words
- Relevance
- Scoring theory
- Practical scoring
- Query time boosting
- Query scoring
- Not quite not
- Ignoring TFIDF
- Function score query
- Popularity
- Boosting filtered subsets
- Random scoring
- Decay functions
- Pluggable similarities
- Conclusion
- Language intro
- Intro
- Using
- Configuring
- Language pitfalls
- One language per doc
- One language per field
- Mixed language fields
- Conclusion
- Identifying words
- Intro
- Standard analyzer
- Standard tokenizer
- ICU plugin
- ICU tokenizer
- Tidying text
- Token normalization
- Intro
- Lowercasing
- Removing diacritics
- Unicode world
- Case folding
- Character folding
- Sorting and collations
- Stemming
- Intro
- Algorithmic stemmers
- Dictionary stemmers
- Hunspell stemmer
- Choosing a stemmer
- Controlling stemming
- Stemming in situ
- Stopwords
- Intro
- Using stopwords
- Stopwords and performance
- Divide and conquer
- Phrase queries
- Common grams
- Relevance
- Synonyms
- Intro
- Using synonyms
- Synonym formats
- Expand contract
- Analysis chain
- Multi word synonyms
- Symbol synonyms
- Fuzzy matching
- Intro
- Fuzziness
- Fuzzy query
- Fuzzy match query
- Scoring fuzziness
- Phonetic matching
- Aggregations
- overview
- circuit breaker fd settings
- filtering
- facets
- docvalues
- eager
- breadth vs depth
- Conclusion
- concepts buckets
- basic example
- add metric
- nested bucket
- extra metrics
- bucket metric list
- histogram
- date histogram
- scope
- filtering
- sorting ordering
- approx intro
- cardinality
- percentiles
- sigterms intro
- sigterms
- fielddata
- analyzed vs not
- 地理坐标点
- 地理坐标点
- 通过地理坐标点过滤
- 地理坐标盒模型过滤器
- 地理距离过滤器
- 缓存地理位置过滤器
- 减少内存占用
- 按距离排序
- Geohashe
- Geohashe
- Geohashe映射
- Geohash单元过滤器
- 地理位置聚合
- 地理位置聚合
- 按距离聚合
- Geohash单元聚合器
- 范围(边界)聚合器
- 地理形状
- 地理形状
- 映射地理形状
- 索引地理形状
- 查询地理形状
- 在查询中使用已索引的形状
- 地理形状的过滤与缓存
- 关系
- 关系
- 应用级别的Join操作
- 扁平化你的数据
- Top hits
- Concurrency
- Concurrency solutions
- 嵌套
- 嵌套对象
- 嵌套映射
- 嵌套查询
- 嵌套排序
- 嵌套集合
- Parent Child
- Parent child
- Indexing parent child
- Has child
- Has parent
- Children agg
- Grandparents
- Practical considerations
- Scaling
- Shard
- Overallocation
- Kagillion shards
- Capacity planning
- Replica shards
- Multiple indices
- Index per timeframe
- Index templates
- Retiring data
- Index per user
- Shared index
- Faking it
- One big user
- Scale is not infinite
- Cluster Admin
- Marvel
- Health
- Node stats
- Other stats
- Deployment
- hardware
- other
- config
- dont touch
- heap
- file descriptors
- conclusion
- cluster settings
- Post Deployment
- dynamic settings
- logging
- indexing perf
- rolling restart
- backup
- restore
- conclusion