[[scoring-theory]]
=== Theory Behind Relevance Scoring
Lucene (and thus Elasticsearch) uses the
http://en.wikipedia.org/wiki/Standard_Boolean_model[_Boolean model_]
to find matching documents,((("relevance scores", "theory behind", id="ix_relscore", range="startofrange")))((("Boolean Model"))) and a formula called the
<<practical-scoring-function,_practical scoring function_>>
to calculate relevance. This formula borrows concepts from
http://en.wikipedia.org/wiki/Tfidf[_term frequency/inverse document frequency_] and the
http://en.wikipedia.org/wiki/Vector_space_model[_vector space model_]
but adds more-modern features like a coordination factor, field length
normalization, and term or query clause boosting.
[NOTE]
====
Don't be alarmed! These concepts are not as complicated as the names make
them appear. While this section mentions algorithms, formulae, and mathematical
models, it is intended for consumption by mere humans. Understanding the
algorithms themselves is not as important as understanding the factors that
influence the outcome.
====
[[boolean-model]]
==== Boolean Model
The _Boolean model_ simply applies the `AND`, `OR`, and `NOT` conditions
expressed in the query to find all the documents that match.((("and operator")))((("not operator")))((("or operator"))) A query for
full AND text AND search AND (elasticsearch OR lucene)
will include only documents that contain all of the terms `full`, `text`, and
`search`, and either `elasticsearch` or `lucene`.
This process is simple and fast. It is used to exclude any documents that
cannot possibly match the query.
[[tfidf]]
==== Term Frequency/Inverse Document Frequency (TF/IDF)
Once we have a list of matching documents, they need to be ranked by
relevance.((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm"))) Not all documents will contain all the terms, and some terms are
more important than others. The relevance score of the whole document
depends (in part) on the _weight_ of each query term that appears in
that document.
The weight of a term is determined by three factors, which we already
introduced in <<relevance-intro>>. The formulae are included for interest's
sake, but you are not required to remember them.
[[tf]]
===== Term frequency
How often does the term appear in this document?((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "term frequency"))) The more often, the
_higher_ the weight. A field containing five mentions of the same term is
more likely to be relevant than a field containing just one mention.
The term frequency is calculated as follows:
..........................
tf(t in d) = √frequency <1>
..........................
<1> The term frequency (`tf`) for term `t` in document `d` is the square root
of the number of times the term appears in the document.
If you don't care about how often a term appears in a field, and all you care
about is that the term is present, then you can disable term frequencies in
the field mapping:
[source,json]
--------------------------
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs" <1>
}
}
}
}
}
--------------------------
<1> Setting `index_options` to `docs` will disable term frequencies and term
positions. A field with this mapping will not count how many times a term
appears, and will not be usable for phrase or proximity queries.
Exact-value `not_analyzed` string fields use this setting by default.
[[idf]]
===== Inverse document frequency
How often does the term appear in all documents in the collection? The more
often, the _lower_ the weight.((("inverse document frequency")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "inverse document frequency"))) Common terms like `and` or `the` contribute
little to relevance, as they appear in most documents, while uncommon terms
like `elastic` or `hippopotamus` help us zoom in on the most interesting
documents. The inverse document frequency is calculated as follows:
..........................
idf(t) = 1 + log ( numDocs / (docFreq + 1)) <1>
..........................
<1> The inverse document frequency (`idf`) of term `t` is the
logarithm of the number of documents in the index, divided by
the number of documents that contain the term.
[[field-norm]]
===== Field-length norm
How long is the field? ((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "field-length norm")))((("field-length norm")))The shorter the field, the _higher_ the weight. If a
term appears in a short field, such as a `title` field, it is more likely that
the content of that field is _about_ the term than if the same term appears
in a much bigger `body` field. The field length norm is calculated as follows:
..........................
norm(d) = 1 / √numTerms <1>
..........................
<1> The field-length norm (`norm`) is the inverse square root of the number of terms
in the field.
While the field-length ((("string fields", "field-length norm")))norm is important for full-text search, many other
fields don't need norms. Norms consume approximately 1 byte per `string` field
per document in the index, whether or not a document contains the field. Exact-value `not_analyzed` string fields have norms disabled by default,
but you can use the field mapping to disable norms on `analyzed` fields as
well:
[source,json]
--------------------------
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"norms": { "enabled": false } <1>
}
}
}
}
}
--------------------------
<1> This field will not take the field-length norm into account. A long field
and a short field will be scored as if they were the same length.
For use cases such as logging, norms are not useful. All you care about is
whether a field contains a particular error code or a particular browser
identifier. The length of the field does not affect the outcome. Disabling
norms can save a significant amount of memory.
===== Putting it together
These three factors--term frequency, inverse document frequency, and field-length norm--are calculated and stored at index time.((("weight", "calculation of"))) Together, they are
used to calculate the _weight_ of a single term in a particular document.
[TIP]
==================================================
When we refer to _documents_ in the preceding formulae, we are actually talking about
a field within a document. Each field has its own inverted index and thus,
for TF/IDF purposes, the value of the field is the value of the document.
==================================================
When we run a simple `term` query with `explain` set to `true` (see
<<explain>>), you will see that the only factors involved in calculating the
score are the ones explained in the preceding sections:
[role="pagebreak-before"]
[source,json]
----------------------------
PUT /my_index/doc/1
{ "text" : "quick brown fox" }
GET /my_index/doc/_search?explain
{
"query": {
"term": {
"text": "fox"
}
}
}
----------------------------
The (abbreviated) `explanation` from the preceding request is as follows:
.......................................................
weight(text:fox in 0) [PerFieldSimilarity]: 0.15342641 <1>
result of:
fieldWeight in 0 0.15342641
product of:
tf(freq=1.0), with freq of 1: 1.0 <2>
idf(docFreq=1, maxDocs=1): 0.30685282 <3>
fieldNorm(doc=0): 0.5 <4>
.......................................................
<1> The final `score` for term `fox` in field `text` in the document with internal
Lucene doc ID `0`.
<2> The term `fox` appears once in the `text` field in this document.
<3> The inverse document frequency of `fox` in the `text` field in all
documents in this index.
<4> The field-length normalization factor for this field.
Of course, queries usually consist of more than one term, so we need a
way of combining the weights of multiple terms. For this, we turn to the
vector space model.
[[vector-space-model]]
==== Vector Space Model
The _vector space model_ provides a way of ((("Vector Space Model")))comparing a multiterm query
against a document. The output is a single score that represents how well the
document matches the query. In order to do this, the model represents both the document
and the query as _vectors_.
A vector is really just a one-dimensional array containing numbers, for example:
[1,2,5,22,3,8]
In the vector space((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "in Vector Space Model"))) model, each number in the vector is((("weight", "calculation of", "in Vector Space Model"))) the _weight_ of a term,
as calculated with <<tfidf,term frequency/inverse document frequency>>.
[TIP]
==================================================
While TF/IDF is the default way of calculating term weights for the vector
space model, it is not the only way. Other models like Okapi-BM25 exist and
are available in Elasticsearch. TF/IDF is the default because it is a
simple, efficient algorithm that produces high-quality search results and
has stood the test of time.
==================================================
Imagine that we have a query for ``happy hippopotamus.'' A common word like
`happy` will have a low weight, while an uncommon term like `hippopotamus`
will have a high weight. Let's assume that `happy` has a weight of 2 and
`hippopotamus` has a weight of 5. We can plot this simple two-dimensional
vector—`[2,5]`—as a line on a graph starting at point (0,0) and
ending at point (2,5), as shown in <<img-vector-query>>.
[[img-vector-query]]
.A two-dimensional query vector for ``happy hippopotamus'' represented
image::images/elas_17in01.png["The query vector plotted on a graph"]
Now, imagine we have three documents:
1. I am _happy_ in summer.
2. After Christmas I'm a _hippopotamus_.
3. The _happy hippopotamus_ helped Harry.
We can create a similar vector for each document, consisting of the weight of
each query term—`happy` and `hippopotamus`—that appears in the
document, and plot these vectors on the same graph, as shown in <<img-vector-docs>>:
* Document 1: `(happy,____________)`—`[2,0]`
* Document 2: `( ___ ,hippopotamus)`—`[0,5]`
* Document 3: `(happy,hippopotamus)`—`[2,5]`
[[img-vector-docs]]
.Query and document vectors for ``happy hippopotamus''
image::images/elas_17in02.png["The query and document vectors plotted on a graph"]
The nice thing about vectors is that they can be compared. By measuring the
angle between the query vector and the document vector, it is possible to
assign a relevance score to each document. The angle between document 1 and
the query is large, so it is of low relevance. Document 2 is closer to the
query, meaning that it is reasonably relevant, and document 3 is a perfect
match.
[TIP]
==================================================
In practice, only two-dimensional vectors (queries with two terms) can be
plotted easily on a graph. Fortunately, _linear algebra_—the branch of
mathematics that deals with vectors--provides tools to compare the
angle between multidimensional vectors, which means that we can apply the
same principles explained above to queries that consist of many terms.
You can read more about how to compare two vectors by using http://en.wikipedia.org/wiki/Cosine_similarity[_cosine similarity_].
==================================================
Now that we have talked about the theoretical basis of scoring, we can move on
to see how scoring is implemented in Lucene.((("relevance scores", "theory behind", range="endofrange", startref="ix_relscore")))
- Introduction
- 入门
- 是什么
- 安装
- API
- 文档
- 索引
- 搜索
- 聚合
- 小结
- 分布式
- 结语
- 分布式集群
- 空集群
- 集群健康
- 添加索引
- 故障转移
- 横向扩展
- 更多扩展
- 应对故障
- 数据
- 文档
- 索引
- 获取
- 存在
- 更新
- 创建
- 删除
- 版本控制
- 局部更新
- Mget
- 批量
- 结语
- 分布式增删改查
- 路由
- 分片交互
- 新建、索引和删除
- 检索
- 局部更新
- 批量请求
- 批量格式
- 搜索
- 空搜索
- 多索引和多类型
- 分页
- 查询字符串
- 映射和分析
- 数据类型差异
- 确切值对决全文
- 倒排索引
- 分析
- 映射
- 复合类型
- 结构化查询
- 请求体查询
- 结构化查询
- 查询与过滤
- 重要的查询子句
- 过滤查询
- 验证查询
- 结语
- 排序
- 排序
- 字符串排序
- 相关性
- 字段数据
- 分布式搜索
- 查询阶段
- 取回阶段
- 搜索选项
- 扫描和滚屏
- 索引管理
- 创建删除
- 设置
- 配置分析器
- 自定义分析器
- 映射
- 根对象
- 元数据中的source字段
- 元数据中的all字段
- 元数据中的ID字段
- 动态映射
- 自定义动态映射
- 默认映射
- 重建索引
- 别名
- 深入分片
- 使文本可以被搜索
- 动态索引
- 近实时搜索
- 持久化变更
- 合并段
- 结构化搜索
- 查询准确值
- 组合过滤
- 查询多个准确值
- 包含,而不是相等
- 范围
- 处理 Null 值
- 缓存
- 过滤顺序
- 全文搜索
- 匹配查询
- 多词查询
- 组合查询
- 布尔匹配
- 增加子句
- 控制分析
- 关联失效
- 多字段搜索
- 多重查询字符串
- 单一查询字符串
- 最佳字段
- 最佳字段查询调优
- 多重匹配查询
- 最多字段查询
- 跨字段对象查询
- 以字段为中心查询
- 全字段查询
- 跨字段查询
- 精确查询
- 模糊匹配
- Phrase matching
- Slop
- Multi value fields
- Scoring
- Relevance
- Performance
- Shingles
- Partial_Matching
- Postcodes
- Prefix query
- Wildcard Regexp
- Match phrase prefix
- Index time
- Ngram intro
- Search as you type
- Compound words
- Relevance
- Scoring theory
- Practical scoring
- Query time boosting
- Query scoring
- Not quite not
- Ignoring TFIDF
- Function score query
- Popularity
- Boosting filtered subsets
- Random scoring
- Decay functions
- Pluggable similarities
- Conclusion
- Language intro
- Intro
- Using
- Configuring
- Language pitfalls
- One language per doc
- One language per field
- Mixed language fields
- Conclusion
- Identifying words
- Intro
- Standard analyzer
- Standard tokenizer
- ICU plugin
- ICU tokenizer
- Tidying text
- Token normalization
- Intro
- Lowercasing
- Removing diacritics
- Unicode world
- Case folding
- Character folding
- Sorting and collations
- Stemming
- Intro
- Algorithmic stemmers
- Dictionary stemmers
- Hunspell stemmer
- Choosing a stemmer
- Controlling stemming
- Stemming in situ
- Stopwords
- Intro
- Using stopwords
- Stopwords and performance
- Divide and conquer
- Phrase queries
- Common grams
- Relevance
- Synonyms
- Intro
- Using synonyms
- Synonym formats
- Expand contract
- Analysis chain
- Multi word synonyms
- Symbol synonyms
- Fuzzy matching
- Intro
- Fuzziness
- Fuzzy query
- Fuzzy match query
- Scoring fuzziness
- Phonetic matching
- Aggregations
- overview
- circuit breaker fd settings
- filtering
- facets
- docvalues
- eager
- breadth vs depth
- Conclusion
- concepts buckets
- basic example
- add metric
- nested bucket
- extra metrics
- bucket metric list
- histogram
- date histogram
- scope
- filtering
- sorting ordering
- approx intro
- cardinality
- percentiles
- sigterms intro
- sigterms
- fielddata
- analyzed vs not
- 地理坐标点
- 地理坐标点
- 通过地理坐标点过滤
- 地理坐标盒模型过滤器
- 地理距离过滤器
- 缓存地理位置过滤器
- 减少内存占用
- 按距离排序
- Geohashe
- Geohashe
- Geohashe映射
- Geohash单元过滤器
- 地理位置聚合
- 地理位置聚合
- 按距离聚合
- Geohash单元聚合器
- 范围(边界)聚合器
- 地理形状
- 地理形状
- 映射地理形状
- 索引地理形状
- 查询地理形状
- 在查询中使用已索引的形状
- 地理形状的过滤与缓存
- 关系
- 关系
- 应用级别的Join操作
- 扁平化你的数据
- Top hits
- Concurrency
- Concurrency solutions
- 嵌套
- 嵌套对象
- 嵌套映射
- 嵌套查询
- 嵌套排序
- 嵌套集合
- Parent Child
- Parent child
- Indexing parent child
- Has child
- Has parent
- Children agg
- Grandparents
- Practical considerations
- Scaling
- Shard
- Overallocation
- Kagillion shards
- Capacity planning
- Replica shards
- Multiple indices
- Index per timeframe
- Index templates
- Retiring data
- Index per user
- Shared index
- Faking it
- One big user
- Scale is not infinite
- Cluster Admin
- Marvel
- Health
- Node stats
- Other stats
- Deployment
- hardware
- other
- config
- dont touch
- heap
- file descriptors
- conclusion
- cluster settings
- Post Deployment
- dynamic settings
- logging
- indexing perf
- rolling restart
- backup
- restore
- conclusion