[[using-stopwords]]
=== Using Stopwords
The removal of stopwords is ((("stopwords", "removal of")))handled by the
http://bit.ly/1INX4tN[`stop` token filter] which can be used
when ((("stop token filter")))creating a `custom` analyzer (see <<stop-token-filter>>).
However, some out-of-the-box analyzers((("analyzers", "stop filter pre-integrated")))((("pattern analyzer", "stopwords and")))((("standard analyzer", "stop filter")))((("language analyzers", "stop filter pre-integrated"))) come with the `stop` filter pre-integrated:
http://bit.ly/1xtdoJV[Language analyzers]::
Each language analyzer defaults to using the appropriate stopwords list
for that language. For instance, the `english` analyzer uses the
`_english_` stopwords list.
http://bit.ly/14EpXv3[`standard` analyzer]::
Defaults to the empty stopwords list: `_none_`, essentially disabling
stopwords.
http://bit.ly/1u9OVct[`pattern` analyzer]::
Defaults to `_none_`, like the `standard` analyzer.
==== Stopwords and the Standard Analyzer
To use custom stopwords in conjunction with ((("standard analyzer", "stopwords and")))((("stopwords", "using with standard analyzer")))the `standard` analyzer, all we
need to do is to create a configured version of the analyzer and pass in the
list of `stopwords` that we require:
[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": { <1>
"type": "standard", <2>
"stopwords": [ "and", "the" ] <3>
}
}
}
}
}
---------------------------------
<1> This is a custom analyzer called `my_analyzer`.
<2> This analyzer is the `standard` analyzer with some custom configuration.
<3> The stopwords to filter out are `and` and `the`.
TIP: This same technique can be used to configure custom stopword lists for
any of the language analyzers.
[[maintaining-positions]]
==== Maintaining Positions
The output from the `analyze` API((("stopwords", "maintaining position of terms and"))) is quite interesting:
[source,json]
---------------------------------
GET /my_index/_analyze?analyzer=my_analyzer
The quick and the dead
---------------------------------
[source,json]
---------------------------------
{
"tokens": [
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2 <1>
},
{
"token": "dead",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 5 <1>
}
]
}
---------------------------------
<1> Note the `position` of each token.
The stopwords have been filtered out, as expected, but the interesting part is
that the `position` of the((("phrase matching", "stopwords and", "positions data"))) two remaining terms is unchanged: `quick` is the
second word in the original sentence, and `dead` is the fifth. This is
important for phrase queries--if the positions of each term had been
adjusted, a phrase query for `quick dead` would have matched the preceding
example incorrectly.
[[specifying-stopwords]]
==== Specifying Stopwords
Stopwords can be passed inline, as we did in ((("stopwords", "specifying")))the previous example, by
specifying an array:
[source,json]
---------------------------------
"stopwords": [ "and", "the" ]
---------------------------------
The default stopword list for a particular language can be specified using the
`_lang_` notation:
[source,json]
---------------------------------
"stopwords": "_english_"
---------------------------------
TIP: The predefined language-specific stopword((("languages", "predefined stopword lists for"))) lists available in
Elasticsearch can be found in the
http://bit.ly/157YLFy[`stop` token filter] documentation.
Stopwords can be disabled by ((("stopwords", "disabling")))specifying the special list: `_none_`. For
instance, to use the `english` analyzer((("english analyzer", "using without stopwords"))) without stopwords, you can do the
following:
[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english", <1>
"stopwords": "_none_" <2>
}
}
}
}
}
---------------------------------
<1> The `my_english` analyzer is based on the `english` analyzer.
<2> But stopwords are disabled.
Finally, stopwords can also be listed in a file with one word per line. The
file must be present on all nodes in the cluster, and the path can be
specified((("stopwords_path parameter"))) with the `stopwords_path` parameter:
[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stopwords_path": "stopwords/english.txt" <1>
}
}
}
}
}
---------------------------------
<1> The path to the stopwords file, relative to the Elasticsearch `config`
directory
[[stop-token-filter]]
==== Using the stop Token Filter
The http://bit.ly/1AUzDNI[`stop` token filter] can be combined
with a tokenizer((("stopwords", "using stop token filter")))((("stop token filter", "using in custom analyzer"))) and other token filters when you need to create a `custom`
analyzer. For instance, let's say that we wanted to ((("Spanish", "custom analyzer for")))((("light_spanish stemmer")))create a Spanish analyzer
with the following:
* A custom stopwords list
* The `light_spanish` stemmer
* The <<asciifolding-token-filter,`asciifolding` filter>> to remove diacritics
We could set that up as follows:
[source,json]
---------------------------------
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": [ "si", "esta", "el", "la" ] <1>
},
"light_spanish": { <2>
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"my_spanish": {
"tokenizer": "spanish",
"filter": [ <3>
"lowercase",
"asciifolding",
"spanish_stop",
"light_spanish"
]
}
}
}
}
}
---------------------------------
<1> The `stop` token filter takes the same `stopwords` and `stopwords_path`
parameters as the `standard` analyzer.
<2> See <<algorithmic-stemmers>>.
<3> The order of token filters is important, as explained next.
We have placed the `spanish_stop` filter after the `asciifolding` filter.((("asciifolding token filter", "in custom Spanish analyzer"))) This
means that `esta`, `ésta`, and ++está++ will first have their diacritics
removed to become just `esta`, which will then be removed as a stopword. If,
instead, we wanted to remove `esta` and `ésta`, but not ++está++, we
would have to put the `spanish_stop` filter _before_ the `asciifolding`
filter, and specify both words in the stopwords list.
[[updating-stopwords]]
==== Updating Stopwords
A few techniques can be used to update the list of stopwords
used by an analyzer.((("analyzers", "stopwords list, updating")))((("stopwords", "updating list used by analyzers"))) Analyzers are instantiated at index creation time, when a
node is restarted, or when a closed index is reopened.
If you specify stopwords inline with the `stopwords` parameter, your
only option is to close the index and update the analyzer configuration with the
http://bit.ly/1zijFPx[update index settings API], then reopen
the index.
Updating stopwords is easier if you specify them in a file with the
`stopwords_path` parameter.((("stopwords_path parameter"))) You can just update the file (on every node in
the cluster) and then force the analyzers to be re-created by either of these actions:
* Closing and reopening the index
(see http://bit.ly/1B6s0WY[open/close index]), or
* Restarting each node in the cluster, one by one
Of course, updating the stopwords list will not change any documents that have
already been indexed. It will apply only to searches and to new or updated
documents. To apply the changes to existing documents, you will need to
reindex your data. See <<reindex>>.
- Introduction
- 入门
- 是什么
- 安装
- API
- 文档
- 索引
- 搜索
- 聚合
- 小结
- 分布式
- 结语
- 分布式集群
- 空集群
- 集群健康
- 添加索引
- 故障转移
- 横向扩展
- 更多扩展
- 应对故障
- 数据
- 文档
- 索引
- 获取
- 存在
- 更新
- 创建
- 删除
- 版本控制
- 局部更新
- Mget
- 批量
- 结语
- 分布式增删改查
- 路由
- 分片交互
- 新建、索引和删除
- 检索
- 局部更新
- 批量请求
- 批量格式
- 搜索
- 空搜索
- 多索引和多类型
- 分页
- 查询字符串
- 映射和分析
- 数据类型差异
- 确切值对决全文
- 倒排索引
- 分析
- 映射
- 复合类型
- 结构化查询
- 请求体查询
- 结构化查询
- 查询与过滤
- 重要的查询子句
- 过滤查询
- 验证查询
- 结语
- 排序
- 排序
- 字符串排序
- 相关性
- 字段数据
- 分布式搜索
- 查询阶段
- 取回阶段
- 搜索选项
- 扫描和滚屏
- 索引管理
- 创建删除
- 设置
- 配置分析器
- 自定义分析器
- 映射
- 根对象
- 元数据中的source字段
- 元数据中的all字段
- 元数据中的ID字段
- 动态映射
- 自定义动态映射
- 默认映射
- 重建索引
- 别名
- 深入分片
- 使文本可以被搜索
- 动态索引
- 近实时搜索
- 持久化变更
- 合并段
- 结构化搜索
- 查询准确值
- 组合过滤
- 查询多个准确值
- 包含,而不是相等
- 范围
- 处理 Null 值
- 缓存
- 过滤顺序
- 全文搜索
- 匹配查询
- 多词查询
- 组合查询
- 布尔匹配
- 增加子句
- 控制分析
- 关联失效
- 多字段搜索
- 多重查询字符串
- 单一查询字符串
- 最佳字段
- 最佳字段查询调优
- 多重匹配查询
- 最多字段查询
- 跨字段对象查询
- 以字段为中心查询
- 全字段查询
- 跨字段查询
- 精确查询
- 模糊匹配
- Phrase matching
- Slop
- Multi value fields
- Scoring
- Relevance
- Performance
- Shingles
- Partial_Matching
- Postcodes
- Prefix query
- Wildcard Regexp
- Match phrase prefix
- Index time
- Ngram intro
- Search as you type
- Compound words
- Relevance
- Scoring theory
- Practical scoring
- Query time boosting
- Query scoring
- Not quite not
- Ignoring TFIDF
- Function score query
- Popularity
- Boosting filtered subsets
- Random scoring
- Decay functions
- Pluggable similarities
- Conclusion
- Language intro
- Intro
- Using
- Configuring
- Language pitfalls
- One language per doc
- One language per field
- Mixed language fields
- Conclusion
- Identifying words
- Intro
- Standard analyzer
- Standard tokenizer
- ICU plugin
- ICU tokenizer
- Tidying text
- Token normalization
- Intro
- Lowercasing
- Removing diacritics
- Unicode world
- Case folding
- Character folding
- Sorting and collations
- Stemming
- Intro
- Algorithmic stemmers
- Dictionary stemmers
- Hunspell stemmer
- Choosing a stemmer
- Controlling stemming
- Stemming in situ
- Stopwords
- Intro
- Using stopwords
- Stopwords and performance
- Divide and conquer
- Phrase queries
- Common grams
- Relevance
- Synonyms
- Intro
- Using synonyms
- Synonym formats
- Expand contract
- Analysis chain
- Multi word synonyms
- Symbol synonyms
- Fuzzy matching
- Intro
- Fuzziness
- Fuzzy query
- Fuzzy match query
- Scoring fuzziness
- Phonetic matching
- Aggregations
- overview
- circuit breaker fd settings
- filtering
- facets
- docvalues
- eager
- breadth vs depth
- Conclusion
- concepts buckets
- basic example
- add metric
- nested bucket
- extra metrics
- bucket metric list
- histogram
- date histogram
- scope
- filtering
- sorting ordering
- approx intro
- cardinality
- percentiles
- sigterms intro
- sigterms
- fielddata
- analyzed vs not
- 地理坐标点
- 地理坐标点
- 通过地理坐标点过滤
- 地理坐标盒模型过滤器
- 地理距离过滤器
- 缓存地理位置过滤器
- 减少内存占用
- 按距离排序
- Geohashe
- Geohashe
- Geohashe映射
- Geohash单元过滤器
- 地理位置聚合
- 地理位置聚合
- 按距离聚合
- Geohash单元聚合器
- 范围(边界)聚合器
- 地理形状
- 地理形状
- 映射地理形状
- 索引地理形状
- 查询地理形状
- 在查询中使用已索引的形状
- 地理形状的过滤与缓存
- 关系
- 关系
- 应用级别的Join操作
- 扁平化你的数据
- Top hits
- Concurrency
- Concurrency solutions
- 嵌套
- 嵌套对象
- 嵌套映射
- 嵌套查询
- 嵌套排序
- 嵌套集合
- Parent Child
- Parent child
- Indexing parent child
- Has child
- Has parent
- Children agg
- Grandparents
- Practical considerations
- Scaling
- Shard
- Overallocation
- Kagillion shards
- Capacity planning
- Replica shards
- Multiple indices
- Index per timeframe
- Index templates
- Retiring data
- Index per user
- Shared index
- Faking it
- One big user
- Scale is not infinite
- Cluster Admin
- Marvel
- Health
- Node stats
- Other stats
- Deployment
- hardware
- other
- config
- dont touch
- heap
- file descriptors
- conclusion
- cluster settings
- Post Deployment
- dynamic settings
- logging
- indexing perf
- rolling restart
- backup
- restore
- conclusion