Analysis · ElasticSearch 权威指南

[[analysis-intro]]=== Analysis and analyzers _Analysis_ is the process of: - first, tokenizing a block of text intoindividual _terms_ suitable for use in an inverted index, - then normalizing these terms into a standard form to improve their``searchability'' or _recall_. This job is performed by _analyzers_. An _analyzer_ is really just a wrapperwhich combines three functions into a single package: Character filters:: ~~~ First, the string is passed through any _character filters_ in turn. Theirjob is to tidy up the string before tokenization. A character filter couldbe used to strip out HTML, or to convert `"&"` characters to the word`"and"`. ~~~ Tokenizer:: Next, the string is tokenized into individual terms by a _tokenizer_. A simple tokenizer might split the text up into terms whenever it encounters whitespace or punctuation. Token filters:: Last, each term is passed through any _token filters_ in turn, which can change terms (eg lowercasing `"Quick"`), remove terms (eg stopwords like `"a"`, `"and"`, `"the"` etc) or add terms (eg synonyms like `"jump"` and `"leap"`) Elasticsearch provides many character filters, tokenizers and token filtersout of the box. These can be combined to create custom analyzers suitablefor different purposes. We will discuss these in detail in <>. ==== Built-in analyzers However, Elasticsearch also ships with a number of pre-packaged analyzers thatyou can use directly. We list the most important ones below and, to demonstratethe difference in behaviour, we show what terms each would producefrom this string: ~~~ "Set the shape to semi-transparent by calling set_trans(5)" ~~~ Standard analyzer:: The standard analyzer is the default analyzer that Elasticsearch uses. It isthe best general choice for analyzing text which may be in any language. Itsplits the text on _word boundaries_, as defined by the[http://www.unicode.org/reports/tr29/[Unicode](http://www.unicode.org/reports/tr29/[Unicode) Consortium], and removes mostpunctuation. Finally, it lowercases all terms. It would produce:+ set, the, shape, to, semi, transparent, by, calling, set_trans, 5 Simple analyzer:: The simple analyzer splits the text on anything that isn't a letter,and lowercases the terms. It would produce:+ set, the, shape, to, semi, transparent, by, calling, set, trans Whitespace analyzer:: The whitespace analyzer splits the text on whitespace. It doesn'tlowercase. It would produce:+ Set, the, shape, to, semi-transparent, by, calling, set_trans(5) Language analyzers:: Language-specific analyzers are available for many languages. They are able totake the peculiarities of the specified language into account. For instance,the `english` analyzer comes with a set of English stopwords -- common wordslike `and` or `the` which don't have much impact on relevance -- which itremoves, and it is able to _stem_ English words because it understands therules of English grammar.+The `english` analyzer would produce the following:+ set, shape, semi, transpar, call, set_tran, 5+Note how `"transparent"`, `"calling"`, and `"set_trans"` have been stemmed totheir root form. ==== When analyzers are used When we _index_ a document, its full text fields are analyzed into terms whichare used to create the inverted index. However, when we _search_ on a fulltext field, we need to pass the query string through the _same analysisprocess_, to ensure that we are searching for terms in the same form as thosethat exist in the index. Full text queries, which we will discuss later, understand how each field isdefined, and so they can do the right thing: - When you query a _full text_ field, the query will apply the same analyzerto the query string to produce the correct list of terms to search for. - When you query an _exact value_ field, the query will not analyze thequery string, but instead search for the exact value that you havespecified. Now you can understand why the queries that we demonstrated at the<> return what they do: - The `date` field contains an exact value: the single term `"2014-09-15"`. - The `_all` field is a full text field, so the analysis process hasconverted the date into the three terms: `"2014"`, `"09"` and `"15"`. When we query the `_all` field for `2014`, it matches all 12 tweets, becauseall of them contain the term `2014`: ### [source,sh] ### GET /_search?q=2014 # 12 results // SENSE: 052_Mapping_Analysis/25_Data_type_differences.json When we query the `_all` field for `2014-09-15`, it first analyzes the querystring to produce a query which matches _any_ of the terms `2014`, `09` or`15`. This also matches all 12 tweets, because all of them contain the term`2014`: ### [source,sh] ### GET /_search?q=2014-09-15 # 12 results ! // SENSE: 052_Mapping_Analysis/25_Data_type_differences.json When we query the `date` field for `2014-09-15`, it looks for that _exact_date, and finds one tweet only: ### [source,sh] ### GET /_search?q=date:2014-09-15 # 1 result // SENSE: 052_Mapping_Analysis/25_Data_type_differences.json When we query the `date` field for `2014`, it finds no documentsbecause none contain that exact date: ### [source,sh] ### GET /_search?q=date:2014 # 0 results ! // SENSE: 052_Mapping_Analysis/25_Data_type_differences.json [[analyze-api]]==== Testing analyzers Especially when you are new to Elasticsearch, it is sometimes difficult tounderstand what is actually being tokenized and stored into your index. Tobetter understand what is going on, you can use the `analyze` API to see howtext is analyzed. Specify which analyzer to use in the query stringparameters, and the text to analyze in the body: ### [source,js] GET /_analyze?analyzer=standard ### Text to analyze // SENSE: 052_Mapping_Analysis/40_Analyze.json Each element in the result represents a single term: ### [source,js] { "tokens": [ { "token": "text", "start_offset": 0, "end_offset": 4, "type": "", "position": 1 }, { "token": "to", "start_offset": 5, "end_offset": 7, "type": "", "position": 2 }, { "token": "analyze", "start_offset": 8, "end_offset": 15, "type": "", "position": 3 } ] ### } The `token` is the actual term that will be stored in the index. The`position` indicates the order in which the terms appeared in the originaltext. The `start_offset` and `end_offset` indicate the character positionsthat the original word occupied in the original string. The `analyze` API is really useful tool for understanding what is happeninginside Elasticsearch indices, and we will talk more about it as we progress. ==== Specifying analyzers When Elasticsearch detects a new string field in your documents, itautomatically configures it as a full text `string` field and analyzes it withthe `standard` analyzer. You don't always want this. Perhaps you want to apply a different analyzerwhich suits the language your data is in. And sometimes you want astring field to be just a string field -- to index the exact value thatyou pass in, without any analysis, such as a string user ID or aninternal status field or tag. In order to achieve this, we have to configure these fields manuallyby specifying the _mapping_.