💎一站式轻松地调用各大LLM模型接口,支持GPT4、智谱、星火、月之暗面及文生图 广告
[[char-filters]] === Tidying Up Input Text Tokenizers produce the best results when the input text is clean, valid text, where _valid_ means that it follows the punctuation rules that the Unicode algorithm expects.((("text", "tidying up text input for tokenizers")))((("words", "identifying", "tidying up text input"))) Quite often, though, the text we need to process is anything but clean. Cleaning it up before tokenization improves the quality of the output. ==== Tokenizing HTML Passing HTML through the `standard` tokenizer or the `icu_tokenizer` produces poor results.((("HTML, tokenizing"))) These tokenizers just don't know what to do with the HTML tags. For example: [source,js] -------------------------------------------------- GET /_analyzer?tokenizer=standard <p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a> -------------------------------------------------- The `standard` tokenizer((("standard tokenizer", "tokenizing HTML"))) confuses HTML tags and entities, and emits the following tokens: `p`, `Some`, `d`, `eacute`, `j`, `agrave`, `vu`, `a`, `href`, `http`, `somedomain.com`, `website`, `a`. Clearly not what was intended! _Character filters_ can be added to an analyzer to ((("character filters")))preprocess the text _before_ it is passed to the tokenizer. In this case, we can use the `html_strip` character filter((("analyzers", "adding character filters to")))((("html_strip character filter"))) to remove HTML tags and to decode HTML entities such as `&eacute;` into the corresponding Unicode characters. Character filters can be tested out via the `analyze` API by specifying them in the query string: [source,js] -------------------------------------------------- GET /_analyzer?tokenizer=standard&char_filters=html_strip <p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a> -------------------------------------------------- To use them as part of the analyzer, they should be added to a `custom` analyzer definition: [source,js] -------------------------------------------------- PUT /my_index { "settings": { "analysis": { "analyzer": { "my_html_analyzer": { "tokenizer": "standard", "char_filter": [ "html_strip" ] } } } } } -------------------------------------------------- Once created, our new `my_html_analyzer` can be tested with the `analyze` API: [source,js] -------------------------------------------------- GET /my_index/_analyzer?analyzer=my_html_analyzer <p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a> -------------------------------------------------- This emits the tokens that we expect: `Some`, ++déjà++, `vu`, `website`. ==== Tidying Up Punctuation The `standard` tokenizer and `icu_tokenizer` both understand that an apostrophe _within_ a word should be treated as part of the word, while single quotes that _surround_ a word should not.((("standard tokenizer", "handling of punctuation")))((("icu_tokenizer", "handling of punctuation")))((("punctuation", "tokenizers&#x27; handling of"))) Tokenizing the text `You're my 'favorite'`. would correctly emit the tokens `You're, my, favorite`. Unfortunately,((("apostrophes"))) Unicode lists a few characters that are sometimes used as apostrophes: `U+0027`:: Apostrophe (`'`)&#x2014;the original ASCII character `U+2018`:: Left single-quotation mark (`‘`)&#x2014;opening quote when single-quoting `U+2019`:: Right single-quotation mark (`’`)&#x2014;closing quote when single-quoting, but also the preferred character to use as an apostrophe Both tokenizers treat these three characters as an apostrophe (and thus as part of the word) when they appear within a word. Then there are another three apostrophe-like characters: `U+201B`:: Single high-reversed-9 quotation mark (`‛`)&#x2014;same as `U+2018` but differs in appearance `U+0091`:: Left single-quotation mark in ISO-8859-1&#x2014;should not be used in Unicode `U+0092`:: Right single-quotation mark in ISO-8859-1&#x2014;should not be used in Unicode Both tokenizers treat these three characters as word boundaries--a place to break text into tokens.((("quotation marks"))) Unfortunately, some publishers use `U+201B` as a stylized way to write names like `M‛coy`, and the second two characters may well be produced by your word processor, depending on its age. Even when using the ``acceptable'' quotation marks, a word written with a single right quotation mark&#x2014;`You’re`&#x2014;is not the same as the word written with an apostrophe&#x2014;`You're`&#x2014;which means that a query for one variant will not find the other. Fortunately, it is possible to sort out this mess with the `mapping` character filter,((("character filters", "mapping character filter")))((("mapping character filter"))) which allows us to replace all instances of one character with another. In this case, we will replace all apostrophe variants with the simple `U+0027` apostrophe: [source,js] -------------------------------------------------- PUT /my_index { "settings": { "analysis": { "char_filter": { <1> "quotes": { "type": "mapping", "mappings": [ <2> "\\u0091=>\\u0027", "\\u0092=>\\u0027", "\\u2018=>\\u0027", "\\u2019=>\\u0027", "\\u201B=>\\u0027" ] } }, "analyzer": { "quotes_analyzer": { "tokenizer": "standard", "char_filter": [ "quotes" ] <3> } } } } } -------------------------------------------------- <1> We define a custom `char_filter` called `quotes` that maps all apostrophe variants to a simple apostrophe. <2> For clarity, we have used the JSON Unicode escape syntax for each character, but we could just have used the characters themselves: `"‘=>'"`. <3> We use our custom `quotes` character filter to create a new analyzer called `quotes_analyzer`. As always, we test the analyzer after creating it: [source,js] -------------------------------------------------- GET /my_index/_analyze?analyzer=quotes_analyzer You’re my ‘favorite’ M‛Coy -------------------------------------------------- This example returns the following tokens, with all of the in-word quotation marks replaced by apostrophes: `You're`, `my`, `favorite`, `M'Coy`. The more effort that you put into ensuring that the tokenizer receives good-quality input, the better your search results will be.