Word Delimiter Token Filter · my-elasticsearch-cn

# Word Delimiter Token Filter（Word Delimiter 词元过滤器）命名为 word_delimiter，它将单词分解为子词，并对子词组进行可选的转换。词被分为以下规则的子词： * 拆分字内分隔符（默认情况下，所有非字母数字字符）。. * "Wi-Fi" → "Wi", "Fi" * 按大小写转换拆分: "PowerShot" → "Power", "Shot" * 按字母数字转换拆分: "SD500" → "SD", "500" * 每个子词的前导和尾随的词内分隔符都被忽略: "//hello---there, *dude*" → "hello", "there", "dude" * 每个子词都删除尾随的“s”: "O’Neil’s" → "O", "Neil" 参数包括： **generate_word_parts** **true** 导致生成单词部分："PowerShot" ⇒ "Power" "Shot"。默认 **true** **generate_number_parts** **true **导致生成数字子词："500-42" ⇒ "500" "42"。默认 **true** **catenate_numbers** ****true ****导致单词最大程度的链接到一起："wi-fi" ⇒ "wifi"。默认 **false** **catenate_numbers** ******true ******导致数字最大程度的连接到一起："500-42" ⇒ "50042"。默认 **false** **catenate_all** ********true ********导致所有的子词可以连接："wi-fi-4000" ⇒ "wifi4000"。默认 **false** **split_on_case_change** **true **导致 "PowerShot" 作为两个词元（"Power-Shot" 作为两部分看待）。默认 **true** **preserve_original** **true **在子词中保留原始词： "500-42" ⇒ "500-42" "500" "42"。默认 **false** **split_on_numerics** **true **导致 "j2se" 成为三个词元： "j" "2" "se"。默认 **true** **stem_english_possessive** **true **导致每个子词中的 "'s" 都会被移除："O’Neil’s" ⇒ "O", "Neil"。默认 **true** 高级设置： **protected_words** 被分隔时的受保护词列表。一个数组，或者也可以将**protected_words_path** 设置为配置有保护字的文件（每行一个）。如果存在，则自动解析为基于 **config/** 位置的位置。 **type_table** 例如，自定义类型映射表（使用 **type_table_path** 配置时）： | `# Map the $, %, ``'.'``, and ``','` `characters to DIGIT` `# This might be useful ``for` `financial data.` `$ => DIGIT` `% => DIGIT` `. => DIGIT` `\\u002C => DIGIT` `# in some cases you might not want to split on ZWJ` `# ``this` `also tests the ``case` `where we need a bigger ``byte``[]` `# see http:``//en.wikipedia.org/wiki/Zero-width_joiner` `\\u200D => ALPHANUM` | **提示：** 使用像 **`standard`**** tokenizer** 的 **tokenizer** 可能会干扰 **catenate_ *** 和**preserve_original** 参数，因为原始字符串在分词期间可能已经丢失了标点符号。相反，您可能需要使用 **`whitespace `tokenizer**。