Pattern Replace Character Filter · Elasticsearch 5.4 中文文档

# Pattern Replace Character Filter 原文链接 : [https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-pattern-replace-charfilter.html](https://www.elastic.co/guide/en/elasticsearch/reference/5.0/analysis-pattern-replace-charfilter.html) 译文链接 :[http://www.apache.wiki/display/Elasticsearch/Pattern+Replace+Character+Filter](http://www.apache.wiki/display/Elasticsearch/Pattern+Replace+Character+Filter) 贡献者 : [谢雄](/display/~xiexiong)，[ApacheCN](/display/~apachecn)，[Apache中文网](/display/~apachechina) **Pattern Replace Character Filter** 使用正则表达式来匹配应替换为指定替换字符串的字符。替换字符串可以引用正则表达式中的捕获组。 ## 小心"病态"正则表达式 **Pattern Replace Character Filter**使用[Java正则表达式](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html)。一个“病态”的正则表达式可能会运行得非常慢，甚至会抛出一个**StackOverflowError**，还会导致其运行的节点突然退出。更多信息请查看 [pathological regular expressions and how to avoid them](http://www.regular-expressions.info/catastrophic.html)[.](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) ## 配置 **Pattern Replace Character Filter** 会接收以下参数： | 参数名称 | 参数说明 | | --- | --- | | `pattern` | 一个[Java正则表达式](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). 必填 | | `replacement` | 替换字符串，可以使用$ 1 .. $ 9语法来引用捕获组，可以参考[这里](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-)。 | | `flags` | Java正则表达式[标志](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary)。标志应该用“|”进行分割，例如“CASE_INSENSITIVE | COMMENTS”。 | 在下面例子中，我们使用**Pattern Replace Character Filter**实现用下划线替代任何嵌入的破折号，即123-456-789→123_456_789： **案例** ``` PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\\d+)-(?=\\d)", "replacement": "$1_" } } } } } POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" } ``` 上面案例将返回如下结果： ``` [ My, credit, card, is 123_456_789 ] ``` 出于搜索目的使用替换字符串，会引起原始文本长度的更改，进而导致不正确的高亮显示。案例如下所示。这个案例在遇到小写字母后跟大写字母（即fooBarBaz→foo Bar Baz）时插入空格，允许单独查询camelCase字词： **案例** ``` PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": [ "my_char_filter" ], "filter": [ "lowercase" ] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(?<=\\p{Lower})(?=\\p{Upper})", "replacement": " " } } } }, "mappings": { "my_type": { "properties": { "text": { "type": "text", "analyzer": "my_analyzer" } } } } } POST my_index/_analyze { "analyzer": "my_analyzer", "text": "The fooBarBaz method" } ``` 案例返回的结果如下： ``` [ the, foo, bar, baz, method ] ``` 查询 **bar** 可以正确找到文档，但突出显示结果将产生不正确的高光，因为我们的字符过滤器更改了原始文本的长度： ``` PUT my_index/my_doc/1?refresh { "text": "The fooBarBaz method" } GET my_index/_search { "query": { "match": { "text": "bar" } }, "highlight": { "fields": { "text": {} } } } ``` 以上的输出结果是： ``` { "timed_out": false, "took": $body.took, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.2824934, "hits": [ { "_index": "my_index", "_type": "my_doc", "_id": "1", "_score": 0.2824934, "_source": { "text": "The fooBarBaz method" }, "highlight": { "text": [ "The foo<em>Ba</em>rBaz method" ] } } ] } } ```