Grok Processor（Grok 处理器） · Elasticsearch 5.4 中文文档

# Grok Processor（Grok 处理器）原文链接 : [https://www.elastic.co/guide/en/elasticsearch/reference/5.3/grok-processor.html](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/grok-processor.html) 译文链接 : [http://www.apache.wiki/pages/viewpage.action?pageId=10027802](http://www.apache.wiki/pages/viewpage.action?pageId=10027802) 贡献者 : [那伊抹微笑](/display/~wangyangting)，[ApacheCN](/display/~apachecn)，[Apache中文网](/display/~apachechina) 从 **document**（文档）中的单个 **text** **filed**（文本字段）提取 **structured** **fields**（结构化字段）。您可以选择从哪个字段来提取所匹配的字段，以及您想要匹配的 **grok ****pattern**。**grok pattern** 就像正则表达式，并且支持可以重用的 **aliased** **expressions**（别名表达式）。此工具非常适用于 **syslog** **logs**，**apache** 和其它的 **webserver** **logs**，**mysql** **logs**，以及一般情况下，用于人类而不是计算机使用的任何的 **log** **format**（日志格式）。该 **processor**（处理器）包含超过 [120 种可重用的 patterns](https://github.com/elastic/elasticsearch/tree/master/modules/ingest-common/src/main/resources/patterns)。如果您需要工具来帮助 **building** **patterns**（构建模式）以匹配 **log**（日志），您将会发现 [http://grokdebug.herokuapp.com](http://grokdebug.herokuapp.com/) 和 [http://grokconstructor.appspot.com/](http://grokconstructor.appspot.com/) 应用程序是相当有用的。 ### Grok Basics（Grok 基础） **Grok** 以 **regular** **expressions**（正则表达式）为基础，所以在 **Grok** 中的任何正则表达式也是有效的。正则表达式库是 **Oniguruma**。您可以在[ Onigiruma 网站上](https://github.com/kkos/oniguruma/blob/master/doc/RE) 查看所支持的完整的 **r****egexp syntax**（正在表达式语法）。 **Grok** 通过利用这种正则表达式语言来工作，允许命名现有的 **pattern**（模式），并将它们组合成与您的字段相匹配的更复杂的 **pattern**（模式）。对于重用 **grok pattern**（**grok** 模式）的语法有三种形式 : **`%{SYNTAX:SEMANTIC}`**，**`%{SYNTAX}`**，`**%{SYNTAX:SEMANTIC:TYPE}**。` 该 **SYNTAX**（语法）是将要匹配您的文本的 **pattern**（模式）的名称。例如，**3.44** 将会被 **NUMBER** 模式匹配并且 **`55.3.244.1`** 将会被 **IP** 模式匹配。该语法是告诉你如何匹配的。`**NUMBER** `和 `**IP** `都是在 **default patterns set**（默认模式集）中提供的 **pattern**（模式）。该 **SEMANTIC**（语义）是您给一段被匹配的文本的标识符。例如，**3.44** 可以是事件的持续时间，所以你可以简单的称之为 **duration**。此外，字符串 **55.3.244.1** 可能会标识 **client** 发出的请求。该 **TYPE**（类型）是您希望转换您命名的 **field**（字段）的 **type**（类型）。**int** 和 **float** 是目前唯一所支持的强制类型。例如，您可能想要去匹配以下文本 : ``` 3.44 55.3.244.1 ``` 您可能知道该示例中的消息是一个 **number**（数字），后跟一个 **IP address**（**IP** 地址）。您可以通过使用下列的 **Grok ****expression**（**Grok** 表达式）来匹配这个文本。 ``` %{NUMBER:duration} %{IP:client} ``` ### Using the Grok Processor in a Pipeline（在管道中使用 Grok 表达式） #### Table 20. Grok Options（表 20\. Grok 选项） | Name（名称） | Required（必要的） | Default（默认值） | Description（描述） | | --- | --- | --- | --- | | **`field`** | **yes** | **-** | The field to use for grok expression parsing | | **`patterns`** | **yes** | **-** | An ordered list of grok expression to match and extract named captures with. Returns on the first expression in the list that matches. | | **`pattern_definitions`** | **no** | **-** | A map of pattern-name and pattern tuples defining custom patterns to be used by the current processor. Patterns matching existing names will override the pre-existing definition. | | **`trace_match`** | **no** | **false** | when true, `_ingest._grok_match_index` will be inserted into your matched document’s metadata with the index into the pattern found in `patterns` that matched. | | **`ignore_missing`** | **no** | **false** | If `true` and `field` does not exist or is `null`, the processor quietly exits without modifying the document | 以下是使用提供的 **pattern**（模式）从 **document**（文档）中的 **string** **field**（字符串字段）中提取和命名结构化字段的示例。 ``` { "message": "55.3.244.1 GET /index.html 15824 0.043" } ``` 这个 **pattern**（模式）可以是 : ``` %{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration} ``` 以下是一个使用 **Grok** 处理上述 **document**（文档）的示例 **pipeline**（管道）: ``` { "description" : "...", "processors": [ { "grok": { "field": "message", "patterns": ["%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"] } } ] } ``` 此 **pipeline**（管道）将这些 **named captures**（命名捕获）作为文档中的新字段插入，如下所示 : ``` { "message": "55.3.244.1 GET /index.html 15824 0.043", "client": "55.3.244.1", "method": "GET", "request": "/index.html", "bytes": 15824, "duration": "0.043" } ``` ### Custom Patterns and Pattern Files（自定义模式和模式文件）该 **Grok** **processor** 采用基本的 **pattern**（模式）进行预包装。这些 **pattern**（模式）可能并不总是有你想要的。**Pattern** 有一个非常基本的格式。每个 **entry** 描述有一个 **name**（名称）和 **pattern**（模式）本身。您也可以在 **pattern_definitions** 选项下添加您自己的 **pattern**（模式）到 **processor** **definition**（处理器定义）中。以下是一个指定自定义 **pattern** **definitions**（模式定义）的 **pipeline**（管道）: ``` { "description" : "...", "processors": [ { "grok": { "field": "message", "patterns": ["my %{FAVORITE_DOG:dog} is colored %{RGB:color}"] "pattern_definitions" : { "FAVORITE_DOG" : "beagle", "RGB" : "RED|GREEN|BLUE" } } } ] } ``` ### Providing Multiple Match Patterns（提供多个匹配模式）有时一种 **pattern**（模式）不足以捕捉一个 **field**（字段）的潜在结构。假设我们要匹配包含您最喜欢的猫或狗宠物品种的所有 **message**（消息）。实现这一点的一个方法是提供两个不同的 **pattern**（模式），而不是一个真正复杂的表达式所捕获相同的 `**or** `行为。以下是针对 **simulate** **API**（模拟 **API**）执行的这种配置的示例 : ``` curl -XPOST 'localhost:9200/_ingest/pipeline/_simulate?pretty' -H 'Content-Type: application/json' -d' { "pipeline": { "description" : "parse multiple patterns", "processors": [ { "grok": { "field": "message", "patterns": ["%{FAVORITE_DOG:pet}", "%{FAVORITE_CAT:pet}"], "pattern_definitions" : { "FAVORITE_DOG" : "beagle", "FAVORITE_CAT" : "burmese" } } } ] }, "docs":[ { "_source": { "message": "I love burmese cats!" } } ] } ' ``` 响应如下 : ``` { "docs": [ { "doc": { "_type": "_type", "_index": "_index", "_id": "_id", "_source": { "message": "I love burmese cats!", "pet": "burmese" }, "_ingest": { "timestamp": "2016-11-08T19:43:03.850+0000" } } } ] } ``` 两种 **pattern**（模式）都将使用适当的匹配来设置该字段 **pet**。但是如果要跟踪是哪以个模式匹配并且填充了字段，该怎么办呢？我们可以通过使用 **`trace_match`**参数来做到这一点。以下是一个一样 **pipeline**（管道）的输出，但是使用的是 "**trace_match**": **true **的配置 : ``` { "docs": [ { "doc": { "_type": "_type", "_index": "_index", "_id": "_id", "_source": { "message": "I love burmese cats!", "pet": "burmese" }, "_ingest": { "_grok_match_index": "1", "timestamp": "2016-11-08T19:43:03.850+0000" } } } ] } ``` 在上述响应中，您可以看到匹配的 **pattern**（模式）的 **index**（索引）为 `"**1**"`。这就是说，它是在 **patterns** 中用于匹配的第二个（索引从零开始）模式。这些所跟踪的元数据可以调试哪些 **patterns**（模式）被匹配到了。这些信息存储在 **ingest** **metadata**（元数据）中，并且不会被索引。