Getting started · Prometheus 官方文档中文翻译

# **开始** 本指南是一种 “ Hello World” 风格的教程，它通过简单的示例设置显示了如何安装，配置和使用 Prometheus。您将在本地下载并运行 Prometheus，对其进行配置以采样 Prometheus 自身和示例应用程序，然后使用 query、rule 和 graph 来利用收集的时间序列数据。 ## **下载、运行 Prometheus** 根据您的平台下载[最新版本的 Prometheus](https://prometheus.io/download)，然后解压缩并运行它： ~~~ tar xvfz prometheus-*.tar.gz cd prometheus-* ~~~ 在启动 Prometheus 之前，让我们对其进行配置。 ## **配置 Prometheus 来监控其自身** Prometheus 在这些监控 targets 上通过采样 HTTP endpoint 来获取指标。由于Prometheus 还以相同的方式暴露其自身的数据，因此它也可以采样并监视其自身的健康状况。虽然仅收集 Prometheus 服务器自身指标在实践中不是很有用，但它是一个很好的入门示例。将如下基本的 Prometheus 配置保存为名为 prometheus.yml 的文件： ~~~ global: scrape_interval: 15s # 默认每 15 秒采样一次目标 # 与其他外部系统（e.g. federation，remote storage，Alertmanager）通信时，将会附加这些标签到时序数据或警报 external_labels: monitor: 'codelab-monitor' # 一份采样配置仅包含一个 endpoint 来做采样，下面是 Prometheus 本身的endpoint。 scrape_configs: # 被采样的任意时序都会将这个 job 名称会被添加作为一个标签 `job=<job_name>` - job_name: 'prometheus' # 覆盖全局默认值，每 5s 从该 job 进行采样 scrape_interval: 5s static_configs: - targets: ['localhost:9090'] ~~~ 有关配置选项的完整说明，请参阅[配置文档](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)。 ## **启动 Prometheus** 使用新创建的配置文件来启动 Prometheus，切换到包含 Prometheus 二进制文件的目录并运行： ~~~ # 启动 Prometheus. # 默认地, Prometheus 在 ./data 路径下存储其数据库 (flag --storage.tsdb.path). ./prometheus --config.file=prometheus.yml ~~~ 此时 Prometheus 应该启动起来了，您可以通过访问 `localhost:9000` 来浏览状态页。等待几秒让他从自己的 HTTP metric endpoint 来收集数据。您还可以通过访问到其 metric endpoint 来验证 Prometheus 是否正在提供有关其自身的 metrics：`localhost:9090/metrics` ## 使用 expressin browser 让我们尝试看一看 Prometheus 收集的其自身的数据。使用 Prometheus 内置的`expression browser`，访问 `localhost:9000/graph`，选择 Graph tab 下的 Console。正如您可以从 `localhost:9090/metrics` 查看的那样，Prometheus 导出的其自身的一个指标称为 `prometheus_target_interval_length_seconds`（目标采样之间的实际时间）。继续并将其输入到表达式控制台中： ~~~ prometheus_target_interval_length_seconds ~~~ 这将返回多个不同的时间序列（以及每个时间序列的最新值），所有时间序列的 metric 名称均为 prometheus_target_interval_length_seconds，但具有不同的标签。这些标签具有不同的`延迟百分比`和`目标组间隔（target group intervals）`。如果我们只对第 99 个百分位延迟感兴趣，则可以使用以下查询来检索该信息： ~~~ prometheus_target_interval_length_seconds{quantile="0.99"} ~~~ 要计算返回的时间序列数，您可以编写： ~~~ count(prometheus_target_interval_length_seconds) ~~~ 有关 expression language 的更多信息，请查看 [expression language 文档](https://prometheus.io/docs/prometheus/latest/querying/basics/)。 ## **使用绘图界面** 要绘制图形表达式，请使用 “Graph” 选项卡。例如，输入以下表达式以绘制在自采样的 Prometheus 中每秒创建 chunk 的速率： ~~~ rate(prometheus_tsdb_head_chunks_created_total[1m]) ~~~ 可以尝试 Graph 范围参数和其他设置。 ## **启动样本 targets** 让我们做点更有意思的，启动一些样本目标，让 Prometheus 进行采样。 Go 客户端库包含一个示例，该示例可以导出具有不同延迟分布的三个服务的虚构 RPC 延迟。确保您已安装 Go 编译器，并设置了可正常运行的 Go 构建环境（具有正确的 GOPATH）。下载 Prometheus 的 Go 客户端库，并运行以下三个示例过程： ~~~ # 下载及编译. git clone https://github.com/prometheus/client_golang.git cd client_golang/examples/random go get -d go build # 在不同的终端启动下面3个示例目标 ./random -listen-address=:8080 ./random -listen-address=:8081 ./random -listen-address=:8082 ~~~ 现在，您的示例目标可以监听 `http://localhost:8080/metrics, http://localhost:8081/metrics, and http://localhost:8082/metrics`。 ## **配置 Prometheus 来监控示例目标** 现在，我们将配置 Prometheus 来采样这些新目标。让我们将所有三个 endpoint 分组为一个称为 example-random 的 job。但是，假设前两个 endpoint 是生产目标，而第三个 endpoint 代表金丝雀实例。为了在 Prometheus 中对此建模，我们可以将多个端组添加到单个 job 中，并为每个目标组添加额外的标签。在此示例中，我们将 group=“ production” 标签添加到第一个目标组，同时将 group=“ canary” 添加到第二个目标。为此，请将以下作业定义添加到 prometheus.yml 中的 scrape_configs 部分，然后重新启动 Prometheus 实例： ~~~ scrape_configs: - job_name: 'example-random' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'canary' ~~~ 现在前往 expression browser 来进行验证，比如 `rpc_durations_seconds`。 ## **配置规则以将采样的数据汇总到新的时间序列中** 尽管在我们的示例中并不会有问题，但是在临时计算时，聚集了数千个时间序列的查询可能会变慢。为了提高效率，Prometheus 允许您通过配置的规则将表达式预记录到全新的持久化的时间序列中。假设我们感兴趣的是在 5 分钟的窗口中测得的所有实例（但保留 Job 和服务（service）维度）平均的示例 RPC 每秒速率（rpc_durations_seconds_count）。我们可以这样写： ~~~ avg(rate(rpc_durations_seconds_count[5m])) by (job, service) ~~~ 尝试绘制此表达式的图形。要将由该表达式产生的时间序列记录到名为 job_service：`rpc_durations_seconds_count：avg_rate5m` 的新指标中，请使用以下记录规则创建文件并将其另存为 `prometheus.rules.yml`： ~~~ groups: - name: example rules: - record: job_service:rpc_durations_seconds_count:avg_rate5m expr: avg(rate(rpc_durations_seconds_count[5m])) by (job, service) ~~~ 要使 Prometheus 选择此新规则，请在 prometheus.yml 中添加 rule_files 语句。现在，配置应如下所示： ~~~ global: scrape_interval: 15s # By default, scrape targets every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. # Attach these extra labels to all timeseries collected by this Prometheus instance. external_labels: monitor: 'codelab-monitor' rule_files: - 'prometheus.rules.yml' scrape_configs: - job_name: 'prometheus' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'example-random' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'canary' ~~~ 通过新的配置重新启动 Prometheus，并通过表达式浏览器对其进行查询或对其进行制图，以验证 metric 名称为 `job_service：rpc_durations_seconds_count：avg_rate5m` 的新时间序列是否可用。