Prometheus安装 · Kubernetes

[TOC] # 简介 Prometheus 最初是 SoundCloud 构建的开源系统监控和报警工具，是一个独立的开源项目，于2016年加入了 CNCF 基金会，作为继 Kubernetes 之后的第二个托管项目。Prometheus 相比于其他传统监控工具主要有以下几个特点： * 具有由 metric 名称和键/值对标识的时间序列数据的多维数据模型 * 有一个灵活的查询语言 * 不依赖分布式存储，只和本地磁盘有关 * 通过 HTTP 的服务拉取时间序列数据 * 也支持推送的方式来添加时间序列数据 * 还支持通过服务发现或静态配置发现目标 * 多种图形和仪表板支持 Prometheus 由多个组件组成，但是其中有些组件是可选的： * `Prometheus Server`：用于抓取指标、存储时间序列数据 * `exporter`：暴露指标让任务来抓 * `pushgateway`：push 的方式将指标数据推送到该网关 * `alertmanager`：处理报警的报警组件 `adhoc`：用于数据查询大多数 Prometheus 组件都是用 Go 编写的，因此很容易构建和部署为静态的二进制文件。下图是 Prometheus 官方提供的架构及其一些相关的生态系统组件： ![prometheus架构图](https://prometheus.io/assets/architecture.png) 整体流程比较简单，Prometheus 直接接收或者通过中间的 Pushgateway 网关被动获取指标数据，在本地存储所有的获取的指标数据，并对这些数据进行一些规则整理，用来生成一些聚合数据或者报警信息，Grafana 或者其他工具用来可视化这些数据。 # 安装Prometheus ## RABC权限 ```yaml cat <<'EOF' | kubectl apply -f - # 创建集群权限 --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy - nodes/metrics - configmaps verbs: - get - list - watch - apiGroups: - extensions - networking.k8s.io resources: - ingresses - ingresses/status verbs: - get - list - watch - nonResourceURLs: - /metrics verbs: - get # 创建sa --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system # sa与集群权限绑定 --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system EOF ``` ## 配置文件 ```yaml cat <<'EOF' | kubectl apply -f - # 主配置文件 --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus namespace: kube-system data: prometheus.yml: | global: scrape_interval: 15s scrape_timeout: 15s rule_files: - /etc/prometheus/rule/*.rules scrape_config_files: - /etc/prometheus/target/*.targets # target配置文件 --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus-target namespace: kube-system data: prometheus.targets: | scrape_configs: - job_name: 'prometheus' # 抓取metrics路径，Prometheus访问路径添加上下文需要添加上 metrics_path: /prometheus/metrics static_configs: - targets: ['localhost:9090'] # rule配置文件 --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rule namespace: kube-system data: EOF ``` ## 创建Prometheus ```yaml cat <<'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: prometheus namespace: kube-system labels: app: prometheus spec: selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: # 初始化容器解决启动时 lock DB directory 报错 initContainers: - name: prom-prefix image: jiaxzeng/client:v1.1 command: - "bash" - "-c" - "chown -R 65534. /prometheus && rm -f /prometheus/data/lock" volumeMounts: - mountPath: "/prometheus/data" name: data containers: - image: prom/prometheus:v2.45.4 name: prometheus args: - "--config.file=/etc/prometheus/prometheus.yml" - "--web.console.libraries=/usr/share/prometheus/console_libraries" - "--web.console.templates=/usr/share/prometheus/consoles" # 监控数据保留时间 - "--storage.tsdb.retention.time=24h" # 控制对admin HTTP API的访问，其中包括删除时间序列等功能 - "--web.enable-admin-api" # 支持热更新，直接执行localhost:9090/-/reload立即生效 - "--web.enable-lifecycle" # 添加上下文，健康检查也需要修改；默认 / # 【注意】这个参数影响健康检查以及配置文件抓取Prometheus数据路径 - "--web.external-url=/prometheus" ports: - containerPort: 9090 name: http startupProbe: httpGet: path: /prometheus/-/healthy port: 9090 initialDelaySeconds: 10 periodSeconds: 5 successThreshold: 1 failureThreshold: 24 livenessProbe: httpGet: path: /prometheus/-/healthy port: 9090 periodSeconds: 5 failureThreshold: 3 successThreshold: 1 readinessProbe: httpGet: path: /prometheus/-/ready port: 9090 periodSeconds: 5 failureThreshold: 3 successThreshold: 1 volumeMounts: - mountPath: "/prometheus/data" name: data - mountPath: "/etc/prometheus" name: config - mountPath: "/etc/prometheus/target" name: target - mountPath: "/etc/prometheus/rule" name: rule resources: requests: cpu: 100m memory: 512Mi limits: cpu: 2 memory: 2048Mi # 访问集群资源需要用到的用户 serviceAccountName: prometheus nodeSelector: kubernetes.io/node: monitor volumes: - name: data hostPath: path: /data/prometheus/ - configMap: name: prometheus name: config - configMap: name: prometheus-target name: target - configMap: name: prometheus-rule name: rule EOF ``` 另外为了 prometheus 的性能和数据持久化我们这里是直接将通过 hostPath 的方式来进行数据持久化的，通过 `--storage.tsdb.path=/data` 指定数据目录，然后将该目录声明挂载到 `/data/prometheus` 这个主机目录下面，为了防止 Pod 漂移，所以我们使用 `nodeSelector` 将 Pod 固定到了一个具有 `kubernetes.io/node=monitor` 标签的节点上，如果没有这个标签则需要为你的目标节点打上这个标签 ```shell $ kubectl label node <k8s_name> kubernetes.io/node=monitor ``` ## 创建service ```yaml cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: prometheus namespace: kube-system labels: app: prometheus spec: type: ClusterIP selector: app: prometheus ports: - name: web port: 9090 targetPort: http EOF ``` ## 设置ingress ```shell cat <<EOF | sudo tee ingress.yml > /dev/null apiVersion: extensions/v1beta1 kind: Ingress metadata: name: prometheus namespace: kube-system spec: ingressClassName: nginx rules: - host: www.ecloud.com http: paths: - path: /prometheus backend: serviceName: prometheus servicePort: 9090 EOF ``` # 验证 ![Prometheus界面](https://img.kancloud.cn/b8/d4/b8d49b5f5c72559000b359bd0da09292_1920x584.png) > 如果出现有告警提示时间不同步的，例如 Warning: Error fetching server time: Detected 32.164000034332275 seconds time difference between your browser and the server. Prometheus relies on accurate time and time drift might cause unexpected query results. 解决方法：通常是服务器的时间与客户端的时间不同步导致的一个问题。服务器是同步阿里云的，所以修改客户端也是同步阿里云即可。 ![Prometheus时间不同步1](https://img.kancloud.cn/bf/8d/bf8db522e4012fb8548fd6efb00a1e18_1280x1000.png) ![Prometheus时间不同步2](https://img.kancloud.cn/1b/37/1b377a921abaeee2f61866d3c7479c9c_1280x1000.png) ![Prometheus时间不同步3](https://img.kancloud.cn/d0/c9/d0c96775ea8458c2e013baa25cf3feef_577x637.png)