ThinkChat2.0新版上线,更智能更精彩,支持会话、画图、阅读、搜索等,送10W Token,即刻开启你的AI之旅 广告
[TOC] # 简介 Prometheus 最初是 SoundCloud 构建的开源系统监控和报警工具,是一个独立的开源项目,于2016年加入了 CNCF 基金会,作为继 Kubernetes 之后的第二个托管项目。Prometheus 相比于其他传统监控工具主要有以下几个特点: * 具有由 metric 名称和键/值对标识的时间序列数据的多维数据模型 * 有一个灵活的查询语言 * 不依赖分布式存储,只和本地磁盘有关 * 通过 HTTP 的服务拉取时间序列数据 * 也支持推送的方式来添加时间序列数据 * 还支持通过服务发现或静态配置发现目标 * 多种图形和仪表板支持 Prometheus 由多个组件组成,但是其中有些组件是可选的: * `Prometheus Server`:用于抓取指标、存储时间序列数据 * `exporter`:暴露指标让任务来抓 * `pushgateway`:push 的方式将指标数据推送到该网关 * `alertmanager`:处理报警的报警组件 `adhoc`:用于数据查询 大多数 Prometheus 组件都是用 Go 编写的,因此很容易构建和部署为静态的二进制文件。下图是 Prometheus 官方提供的架构及其一些相关的生态系统组件: ![prometheus架构图](https://prometheus.io/assets/architecture.png) 整体流程比较简单,Prometheus 直接接收或者通过中间的 Pushgateway 网关被动获取指标数据,在本地存储所有的获取的指标数据,并对这些数据进行一些规则整理,用来生成一些聚合数据或者报警信息,Grafana 或者其他工具用来可视化这些数据。 # 安装Prometheus ## RABC权限 ```yaml cat <<'EOF' | kubectl apply -f - # 创建集群权限 --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups:   - ""   resources:   - nodes   - services   - endpoints   - pods   - nodes/proxy - nodes/metrics - configmaps   verbs:   - get   - list   - watch - apiGroups:   - extensions - networking.k8s.io   resources:   - ingresses - ingresses/status   verbs:   - get   - list   - watch - nonResourceURLs:   - /metrics   verbs:   - get # 创建sa --- apiVersion: v1 kind: ServiceAccount metadata:   name: prometheus   namespace: kube-system # sa与集群权限绑定 --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system EOF ``` ## 配置文件 ```yaml cat <<'EOF' | kubectl apply -f - # 主配置文件 --- apiVersion: v1 kind: ConfigMap metadata:   name: prometheus   namespace: kube-system data: prometheus.yml: | global: scrape_interval: 15s scrape_timeout: 15s rule_files: - /etc/prometheus/rule/*.rules scrape_config_files: - /etc/prometheus/target/*.targets # target配置文件 --- apiVersion: v1 kind: ConfigMap metadata:   name: prometheus-target   namespace: kube-system data: prometheus.targets: | scrape_configs: - job_name: 'prometheus' # 抓取metrics路径,Prometheus访问路径添加上下文需要添加上 metrics_path: /prometheus/metrics static_configs: - targets: ['localhost:9090'] # rule配置文件 --- apiVersion: v1 kind: ConfigMap metadata:   name: prometheus-rule   namespace: kube-system data: EOF ``` ## 创建Prometheus ```yaml cat <<'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata:   name: prometheus   namespace: kube-system   labels:     app: prometheus spec:   selector:     matchLabels:       app: prometheus   template:     metadata:       labels:         app: prometheus     spec: # 初始化容器解决启动时 lock DB directory 报错 initContainers: - name: prom-prefix image: jiaxzeng/client:v1.1 command: - "bash" - "-c" - "chown -R 65534. /prometheus && rm -f /prometheus/data/lock" volumeMounts: - mountPath: "/prometheus/data" name: data       containers:       - image: prom/prometheus:v2.45.4         name: prometheus         args:         - "--config.file=/etc/prometheus/prometheus.yml"         - "--web.console.libraries=/usr/share/prometheus/console_libraries"         - "--web.console.templates=/usr/share/prometheus/consoles" # 监控数据保留时间         - "--storage.tsdb.retention.time=24h" # 控制对admin HTTP API的访问,其中包括删除时间序列等功能         - "--web.enable-admin-api" # 支持热更新,直接执行localhost:9090/-/reload立即生效         - "--web.enable-lifecycle" # 添加上下文,健康检查也需要修改;默认 / # 【注意】这个参数影响 健康检查 以及 配置文件抓取Prometheus数据路径 - "--web.external-url=/prometheus"         ports:         - containerPort: 9090           name: http startupProbe: httpGet: path: /prometheus/-/healthy port: 9090 initialDelaySeconds: 10 periodSeconds: 5 successThreshold: 1 failureThreshold: 24 livenessProbe: httpGet: path: /prometheus/-/healthy port: 9090 periodSeconds: 5 failureThreshold: 3 successThreshold: 1 readinessProbe: httpGet: path: /prometheus/-/ready port: 9090 periodSeconds: 5 failureThreshold: 3 successThreshold: 1         volumeMounts:         - mountPath: "/prometheus/data"           name: data         - mountPath: "/etc/prometheus"           name: config         - mountPath: "/etc/prometheus/target"           name: target         - mountPath: "/etc/prometheus/rule"           name: rule         resources:           requests:             cpu: 100m             memory: 512Mi           limits:             cpu: 2             memory: 2048Mi # 访问集群资源需要用到的用户       serviceAccountName: prometheus       nodeSelector:         kubernetes.io/node: monitor       volumes:       - name: data         hostPath:           path: /data/prometheus/       - configMap:           name: prometheus         name: config       - configMap:           name: prometheus-target         name: target       - configMap:           name: prometheus-rule         name: rule EOF ``` 另外为了 prometheus 的性能和数据持久化我们这里是直接将通过 hostPath 的方式来进行数据持久化的,通过 `--storage.tsdb.path=/data` 指定数据目录,然后将该目录声明挂载到 `/data/prometheus` 这个主机目录下面,为了防止 Pod 漂移,所以我们使用 `nodeSelector` 将 Pod 固定到了一个具有 `kubernetes.io/node=monitor` 标签的节点上,如果没有这个标签则需要为你的目标节点打上这个标签 ```shell $ kubectl label node <k8s_name> kubernetes.io/node=monitor ``` ## 创建service ```yaml cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: Service metadata:   name: prometheus   namespace: kube-system   labels:     app: prometheus spec:   type: ClusterIP   selector:     app: prometheus   ports:   - name: web     port: 9090     targetPort: http EOF ``` ## 设置ingress ```shell cat <<EOF | sudo tee ingress.yml > /dev/null apiVersion: extensions/v1beta1 kind: Ingress metadata: name: prometheus namespace: kube-system spec: ingressClassName: nginx rules: - host: www.ecloud.com http: paths: - path: /prometheus backend: serviceName: prometheus servicePort: 9090 EOF ``` # 验证 ![Prometheus界面](https://img.kancloud.cn/b8/d4/b8d49b5f5c72559000b359bd0da09292_1920x584.png) > 如果出现有告警提示时间不同步的,例如 Warning: Error fetching server time: Detected 32.164000034332275 seconds time difference between your browser and the server. Prometheus relies on accurate time and time drift might cause unexpected query results. 解决方法: 通常是服务器的时间与客户端的时间不同步导致的一个问题。服务器是同步阿里云的,所以修改客户端也是同步阿里云即可。 ![Prometheus时间不同步1](https://img.kancloud.cn/bf/8d/bf8db522e4012fb8548fd6efb00a1e18_1280x1000.png) ![Prometheus时间不同步2](https://img.kancloud.cn/1b/37/1b377a921abaeee2f61866d3c7479c9c_1280x1000.png) ![Prometheus时间不同步3](https://img.kancloud.cn/d0/c9/d0c96775ea8458c2e013baa25cf3feef_577x637.png)