**安装**
Prometheus在容器内运行的话,数据不能持久
Node-exporter在容器里面收集物理节点数据的话,数据会不准确。
所以我们采用federation的方式。就是容器里面运行一个prometheus server采集容器里面的数据,外部再运行一个prometheus server采集物理节点的数据+容器内prometheus采集到的数据。
![pastedGraphic.png](blob:https://www.kancloud.cn/8254d902-7513-43b0-941d-74c6bfa174fd)
容器内部安装就不介绍了,只介绍外部
安装包在prometheus.io里面找
node-exporter
Prometheus
Alertmanager
**安装****node-exporter**
在每个需要监控的节点上安装node-exporter放在/usr/local下
\# tar -xf node\_exporter-0.16.0.linux-amd64.tar.gz
\# cd node-exporter
\# ./node\_exporter &
**安装****prometheus**
\# tar -xf prometheus-2.4.2.linux-amd64.tar.gz
\# cd prometheus
加入job收集外部采集的数据,federation采集内部prometheus的数据
\# vim prometheus.yml
global:
scrape\_interval: 15s 15秒采集一次
evaluation\_interval: 15s 15秒评估一次规则
alerting:
alertmanagers:
\- static\_configs:
\- targets: \["localhost:9093"\]
rule\_files:
\- "rule/\*.yml" 报警规则文件
scrape\_configs:
\- job\_name: 'prometheus'
static\_configs:
\- targets: \['localhost:9090'\]
\- job\_name: 'node-exporter'
static\_configs:
\- targets: \['192.168.11.212:9100',
'192.168.11.213:9100',
'192.168.11.214:9100',
'192.168.11.215:9100',
'192.168.11.216:9100'\]
\- job\_name: 'federate'
scrape\_interval: 15s
honor\_labels: true
metrics\_path: '/federate'
params:
'match\[\]':
\- '{job=~"kubernetes.\*"}'
static\_configs:
\- targets:
\- 'prometheus.pkbeta.com'
**安装****alertmanger**
\# tar -xf alertmanager-0.15.2.linux-amd64.tar.gz
\# cd alertmanager
\# vim alertmanager.yml
global:
resolve\_timeout: 5m
smtp\_smarthost: 'smtp.163.com:25' 我用的是163邮箱
smtp\_from: 'XXXXX@163.com'
smtp\_auth\_username: 'XXXXX@163.com'
smtp\_auth\_password: 'XXXXX'
smtp\_require\_tls: false
route:
group\_by: \['NODE'\]
group\_wait: 10s 报警等待时间
group\_interval: 10s 报警间隔时间
repeat\_interval: 1h 重复发送时间
receiver: 'node'
receivers:
\- name: 'node'
email\_configs:
\- to: 'XXXXX@163.com'
inhibit\_rules:
\- source\_match:
severity: 'critical'
target\_match:
severity: 'warning'
equal: \['alertname', 'dev', 'instance'\]
启动alertmanager
\# ./alertmanager &
**编写****prometheus****的报警规则**
\# cd prometheus/rule
\# vim test.yml
groups:
\- name: NODE 组的名字
rules:
\- alert: NodeCPUUsage 75% 报警名
expr: (100 - (avg by (instance) (irate(node\_cpu\_seconds\_total{mode="idle"}\[5m\])) \* 100)) > 75 报警的规则
for: 1m 达到阈值1分钟就报警
labels:
severity: page
annotations: 以下就是报警收到的信息
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 75% (current value is: {{ $value }})"
启动prometheus
\# ./prometheus
浏览器访问prometheus 默认端口9090
![pastedGraphic_1.png](blob:https://www.kancloud.cn/d48377de-b88a-46b1-8fa7-46900dc68c64)
![pastedGraphic_2.png](blob:https://www.kancloud.cn/d3cb2b60-3ca6-475c-ad82-1a5e0dc909d3)
![pastedGraphic_3.png](blob:https://www.kancloud.cn/e86ad604-9c80-40c7-a9fd-ca3431076f02)