白盒指标: 通常Prometheus通过Exporter抓取的指标都是白盒指标监控;
黑盒指标: 一种直接能够直接模拟用户访问验证服务的外部可见性的指标获取;
与白盒测试、黑盒测试的对比类似,黑盒测试只关注结果,不关注其内在实现。而黑盒指标监控,也只关心返回的结果是否是我预期的,不关注其内在如何判断是否达到预期了。
黑盒监控和白盒监控:
- 黑盒监控,关注的是实时状态,一般都是正在发生的事件,比如网站访问不了、磁盘无法写入数据等。即黑盒监控的重点是能对正在发生的故障进行告警。常见的黑盒监控包括HTTP探针、TCP探针等用于检测站点或者服务的可访问性,以及访问效率等。
- 白盒监控,关注的是原因,也就是系统内部的一些运行指标数据,例如nginx响应时长、存储I/O负载等
blackbox-exporter是Prometheus官方提供的一个黑盒监控解决方案,可以通过HTTP、HTTPS、DNS、ICMP、TCP和gRPC方式对目标实例进行检测。可用于以下使用场景:
- HTTP/HTTPS:URL/API可用性检测
- ICMP:主机存活检测
- TCP:端口存活检测
- DNS:域名解析
监控系统要能够有效的支持百盒监控和黑盒监控,通过白盒能够了解系统内部的实际运行状态,以及对监控指标的观察能够预判出可能出现的潜在问题,从而对潜在的不确定因素进行提前处理避免问题发生;而通过黑盒监控,可以在系统或服务发生故障时快速通知相关人员进行处理。
部署黑盒监控
blackbox-exporter监控部署步骤参考
原理:安装blackbox-exporter的服务器通过配置发出各种检查信息,比如检测网站,端口等,收集到信息后统一交由prometheus展示,所以一般部署在和prometheus同一台服务器上,也可以部署单独一台服务器(所有检查请求由该服务器发出)
安装blackbox-exporter
文件下载地址1
文件下载地址2
脚本安装blackbox
blackbox_exporter.sh一键监控安装脚本,提前下载好安装文件或者在线下载
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| #!/bin/bash
echo "download blackbox_exporter" sleep 2 wget -N -P /root/ https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
echo "tar blackbox_exporter" sleep 2 tar -zxvf /root/blackbox_exporter-0.24.0.linux-amd64.tar.gz -C /opt/ && mv /opt/blackbox_exporter-0.24.0.linux-amd64 /usr/local/blackbox_exporter
echo "delete blackbox_exporter***tar.gz" rm -rf /root/blackbox_exporter-0.24.0.linux-amd64.tar.gz
echo "firewall blackbox_exporter port 9115" sleep 2 firewall-cmd --zone=public --add-port=9115/tcp --permanent && firewall-cmd --reload echo "add blackbox_exporter.service" sleep 2 cat << EOF > /usr/lib/systemd/system/blackbox_exporter.service [Unit] Description=Prometheus Blackbox Exporter After=network.target
[Service] Restart=on-failure ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml --web.listen-address=:9115 --timeout-offset=2 Restart=on-failure
[Install] WantedBy=multi-user.target EOF echo "start blackbox_exporter.service" sleep 2 systemctl daemon-reload && systemctl start blackbox_exporter && systemctl enable --now blackbox_exporter
|
文件安装blackbox
安装blackbox-exporter,采集机器运行数据信息,默认端口9115 (可更改为指定端口),默认0.5秒获取一次数据
blackbox-exporter的配置文件使用默认的即可(/usr/local/blackbox_exporter/blackbox.yml)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
mkdir /usr/local/blackbox_exporter tar -zxvf blackbox_exporter-0.24.0.linux-amd64.tar.gz -C /usr/local/blackbox_exporter
cd /usr/local/blackbox_exporter cp blackbox_exporter-0.24.0.linux-amd64/* . && rm -rf blackbox_exporter-0.24.0.linux-amd64
[root@localhost local] [root@localhost local]
[root@localhost local]
vi /usr/lib/systemd/system/blackbox_exporter.service
[Unit] Description=Prometheus Blackbox Exporter After=network.target
[Service] Restart=on-failure ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml --web.listen-address=:9115 --timeout-offset=2 Restart=on-failure
[Install] WantedBy=multi-user.target
systemctl daemon-reload systemctl start blackbox_exporter systemctl status blackbox_exporter systemctl enable blackbox_exporter
|
blackbox-exporter的配置文件使用默认的即可(/usr/local/blackbox_exporter/blackbox.yml),文件里定义了进行目标检测时要使用的模块和模块参数。至于要检测哪些目标是定义在Prometheus 的Job配置中。
默认是以下模块和参数,可以添加自己需要的
- http_2xx:http检测,GET方式
- http_post_2xx:http检测,POST方式
- tcp_connect:端口检测
- pop3s_banner:
- grpc:
- grpc_plain:
- ssh_banner:
- irc_banner:
- icmp:实现ICMP监控
- icmp_ttl5:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
| modules: http_2xx: prober: http http: preferred_ip_protocol: "ip4" http_post_2xx: prober: http http: method: POST tcp_connect: prober: tcp pop3s_banner: prober: tcp tcp: query_response: - expect: "^+OK" tls: true tls_config: insecure_skip_verify: false grpc: prober: grpc grpc: tls: true preferred_ip_protocol: "ip4" grpc_plain: prober: grpc grpc: tls: false service: "service1" ssh_banner: prober: tcp tcp: query_response: - expect: "^SSH-2.0-" - send: "SSH-2.0-blackbox-ssh-check" irc_banner: prober: tcp tcp: query_response: - send: "NICK prober" - send: "USER prober prober prober :prober" - expect: "PING :([^ ]+)" send: "PONG ${1}" - expect: "^:[^ ]+ 001" icmp: prober: icmp icmp_ttl5: prober: icmp timeout: 5s icmp: ttl: 5
|
运行blackbox-exporter
浏览器运行访问 http://10.11.8.108:9115
加入prometheus监控
登录prometheus所在服务器,在文件的最下面添加job 配置,并重启Prometheus
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| - job_name: http-status metrics_path: /probe scrape_interval: 5s params: module: [http_2xx]
static_configs: - targets: - http://www.baidu.com - http://www.baidu.com labels: group: web
file_sd_configs: - files: - "./device/http_device.yml" refresh_interval: 5s
relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 10.11.8.108:9115
|
http_2xx监控
此job配置了对http://www.baidu.com,验证是否可以访问,这里使用一个blackbox-exporter检测多个目标网站。
对应target的地址就是http://10.11.8.108:9115/probe?module=http_2xx&target=http://www.baidu.com
静态部署
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
| [root@prometheus ~] ······ - job_name: 'Http_2xx_get' scrape_interval: 5s metrics_path: /probe params: module: [http_2xx] static_configs: - targets: ['https://pig.yurun.com','https://mpig.yurun.com'] labels: instance: web_status group: 'web' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 10.11.8.108:9115 ······
[root@prometheus ~] ········· probe_ip_protocol 4
probe_success 1
[root@prometheus ~] [root@prometheus prometheus] [root@prometheus ~]
[root@prometheus ~]
|
文件部署
当有新的节点时,只需要修改对应的yml或者json文件即可。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
| vi /usr/local/prometheus/prometheus.yml ······ - job_name: 'Http_2xx_get' scrape_interval: 5s metrics_path: /probe file_sd_configs: - files: - "./device/http_device.yml" refresh_interval: 5s params: module: [http_2xx] relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - source_labels: [module] target_label: __param_module - target_label: __address__ replacement: 10.11.8.108:9115
vi /usr/local/prometheus/device/http_device.yml - labels: group: web module: http_2xx targets: - 'https://pig.yurun.com' - 'https://mpig.yurun.com'
- labels: group: web module: http_2xx targets: ['https://pig.yurun.com','https://mpig.yurun.com']
vi /usr/local/prometheus/device/http_device.json [{ "labels": { "group": "web", "module": "http_2xx" }, "targets": [ "https://pig.yurun.com", "https://mpig.yurun.com" ] }]
······
[root@prometheus ~] [root@prometheus prometheus]
[root@prometheus ~]
|
访问blackbox-exporter所在的服务器http://10.11.8.108:9115/,可以看到以下信息
icmp监控
此job配置了对112.4.152.2和36.152.156.108,验证是否ping通,这里使用一个blackbox-exporter检测多个IP。
对应target的地址就是http://10.11.8.108:9115/probe?module=icmp&target=36.152.156.108
静态部署
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| [root@prometheus ~] ······ - job_name: 'Icmp_ping' scrape_interval: 5s metrics_path: /probe params: module: [icmp] static_configs: - targets: ['112.4.152.2','36.152.156.108'] labels: instance: icmp_status group: 'icmp' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 10.11.8.108:9115 ······
[root@prometheus ~] [root@prometheus prometheus] [root@prometheus ~]
[root@prometheus ~]
|
文件部署
当有新的节点时,只需要修改对应的yml或者json文件即可。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
| vi /usr/local/prometheus/prometheus.yml ······ - job_name: 'Icmp_ping' scrape_interval: 5s metrics_path: /probe file_sd_configs: - files: - "./device/icmp_device.yml" refresh_interval: 5s params: module: - icmp relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - source_labels: [__param_module] target_label: module - target_label: __address__ replacement: 10.11.8.108:9115
vi /usr/local/prometheus/device/icmp_device.yml - labels: instance: icmp_status group: icmp module: icmp targets: - '112.4.152.2' - '36.152.156.108'
- labels: instance: icmp_status group: icmp module: icmp targets: ['112.4.152.2','36.152.156.108']
vi /usr/local/prometheus/device/icmp_device.json [{ "labels": { "instance": "icmp_status", "group": "icmp", "module": "icmp" }, "targets": [ "112.4.152.2", "136.152.156.108" ] }]
······
[root@prometheus ~] [root@prometheus prometheus]
[root@prometheus ~]
|
访问blackbox-exporter所在的服务器http://10.11.8.108:9115/,可以看到以下信息
tcp_connect监控
此job配置了对10.11.7.216:7190和10.11.7.216:52089,端口监控,这里使用一个blackbox-exporter检测多个IP。
对应target的地址就是http://10.11.8.108:9115/probe?module=tcp_connect&target=10.11.7.216:7190
静态部署
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| [root@prometheus ~] ······ - job_name: 'Tcp_connect' scrape_interval: 5s metrics_path: /probe params: module: [tcp_connect] static_configs: - targets: ['10.11.7.216:7190','10.11.7.216:52089'] labels: instance: tcp_status group: 'tcp' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 10.11.8.108:9115 ······
[root@prometheus ~] [root@prometheus prometheus] [root@prometheus ~]
[root@prometheus ~]
|
文件部署
当有新的节点时,只需要修改对应的yml或者json文件即可。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
| vi /usr/local/prometheus/prometheus.yml ······ - job_name: 'Tcp_connect' scrape_interval: 5s metrics_path: /probe file_sd_configs: - files: - "./device/tcp_device.yml" refresh_interval: 5s params: module: - tcp_connect relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - source_labels: [__param_module] target_label: module - target_label: __address__ replacement: 10.11.8.108:9115
vi /usr/local/prometheus/device/tcp_device.yml - labels: instance: tcp_status group: tcp module: tcp_connect targets: - '10.11.7.216:7190' - '10.11.7.216:52089'
- labels: instance: tcp_status group: tcp module: tcp_connect targets: ['10.11.7.216:7190','10.11.7.216:52089']
vi /usr/local/prometheus/device/tcp_device.json [{ "labels": { "instance": "tcp_status", "group": "tcp", "module": "tcp_connect" }, "targets": [ "10.11.7.216:7190", "10.11.7.216:52089" ] }]
······
[root@prometheus ~] [root@prometheus prometheus]
[root@prometheus ~]
|
访问blackbox-exporter所在的服务器http://10.11.8.108:9115/,可以看到以下信息
grafa页面展示
面板展示大全
推荐使用13659
面板id:11175展示网站状态
面板id:12275 展示
面板id:9965 展示
面板id:13230 展示SSL证书状态
添加告警规则
添加相关告警规则,告警规则参考
1 2 3 4 5 6 7 8 9 10
| expr: probe_success{job="Tcp_connect"} == 0
expr: probe_success{job="Icmp_ping"} == 0
expr: probe_http_status_code{job="blackbox_http_2xx"} != 200
expr: probe_ssl_earliest_cert_expiry{job="Http_2xx_get"} - time() < 86400 * 30
expr: sum(probe_http_duration_seconds) by (instance) > 3
|
编写告警规则 参考:BlackBox-alert-rules.yml
建议原样拷贝,格式很重要,可以通过命令行检测
/usr/local/prometheus/promtool check config prometheus.yml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
| groups: - name: blackbox-probe-alert rules: - alert: probeHttpStatus expr: probe_success{job="Icmp_ping"}==0 for: 1m labels: severity: red annotations: summary: '节点不可达,请及时查看' description: '{{$labels.instance}} 节点不可达,请及时查看' resolvetion: "{{$labels.instance}} 节点已恢复." - name: blackbox-port-alert rules: - alert: probeHttpPort expr: probe_success{job="Tcp_connect"}==0 for: 1m labels: severity: red annotations: summary: '节点不可达,请及时查看' description: '{{$labels.instance}} 节点端口不可达,请及时查看' resolvetion: "{{$labels.instance}} 节点端口已恢复." - name: blackbox-http-alert rules: - alert: curlHttpStatus expr: probe_http_status_code{job="Http_2xx_get"} >=400 and probe_success{job="Http_2xx_get"}==0 for: 1m labels: severity: red annotations: summary: 'web接口访问异常状态码 > 400' description: '{{$labels.instance}} 不可访问,请及时查看,当前状态码为{{$value}}' resolvetion: "{{$labels.instance}} 访问已恢复." - name: blackbox-http-time-alert rules: - alert: curlHttpStatus expr: sum(probe_http_duration_seconds) by (instance) > 3 for: 1m labels: severity: warning annotations: summary: 'web接口访问总耗时大于 3 秒' description: '{{$labels.instance}} 接口访问总耗时大于 3 秒' resolvetion: "{{$labels.instance}} 接口访问时间已恢复,低于 3 秒." - name: blackbox-ssl_expiry rules: - alert: Ssl Cert Will Expire in 30 days expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 5m labels: severity: warning annotations: summary: "域名证书即将过期 (instance {{ $labels.instance }})" description: "域名证书 30 天后过期 \n LABELS: {{ $labels }}" resolvetion: "域名证书已延期."
|