k8s(九)、监控--Prometheus扩展篇(mysqld-exporter、服务发现、监控项、联邦、relabel)

前言

上一篇主要介绍prometheus基础告警相关,本篇再进行扩展,加入mysql监控,整理出一些监控告警表达式及配置文件,以及部署prometheus 2.0版之后支持的联邦job,多prometheus分支数据聚合,在多集群环境中很适用。

mysqld-exporter部署:

在公司的环境中,大部分的DB已迁移至K8S内运行,因此再以常见的节点二进制安装部署mysqld-exporter的方式不再适用。在本地k8s环境中,采取单pod包含多容器镜像的方式,为原本的DB pod实例多注入一个mysql-exporter的方式,对两者进行整合为一体部署,类似此前文章里istio的注入。
注入后的deployment文件的spec.template.spec.containers变更如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
spec:
containers:
- name: mydbtm
env:
- name: DB_NAME
value: mydbtm
image: registry.xxx.com:5000/mysql:56v6
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 3306
timeoutSeconds: 30
ports:
- containerPort: 3306
protocol: TCP
resources:
limits:
memory: 4Gi
requests:
memory: 512Mi
terminationMessagePath: /dev/termination-log
volumeMounts:
- mountPath: /var/lib/mysql
name: datadir

- name: mysql-exporter
env:
- name: DATA_SOURCE_NAME
value: $username:$passwrod@(127.0.0.1:3306)/
image: prom/mysqld-exporter
imagePullPolicy: Always
name: mysql-exporter
ports:
- containerPort: 9104
protocol: TCP

额外多增加一个mysql-exporter的容器,并提供环境变量DATA_SOURCE_NAME,注意,pod运行起来后,需要保证DATA_SOURCE_NAME的账号有数据库的读写权限。否则权限问题会导致无法获取数据。

进入pod查看端口3306/9104是否监听正常:
这里写图片描述

部署完mysql实例以及mysqld-exporter后,需要在prometheus的配置文件中增加相应的job。因为容器内POD IP是可变的,为了保证连接的持续可靠性,这里为该deployment创建了一个svc,通过svc的cluster IP与prometheus进行绑定,这样就基本不用担心POD IP变化了。

首先创建svc,捆绑2个服务端口:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: v1
kind: Service
metadata:
labels:
app: mydbtm
name: mydbtm
namespace: default
spec:
ports:
- port: 3306
protocol: TCP
targetPort: 3306
name: mysql
- port: 9104
protocol: TCP
targetPort: 9104
name: mysql-exporter
selector:
app: mydbtm
sessionAffinity: None
type: ClusterIP

kubectl get svc 获取刚创建的svc 的cluster IP,绑定至prometheus的confimap:
在prometheus-configmap.yaml下添加一个job,targets对应的列表内填写刚获取到的cluster IP和9104端口,后续增加实例可直接在列表中添加元素:

1
2
3
4
5
6
7
8
9
10
11
- job_name: 'mysql performance'
scrape_interval: 1m
static_configs:
- targets:
['10.102.20.179:9104']
params:
collect[]:
- global_status
- perf_schema.tableiowaits
- perf_schema.indexiowaits
- perf_schema.tablelocks

更新配置文件后滚动更新pod加载配置文件:

1
kubectl patch deployment prometheus --patch '{"spec": {"template": {"metadata": {"annotations": {"update-time": "2018-06-25 17:50" }}}}}' -n kube-system

查看prometheus的target,可以看到mysql-exporter已经up了,如果有异常,检查pod志:
这里写图片描述

打开grafana查看mysql模板里的监控图:
这里写图片描述

服务发现

在k8s环境中,部署的应用可能是海量的,如果像上方一样手动在static_configs.targets中增加对象,然后再触发prometheus实例的更新,未免显得太过麻烦和笨重,好在promethues提供SD(service discovery)功能,支持基于k8s的自动发现服务,可以自动发现k8s的node/service/pod/endpoints/ingress这些类型的资源。

其实在第一部分部署时,已经指定了kubernetes-service的target job,但是可以看到在web /target 路径页面上没有看到任何相关的job,连kubernetes-service的标题都没有:
这里写图片描述

在查阅了官方文档后,发现其中的说明并不详尽,没有具体的配置样例,网上关于此类场景的文章也没有找到,为此,特意去github上提交了issue向项目member请教,最后得知了不生效的原因为标签不匹配。

github issue链接:https://github.com/prometheus/prometheus/issues/4326
官方文档配置说明链接:prometheus配置

作出如下修改后,配置生效,web上可见:
1.job:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
- job_name: 'kube_svc_mysql-exporter'
kubernetes_sd_configs:
- role: service
metrics_path: /metrics
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
# 匹配svc.metadata.annotation字段包含key/value(prometheus.io/probe: "true")
- source_labels: [__meta_kubernetes_service_port_name]
action: keep
regex: "mysql-exporter"
# 匹配svc.spec.ports.name 是否包含mysql-exporter
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name

svc样例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: v1
kind: Service
metadata:
labels:
app: testdbtm
annotations:
# 必须包含这个kv以匹配上方job规则
prometheus.io/probe: true
name: testdbtm
namespace: default
spec:
ports:
- port: 3306
protocol: TCP
targetPort: 3306
name: mysql
- port: 9104
protocol: TCP
targetPort: 9104
# 必须包含这个name以匹配job规则
name: mysql-exporter
selector:
app: orderdealdbtm
sessionAffinity: None
type: ClusterIP

更新configmap以及prometheus实例后,即可在web /targets页面中查看到自动发现的包含mysql-exporter的svc,通过其集群FQDN域名+9104端口获取mysql-exporter metrics,实现自动发现。其他的各种应用,也可以类似定制。
这里写图片描述

监控项

经过这几天的验证,整理了一些基础的监控项表达式,根据exporter进行了分类,当然这还不够,需要根据环境持续补充,调整.注意,metric的名称在不同的版本有些许不同,我们的环境中有两套prometheus,其中一套对应的exporter版本较低,注意点已注释出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
targets:
cadvisor:
#包含pod(container)相关指标
url: http://$NODE_IP:10255

metric expr:
### 容器内存使用率超过limit值80%,部分pod没有limit限制,所以值为+Inf,需要排除
- container_memory_usage_bytes{kubernetes_container_name!=""} / container_spec_memory_limit_bytes{kubernetes_container_name!=""} *100 != +Inf > 80 #低版本
- container_memory_usage_bytes{container_name!=""} / container_spec_memory_limit_bytes{container_name!=""} *100 != +Inf > 80 #新版本

### 计算pod的内存使用情况,单位为MB,可以按需设定阈值,有两个表达式都可以实现,感觉第一个结果更清晰。
- sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*"}) by (pod_name) /1024 /1024 #低版本pod_name改成kubernetes_pod_name
- sort_desc(sum(container_memory_usage_bytes{image!=""}) by (io_kubernetes_container_name, image)) / 1024 / 1024 #通用


### 根据相同的pod_name来计算过去一分钟内pod 的 cpu使用率,metric名称版本不同也有些不一样:
- sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) ) * 100 > 70 #高版本
- sum by (kubernetes_pod_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) ) * 100 > 70 #低版本


### 计算pod的网络IO情况,单位为Mbps
# rx方向
- sort_desc(sum by (kubernetes_pod_name) (rate (container_network_receive_bytes_total{name!=""}[1m]) )) /1024 /1024 /60 * 8 > 100 #低版本
- sort_desc(sum by (pod_name) (rate (container_network_receive_bytes_total{name!=""}[1m]) )) /1024 /1024 /60 *8 > 100 #高版本
# tx方向
- sort_desc(sum by (kubernetes_pod_name) (rate (container_network_transmit_bytes_total{name!=""}[1m]) )) /1024 /1024 /60 * 8 > 100 #低版本
- sort_desc(sum by (pod_name) (rate (container_network_transmit_bytes_total{name!=""}[1m]) )) /1024 /1024 /60 *8 > 100 #高版本


node-exporter:
### 主要包含node相关的指标
# url: http://$NODE_IP:31672/metrics

metric expr:
### node磁盘使用率
- (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 80

### node内存使用率
- (node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100 > 80


### 计算node cpu使用率,关于此项计算,参考了许多个模板的表达式,但感觉都不太精确,下面两个是模板里找到的:
# 这些模板里计算node cpu使用率的表达式,都是拿容器cpu使用情况除以节点cpu总数,这样计算出来的结果感觉有些偏差,因为node本身 \
# 还有一些额外开销,虽然在大集群中,node本身的开销相比较于这么多容器开销来说占比较小,但是还是不可忽视:
- sum(sum by (io_kubernetes_container_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) )) / count(node_cpu{mode="system"}) * 100
- sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100

# 因此经过对集群cpu使用率的实际观察,我觉得下面这个表达式计算出的数值更为准确:
- (sum(rate(node_cpu_seconds_total[1m])) - sum(rate(node_cpu_seconds_total{mode="idle"}[1m]))) / sum(rate(node_cpu_seconds_total[1m])) * 100 > 80

# 在老版本的exporter中`node_cpu_seconds_total`这个metric值叫`node_cpu`,因此老版本使用这个表达式:
- (sum(rate(node_cpu[1m])) - sum(rate(node_cpu{mode="idle"}[1m]))) / sum(rate(node_cpu[1m])) * 100 > 80


mysql-exporter:
### 主要包含mysql相关指标
## url: http://$Mysql_IP:9104/metrics
# innodb各类型缓存缓存大小:
- InnoDB Buffer Pool Data size : mysql_global_status_innodb_page_size * on (instance) mysql_global_status_buffer_pool_pages{state="data"}
- InnoDB Log Buffer Size: mysql_global_variables_innodb_log_buffer_size
- InnoDB Additional Memory Pool Size: mysql_global_variables_innodb_additional_mem_pool_size
- InnoDB Dictionary Size: mysql_global_status_innodb_mem_dictionary
- Key Buffer Size: mysql_global_variables_key_buffer_size
- Query Cache Size: mysql_global_variables_query_cache_size
- Adaptive Hash Index Size: mysql_global_status_innodb_mem_adaptive_hash


metric expr:
# 实例启动时间,单位s,三分钟内有重启记录则告警
- mysql_global_status_uptime < 180

# 每秒查询次数指标
- rate(mysql_global_status_questions[5m]) > 500

# 连接数指标
- rate(mysql_global_status_connections[5m]) > 200

# mysql接收速率,单位Mbps
- rate(mysql_global_status_bytes_received[3m]) * 1024 * 1024 * 8 > 50

# mysql传输速率,单位Mbps
- rate(mysql_global_status_bytes_sent[3m]) * 1024 * 1024 * 8 > 100

# 慢查询
- rate(mysql_global_status_slow_queries[30m]) > 3

# 死锁
- rate(mysql_global_status_innodb_deadlocks[3m]) > 1

# 活跃线程小于30%
- mysql_global_status_threads_running / mysql_global_status_threads_connected * 100 < 30


# innodb缓存占用缓存池大小超过80%
- (mysql_global_status_innodb_page_size * on (instance) mysql_global_status_buffer_pool_pages{state="data"} + on (instance) mysql_global_variables_innodb_log_buffer_size + on (instance) mysql_global_variables_innodb_additional_mem_pool_size + on (instance) mysql_global_status_innodb_mem_dictionary + on (instance) mysql_global_variables_key_buffer_size + on (instance) mysql_global_variables_query_cache_size + on (instance) mysql_global_status_innodb_mem_adaptive_hash ) / on (instance) mysql_global_variables_innodb_buffer_pool_size * 100 > 80

根据上方的监控项expr,整理出如下rules规则模板:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
rules.yml: |
groups:
- name: alert-rule
rules:
### Node监控
- alert: NodeFilesystemUsage-high
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 80
for: 2m
labels:
team: node
severity: warning
annotations:
summary: "{{$labels.instance}}: High Node Filesystem usage detected"
description: "{{$labels.instance}}: Node Filesystem usage is above 80% ,(current value is: {{ $value }})"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: High Node Memory usage detected"
description: "{{$labels.instance}}: Node Memory usage is above 80% ,(current value is: {{ $value }})"
- alert: NodeCPUUsage
expr: (sum(rate(node_cpu_seconds_total[1m])) - sum(rate(node_cpu_seconds_total{mode="idle"}[1m]))) / sum(rate(node_cpu_seconds_total[1m])) * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Node High CPU usage detected"
description: "{{$labels.instance}}: Node CPU usage is above 80% ,(current value is: {{ $value }})"
- alert: NodeCPUUsage_
expr: (sum(rate(node_cpu[1m])) - sum(rate(node_cpu{mode="idle"}[1m]))) / sum(rate(node_cpu[1m])) * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Node High CPU usage detected"
description: "{{$labels.instance}}: Node CPU usage is above 80% ,(current value is: {{ $value }})"
### Pod监控
- alert: PodMemUsage
expr: container_memory_usage_bytes{container_name!=""} / container_spec_memory_limit_bytes{container_name!=""} *100 != +Inf > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Pod High Mem usage detected"
description: "{{$labels.instance}}: Pod Mem is above 80% ,(current value is: {{ $value }})"
- alert: PodMemUsage_
expr: container_memory_usage_bytes{kubernetes_container_name!=""} / container_spec_memory_limit_bytes{kubernetes_container_name!=""} *100 != +Inf > 90
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Pod High Mem usage detected"
description: "{{$labels.instance}}: Pod Mem is above 80% ,(current value is: {{ $value }})"
- alert: PodCpuUsage
expr: sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) ) * 100 > 70
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Pod High CPU usage detected"
description: "{{$labels.instance}}: Pod CPU is above 80% ,(current value is: {{ $value }})"
- alert: PodCpuUsage_
expr: sum by (kubernetes_pod_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) ) * 100 > 70
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Pod High CPU usage detected"
description: "{{$labels.instance}}: Pod CPU is above 80% ,(current value is: {{ $value }})"
- alert: NetI/O_RX
expr: sort_desc(sum by (kubernetes_pod_name) (rate (container_network_receive_bytes_total{name!=""}[1m]) )) /1024 /1024 /60 * 8 > 500
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Pod High NetI/O_RX detected"
description: "{{$labels.instance}}: Pod NetI/O_RX is more than 500Mbps ,(current value is: {{ $value }})"
- alert: NetI/O_RX_
expr: sort_desc(sum by (pod_name) (rate (container_network_receive_bytes_total{name!=""}[1m]) )) /1024 /1024 /60 *8 > 500
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Pod High NetI/O_RX detected"
description: "{{$labels.instance}}: Pod NetI/O_RX is more than 500Mbps ,(current value is: {{ $value }})"
- alert: NetI/O_TX
expr: sort_desc(sum by (kubernetes_pod_name) (rate (container_network_transmit_bytes_total{name!=""}[1m]) )) /1024 /1024 /60 * 8 > 500
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Pod High NetI/O_TX detected"
description: "{{$labels.instance}}: Pod NetI/O_TX is more than 100Mbps ,(current value is: {{ $value }})"
- alert: NetI/O_TX_
expr: sort_desc(sum by (pod_name) (rate (container_network_transmit_bytes_total{name!=""}[1m]) )) /1024 /1024 /60 *8 > 500
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Pod High NetI/O_TX detected"
description: "{{$labels.instance}}: Pod NetI/O_TX is more than 100Mbps ,(current value is: {{ $value }})"
### Mysql监控
- alert: Mysql_Instance_Reboot
expr: mysql_global_status_uptime < 180
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql_Instance_Reboot detected"
description: "{{$labels.instance}}: Mysql_Instance_Reboot in 3 minute (up to now is: {{ $value }} seconds"
- alert: Mysql_High_QPS
expr: rate(mysql_global_status_questions[5m]) > 500
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql_High_QPS detected"
description: "{{$labels.instance}}: Mysql opreation is more than 500 per second ,(current value is: {{ $value }})"
- alert: Mysql_Too_Many_Connections
expr: rate(mysql_global_status_connections[5m]) > 100
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql Too Many Connections detected"
description: "{{$labels.instance}}: Mysql Connections is more than 100 per second ,(current value is: {{ $value }})"
- alert: Mysql_High_Recv_Rate
expr: rate(mysql_global_status_bytes_received[3m]) * 1024 * 1024 * 8 > 100
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql_High_Recv_Rate detected"
description: "{{$labels.instance}}: Mysql_Receive_Rate is more than 100Mbps ,(current value is: {{ $value }})"
- alert: Mysql_High_Send_Rate
expr: rate(mysql_global_status_bytes_sent[3m]) * 1024 * 1024 * 8 > 100
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql_High_Send_Rate detected"
description: "{{$labels.instance}}: Mysql data Send Rate is more than 100Mbps ,(current value is: {{ $value }})"
- alert: Mysql_Too_Many_Slow_Query
expr: rate(mysql_global_status_slow_queries[30m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql_Too_Many_Slow_Query detected"
description: "{{$labels.instance}}: Mysql current Slow_Query Sql is more than 3 ,(current value is: {{ $value }})"
- alert: Mysql_Deadlock
expr: mysql_global_status_innodb_deadlocks > 0
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql_Deadlock detected"
description: "{{$labels.instance}}: Mysql Deadlock was found ,(current value is: {{ $value }})"
- alert: Mysql_Too_Many_sleep_threads
expr: mysql_global_status_threads_running / mysql_global_status_threads_connected * 100 < 30
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql_Too_Many_sleep_threads detected"
description: "{{$labels.instance}}: Mysql_sleep_threads percent is more than {{ $value }}, please clean the sleeping threads"
- alert: Mysql_innodb_Cache_insufficient
expr: (mysql_global_status_innodb_page_size * on (instance) mysql_global_status_buffer_pool_pages{state="data"} + on (instance) mysql_global_variables_innodb_log_buffer_size + on (instance) mysql_global_variables_innodb_additional_mem_pool_size + on (instance) mysql_global_status_innodb_mem_dictionary + on (instance) mysql_global_variables_key_buffer_size + on (instance) mysql_global_variables_query_cache_size + on (instance) mysql_global_status_innodb_mem_adaptive_hash ) / on (instance) mysql_global_variables_innodb_buffer_pool_size * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Mysql_innodb_Cache_insufficient detected"
description: "{{$labels.instance}}: Mysql innodb_Cache was used more than 80% ,(current value is: {{ $value }})"

将这些规则更新进prometheus的configmap中,重新apply config,更新prometheus,查看prometheus的告警页面,可以查看多项告警选项,则配置成功。
这里写图片描述

联邦

prometheus 2.0以上版本的新特性,联邦机制,支持从其他的prometheus扩展合并数据,比较类似zabbix的server端与proxy端,适用于跨集群/机房使用,但不同的是prometheus联邦配置非常简单,简单的增加job指定其他prometheus实例服务IP端口,即可获取数据。我们的环境中有两套集群,两个prometheus server实例,在prometheus的configmap配置增加一个如下job配置即可合并数据统一展示:

1
2
3
4
5
6
7
8
9
10
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*"}'
static_configs:
- targets:
- '192.168.9.26:9090'

更新配置,更新prometheus pod,重新打开grafana的集群监控模板页面,可以看到数据已经进行了整合:

整合前:
这里写图片描述

整合后:
这里写图片描述

Relabel

Relabel机制可以在Prometheus采集数据存入TSDB之前,通过Target实例的Metadata信息,动态地为Label赋值,可以是metadata中的label,也可以赋值给新label。除此之外,还能根据Target实例的Metadata信息选择是否采集或者忽略该Target实例。
例如这里自定义了如下target:

1
2
3
4
5
6
7
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

在web上查看target的原始标签:
这里写图片描述
可以看到instance标签的值为主机名,鼠标悬停展开的黑色框中的标签为metadata标签,form表单里的标签是放入TSDB最终展示的标签

下面给这段配置增加一个自定义标签:

1
2
3
4
5
6
7
8
9
10
11
12
13
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__address__]
regex: (.*):(.*)
action: replace
replacement: $1
target_label: instance

这里写图片描述
可以看到instance标签的值变成了host的IP地址
relabel配置解析:

1
2
3
4
5
source_labels: [__address__]  #值的来源标签(metadata),__address__是标签名
regex: (.*):(.*) #正则匹配表达式
action: replace #动作为替换,默认动作
replacement: $1 #替换的字段,即将正则表达式第一段匹配到的值赋值给target_label
target_label: instance #被赋值的标签,可以为metadata中的标签,也可以是自定义的新标签名

通过relabel,可以做一些自定义过滤规则,实现特殊的需求订制

配置热更新

监控配置项调整是非常常见的操作,如果每次调整配置后,都需要重建pod,过程还是比较繁琐漫长的,好在prometheus提供热重载的api,访问方式为:

1
2
curl -X POST  http://${IP/DOMAIN}/-/reload
# 注意,2.0版本以后默认重载api没有开启,需要在prometheus的启动参数里添加指定--web.enable-lifecycle 参数

总结:

以上就是在上一篇的基础上对prometheus适应生产环境做的一些扩展,当然,需要做的监控项、应用还很多,配置文件、修改后的grafana模板文件随后打包附上,欢迎交流讨论,一起扩展。

赏一瓶快乐回宅水吧~
-------------本文结束感谢您的阅读-------------