Skip to content

Commit 943db16

Browse files
authored
Merge pull request #143 from icey-yu/fix-pro
fix: add prometheus config edit illustrate
2 parents 0a7dc79 + eaeb91f commit 943db16

1 file changed

Lines changed: 118 additions & 115 deletions

File tree

docs/guides/gettingStarted/admin.mdx

Lines changed: 118 additions & 115 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,124 @@ import Image4 from './assets/admin.jpg';
4343

4444
<img src={Image4} width="700" alt="admin " />
4545

46-
46+
## 配置文件和告警说明
47+
48+
1. prometheus.yml 文件说明:主要用来配置告警规则文件路径,告警管理服务地址,抓取监控数据ip地址。需要把其中所有的`internal_ip`替换为自己的私网ip地址。如下:
49+
50+
```yaml
51+
# Alertmanager configuration
52+
alerting:
53+
alertmanagers:
54+
- static_configs:
55+
- targets: ['192.168.0.1:19093']
56+
57+
...
58+
```
59+
60+
如果需要添加告警文件,需要在`rule_files`下添加。默认告警文件为`instance-down-rules.yml`
61+
62+
2. 邮件告警架构说明图:Prometheus组件加载告警规则instance-down-rules.yml文件,将符合条件的告警信息发送到alertmanager组件,alertmanager组件加载alertmanager.yml和email.tmpl文件,通过配置的告警邮箱信息和邮件模版发送邮件
63+
64+
![PC Web Interface](./assets/alert2.png)
65+
66+
3. 告警规则instance-down-rules.yaml文件说明:默认实现了两条(instance_down,database_insert_failure_alerts)邮件告警规则,如果增加告警规则可以在instance-down-rules.yml文件中添加规则。
67+
68+
```yaml
69+
groups:
70+
- name: instance_down #报警规则一:监控模块宕机超过一分钟就触发告警
71+
rules:
72+
- alert: InstanceDown
73+
expr: up == 0
74+
for: 1m
75+
labels:
76+
severity: critical
77+
annotations:
78+
summary: "Instance {{ $labels.instance }} down"
79+
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
80+
81+
- name: database_insert_failure_alerts #报警规则二:监控指标msg_insert_redis_failed_total和msg_insert_mongo_failed_total有增长就触发报警
82+
rules:
83+
- alert: DatabaseInsertFailed
84+
expr: (increase(msg_insert_redis_failed_total[5m]) > 0) or (increase(msg_insert_mongo_failed_total[5m]) > 0)
85+
for: 1m
86+
labels:
87+
severity: critical
88+
annotations:
89+
summary: "Increase in MsgInsertRedisFailedCounter or MsgInsertMongoFailedCounter detected"
90+
description: "Either MsgInsertRedisFailedCounter or MsgInsertMongoFailedCounter has increased in the last 5 minutes, indicating failures in message insert operations to Redis or MongoDB,maybe the redis or mongodb is crash."
91+
```
92+
93+
4. 告警管理alertmanager.yml文件说明:修改发送者和接收者邮箱配置信息,即可接收告警信息,如果想实现钉钉,企业微信等方式的告警通知,需要自行改写alertmanager.yml,可以参阅告警管理模块官方文档:https://prometheus.io/docs/alerting/latest/alertmanager/
94+
95+
```yaml
96+
global:
97+
resolve_timeout: 5m
98+
smtp_from: alert@openim.io #告警信息发送邮箱
99+
smtp_smarthost: smtp.163.com:465 #发送邮箱smtp地址
100+
smtp_auth_username: alert@openim.io #发送邮箱授权用户名,一般和smtp_from邮箱相同
101+
smtp_auth_password: YOURAUTHPASSWORD #发送邮箱授权码
102+
smtp_require_tls: false
103+
smtp_hello: openim alert
104+
105+
templates:
106+
- /etc/alertmanager/email.tmpl #邮件模版
107+
108+
route:
109+
group_by: ['alertname'] # 告警分组的标签,具有相同标签值的告警会被合并到同一个通知中
110+
group_wait: 5s # 在发送第一个告警通知之前的等待时间
111+
group_interval: 5s # 在发送分组通知之间的间隔时间
112+
repeat_interval: 5m # 重复发送相同告警的通知之间的间隔时间。用于定期提醒接收者仍然存在的告警。
113+
receiver: email # 默认的接收器名称
114+
receivers:
115+
- name: email # # 接收器名称
116+
email_configs:
117+
- to: 'alert@example.com' #接收告警邮箱
118+
html: '{{ template "email.to.html" . }}'
119+
headers: { Subject: "[OPENIM-SERVER]Alarm" }#邮件标题
120+
send_resolved: true # 告警解决时是否发送通知
121+
```
122+
123+
5. 邮件模版文件email.tmpl说明:此文件是html格式,告警管理模块会填充里面的变量信息,然后渲染成html格式文件,进行邮件的发送,可根据需求自行改写:
124+
125+
```tmpl
126+
{{ define "email.to.html" }}
127+
{{ if eq .Status "firing" }}
128+
{{ range .Alerts }}
129+
<!-- Begin of OpenIM Alert -->
130+
<div style="border:1px solid #ccc; padding:10px; margin-bottom:10px;">
131+
<h3>OpenIM Alert</h3>
132+
<p><strong>Alert Status:</strong> firing</p>
133+
<p><strong>Alert Program:</strong> Prometheus Alert</p>
134+
<p><strong>Severity Level:</strong> {{ .Labels.severity }}</p>
135+
<p><strong>Alert Type:</strong> {{ .Labels.alertname }}</p>
136+
<p><strong>Affected Host:</strong> {{ .Labels.instance }}</p>
137+
<p><strong>Affected Service:</strong> {{ .Labels.job }}</p>
138+
<p><strong>Alert Subject:</strong> {{ .Annotations.summary }}</p>
139+
<p><strong>Trigger Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
140+
</div>
141+
{{ end }}
142+
143+
144+
{{ else if eq .Status "resolved" }}
145+
{{ range .Alerts }}
146+
<!-- Begin of OpenIM Alert -->
147+
<div style="border:1px solid #ccc; padding:10px; margin-bottom:10px;">
148+
<h3>OpenIM Alert</h3>
149+
<p><strong>Alert Status:</strong> resolved</p>
150+
<p><strong>Alert Program:</strong> Prometheus Alert</p>
151+
<p><strong>Severity Level:</strong> {{ .Labels.severity }}</p>
152+
<p><strong>Alert Type:</strong> {{ .Labels.alertname }}</p>
153+
<p><strong>Affected Host:</strong> {{ .Labels.instance }}</p>
154+
<p><strong>Affected Service:</strong> {{ .Labels.job }}</p>
155+
<p><strong>Alert Subject:</strong> {{ .Annotations.summary }}</p>
156+
<p><strong>Trigger Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
157+
</div>
158+
{{ end }}
159+
<!-- End of OpenIM Alert -->
160+
{{ end }}
161+
{{ end }}
162+
163+
```
47164

48165
## 登录grafana
49166
先登录管理后台,再点击左侧数据监控菜单,输入默认用户名(admin)和密码(admin)登入grafana.
@@ -104,120 +221,6 @@ node-exporter指标信息,如下图
104221

105222

106223

107-
## 告警配置文件说明
108-
109-
1,邮件告警架构说明图:Prometheus组件加载告警规则instance-down-rules.yml文件,将符合条件的告警信息发送到alertmanager组件,alertmanager组件加载alertmanager.yml和email.tmpl文件,通过配置的告警邮箱信息和邮件模版发送邮件
110-
![PC Web Interface](./assets/alert2.png)
111-
112-
2,prometheus.yml 文件说明:主要用来配置告警规则文件路径,告警管理服务地址,抓取监控数据ip地址。默认不需要修改。
113-
```
114-
115-
# Alertmanager configuration
116-
alerting:
117-
alertmanagers:
118-
- static_configs:
119-
- targets: ['172.28.0.1:19093']
120-
121-
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
122-
rule_files:
123-
- "instance-down-rules.yml"
124-
125-
```
126-
3,告警规则instance-down-rules.yaml文件说明:默认实现了两条(instance_down,database_insert_failure_alerts)邮件告警规则,如果增加告警规则可以在instance-down-rules.yml文件中添加规则:
127-
```
128-
groups:
129-
- name: instance_down #报警规则一:监控模块宕机超过一分钟就触发告警
130-
rules:
131-
- alert: InstanceDown
132-
expr: up == 0
133-
for: 1m
134-
labels:
135-
severity: critical
136-
annotations:
137-
summary: "Instance {{ $labels.instance }} down"
138-
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
139-
140-
- name: database_insert_failure_alerts #报警规则二:监控指标msg_insert_redis_failed_total和msg_insert_mongo_failed_total有增长就触发报警
141-
rules:
142-
- alert: DatabaseInsertFailed
143-
expr: (increase(msg_insert_redis_failed_total[5m]) > 0) or (increase(msg_insert_mongo_failed_total[5m]) > 0)
144-
for: 1m
145-
labels:
146-
severity: critical
147-
annotations:
148-
summary: "Increase in MsgInsertRedisFailedCounter or MsgInsertMongoFailedCounter detected"
149-
description: "Either MsgInsertRedisFailedCounter or MsgInsertMongoFailedCounter has increased in the last 5 minutes, indicating failures in message insert operations to Redis or MongoDB,maybe the redis or mongodb is crash."
150-
```
151-
152-
4,告警管理alertmanager.yml文件说明:修改发送者和接收者邮箱配置信息,即可接收告警信息,如果想实现钉钉,企业微信等方式的告警通知,需要自行改写alertmanager.yml,可以参阅告警管理模块官方文档:https://prometheus.io/docs/alerting/latest/alertmanager/
153-
```
154-
global:
155-
resolve_timeout: 5m
156-
smtp_from: alert@openim.io #告警信息发送邮箱
157-
smtp_smarthost: smtp.163.com:465 #发送邮箱smtp地址
158-
smtp_auth_username: alert@openim.io #发送邮箱授权用户名,一般和smtp_from邮箱相同
159-
smtp_auth_password: YOURAUTHPASSWORD #发送邮箱授权码
160-
smtp_require_tls: false
161-
smtp_hello: openim alert
162-
163-
templates:
164-
- /etc/alertmanager/email.tmpl #邮件模版
165-
166-
route:
167-
group_by: ['alertname'] # 告警分组的标签,具有相同标签值的告警会被合并到同一个通知中
168-
group_wait: 5s # 在发送第一个告警通知之前的等待时间
169-
group_interval: 5s # 在发送分组通知之间的间隔时间
170-
repeat_interval: 5m # 重复发送相同告警的通知之间的间隔时间。用于定期提醒接收者仍然存在的告警。
171-
receiver: email # 默认的接收器名称
172-
receivers:
173-
- name: email # # 接收器名称
174-
email_configs:
175-
- to: 'alert@example.com' #接收告警邮箱
176-
html: '{{ template "email.to.html" . }}'
177-
headers: { Subject: "[OPENIM-SERVER]Alarm" }#邮件标题
178-
send_resolved: true # 告警解决时是否发送通知
179-
```
180-
5,邮件模版文件email.tmpl说明:此文件是html格式,告警管理模块会填充里面的变量信息,然后渲染成html格式文件,进行邮件的发送,可根据需求自行改写:
181-
```
182-
{{ define "email.to.html" }}
183-
{{ if eq .Status "firing" }}
184-
{{ range .Alerts }}
185-
<!-- Begin of OpenIM Alert -->
186-
<div style="border:1px solid #ccc; padding:10px; margin-bottom:10px;">
187-
<h3>OpenIM Alert</h3>
188-
<p><strong>Alert Status:</strong> firing</p>
189-
<p><strong>Alert Program:</strong> Prometheus Alert</p>
190-
<p><strong>Severity Level:</strong> {{ .Labels.severity }}</p>
191-
<p><strong>Alert Type:</strong> {{ .Labels.alertname }}</p>
192-
<p><strong>Affected Host:</strong> {{ .Labels.instance }}</p>
193-
<p><strong>Affected Service:</strong> {{ .Labels.job }}</p>
194-
<p><strong>Alert Subject:</strong> {{ .Annotations.summary }}</p>
195-
<p><strong>Trigger Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
196-
</div>
197-
{{ end }}
198-
199-
200-
{{ else if eq .Status "resolved" }}
201-
{{ range .Alerts }}
202-
<!-- Begin of OpenIM Alert -->
203-
<div style="border:1px solid #ccc; padding:10px; margin-bottom:10px;">
204-
<h3>OpenIM Alert</h3>
205-
<p><strong>Alert Status:</strong> resolved</p>
206-
<p><strong>Alert Program:</strong> Prometheus Alert</p>
207-
<p><strong>Severity Level:</strong> {{ .Labels.severity }}</p>
208-
<p><strong>Alert Type:</strong> {{ .Labels.alertname }}</p>
209-
<p><strong>Affected Host:</strong> {{ .Labels.instance }}</p>
210-
<p><strong>Affected Service:</strong> {{ .Labels.job }}</p>
211-
<p><strong>Alert Subject:</strong> {{ .Annotations.summary }}</p>
212-
<p><strong>Trigger Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
213-
</div>
214-
{{ end }}
215-
<!-- End of OpenIM Alert -->
216-
{{ end }}
217-
{{ end }}
218-
219-
```
220-
221224

222225
## 告警体验
223226
可手动触发instancedown告警规则,如果是源码部署openim方式,执行 `make stop`命令停止openim-server服务,等待5m分钟以上,即可收到告警邮件,内容如下:

0 commit comments

Comments
 (0)