@@ -43,7 +43,124 @@ import Image4 from './assets/admin.jpg';
4343
4444<img src = { Image4 } width = " 700" alt = " admin " />
4545
46-
46+ ## 配置文件和告警说明
47+
48+ 1 . prometheus.yml 文件说明:主要用来配置告警规则文件路径,告警管理服务地址,抓取监控数据ip地址。需要把其中所有的` internal_ip ` 替换为自己的私网ip地址。如下:
49+
50+ ``` yaml
51+ # Alertmanager configuration
52+ alerting :
53+ alertmanagers :
54+ - static_configs :
55+ - targets : ['192.168.0.1:19093']
56+
57+ ...
58+ ```
59+
60+ 如果需要添加告警文件,需要在` rule_files ` 下添加。默认告警文件为` instance-down-rules.yml ` 。
61+
62+ 2 . 邮件告警架构说明图:Prometheus组件加载告警规则instance-down-rules.yml文件,将符合条件的告警信息发送到alertmanager组件,alertmanager组件加载alertmanager.yml和email.tmpl文件,通过配置的告警邮箱信息和邮件模版发送邮件
63+
64+ ![ PC Web Interface] ( ./assets/alert2.png )
65+
66+ 3 . 告警规则instance-down-rules.yaml文件说明:默认实现了两条(instance_down,database_insert_failure_alerts)邮件告警规则,如果增加告警规则可以在instance-down-rules.yml文件中添加规则。
67+
68+ ``` yaml
69+ groups :
70+ - name : instance_down # 报警规则一:监控模块宕机超过一分钟就触发告警
71+ rules :
72+ - alert : InstanceDown
73+ expr : up == 0
74+ for : 1m
75+ labels :
76+ severity : critical
77+ annotations :
78+ summary : " Instance {{ $labels.instance }} down"
79+ description : " {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
80+
81+ - name : database_insert_failure_alerts # 报警规则二:监控指标msg_insert_redis_failed_total和msg_insert_mongo_failed_total有增长就触发报警
82+ rules :
83+ - alert : DatabaseInsertFailed
84+ expr : (increase(msg_insert_redis_failed_total[5m]) > 0) or (increase(msg_insert_mongo_failed_total[5m]) > 0)
85+ for : 1m
86+ labels :
87+ severity : critical
88+ annotations :
89+ summary : " Increase in MsgInsertRedisFailedCounter or MsgInsertMongoFailedCounter detected"
90+ description : " Either MsgInsertRedisFailedCounter or MsgInsertMongoFailedCounter has increased in the last 5 minutes, indicating failures in message insert operations to Redis or MongoDB,maybe the redis or mongodb is crash."
91+ ` ` `
92+
93+ 4. 告警管理alertmanager.yml文件说明:修改发送者和接收者邮箱配置信息,即可接收告警信息,如果想实现钉钉,企业微信等方式的告警通知,需要自行改写alertmanager.yml,可以参阅告警管理模块官方文档:https://prometheus.io/docs/alerting/latest/alertmanager/
94+
95+ ` ` ` yaml
96+ global :
97+ resolve_timeout : 5m
98+ smtp_from : alert@openim.io # 告警信息发送邮箱
99+ smtp_smarthost : smtp.163.com:465 # 发送邮箱smtp地址
100+ smtp_auth_username : alert@openim.io # 发送邮箱授权用户名,一般和smtp_from邮箱相同
101+ smtp_auth_password : YOURAUTHPASSWORD # 发送邮箱授权码
102+ smtp_require_tls : false
103+ smtp_hello : openim alert
104+
105+ templates :
106+ - /etc/alertmanager/email.tmpl # 邮件模版
107+
108+ route :
109+ group_by : ['alertname'] # 告警分组的标签,具有相同标签值的告警会被合并到同一个通知中
110+ group_wait : 5s # 在发送第一个告警通知之前的等待时间
111+ group_interval : 5s # 在发送分组通知之间的间隔时间
112+ repeat_interval : 5m # 重复发送相同告警的通知之间的间隔时间。用于定期提醒接收者仍然存在的告警。
113+ receiver : email # 默认的接收器名称
114+ receivers :
115+ - name : email # # 接收器名称
116+ email_configs :
117+ - to : ' alert@example.com' # 接收告警邮箱
118+ html : ' {{ template "email.to.html" . }}'
119+ headers : { Subject: "[OPENIM-SERVER]Alarm" }#邮件标题
120+ send_resolved : true # 告警解决时是否发送通知
121+ ` ` `
122+
123+ 5. 邮件模版文件email.tmpl说明:此文件是html格式,告警管理模块会填充里面的变量信息,然后渲染成html格式文件,进行邮件的发送,可根据需求自行改写:
124+
125+ ` ` ` tmpl
126+ {{ define "email.to.html" }}
127+ {{ if eq .Status "firing" }}
128+ {{ range .Alerts }}
129+ <!-- Begin of OpenIM Alert -->
130+ <div style="border:1px solid # ccc; padding:10px; margin-bottom:10px;">
131+ <h3>OpenIM Alert</h3>
132+ <p><strong>Alert Status:</strong> firing</p>
133+ <p><strong>Alert Program:</strong> Prometheus Alert</p>
134+ <p><strong>Severity Level:</strong> {{ .Labels.severity }}</p>
135+ <p><strong>Alert Type:</strong> {{ .Labels.alertname }}</p>
136+ <p><strong>Affected Host:</strong> {{ .Labels.instance }}</p>
137+ <p><strong>Affected Service:</strong> {{ .Labels.job }}</p>
138+ <p><strong>Alert Subject:</strong> {{ .Annotations.summary }}</p>
139+ <p><strong>Trigger Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
140+ </div>
141+ {{ end }}
142+
143+
144+ {{ else if eq .Status "resolved" }}
145+ {{ range .Alerts }}
146+ <!-- Begin of OpenIM Alert -->
147+ <div style="border:1px solid # ccc; padding:10px; margin-bottom:10px;">
148+ <h3>OpenIM Alert</h3>
149+ <p><strong>Alert Status:</strong> resolved</p>
150+ <p><strong>Alert Program:</strong> Prometheus Alert</p>
151+ <p><strong>Severity Level:</strong> {{ .Labels.severity }}</p>
152+ <p><strong>Alert Type:</strong> {{ .Labels.alertname }}</p>
153+ <p><strong>Affected Host:</strong> {{ .Labels.instance }}</p>
154+ <p><strong>Affected Service:</strong> {{ .Labels.job }}</p>
155+ <p><strong>Alert Subject:</strong> {{ .Annotations.summary }}</p>
156+ <p><strong>Trigger Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
157+ </div>
158+ {{ end }}
159+ <!-- End of OpenIM Alert -->
160+ {{ end }}
161+ {{ end }}
162+
163+ ```
47164
48165## 登录grafana
49166先登录管理后台,再点击左侧数据监控菜单,输入默认用户名(admin)和密码(admin)登入grafana.
@@ -104,120 +221,6 @@ node-exporter指标信息,如下图
104221
105222
106223
107- ## 告警配置文件说明
108-
109- 1,邮件告警架构说明图:Prometheus组件加载告警规则instance-down-rules.yml文件,将符合条件的告警信息发送到alertmanager组件,alertmanager组件加载alertmanager.yml和email.tmpl文件,通过配置的告警邮箱信息和邮件模版发送邮件
110- ![ PC Web Interface] ( ./assets/alert2.png )
111-
112- 2,prometheus.yml 文件说明:主要用来配置告警规则文件路径,告警管理服务地址,抓取监控数据ip地址。默认不需要修改。
113- ```
114-
115- # Alertmanager configuration
116- alerting:
117- alertmanagers:
118- - static_configs:
119- - targets: ['172.28.0.1:19093']
120-
121- # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
122- rule_files:
123- - "instance-down-rules.yml"
124-
125- ```
126- 3,告警规则instance-down-rules.yaml文件说明:默认实现了两条(instance_down,database_insert_failure_alerts)邮件告警规则,如果增加告警规则可以在instance-down-rules.yml文件中添加规则:
127- ```
128- groups:
129- - name: instance_down #报警规则一:监控模块宕机超过一分钟就触发告警
130- rules:
131- - alert: InstanceDown
132- expr: up == 0
133- for: 1m
134- labels:
135- severity: critical
136- annotations:
137- summary: "Instance {{ $labels.instance }} down"
138- description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
139-
140- - name: database_insert_failure_alerts #报警规则二:监控指标msg_insert_redis_failed_total和msg_insert_mongo_failed_total有增长就触发报警
141- rules:
142- - alert: DatabaseInsertFailed
143- expr: (increase(msg_insert_redis_failed_total[5m]) > 0) or (increase(msg_insert_mongo_failed_total[5m]) > 0)
144- for: 1m
145- labels:
146- severity: critical
147- annotations:
148- summary: "Increase in MsgInsertRedisFailedCounter or MsgInsertMongoFailedCounter detected"
149- description: "Either MsgInsertRedisFailedCounter or MsgInsertMongoFailedCounter has increased in the last 5 minutes, indicating failures in message insert operations to Redis or MongoDB,maybe the redis or mongodb is crash."
150- ```
151-
152- 4,告警管理alertmanager.yml文件说明:修改发送者和接收者邮箱配置信息,即可接收告警信息,如果想实现钉钉,企业微信等方式的告警通知,需要自行改写alertmanager.yml,可以参阅告警管理模块官方文档:https://prometheus.io/docs/alerting/latest/alertmanager/
153- ```
154- global:
155- resolve_timeout: 5m
156- smtp_from: alert@openim.io #告警信息发送邮箱
157- smtp_smarthost: smtp.163.com:465 #发送邮箱smtp地址
158- smtp_auth_username: alert@openim.io #发送邮箱授权用户名,一般和smtp_from邮箱相同
159- smtp_auth_password: YOURAUTHPASSWORD #发送邮箱授权码
160- smtp_require_tls: false
161- smtp_hello: openim alert
162-
163- templates:
164- - /etc/alertmanager/email.tmpl #邮件模版
165-
166- route:
167- group_by: ['alertname'] # 告警分组的标签,具有相同标签值的告警会被合并到同一个通知中
168- group_wait: 5s # 在发送第一个告警通知之前的等待时间
169- group_interval: 5s # 在发送分组通知之间的间隔时间
170- repeat_interval: 5m # 重复发送相同告警的通知之间的间隔时间。用于定期提醒接收者仍然存在的告警。
171- receiver: email # 默认的接收器名称
172- receivers:
173- - name: email # # 接收器名称
174- email_configs:
175- - to: 'alert@example.com' #接收告警邮箱
176- html: '{{ template "email.to.html" . }}'
177- headers: { Subject: "[OPENIM-SERVER]Alarm" }#邮件标题
178- send_resolved: true # 告警解决时是否发送通知
179- ```
180- 5,邮件模版文件email.tmpl说明:此文件是html格式,告警管理模块会填充里面的变量信息,然后渲染成html格式文件,进行邮件的发送,可根据需求自行改写:
181- ```
182- {{ define "email.to.html" }}
183- {{ if eq .Status "firing" }}
184- {{ range .Alerts }}
185- <!-- Begin of OpenIM Alert -->
186- <div style="border:1px solid #ccc; padding:10px; margin-bottom:10px;">
187- <h3>OpenIM Alert</h3>
188- <p><strong>Alert Status:</strong> firing</p>
189- <p><strong>Alert Program:</strong> Prometheus Alert</p>
190- <p><strong>Severity Level:</strong> {{ .Labels.severity }}</p>
191- <p><strong>Alert Type:</strong> {{ .Labels.alertname }}</p>
192- <p><strong>Affected Host:</strong> {{ .Labels.instance }}</p>
193- <p><strong>Affected Service:</strong> {{ .Labels.job }}</p>
194- <p><strong>Alert Subject:</strong> {{ .Annotations.summary }}</p>
195- <p><strong>Trigger Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
196- </div>
197- {{ end }}
198-
199-
200- {{ else if eq .Status "resolved" }}
201- {{ range .Alerts }}
202- <!-- Begin of OpenIM Alert -->
203- <div style="border:1px solid #ccc; padding:10px; margin-bottom:10px;">
204- <h3>OpenIM Alert</h3>
205- <p><strong>Alert Status:</strong> resolved</p>
206- <p><strong>Alert Program:</strong> Prometheus Alert</p>
207- <p><strong>Severity Level:</strong> {{ .Labels.severity }}</p>
208- <p><strong>Alert Type:</strong> {{ .Labels.alertname }}</p>
209- <p><strong>Affected Host:</strong> {{ .Labels.instance }}</p>
210- <p><strong>Affected Service:</strong> {{ .Labels.job }}</p>
211- <p><strong>Alert Subject:</strong> {{ .Annotations.summary }}</p>
212- <p><strong>Trigger Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
213- </div>
214- {{ end }}
215- <!-- End of OpenIM Alert -->
216- {{ end }}
217- {{ end }}
218-
219- ```
220-
221224
222225## 告警体验
223226可手动触发instancedown告警规则,如果是源码部署openim方式,执行 ` make stop ` 命令停止openim-server服务,等待5m分钟以上,即可收到告警邮件,内容如下:
0 commit comments