配置 webhook 告警在 TiDB 实践

网友投稿 536 2024-05-01



prometheus-webhook 是对alertmanager 告警的一个扩展,支持钉钉,微信,邮件告警和自建告警模板

配置 webhook 告警在 TiDB 实践

1、配置告警

1、下载并解压告警安装包

#下载

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

#解压

tar -zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

[tidb@vm172-16-201-64 prometheus-webhook-dingtalk-2.1.0.linux-amd64]$ ll

总用量 18744

-rw-r--r-- 1 tidb tidb 1299 4月 21 16:20 config.example.yml

drwxr-xr-x 4 tidb tidb 4096 4月 21 16:20 contrib

-rw-r--r-- 1 tidb tidb 11358 4月 21 16:20 LICENSE

-rwxr-xr-x 1 tidb tidb 19172733 4月 21 16:19 prometheus-webhook-dingtalk

[tidb@vm172-16-201-64 prometheus-webhook-dingtalk-2.1.0.linux-amd64]$

2、配置webhook启动脚本

more /data/webhook-dingtalk/webhook-dingtalk.sh

#!/bin/bash

set -e

WEBHOOK_BIN=/data/webhook-dingtalk/prometheus-webhook-dingtalk

exec $WEBHOOK_BIN \

--web.listen-address=":8060" \

--config.file="/data/webhook-dingtalk/jms_config.yml" \

--log.level="info" \

--log.format="logfmt" \

--web.enable-lifecycle \

--web.enable-ui \

3、配置webhook 配置文件

more /data/webhook-dingtalk_config.yml

## Request timeout

# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)

#no_builtin_template: true

## Customizable templates path

#templates:

# - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`

## The following example to use the legacy template from v0.3.0

#default_message:

# title: {{ template "legacy.title" . }}

# text: {{ template "legacy.content" . }}

## Targets, previously was known as "profiles"

targets:

webhook1:

# secret for signature

secret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#webhook2:

webhook_legacy:

secret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# Customize template content

message:

# Use legacy template

title: {{ template "legacy.title" . }}

text: {{ template "legacy.content" . }}

#webhook_mention_all:

# mention:

# all: true

webhook_mention_users:

secret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

mention:

mobiles: [XXXXXXXXXXXX]

4、配置alertmanager.yml

more /data/dm-deploy/alertmanager-9093/conf/alertmanager.yml

global:

# The smarthost and SMTP sender used for mail notifications.

smtp_smarthost: "localhost:25"

smtp_from: "alertmanager@example.org"

smtp_auth_username: "alertmanager"

smtp_auth_password: "password"

# smtp_require_tls: true

# The Slack webhook URL.

# slack_api_url:

route:

# A default receiver

receiver: "webhook"

# The labels by which incoming alerts are grouped together. For example,

# multiple alerts coming in for cluster=A and alertname=LatencyHigh would

# be batched into a single group.

group_by: ["env", "instance", "alertname", "type", "group", "job"]

# When a new group of alerts is created by an incoming alert, wait at

# least group_wait to send the initial notification.

# This way ensures that you get multiple alerts for the same group that start

# firing shortly after another are batched together on the first

# notification.

group_wait: 30s

# When the first notification was sent, wait group_interval to send a batch

# of new alerts that started firing for that group.

group_interval: 3m

# If an alert has successfully been sent, wait repeat_interval to

# resend them.

repeat_interval: 3m

routes:

# - match:

# receiver: webhook-kafka-adapter

# continue: true

# - match:

# env: test-cluster

# receiver: db-alert-slack

# - match:

# env: test-cluster

# receiver: db-alert-email

#配置的IP地址就是部署webhook的机器地址

receivers:

- name: webhook

webhook_configs:

- send_resolved: true

url: http://XX.XX.XX.:8060/dingtalk/webhook1/send

#- name: db-alert-slack

# slack_configs:

# - channel: #alerts

# username: db-alert

# icon_emoji: :bell:

# title: {{ .CommonLabels.alertname }}

# text: {{ .CommonAnnotations.summary }} {{ .CommonAnnotations.description }} expr: {{ .CommonLabels.expr }} http://1

72.0.0.1:9093/#/alerts

# - name: "db-alert-email"

# email_configs:

# - send_resolved: true

# to: "example@example.com"

# This doesnt alert anything, please configure your own receiver

#- name: "blackhole"

5、配置开机启动脚本

more /etc/systemd/system/prometheus-webhook.service

[Unit]

Description=prometheus-webhook service

After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]

LimitNOFILE=1000000

LimitSTACK=10485760

User=tidb

ExecStart=/data/webhook-dingtalk/webhook-dingtalk.sh

Restart=always

RestartSec=15s

[Install]

WantedBy=multi-user.target

6、启动webhook

#启动webhook

sudo systemctl start prometheus-webhook.service

#停止webhook

sudo systemctl stop prometheus-webhook.service

#查看服务状态

sudo systemctl status -l prometheus-webhook.service

7、重启alertmanager让告警生效

tiup clutster stop tidb-test -N x:9093

tiup clutster start tidb-test-N x:9093

#查看启动后状态

tiup clutster display tidb-jms -N x:9093

8、告警展示

[FIRING:1] tidb_tikvclient_backoff_seconds_count

Alerts Firing

TiDB tikvclient_backoff_count error

Description: cluster: tidb-test, instance: xxxx:10081, values:253.33333333333331

Graph:

Details:

alertname: tidb_tikvclient_backoff_seconds_count

cluster: tidb-test

env: tidb-test

expr: increase( tidb_tikvclient_backoff_seconds_count[10m] ) > 10

instance: xxxx:10081

job: tidb

level: warning

monitor: prometheus

type: regionMiss

9、注意事项

需要注意的是,TiUP 会使用自己的配置参数覆盖监控组件的配置,如果你直接修改监控组件的配置文件,修改的配置文件可能在对集群进行 deploy/scale-out/scale-in/reload 等操作中被 TiUP 所覆盖,导致配置不生效。

alertmanager_servers

config_file:该字段指定一个本地文件,该文件会在集群配置初始化阶段被传输到目标机器上,作为 Alertmanager 的配置

Plain Text

alertmanager_servers:

- host: 172.16.201.64

ssh_port: 22

web_port: 9093

cluster_port: 9094

deploy_dir: /data1/tidb-deploy/alertmanager-9093

data_dir: /data1/tidb-data/alertmanager-9093

log_dir: /data1/tidb-deploy/alertmanager-9093/log

arch: amd64

os: linux

config_file: /data1/tidb-deploy/alertmanager-9093/conf/alertmanager_test.yml

2、修改告警

1、到prometheus的conf 目录下找到对应的告警项

[tidb@vm172-16-201-64 ~]$ cd /data/tidb-deploy/prometheus-9090/conf/

[tidb@vm172-16-201-64 conf]$ ll

总用量 96

-rw-r--r-- 1 tidb tidb 3500 6月 28 15:34 binlog.rules.yml

-rw-r--r-- 1 tidb tidb 4492 6月 28 15:34 blacker.rules.yml

-rw-r--r-- 1 tidb tidb 37 6月 28 15:34 bypass.rules.yml

-rw-r--r-- 1 tidb tidb 1964 6月 28 15:34 kafka.rules.yml

-rw-r--r-- 1 tidb tidb 459 6月 28 15:34 lightning.rules.yml

-rw-r--r-- 1 tidb tidb 507 6月 28 15:34 ngmonitoring.toml

-rw-r--r-- 1 tidb tidb 5214 6月 28 15:34 node.rules.yml

-rw-r--r-- 1 tidb tidb 7920 6月 28 15:34 pd.rules.yml

-rw-r--r-- 1 tidb tidb 6199 6月 28 15:34 prometheus.yml

-rw-r--r-- 1 tidb tidb 6507 6月 28 15:34 ticdc.rules.yml

-rw-r--r-- 1 tidb tidb 6271 6月 28 15:34 tidb.rules.yml

-rw-r--r-- 1 tidb tidb 3112 6月 28 15:34 tiflash.rules.yml

-rw-r--r-- 1 tidb tidb 4685 6月 28 15:34 tikv.accelerate.rules.yml

-rw-r--r-- 1 tidb tidb 13977 6月 28 15:34 tikv.rules.yml

[tidb@vm172-16-201-64 conf]$

2、备份相应的文件,修改告警项

cp tidb.rules.yml tidb.rules.yml_20220628

vi tidb.rules.yml

3、重启prometheus,让修改生效

Plain Text

tiup clutster stop tidb-jms -N x:9090

tiup clutster start tidb-jms -N x:9090

#查看启动后状态

tiup clutster display tidb-jms -N x:9090

4、临时静默

https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/alert-manager-inhibit

用户或者管理员可以直接通过Alertmanager的UI临时屏蔽特定的告警通知。通过定义标签的匹配规则(字符串或者正则表达式),如果新的告警通知满足静默规则的设置,则停止向receiver发送通知。

进入Alertmanager UI,点击"New Silence"显示如下内容:

1、创建静默规则

用户可以通过该UI定义新的静默规则的开始时间以及持续时间,通过Matchers部分可以设置多条匹配规则(字符串匹配或者正则匹配)。填写当前静默规则的创建者以及创建原因后,点击"Create"按钮即可。

通过"Preview Alerts"可以查看预览当前匹配规则匹配到的告警信息。静默规则创建成功后,Alertmanager会开始加载该规则并且设置状态为Pending,当规则生效后则进行到Active状态。

活动的静默规则

当静默规则生效以后,从Alertmanager的Alerts页面下用户将不会看到该规则匹配到的告警信息。

告警信息

对于已经生效的规则,用户可以通过手动点击”Expire“按钮使当前规则过期。

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:配置 TiDB 集群故障自动转移的 TiDB Operator 教程
下一篇:阿毛哥与 TiDB 的成长故事 探索顶级贡献者的旅程
相关文章