TiDB同城双中心监控组件高可用方案

网友投稿 799 2023-11-24

背景

在双中心部署tidb dr-auto sync集群，出于监控的高可用考虑，在物理分离的两个数据中心分别部署独立的prometheus+alertmanager+grafana，实现任一监控均可访问。

此部署架构需考虑两套监控组件数据采集的一致性，以及监控告警重复发送的问题。

实现思路

两套Prometheus组件各自独立进行集群监控信息的采集和存储；

两套Grafana连接各自的Prometheus作为数据源；

AlertManager通过集群配置，基于gossip机制，在多个alertmanager收到相同告警事件后，由其中之一对外发送监控告警信息。

模拟实现

模拟实现的环境

TiDB v7.1.0 LTS

单个集群部署两套监控

# # Server configs are used to specify the configuration of Prometheus Server. monitoring_servers: - host: 30.0.100.40 port: 9091 deploy_dir: "/tidb/tidb-deploy/prometheus-8249" data_dir: "/data/tidb-data/prometheus-8249" log_dir: "/data/tidb-deploy/prometheus-8249/log" - host: 30.0.100.42 port: 9091 deploy_dir: "/tidb/tidb-deploy/prometheus-8249" data_dir: "/data/tidb-data/prometheus-8249" log_dir: "/data/tidb-deploy/prometheus-8249/log" # # Server configs are used to specify the configuration of Grafana Servers. grafana_servers: - host: 30.0.100.40 deploy_dir: /data/tidb-deploy/grafana-3000 - host: 30.0.100.42 deploy_dir: /data/tidb-deploy/grafana-3000 # # Server configs are used to specify the configuration of Alertmanager Servers. alertmanager_servers: - host: 30.0.100.40 deploy_dir: "/data/tidb-deploy/alertmanager-9093" data_dir: "/data/tidb-data/alertmanager-9093" log_dir: "/data/tidb-deploy/alertmanager-9093/log" - host: 30.0.100.42 deploy_dir: "/data/tidb-deploy/alertmanager-9093" data_dir: "/data/tidb-data/alertmanager-9093" log_dir: "/data/tidb-deploy/alertmanager-9093/log"

调整监控数据链路

grafana调整datasource

确认prometheus配置，设置alertmanager信息

登录alertmanager，确认多个alertmanager组成了集群（此处由tidb自动完成配置）

需复用haproxy+keepalive反向代理多个prometheus，并修改dashboard的prometheus数据源，以免单个prometheus故障后影响dashboard的使用

haproxy配置略

dashboard配置如下

Webhook实现

编写webhook转换为飞书api的golang程序

略

测试，使用HTTP接口测试工具，确认飞书webhook小程序接收并解析了相关告警事件

{ "version": "4", "groupKey": "123333", "status": "firing", "receiver": "target", "groupLabels": {"group":"group1"}, "commonLabels": {"server":"test"}, "commonAnnotations": {"server":"test"}, "externalURL": "http://30.0.100.40:3000", "alerts": [ { "labels": {"server":"test"}, "annotations": {"server":"test"}, "startsAt": "2023-08-12T07:20:50.52Z", "endsAt": "2023-08-12T09:20:50.52Z" } ] } 2023/08/20 10:40:20 172.31.0.4 - {"version":"4","groupKey":"123333","status":"firing","Receiver":"target","GroupLabels":{"group":"group1"},"CommonLabels":{"server":"test"},"CommonAnnotations":{"server":"test"},"ExternalURL":"http://30.0.100.40:3000","Alerts":[{"labels":{"server":"test"},"annotations":{"server":"test"},"startsAt":"2023-08-12T07:20:50.52Z","endsAt":"2023-08-12T09:20:50.52Z"}]} [GIN] 2023/08/20 - 10:40:20 | 200 | 621.879µs | 172.31.0.4 | POST "/alert-feishu"

配置alertmanager webhook

编写alertmanager配置文件模板，添加reciver及webhook定义，存放在tiup中控机的路径下

routes: - match: receiver: webhook-feishu-adapter continue: true receivers: - name: webhook-feishu-adapter webhook_configs: - send_resolved: true url: http://30.0.100.42:9999/alert-feishu

使用tiup edit-config，添加alertmanager_server下的config_file，路径指向上一步编写的alertmanager配置文件

alertmanager_servers: - host: 30.0.100.40 ssh_port: 22 web_port: 9093 cluster_port: 9094 deploy_dir: /data/tidb-deploy/alertmanager-9093 data_dir: /data/tidb-data/alertmanager-9093 log_dir: /data/tidb-deploy/alertmanager-9093/log arch: arm64 os: linux config_file: /home/tidb/monitor-template/alert_config_40.yaml - host: 30.0.100.42 ssh_port: 22 web_port: 9093 cluster_port: 9094 deploy_dir: /data/tidb-deploy/alertmanager-9093 data_dir: /data/tidb-data/alertmanager-9093 log_dir: /data/tidb-deploy/alertmanager-9093/log arch: arm64 os: linux config_file: /home/tidb/monitor-template/alert_config_42.yaml

尝试触发告警，确认未产生多条告警

关闭其中一个中心的监控组件，确认是否可以正常告警

启动上一步停止的tidb组件，确认可以触发告警的恢复

（此处为webhook代码中的错误，未引用恢复时间）

结论

在多中心环境下，除考虑集群本身的高可用功能外，其监控组件同样应具备高可用能力。本文从多中心监控使用及告警整合的维度，尝试构建了集群监控在多中心的高可用部署及实现方案。

如有疑问，欢迎讨论。

参考：

https://www.prometheus.wang/ha/alertmanager-high-availability.html

https://prometheus.io/docs/alerting/latest/overview/

标签：TiDB

TiDB同城双中心监控组件高可用方案

背景

实现思路

模拟实现

结论

黄东旭解析 TiDB 的核心优势

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

高成本云服务？TiDB 帮你省钱

推荐文章

HTAP 还可以这么玩？丨TiDB 在 IoT 智慧园区的应用

新特性解析丨TiDB 资源管控的设计思路与场景解析

TiDB赋能保险业-首个全栈自主核心保单系统成功投产

首个云原生、分布式、全栈国产化银行核心业务系统投产上线丨TiDB × 杭州银行

TiDB 在社交场景的解决方案实践

电商数据技术栈，在海量数据增长下如何实现实时与全量兼得？

金融行业数据库的选择

TiDB 在智能制造中的应用实践

TiDB 在全球头部物流企业计费管理系统的应用实践

PingCAP与教育部教育管理信息中心合作，推动普惠教育数字化转型

友情链接

热评文章

TiDB 中标杭州银行核心系统数据库项目

TiDB 首批通过信通院 HTAP 数据库基础能力评

PingCAP 与 Wisconsin-Madiso

PingCAP 成为中国唯一入选 Forrester

TiDB 走进东软集团，共建医疗数字化基石

共享开源技术，共建开放生态丨平凯星辰余梦杰出席 20

TiDB同城双中心监控组件高可用方案

背景

实现思路

模拟实现

结论

微信扫一扫：分享

推荐文章

友情链接

热评文章