扩容prometheus 并迁移prometheus 的数据
背景:
Tidb 在日常的集群运维时,如果需要做扩缩容时,是非常方便的,但是对于监控组件prometheus 官网并没有给出带历史监控数据的迁移方式。下面是结合实际的操作总结的一些操作步骤,如果有更好的方式欢迎交流。
1、扩容新的prometheus
tiup cluster scale
-out tidb
-dev
./scale
-out-prom
.yaml
2、修改监控组件和dashbaord的promethues 源
Grafana 报错
解决方案:
修改grafna 的数据源
选择目前的数据源
修改成新扩容的promthues地址
Dashboard 报错
解决方案:
点击修改promethues 源,选择自由源
3、stop 老的prometheus
tiup cluster stop tidb-prd -N 10.2.103.116:9590
4、备份老的prometheus的数据
[tidb
@vm116 tidb
-data]$ mv prometheus
-9590/ prometheus
-9590_bak
[tidb
@vm116 tidb
-data]$ mkdir prometheus
-9590
[tidb
@vm116 tidb
-data]$ ll
5、拷贝老的监控数据到新的prometheus
drwxr-xr-x
3tidb tidb
4096 4月
13 19:00 01GXX4BEMERFC450HMF20P43XZ
drwxr-xr-x
3 tidb tidb
4096 4月
14 11:00 01GXYV98210PZBRFFDBY5H8GEC
drwxr-xr-x
3 tidb tidb
4096 4月
15 05:00 01GY0S2SMKP8R40QTY9TZ2E4Z1
drwxr-xr-x
3 tidb tidb
4096 4月
15 11:00 01GY1DNYYKXGA1SSQD5Q31T3Y6
drwxr-xr-x
3 tidb tidb
4096 4月
15 13:00 01GY1MHNTZAXMFFA9C8TRK2ZX0
drwxr-xr-x
3 tidb tidb
4096 4月
15 13:00 01GY1MHP8RCK1J80EGF0DGF9R4
drwxr-xr-x
2 tidb tidb
4096 4月
15 13:00 chunks_head
drwx------
2 tidb tidb
4096 4月
15 13:57 docdb
-rw-r--r--
1 tidb tidb
0 4月
13 10:05 lock
-rw-r--r--
1 tidb tidb
20001 4月
15 13:56 queries.active
drwxr-xr-x
8 tidb tidb
4096 4月
15 13:57 tsdb
drwxr-xr-x
3 tidb tidb
4096 4月
15 13:00 wal
[tidb@vm115 prometheus-9590
]$
cp -r 01G* /data1/tidb-data/prometheus-9690/
[tidb@vm115 prometheus-9590
]$
pwd/data1/tidb-data/prometheus-9590
[tidb@vm115 prometheus-9590
]$
总用量
24
drwxr-xr-x
3 tidb tidb
4096 4月
13 09:51 01GXV6HXACPXR3S4MSKR273MKR
drwxr-xr-x
3 tidb tidb
4096 4月
13 09:51 01GXVM9BKBDRW0D6WYGZC2RQZV
drwxr-xr-x
3 tidb tidb
4096 4月
1309:51 01GXVM9CJPYDCXY8Q2HDAJH985
drwxr-xr-x
3 tidb tidb
4096 4月
13 09:51 01GXVV52V0AH0AG7S4CVYF2SBD
drwxr-xr-x
3 tidb tidb
4096 4月
13 09:51 01GXW20T31GE5JF2S2H8FH05XF
-rw-r--r--
1 tidb tidb
0 4月
13 09:51 lock
drwxr-xr-x
3 tidb tidb
4096 4月
13 09:51 wal
[tidb@vm115 prometheus-9590
]$
[tidb
@vm115 prometheus
-9690]$ pwd
/data1
/tidb
-data/prometheus
-9690
[tidb
@vm115 prometheus
-9690]$ ll
总用量
60
drwxr
-xr
-x
3 tidb tidb
4096 4月
15 14:
14 01GXX4BEMERFC450HMF20P43XZ
drwxr
-xr
-x
3 tidb tidb
4096 4月
15 14:
14 01GXYV98210PZBRFFDBY5H8GEC
drwxr
-xr
-x
3 tidb tidb
4096 4月
15 14:
14 01GY0S2SMKP8R40QTY9TZ2E4Z1
drwxr
-xr
-x
3 tidb tidb
4096 4月
15 14:
14 01GY1DNYYKXGA1SSQD5Q31T3Y6
drwxr
-xr
-x
3 tidb tidb
4096 4月
15 14:
14 01GY1MHNTZAXMFFA9C8TRK2ZX0
drwxr
-xr
-x
3 tidb tidb
4096 4月
15 14:
14 01GY1MHP8RCK1J80EGF0DGF9R4
drwxr
-xr
-x
2 tidb tidb
4096 4月
15 14:
09chunks_head
drwx
------ 2 tidb tidb 4096 4月 15 14:09 docdb
-rw
-r
--r-- 1 tidb tidb 0 4月 15 13:56 lock
-rw
-r
--r-- 1 tidb tidb 20001 4月 15 14:16 queries.active
drwxr
-xr
-x
8 tidb tidb
4096 4月
15 14:
09 tsdb
drwxr
-xr
-x
2 tidb tidb
4096 4月
15 14:
09 wal
[tidb
@vm115 prometheus
-9690]$
查询grafana
问题:promethues 保存了一些数据在内存里面,导致即使导入老的监控,仍然有一段是丢失的,暂时没有找到解决版本
如果restart prometheus 报错
level
=info ts
=2023-04-13T01
:59:58.473663032Z caller
=main
.go:640 msg
="Starting TSDB ..."
level
=info ts
=2023-04-13T01
:59:58.473702757Z caller
=web
.go:418 component
=web msg
="Start listening for connections" address
=:9590
level
=info ts
=2023-04-13T01
:59:58.473915597Z caller
=repair
.go:48 component
=tsdb msg
="found healthy block" mint
=1681288029295 maxt
=1681300800000 ulid
=01GXV6HXACPXR3S4MSKR273MKR
level
=info ts
=2023-04-13T01
:59:58.473956652Z caller
=repair
.go:48 component
=tsdb msg
="found healthy block" mint
=1681322400000 maxt
=1681329600000 ulid
=01GXVM9BKBDRW0D6WYGZC2RQZV
level
=info ts
=2023-04-13T01
:59:58.473993214Z caller
=repair
.go:48 component
=tsdb msg
="found healthy block" mint
=1681300800000 maxt
=1681322400000 ulid
=01GXVM9CJPYDCXY8Q2HDAJH985
level
=info ts
=2023-04-13T01
:59:58.474019427Z caller
=repair
.go:48 component
=tsdb msg
="found healthy block" mint
=1681329600000 maxt
=1681336800000 ulid
=01GXVV52V0AH0AG7S4CVYF2SBD
level
=info ts
=2023-04-13T01
:59:58.474186922Z caller
=main
.go:509 msg
="Stopping scrape discovery manager..."
level
=info ts
=2023-04-13T01
:59:58.474200027Z caller
=main
.go:523 msg
="Stopping notify discovery manager..."
level
=info ts
=2023-04-13T01
:59:58.474205361Z caller
=main
.go:545 msg
="Stopping scrape manager..."
level
=info ts
=2023-04-13T01
:59:58.474212946Z caller
=main
.go:539 msg
="Scrape manager stopped"
level
=info ts
=2023-04-13T01
:59:58.474224191Z caller
=main
.go:505 msg
="Scrape discovery manager stopped"
level
=info ts
=2023-04-13T01
:59:58.474232013Z caller
=main
.go:519 msg
="Notify discovery manager stopped"
level
=info ts
=2023-04-13T01
:59:58.474241846Z caller
=manager
.go:736 component
="rule manager" msg
="Stopping rule manager..."
level
=info ts
=2023-04-13T01
:59:58.474253972Z caller
=manager
.go:742component
="rule manager" msg
="Rule manager stopped"
level
=info ts
=2023-04-13T01
:59:58.474264935Z caller
=notifier
.go:521 component
=notifier msg
="Stopping notification manager..."
level
=info ts
=2023-04-13T01
:59:58.474272305Z caller
=main
.go:708 msg
="Notifier manager stopped"
level
=error ts
=2023-04-13T01
:59:58.474455533Z caller
=main
.go:717 err
="opening storage failed: get segment range: segments are not sequential"
^C
#解决方案
rm
-rf
/data1
/tidb
-data
/prometheus
-9590/wal
总结:
整体操作步骤还是比较简单,主要是修改grafana和dashboard 源。希望官方也能给出更好的操作方式。