迁移 PD 与解决 CDC 任务停止的挑战

网友投稿 400 2024-04-29



说明:测试环境v4.0.15,对于cdc 来说是一个非常老的版本,可能存在比较多的问题,如果是生产环境,尽量升级到比较新的版本,比如是v6.1.6,v6.5.1 这些版本无论是在性能和功能上面都有非常大的提升。下面的问题在v5.4.1测试就没有问题,所以推荐使用新的稳定的LTS版本。

迁移 PD 与解决 CDC 任务停止的挑战

cdc 的基本操作命令

#创建cdc tiup ctl:v4.0.15 cdc changefeed create --pd=http://10.2.103.115:32379 --sink-uri="tidb://root:tidb@10.2.103.116:34000/" --changefeed-id="simple-replication-task" --config=cdc.toml #查看cdc 任务状态 tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379 #查看具体任务状态 tiup ctl:v4.0.15 cdc changefeed query -c simple-replication-task --pd=http://10.2.103.115:32379 #移除任务 tiup ctl:v4.0.15 cdc changefeed remove -c simple-replication-task --pd=http://10.2.103.115:32379

PD的状态

[tidb@vm115 ~]$ tiup cluster display tidb-dev tiup is checking updates for component cluster ... A new version of cluster is available: The latest version: v1.12.0 Local installed version: v1.11.3 Update current component: tiup update cluster Update all components: tiup update --all Starting component`cluster`:/home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-dev Cluster type: tidb Cluster name: tidb-dev Cluster version: v4.0.15 Deploy user: tidb SSH type:builtinDashboard URL: http://10.2.103.115:32379/dashboard Grafana URL: http://10.2.103.115:7000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir -- ---- ---- ----- ------- ------ -------- ----------10.2.103.115:9793 alertmanager 10.2.103.115 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793 10.2.103.115:9893 alertmanager 10.2.103.115 9893/9894 linux/x86_64 Up /data1/tidb-data/alertmanager-9893 /data1/tidb-deploy/alertmanager-989310.2.103.115:8400 cdc 10.2.103.115 8400 linux/x86_64 Up /data1/tidb-data/cdc-8400 /data1/tidb-deploy/cdc-8400 10.2.103.115:7000 grafana 10.2.103.115 7000 linux/x86_64 Up - /data1/tidb-deploy/grafana-7000 10.2.103.115:32379 pd 10.2.103.115 32379/3380 linux/x86_64 Up|L|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-3237910.2.103.115:35379 pd 10.2.103.115 35379/3580 linux/x86_64 Up /data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379 10.2.103.115:36379 pd 10.2.103.115 36379/3680 linux/x86_64 Up /data1/tidb-data/pd-36379 /data1/tidb-deploy/pd-36379 10.2.103.115:9590 prometheus 10.2.103.115 9590/35020 linux/x86_64 Up /data1/tidb-data/prometheus-9590 /data1/tidb-deploy/prometheus-959010.2.103.115:43000 tidb 10.2.103.115 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000 10.2.103.115:30160 tikv 10.2.103.115 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160 Total nodes:10

cdc 任务状态

[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379 Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379 [ { "id": "simple-replication-task2", "summary": { "state": "normal", "tso": 440757324149424131, "checkpoint": "2023-04-13 11:15:59.237", "error": null } } ]

切换PD leader

[tidb@vm115 ~]$ tiup ctl:v4.0.15 pd -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379 Starting component `ctl`:/home/tidb/.tiup/components/ctl/v4.0.15/ctl pd -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379 Success! [tidb@vm115 ~]$ tiup cluster display tidb-dev tiup is checking updates for component cluster ... A new version of cluster is available: The latest version: v1.12.0 Local installed version: v1.11.3 Update current component: tiup update cluster Update all components: tiup update --all Starting component`cluster`:/home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-dev Cluster type: tidb Cluster name: tidb-dev Cluster version: v4.0.15 Deploy user: tidb SSH type:builtinDashboard URL: http://10.2.103.115:32379/dashboard Grafana URL: http://10.2.103.115:7000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir -- ---- ---- ----- ------- ------ -------- ----------10.2.103.115:9793 alertmanager 10.2.103.115 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793 10.2.103.115:9893 alertmanager 10.2.103.115 9893/9894 linux/x86_64 Up /data1/tidb-data/alertmanager-9893 /data1/tidb-deploy/alertmanager-989310.2.103.115:8400 cdc 10.2.103.115 8400 linux/x86_64 Up /data1/tidb-data/cdc-8400 /data1/tidb-deploy/cdc-8400 10.2.103.115:7000 grafana 10.2.103.115 7000 linux/x86_64 Up - /data1/tidb-deploy/grafana-7000 10.2.103.115:32379 pd 10.2.103.115 32379/3380 linux/x86_64 Up|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379 10.2.103.115:35379 pd 10.2.103.115 35379/3580 linux/x86_64 Up|L /data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379 10.2.103.115:36379 pd 10.2.103.115 36379/3680 linux/x86_64 Up /data1/tidb-data/pd-36379 /data1/tidb-deploy/pd-3637910.2.103.115:9590 prometheus 10.2.103.115 9590/35020 linux/x86_64 Up /data1/tidb-data/prometheus-9590 /data1/tidb-deploy/prometheus-9590 10.2.103.115:43000 tidb 10.2.103.115 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000 10.2.103.115:30160 tikv 10.2.103.115 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160 Total nodes:10

切换PD leader 对cdc 没有影响

[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379 Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379 [ { "id": "simple-replication-task2", "summary": { "state": "normal", "tso": 440757380830461955, "checkpoint": "2023-04-13 11:19:35.458", "error": null } } ] [tidb@vm115 ~]$

缩容PD节点

cdc 任务报错

[tidb@vm115 ~]$ tiup cluster scale-in tidb-dev -N 10.2.103.115:35379 tiup is checking updates for component cluster ... A new version of cluster is available: The latest version: v1.12.1 Local installed version: v1.11.3 Update current component: tiup update cluster Update all components: tiup update --all Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-intidb-dev -N 10.2.103.115:35379 This operation will delete the 10.2.103.115:35379 nodes in `tidb-dev` and all their data. Do you want to continue [y/N]:(default=N) y Scale-in nodes... + [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa.pub + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [Parallel] -UserSSH: user=tidb, host=10.2.103.115 + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [Parallel] - UserSSH: user=tidb, host=10.2.103.115 + [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.115:35379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation} Stopping component pd Stopping instance 10.2.103.115 Stop pd 10.2.103.115:35379 success Destroying component pd Destroying instance 10.2.103.115 Destroy 10.2.103.115 success -Destroy pd paths: [/data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379/log /data1/tidb-deploy/pd-35379 /etc/systemd/system/pd-35379.service] + [ Serial ] - UpdateMeta: cluster=tidb-dev, deleted=`10.2.103.115:35379` + [ Serial ] - UpdateTopology: cluster=tidb-dev + Refresh instance configs - Generate config pd -> 10.2.103.115:32379 ... Done - Generate config pd -> 10.2.103.115:36379 ... Done - Generate config tikv -> 10.2.103.115:30160 ... Done - Generate config tidb -> 10.2.103.115:43000 ... Done - Generate config cdc -> 10.2.103.115:8400 ... Done - Generate config prometheus -> 10.2.103.115:9590 ... Done - Generate config grafana -> 10.2.103.115:7000 ... Done - Generate config alertmanager -> 10.2.103.115:9793 ... Done - Generate config alertmanager -> 10.2.103.115:9893 ... Done + Reload prometheus and grafana - Reload prometheus -> 10.2.103.115:9590 ... Done - Reload grafana -> 10.2.103.115:7000 ... Done Scaled cluster `tidb-dev` in successfully

cdc任务报错

[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379 Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379 [2023/04/14 09:18:52.714 +08:00] [WARN] [client_changefeed.go:170] ["query changefeed info failed"] [error="owner not found"] [ { "id": "simple-replication-task2", "summary": null } ] [tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379 Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379 [2023/04/14 09:19:00.720 +08:00] [WARN] [client_changefeed.go:170] ["query changefeed info failed"] [error="owner not found"] [ { "id": "simple-replication-task2", "summary": null } ] [tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379 Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379 [ { "id": "simple-replication-task2", "summary": { "state": "stopped", "tso": 440778127778512897, "checkpoint": "2023-04-14 09:18:38.784", "error": { "addr": "10.2.103.115:8400", "code": "CDC:ErrProcessorUnknown", "message": "failed to update info: [CDC:ErrReachMaxTry]reach maximum try: 3" } } } ] [tidb@vm115 ~]$

cdc 报错日志

[2023/04/14 09:18:40.348 +08:00] [ERROR] [processor.go:497] ["failed to flush task position"] [changefeed=simple-replication-task2] [error="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped"] [errorVerbose="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:30\ngithub.com/pingcap/ticdc/cdc/kv.CDCEtcdClient.PutTaskPositionOnChange\n\tgithub.com/pingcap/ticdc@/cdc/kv/etcd.go:739\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:494\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskStatusAndPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:560\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:318\ngithub.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:54\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:317\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func2\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:349\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:418\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).Run.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:251\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]

解决方案

重新的resume cdc 任务

[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed resume -c simple-replication-task2 --pd=http://10.2.103.115:32379 [tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379 Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379 [ { "id": "simple-replication-task2", "summary": { "state": "normal", "tso": 440778162669879297, "checkpoint": "2023-04-14 09:20:51.884", "error": null } } ] [tidb@vm115 ~]$

升级到v5.4.1测试

缩容PD

同样的操作,cdc 任务不报错

[tidb@vm115 ~]$ tiup cluster scale-in tidb-dev -N 10.2.103.115:35379 tiup is checking updates for component cluster ... A new version of cluster is available: The latest version: v1.12.1 Localinstalled version: v1.11.3 Update current component: tiup update cluster Update all components: tiup update --all Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-dev -N 10.2

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:达到等级保护三级测评:TiDB 数据加密传输的实施经验
下一篇:这一年我与 TiDB 的成长故事
相关文章