说明:测试环境v4.0.15,对于cdc 来说是一个非常老的版本,可能存在比较多的问题,如果是生产环境,尽量升级到比较新的版本,比如是v6.1.6,v6.5.1 这些版本无论是在性能和功能上面都有非常大的提升。下面的问题在v5.4.1测试就没有问题,所以推荐使用新的稳定的LTS版本。
cdc 的基本操作命令
#创建cdc
tiup ctl:v4
.0.15 cdc changefeed
create --pd=http://10.2.103.115:32379 --sink-uri="tidb://root:tidb@10.2.103.116:34000/" --changefeed-id="simple-replication-task" --config=cdc.toml
#查看cdc 任务状态
tiup ctl:v4
.0.15 cdc changefeed list
--pd=http://10.2.103.115:32379
#查看具体任务状态
tiup ctl:v4
.0.15 cdc changefeed query
-c
simple-replication-task
--pd=http://10.2.103.115:32379
#移除任务
tiup ctl:v4
.0.15 cdc changefeed remove
-c
simple-replication-task
--pd=http://10.2.103.115:32379
PD的状态
[tidb@vm115 ~
]$ tiup cluster display tidb-dev
tiup is checking updates
for component cluster
...
A new version of cluster is available:
The latest version: v1.12.0
Local installed version: v1.11.3
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component
`cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-dev
Cluster type: tidb
Cluster name: tidb-dev
Cluster version: v4.0.15
Deploy user: tidb
SSH type:
builtin
Dashboard URL: http://10.2.103.115:32379/dashboard
Grafana URL: http://10.2.103.115:7000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.2.103.115:9793 alertmanager
10.2.103.115
9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793
10.2.103.115:9893 alertmanager
10.2.103.115
9893/9894 linux/x86_64 Up /data1/tidb-data/alertmanager-9893 /data1/tidb-deploy/alertmanager-9893
10.2.103.115:8400 cdc
10.2.103.115
8400 linux/x86_64 Up /data1/tidb-data/cdc-8400 /data1/tidb-deploy/cdc-8400
10.2.103.115:7000 grafana
10.2.103.115
7000 linux/x86_64 Up - /data1/tidb-deploy/grafana-7000
10.2.103.115:32379 pd
10.2.103.115
32379/3380 linux/x86_64 Up
|L
|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379
10.2.103.115:35379 pd
10.2.103.115
35379/3580 linux/x86_64 Up /data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379
10.2.103.115:36379 pd
10.2.103.115
36379/3680 linux/x86_64 Up /data1/tidb-data/pd-36379 /data1/tidb-deploy/pd-36379
10.2.103.115:9590 prometheus
10.2.103.115
9590/35020 linux/x86_64 Up /data1/tidb-data/prometheus-9590 /data1/tidb-deploy/prometheus-9590
10.2.103.115:43000 tidb
10.2.103.115
43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000
10.2.103.115:30160 tikv
10.2.103.115
30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160
Total nodes:
10
cdc 任务状态
[tidb
@vm115 ~]$ tiup ctl
:v4
.0.15 cdc changefeed list
--pd
=http
://10.2.103.115:32379
Starting component
`ctl`: /home
/tidb
/.tiup
/components
/ctl
/v4
.0.15/ctl cdc changefeed list
--pd
=http
://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "normal",
"tso": 440757324149424131,
"checkpoint": "2023-04-13 11:15:59.237",
"error": null
}
}
]
切换PD leader
[tidb@vm115 ~
]$ tiup ctl:v4.0.15 pd -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379
Starting component
`ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl pd -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379
Success
!
[tidb@vm115 ~
]$ tiup cluster display tidb-dev
tiup is checking updates
for component cluster
...
A new version of cluster is available:
The latest version: v1.12.0
Local installed version: v1.11.3
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component
`cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-dev
Cluster type: tidb
Cluster name: tidb-dev
Cluster version: v4.0.15
Deploy user: tidb
SSH type:
builtin
Dashboard URL: http://10.2.103.115:32379/dashboard
Grafana URL: http://10.2.103.115:7000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.2.103.115:9793 alertmanager
10.2.103.115
9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793
10.2.103.115:9893 alertmanager
10.2.103.115
9893/9894 linux/x86_64 Up /data1/tidb-data/alertmanager-9893 /data1/tidb-deploy/alertmanager-9893
10.2.103.115:8400 cdc
10.2.103.115
8400 linux/x86_64 Up /data1/tidb-data/cdc-8400 /data1/tidb-deploy/cdc-8400
10.2.103.115:7000 grafana
10.2.103.115
7000 linux/x86_64 Up - /data1/tidb-deploy/grafana-7000
10.2.103.115:32379 pd
10.2.103.115
32379/3380 linux/x86_64 Up
|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379
10.2.103.115:35379 pd
10.2.103.115
35379/3580 linux/x86_64 Up
|L /data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379
10.2.103.115:36379 pd
10.2.103.115
36379/3680 linux/x86_64 Up /data1/tidb-data/pd-36379 /data1/tidb-deploy/pd-36379
10.2.103.115:9590 prometheus
10.2.103.115
9590/35020 linux/x86_64 Up /data1/tidb-data/prometheus-9590 /data1/tidb-deploy/prometheus-9590
10.2.103.115:43000 tidb
10.2.103.115
43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000
10.2.103.115:30160 tikv
10.2.103.115
30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160
Total nodes:
10
切换PD leader 对cdc 没有影响
[tidb
@vm115 ~]$ tiup ctl
:v4
.0.15 cdc changefeed list
--pd
=http
://10.2.103.115:32379
Starting component
`ctl`: /home
/tidb
/.tiup
/components
/ctl
/v4
.0.15/ctl cdc changefeed list
--pd
=http
://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "normal",
"tso": 440757380830461955,
"checkpoint": "2023-04-13 11:19:35.458",
"error": null
}
}
]
[tidb
@vm115 ~]$
缩容PD节点
cdc 任务报错
[tidb
@vm115 ~]$ tiup cluster scale
-in tidb
-dev
-N 10.2.103.115:35379
tiup
is checking updates
for component cluster
...
A new version of cluster
is available
:
The latest version
: v1
.12.1
Local installed version
: v1
.11.3
Update current component
: tiup update cluster
Update all components
: tiup update
--all
Starting component
`cluster`: /home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster scale
-in tidb
-dev
-N 10.2.103.115:35379
This operation will
delete the
10.2.103.115:35379 nodes
in `tidb-dev` and all their data
.
Do you want to
continue? [y
/N]:(default=N) y
Scale
-in nodes
...
+ [ Serial
] - SSHKeySet
: privateKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-dev
/ssh
/id_rsa
, publicKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-dev
/ssh
/id_rsa
.pub
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH
: user
=tidb
, host
=10.2.103.115
+ [ Serial
] - ClusterOperate
: operation
=DestroyOperation
, options
={Roles
:[] Nodes
:[10.2.103.115:35379] Force
:false SSHTimeout
:5 OptTimeout
:120 APITimeout
:600 IgnoreConfigCheck
:false NativeSSH
:false SSHType
: Concurrency
:5 SSHProxyHost
: SSHProxyPort
:22 SSHProxyUser
:tidb SSHProxyIdentity
:/home
/tidb
/.ssh
/id_rsa SSHProxyUsePassword
:false SSHProxyTimeout
:5 CleanupData
:false CleanupLog
:false CleanupAuditLog
:false RetainDataRoles
:[] RetainDataNodes
:[] DisplayMode
:default Operation
:StartOperation
}
Stopping component pd
Stopping instance
10.2.103.115
Stop pd
10.2.103.115:35379 success
Destroying component pd
Destroying instance
10.2.103.115
Destroy
10.2.103.115 success
- Destroy pd paths
: [/data1
/tidb
-data
/pd
-35379 /data1
/tidb
-deploy
/pd
-35379/log
/data1
/tidb
-deploy
/pd
-35379 /etc
/systemd
/system
/pd
-35379.service
]
+ [ Serial
] - UpdateMeta
: cluster
=tidb
-dev
, deleted
=`10.2.103.115:35379`
+ [ Serial
] - UpdateTopology
: cluster
=tidb
-dev
+ Refresh instance configs
- Generate config pd
-> 10.2.103.115:32379 ... Done
- Generate config pd
-> 10.2.103.115:36379 ... Done
- Generate config tikv
-> 10.2.103.115:30160 ... Done
- Generate config tidb
-> 10.2.103.115:43000 ... Done
- Generate config cdc
-> 10.2.103.115:8400 ... Done
- Generate config prometheus
-> 10.2.103.115:9590 ... Done
- Generate config grafana
-> 10.2.103.115:7000 ... Done
- Generate config alertmanager
-> 10.2.103.115:9793 ... Done
- Generate config alertmanager
-> 10.2.103.115:9893 ... Done
+ Reload prometheus and grafana
- Reload prometheus
-> 10.2.103.115:9590 ... Done
- Reload grafana
-> 10.2.103.115:7000 ... Done
Scaled cluster
`tidb-dev` in successfully
cdc任务报错
[tidb
@vm115 ~]$ tiup ctl
:v4
.0.15 cdc changefeed list
--pd
=http
://10.2.103.115:32379
Starting component
`ctl`: /home
/tidb
/.tiup
/components
/ctl
/v4
.0.15/ctl cdc changefeed list
--pd
=http
://10.2.103.115:32379
[2023/04/14 09:18:52.714 +08:00] [WARN] [client_changefeed
.go
:170] ["query changefeed info failed"] [error
="owner not found"]
[
{
"id": "simple-replication-task2",
"summary": null
}
]
[tidb
@vm115 ~]$ tiup ctl
:v4
.0.15 cdc changefeed list
--pd
=http
://10.2.103.115:32379
Starting component
`ctl`: /home
/tidb
/.tiup
/components
/ctl
/v4
.0.15/ctl cdc changefeed list
--pd
=http
://10.2.103.115:32379
[2023/04/14 09:19:00.720 +08:00] [WARN] [client_changefeed
.go
:170] ["query changefeed info failed"] [error
="owner not found"]
[
{
"id": "simple-replication-task2",
"summary": null
}
]
[tidb
@vm115 ~]$ tiup ctl
:v4
.0.15 cdc changefeed list
--pd
=http
://10.2.103.115:32379
Starting component
`ctl`: /home
/tidb
/.tiup
/components
/ctl
/v4
.0.15/ctl cdc changefeed list
--pd
=http
://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "stopped",
"tso": 440778127778512897,
"checkpoint": "2023-04-14 09:18:38.784",
"error": {
"addr": "10.2.103.115:8400",
"code": "CDC:ErrProcessorUnknown",
"message": "failed to update info: [CDC:ErrReachMaxTry]reach maximum try: 3"
}
}
}
]
[tidb
@vm115 ~]$
cdc 报错日志
[2023/04/14 09:18:40.348 +08:00] [ERROR
] [processor
.go:497] ["failed to flush task position"] [changefeed
=simple
-replication
-task2
] [error="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped"] [errorVerbose
="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:30\ngithub.com/pingcap/ticdc/cdc/kv.CDCEtcdClient.PutTaskPositionOnChange\n\tgithub.com/pingcap/ticdc@/cdc/kv/etcd.go:739\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:494\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskStatusAndPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:560\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:318\ngithub.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:54\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:317\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func2\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:349\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:418\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).Run.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:251\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
解决方案
重新的resume cdc 任务
[tidb
@vm115 ~]$ tiup ctl:v4
.0.15 cdc changefeed resume
-c
simple-replication-task2
--pd=http://10.2.103.115:32379
[tidb
@vm115 ~]$ tiup ctl:v4
.0.15 cdc changefeed list
--pd=http://10.2.103.115:32379
Starting component
`ctl
`:
/home
/tidb
/.tiup
/components
/ctl
/v4
.0.15/ctl cdc changefeed list
--pd=http://10.2.103.115:32379
[
{
"id":
"simple-replication-task2",
"summary": {
"state":
"normal",
"tso":
440778162669879297,
"checkpoint":
"2023-04-14 09:20:51.884",
"error":
null
}
}
]
[tidb
@vm115 ~]$
升级到v5.4.1测试
缩容PD
同样的操作,cdc 任务不报错
[tidb
@vm115 ~]$ tiup cluster scale
-in tidb
-dev
-N
10.2.103.115:
35379
tiup
is checking updates
for component cluster
...
A new version
of cluster
is available:
The latest version: v1
.12.1
Local installed version: v1
.11.3
Update current component: tiup
update cluster
Update all components: tiup
update --all
Starting component
`cluster
`:
/home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster scale
-in tidb
-dev
-N
10.2.103.115:
35379
This operation will
delete the
10.2.103.115:
35379 nodes
in `tidb
-dev
` and all their
data.
Do you want
to continue?
[y
/N
]:
(default=N
) y
Scale
-in nodes
...
+ [ Serial ] - SSHKeySet: privateKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-dev
/ssh
/id_rsa
, publicKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-dev
/ssh
/id_rsa
.pub
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.115
+ [ Serial ] - ClusterOperate: operation
=DestroyOperation
, options
={Roles:
[] Nodes:
[10.2.103.115:
35379] Force:
false SSHTimeout:
5 OptTimeout:
120 APITimeout:
600 IgnoreConfigCheck:
false NativeSSH:
false SSHType: Concurrency:
5 SSHProxyHost: SSHProxyPort:
22 SSHProxyUser:tidb SSHProxyIdentity:
/home
/tidb
/.ssh
/id_rsa SSHProxyUsePassword:
false SSHProxyTimeout:
5 CleanupData:
false CleanupLog:
false CleanupAuditLog:
false RetainDataRoles:
[] RetainDataNodes:
[] DisplayMode:
default Operation:StartOperation}
Stopping component pd
Stopping instance
10.2.103.115
Stop pd
10.2.103.115:
35379 success
Destroying component pd
Destroying instance
10.2.103.115
Destroy
10.2.103.115 success
- Destroy pd paths:
[/data1
/tidb
-data/pd
-35379 /data1
/tidb
-deploy
/pd
-35379/log
/data1
/tidb
-deploy
/pd
-35379 /etc
/systemd
/system
/pd
-35379.service
]
+ [ Serial ] - UpdateMeta: cluster
=tidb
-dev
, deleted
=`10.2.103.115:35379`
+ [ Serial ] - UpdateTopology: cluster
=tidb
-dev
+ Refresh instance configs
- Generate config pd
-> 10.2.103.115:
32379 ... Done
- Generate config pd
-> 10.2.103.115:
36379 ... Done
- Generate config tikv
-> 10.2.103.115:
30160 ... Done
- Generate config tidb
-> 10.2.103.115:
43000 ... Done
- Generate config cdc
-> 10.2.103.115:
8400 ... Done
- Generate config prometheus
-> 10.2.103.115:
9590 ... Done
- Generate config grafana
-> 10.2.103.115:
7000 ... Done
- Generate config alertmanager
-> 10.2.103.115:
9793 ... Done
- Generate config alertmanager
-> 10.2.103.115:
9893 ... Done
+ Reload prometheus
and grafana
- Reload prometheus
-> 10.2.103.115:
9590 ... Done
- Reload grafana
-> 10.2.103.115:
7000 ... Done
Scaled cluster
`tidb
-dev
` in successfully
[tidb
@vm115 ~]$
cdc 状态正常
[tidb
@vm115 ~]$ tiup ctl
:v5
.4.1 cdc changefeed list
--pd
=http
://10.2.103.115:32379
Starting component
`ctl`: /home
/tidb
/.tiup
/components
/ctl
/v5
.4.1/ctl cdc changefeed list
--pd
=http
://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "normal",
"tso": 440778284890324994,
"checkpoint": "2023-04-14 09:28:38.118",
"error": null
}
}
]
[tidb
@vm115 ~]$
总结:
1、任务生产上面的变更,如果有条件都要在测试环境模拟、测试一下。
2、生产集群尽量升级到一些主流、稳定的版本上,过老的版本可能存在一些BUG。
3、最新的LTS版本 cdc 功能和性能都要质的飞跃,推荐使用新的LTS版本。