集群3副本丢失2副本-unsafe-recover

网友投稿 966 2023-03-31

在 TiDB 中，根据用户定义的多种副本规则，一份数据可能会同时存储在多个节点中，从而保证在单个或少数节点暂时离线或损坏时，读写数据不受任何影响。但是，当一个 Region 的多数或全部副本在短时间内全部下线时，该 Region 会处于暂不可用的状态，无法进行读写操作。

如果一段数据的多数副本发生了永久性损坏（如磁盘损坏）等问题，从而导致节点无法上线时，此段数据会一直保持暂不可用的状态。这时，如果用户希望集群恢复正常使用，在用户能够容忍数据回退或数据丢失的前提下，用户理论上可以通过手动移除不可用副本的方式，使 Region 重新形成多数派，进而让上层业务可以写入和读取（可能是 stale 的，或者为空）这一段数据分片。

集群信息

[tidb@vm116 ~]$ [tidb@vm116 ~]$ tiup cluster display tidb-prd tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-prd Cluster type: tidb Cluster name: tidb-prd Cluster version: v5.4.3 Deploy user: tidb SSH type: builtin TLS encryption: enabled CA certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt Client private key: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem Client certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt Dashboard URL: https://10.2.103.116:32379/dashboard Grafana URL: http://10.2.103.116:5000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir -- ---- ---- ----- ------- ------ -------- ---------- 10.2.103.116:9793 alertmanager 10.2.103.116 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793 10.2.103.116:5000 grafana 10.2.103.116 5000 linux/x86_64 Up - /data1/tidb-deploy/grafana-5000 10.2.103.116:32379 pd 10.2.103.116 32379/3380 linux/x86_64 Up|L|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379 10.2.103.116:9390 prometheus 10.2.103.116 9390/32020 linux/x86_64 Up /data1/tidb-data/prometheus-9390 /data1/tidb-deploy/prometheus-9390 10.2.103.116:43000 tidb 10.2.103.116 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000 10.2.103.116:30160 tikv 10.2.103.116 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160 10.2.103.116:30162 tikv 10.2.103.116 30162/30182 linux/x86_64 Up /data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162 10.2.103.116:30163 tikv 10.2.103.116 30163/30183 linux/x86_64 Up /data1/tidb-data/tikv-30163 /data1/tidb-deploy/tikv-30163 Total nodes: 8

查询数据

MySQL [(none)]> use test; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MySQL [test]> select count(*) from t3; +----------+ | count(*) | +----------+ | 3271488 | +----------+ 1 row in set (0.00 sec)

模拟tikv 宕机，同时强制2个tikv 缩容

[tidb@vm116 ~]$ tiup cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force ██ ██ █████ ██████ ███ ██ ██ ███ ██ ██████ ██ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██ ██ ██ █ ██ ███████ ██████ ██ ██ ██ ██ ██ ██ ██ ██ ███ ██ ███ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ███ ███ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██████ Forcing scale in is unsafe and may result in data loss for stateful components. DO NOT use `--force` if you have any component in Pending Offline status. The process is irreversible and could NOT be cancelled. Only use `--force` when some of the servers are already permanently offline. Are you sure to continue? (Type "Yes, I know my data might be lost." to continue) : Yes, I know my data might be lost. This operation will delete the 10.2.103.116:30160,10.2.103.116:30162 nodes in `tidb-prd` and all their data. Do you want to continue? [y/N]:(default=N) y The component `[tikv]` will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it Do you want to continue? [y/N]:(default=N) y Scale-in nodes... + [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.116:30160 10.2.103.116:30162] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation} Stopping component tikv Stopping instance 10.2.103.116 Stop tikv 10.2.103.116:30160 success Destroying component tikv Destroying instance 10.2.103.116 Destroy 10.2.103.116 success - Destroy tikv paths: [/data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160/log /data1/tidb-deploy/tikv-30160 /etc/systemd/system/tikv-30160.service] Stopping component tikv Stopping instance 10.2.103.116 Stop tikv 10.2.103.116:30162 success Destroying component tikv Destroying instance 10.2.103.116 Destroy 10.2.103.116 success - Destroy tikv paths: [/data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162/log /data1/tidb-deploy/tikv-30162 /etc/systemd/system/tikv-30162.service] + [ Serial ] - UpdateMeta: cluster=tidb-prd, deleted=`10.2.103.116:30160,10.2.103.116:30162` + [ Serial ] - UpdateTopology: cluster=tidb-prd + Refresh instance configs - Generate config pd -> 10.2.103.116:32379 ... Done - Generate config tikv -> 10.2.103.116:30163 ... Done - Generate config tidb -> 10.2.103.116:43000 ... Done - Generate config prometheus -> 10.2.103.116:9390 ... Done - Generate config grafana -> 10.2.103.116:5000 ... Done - Generate config alertmanager -> 10.2.103.116:9793 ... Done + Reload prometheus and grafana - Reload prometheus -> 10.2.103.116:9390 ... Done - Reload grafana -> 10.2.103.116:5000 ... Done Scaled cluster `tidb-prd` in successfully [tidb@vm116 ~]$

查询数据报错

MySQL [test]> select count(*) from t3; ERROR 9005 (HY000): Region is unavailable MySQL [test]>

报错日志

[2023/03/24 11:08:45.587 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8e640: Retry in 1000 milliseconds"] [2023/03/24 11:08:45.587 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=1] [2023/03/24 11:08:45.587 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:46.180 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:46.180 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=5002] [addr=10.2.103.116:30162] [2023/03/24 11:08:46.181 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627326.181128634\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"] [2023/03/24 11:08:46.181 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8ed40: Retry in 1000 milliseconds"] [2023/03/24 11:08:46.181 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30162] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=5002] [2023/03/24 11:08:46.181 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:48.162 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"] [2023/03/24 11:08:48.174 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"] [2023/03/24 11:08:50.587 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:50.587 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=1] [addr=10.2.103.116:30160] [2023/03/24 11:08:50.588 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627330.588107444\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30160\"}"] [2023/03/24 11:08:50.588 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8f440: Retry in 1000 milliseconds"] [2023/03/24 11:08:50.588 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=1] [2023/03/24 11:08:50.588 +08:00] [INFO] [store.rs:2580] ["broadcasting unreachable"] [unreachable_store_id=1] [store_id=5001] [2023/03/24 11:08:50.588 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:51.181 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:51.181 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=5002] [addr=10.2.103.116:30162] [2023/03/24 11:08:51.182 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627331.182361851\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"] [2023/03/24 11:08:51.182 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8fb40: Retry in 999 milliseconds"] [2023/03/24 11:08:51.182 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30162] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=5002] [2023/03/24 11:08:51.182 +08:00] [INFO] [store.rs:2580] ["broadcasting unreachable"] [unreachable_store_id=5002] [store_id=5001] [2023/03/24 11:08:51.182 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30162] [store_id=5002] ^C [tidb@vm116 log]$

准备修复（v6.1 以前版本）

1、暂停PD调度

2、检查副本

使用 pd-ctl 检查大于等于一半副本数在故障节点上的 Region；要求：PD 处于运行状态；

» region --jq=.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,5002) then . else empty end) | length>=$total-length) } {"id":2003,"peer_stores":[1,5002,5001]} {"id":7001,"peer_stores":[1,5002,5001]} {"id":7005,"peer_stores":[1,5002,5001]} {"id":7009,"peer_stores":[1,5002,5001]} »

3、stop 需要修复的tikv

[tidb@vm116 ~]$ tiup cluster display tidb-prd tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-prd Cluster type: tidb Cluster name: tidb-prd Cluster version: v5.4.3 Deploy user: tidb SSH type: builtin TLS encryption: enabled CA certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt Client private key: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem Client certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt Dashboard URL: https://10.2.103.116:32379/dashboard Grafana URL: http://10.2.103.116:5000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir -- ---- ---- ----- ------- ------ -------- ---------- 10.2.103.116:9793 alertmanager 10.2.103.116 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793 10.2.103.116:5000 grafana 10.2.103.116 5000 linux/x86_64 Up - /data1/tidb-deploy/grafana-5000 10.2.103.116:32379 pd 10.2.103.116 32379/3380 linux/x86_64 Up|L|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379 10.2.103.116:9390 prometheus 10.2.103.116 9390/32020 linux/x86_64 Up /data1/tidb-data/prometheus-9390 /data1/tidb-deploy/prometheus-9390 10.2.103.116:43000 tidb 10.2.103.116 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000 10.2.103.116:30163 tikv 10.2.103.116 30163/30183 linux/x86_64 Up /data1/tidb-data/tikv-30163 /data1/tidb-deploy/tikv-30163 Total nodes: 6 [tidb@vm116 ~]$ tiup cluster stop tidb-prd -N 10.2.103.116:30163 tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster stop tidb-prd -N 10.2.103.116:30163 Will stop the cluster tidb-prd with nodes: 10.2.103.116:30163, roles: . Do you want to continue? [y/N]:(default=N) y + [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [ Serial ] - StopCluster Stopping component tikv Stopping instance 10.2.103.116 Stop tikv 10.2.103.116:30163 success Stopping component node_exporter Stopping component blackbox_exporter Stopped cluster `tidb-prd` successfully

4、unsafe-recover操作

在所有未发生掉电故障的实例上，对所有 Region 移除掉所有位于故障节点上的 Peer；要求：在所有未发生掉电故障的机器上运行，且需要关闭 TiKV 节点；

[tidb@vm116 v5.4.3]$ ./tikv-ctl --data-dir /data1/tidb-data/tikv-30163 unsafe-recover remove-fail-stores -s 1,5002 --all-regions [2023/03/24 11:25:47.978 +08:00] [WARN] [config.rs:612] ["compaction guard is disabled due to region info provider not available"] [2023/03/24 11:25:47.978 +08:00] [WARN] [config.rs:715] ["compaction guard is disabled due to region info provider not available"] removing stores [1, 5002] from configurations... success

5、启动已经修复的tikv

[tidb@vm116 v5.4.3]$ tiup cluster start tidb-prd -N 10.2.103.116:30163 tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster start tidb-prd -N 10.2.103.116:30163 Starting cluster tidb-prd... + [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [ Serial ] - StartCluster Starting component tikv Starting instance 10.2.103.116:30163 Start instance 10.2.103.116:30163 success Starting component node_exporter Starting instance 10.2.103.116 Start 10.2.103.116 success Starting component blackbox_exporter Starting instance 10.2.103.116 Start 10.2.103.116 success + [ Serial ] - UpdateTopology: cluster=tidb-prd Started cluster `tidb-prd` successfully

6、检查 Region Leader

使用 pd-ctl 检查没有 Leader 的 Region ; 要求：PD 处于运行状态；

[tidb@vm116 v5.4.3]$ tiup ctl:v5.4.3 pd -u "https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.3/ctl pd -u https://10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i » region --jq .regions[]|select(has("leader")|not)|{id: .id, peer_stores: [.peers[].store_id]} »

7、数据一致性检测

检查数据索引一致性，要求：PD、TiKV、TiDB 处于运行状态;

MySQL [test]> select count(*) from t3; +----------+ | count(*) | +----------+ | 3271488 | +----------+ 1 row in set (0.55 sec) MySQL [test]> MySQL [test]> admin check table t3; Query OK, 0 rows affected (0.00 sec) MySQL [test]>

8、恢复调度

异常情况

1、No such region

./tikv-ctl --data-dir /data3/tidb/data unsafe-recover remove-fail-stores -s 1 -r 50377 [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."] [INFO] [mod.rs:479] ["encryption is disabled."] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"] removing stores [1] from configurations... Debugger::remove_fail_stores: "No such region 50377 on the store"

2、创建空 Region 解决 Unavailable 报错

要求：PD 处于运行状态，命令的目标 TiKV 处于关闭状态

./tikv-ctl --ca-path /data3/tidb/deploy/tls/ca.crt --key-path /data3/tidb/deploy/tls/tikv.pem --cert-path /data3/tidb/deploy/tls/tikv.crt --data-dir /data3/tidb/data recreate-region -p https://10.2.103.116:32379 -r 50377 [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."] [INFO] [mod.rs:479] ["encryption is disabled."] initing empty region with peer_id ... success

v6.1 版本修复

1、集群信息

2、强制缩容2个tikv

[tidb@vm116 ~]$ tiup cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force ██ ██ █████ ██████ ███ ██ ██ ███ ██ ██████ ██ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██ ██ ██ █ ██ ███████ ██████ ██ ██ ██ ██ ██ ██ ██ ██ ███ ██ ███ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ███ ███ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██████ Forcing scale in is unsafe and may result in data loss for stateful components. DO NOT use `--force` if you have any component in Pending Offline status. The process is irreversible and could NOT be cancelled. Only use `--force` when some of the servers are already permanently offline. Are you sure to continue? (Type "Yes, I know my data might be lost." to continue) : Yes, I know my data might be lost. This operation will delete the 10.2.103.116:30160,10.2.103.116:30162 nodes in `tidb-prd` and all their data. Do you want to continue? [y/N]:(default=N) y The component `[tikv]` will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it Do you want to continue? [y/N]:(default=N) y Scale-in nodes... + [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.116:30160 10.2.103.116:30162] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation} failed to delete tikv: error requesting https://10.2.103.116:32379/pd/api/v1/store/7014, response: "[PD:core:ErrStoresNotEnough]can not remove store 7014 since the number of up stores would be 2 while need 3" , code 400 Stopping component tikv Stopping instance 10.2.103.116 Stop tikv 10.2.103.116:30162 success Destroying component tikv Destroying instance 10.2.103.116 Destroy 10.2.103.116 success - Destroy tikv paths: [/data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162/log /data1/tidb-deploy/tikv-30162 /etc/systemd/system/tikv-30162.service] failed to delete tikv: error requesting https://10.2.103.116:32379/pd/api/v1/store/7013, response: "[PD:core:ErrStoresNotEnough]can not remove store 7013 since the number of up stores would be 2 while need 3" , code 400 Stopping component tikv Stopping instance 10.2.103.116 Stop tikv 10.2.103.116:30160 success Destroying component tikv Destroying instance 10.2.103.116 Destroy 10.2.103.116 success - Destroy tikv paths: [/etc/systemd/system/tikv-30160.service /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160/log /data1/tidb-deploy/tikv-30160] + [ Serial ] - UpdateMeta: cluster=tidb-prd, deleted=`10.2.103.116:30160,10.2.103.116:30162` + [ Serial ] - UpdateTopology: cluster=tidb-prd + Refresh instance configs - Generate config pd -> 10.2.103.116:32379 ... Done - Generate config tikv -> 10.2.103.116:30163 ... Done - Generate config tidb -> 10.2.103.116:43000 ... Done - Generate config prometheus -> 10.2.103.116:9390 ... Done - Generate config grafana -> 10.2.103.116:5000 ... Done - Generate config alertmanager -> 10.2.103.116:9793 ... Done + Reload prometheus and grafana - Reload prometheus -> 10.2.103.116:9390 ... Done - Reload grafana -> 10.2.103.116:5000 ... Done Scaled cluster `tidb-prd` in successfully

3、查询store信息

[tidb@vm116 ~]$ tiup ctl:v5.4.3 pd -u "https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.3/ctl pd -u https://10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i » store { "count": 3, "stores": [ { "store": { "id": 7014, "address": "10.2.103.116:30162", "version": "6.1.5", "peer_address": "10.2.103.116:30162", "status_address": "10.2.103.116:30182", "git_hash": "e554126f6e83a6ddc944ddc51746b6def303ec1a", "start_timestamp": 1679629145, "deploy_path": "/data1/tidb-deploy/tikv-30162/bin", "last_heartbeat": 1679629565041246521, "node_state": 1, "state_name": "Disconnected" }, "status": { "capacity": "492GiB", "available": "424GiB", "used_size": "324.6MiB", "leader_count": 2, "leader_weight": 1, "leader_score": 2, "leader_size": 178, "region_count": 7, "region_weight": 1, "region_score": 410.17607715843184, "region_size": 251, "slow_score": 1, "start_ts": "2023-03-24T11:39:05+08:00", "last_heartbeat_ts": "2023-03-24T11:46:05.041246521+08:00", "uptime": "7m0.041246521s" } }, { "store": { "id": 5001, "address": "10.2.103.116:30163", "version": "6.1.5", "peer_address": "10.2.103.116:30163", "status_address": "10.2.103.116:30183", "git_hash": "e554126f6e83a6ddc944ddc51746b6def303ec1a", "start_timestamp": 1679629139, "deploy_path": "/data1/tidb-deploy/tikv-30163/bin", "last_heartbeat": 1679629599338071366, "node_state": 1, "state_name": "Up" }, "status": { "capacity": "492GiB", "available": "435GiB", "used_size": "342MiB", "leader_count": 5, "leader_weight": 1, "leader_score": 5, "leader_size": 73, "region_count": 7, "region_weight": 1, "region_score": 409.8657305188558, "region_size": 251, "slow_score": 1, "start_ts": "2023-03-24T11:38:59+08:00", "last_heartbeat_ts": "2023-03-24T11:46:39.338071366+08:00", "uptime": "7m40.338071366s" } }, { "store": { "id": 7013, "address": "10.2.103.116:30160", "version": "6.1.5", "peer_address": "10.2.103.116:30160", "status_address": "10.2.103.116:30180", "git_hash": "e554126f6e83a6ddc944ddc51746b6def303ec1a", "start_timestamp": 1679629155, "deploy_path": "/data1/tidb-deploy/tikv-30160/bin", "last_heartbeat": 1679629565148211763, "node_state": 1, "state_name": "Disconnected" }, "status": { "capacity": "492GiB", "available": "424GiB", "used_size": "324.6MiB", "leader_count": 0, "leader_weight": 1, "leader_score": 0, "leader_size": 0, "region_count": 7, "region_weight": 1, "region_score": 410.17510527918483, "region_size": 251, "slow_score": 1, "start_ts": "2023-03-24T11:39:15+08:00", "last_heartbeat_ts": "2023-03-24T11:46:05.148211763+08:00", "uptime": "6m50.148211763s" } } ] }

4、查询数据报错

MySQL [test]> select count(*) from t3; ERROR 9002 (HY000): TiKV server timeout MySQL [test]>

5、tikv 错误日志

[2023/03/24 11:48:47.549 +08:00] [ERROR] [raft_client.rs:824] ["connection abort"] [addr=10.2.103.116:30160] [store_id=7013] [2023/03/24 11:48:47.813 +08:00] [INFO] [raft.rs:1550] ["starting a new election"] [term=9] [raft_id=7004] [region_id=7001] [2023/03/24 11:48:47.813 +08:00] [INFO] [raft.rs:1170] ["became pre-candidate at term 9"] [term=9] [raft_id=7004] [region_id=7001] [2023/03/24 11:48:47.813 +08:00] [INFO] [raft.rs:1299] ["broadcasting vote request"] [to="[250648, 250650]"] [log_index=1863] [log_term=9] [term=9] [type=MsgRequestPreVote] [raft_id=7004] [region_id=7001] [2023/03/24 11:48:47.927 +08:00] [INFO] [<unknown>] ["subchannel 0x7fae72e89c00 {address=ipv4:10.2.103.116:30162, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30162, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22260, grpc.internal.security_connector=0x7fae72e0ad80, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30162, random id=324}: connect failed: {\"created\":\"@1679629727.927814392\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.1+1.44.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"] [2023/03/24 11:48:47.928 +08:00] [INFO] [<unknown>] ["subchannel 0x7fae72e89c00 {address=ipv4:10.2.103.116:30162, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30162, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22260, grpc.internal.security_connector=0x7fae72e0ad80, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30162, random id=324}: Retry in 999 milliseconds"] [2023/03/24 11:48:47.928 +08:00] [INFO] [advance.rs:296] ["check leader failed"] [to_store=7014] [error="\"[rpc failed] RpcFailure: 14-UNAVAILABLE failed to connect to all addresses\""] [2023/03/24 11:48:47.928 +08:00] [INFO] [<unknown>] ["subchannel 0x7fae72eb5c00 {address=ipv4:10.2.103.116:30160, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30160, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22ee0, grpc.internal.security_connector=0x7fae72fbd640, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30160, random id=325}: connect failed: {\"created\":\"@1679629727.928184608\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.1+1.44.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30160\"}"] [2023/03/24 11:48:47.928 +08:00] [INFO] [<unknown>] ["subchannel 0x7fae72eb5c00 {address=ipv4:10.2.103.116:30160, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30160, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22ee0, grpc.internal.security_connector=0x7fae72fbd640, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30160, random id=325}: Retry in 1000 milliseconds"] [2023/03/24 11:48:47.928 +08:00] [INFO] [advance.rs:296] ["check leader failed"] [to_store=7013] [error="\"[rpc failed] RpcFailure: 14-UNAVAILABLE failed to connect to all addresses\""] [2023/03/24 11:48:48.037 +08:00] [WARN] [endpoint.rs:621] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7009, leader may None\" not_leader { region_id: 7009 }"] [2023/03/24 11:48:48.089 +08:00] [WARN] [endpoint.rs:621] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"] [2023/03/24 11:48:48.409 +08:00] [WARN] [endpoint.rs:621] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 252001, leader may None\" not_leader { region_id: 252001 }"] ^C [tidb@vm116 log]$

6、PD unsafe 修复

1、执行修复命令

[tidb@vm116 ctl]$ tiup ctl:v6.1.5 pd -u "https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i Starting component `ctl`: /home/tidb/.tiup/components/ctl/v6.1.5/ctl pd -u https://10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i » unsafe remove-failed-stores 7013,7014 Success!

2、查询修复进度

3、修复完成

» unsafe remove-failed-stores show [ { "info": "Unsafe recovery enters collect report stage: failed stores 7013, 7014", "time": "2023-03-24 11:56:44.910" }, { "info": "Unsafe recovery enters force leader stage", "time": "2023-03-24 11:56:49.390", "actions": { "store 5001": [ "force leader on regions: 7001, 7005, 7009, 252001, 252005, 252009, 2003" ] } }, { "info": "Unsafe recovery enters demote failed voter stage", "time": "2023-03-24 11:57:20.434", "actions": { "store 5001": [ "region 7001 demotes peers { id:250648 store_id:7014 }, { id:250650 store_id:7013 }", "region 7005 demotes peers { id:250647 store_id:7013 }, { id:250649 store_id:7014 }", "region 7009 demotes peers { id:250644 store_id:7014 }, { id:250646 store_id:7013 }", "region 252001 demotes peers { id:252003 store_id:7013 }, { id:252004 store_id:7014 }", "region 252005 demotes peers { id:252007 store_id:7013 }, { id:252008 store_id:7014 }", "region 252009 demotes peers { id:252011 store_id:7013 }, { id:252012 store_id:7014 }", "region 2003 demotes peers { id:250643 store_id:7013 }, { id:250645 store_id:7014 }" ] } }, { "info": "Unsafe recovery finished", "time": "2023-03-24 11:57:22.443", "details": [ "affected table ids: 73, 77, 68, 70" ] } ]

7、查询数据

MySQL [test]> select count(*) from t3; +----------+ | count(*) | +----------+ | 3271488 | +----------+ 1 row in set (0.50 sec) MySQL [test]> MySQL [test]> admin check table t3; Query OK, 0 rows affected (0.00 sec) MySQL [test]>

总结：

1、尽量在PD的调度上满足异常宕机数据的高可用，考虑多个标签，比如机房，机架，机器，可以降低丢数据的风险。

2、在v6.1 之前，如果出现多副本的丢失，恢复步骤相对的繁琐，人工介入太多。 v6.1后恢复相对简答。如果可以的话，尽量升级到v6.1 这样能够快速恢复

标签：数据一致性 TiDB

集群3副本丢失2副本-unsafe-recover

准备修复（v6.1 以前版本）

1、暂停PD调度

2、检查副本

3、stop 需要修复的tikv

4、unsafe-recover操作

5、启动已经修复的tikv

6、检查 Region Leader

7、数据一致性检测

8、恢复调度

异常情况

1、No such region

2、创建空 Region 解决 Unavailable 报错

v6.1 版本修复

1、集群信息

2、强制缩容2个tikv

3、查询store信息

4、查询数据报错

5、tikv 错误日志

6、PD unsafe 修复

1、执行修复命令

2、查询修复进度

3、修复完成

7、查询数据

推荐文章

友情链接

热评文章