集群3副本丢失2副本 unsafe-recover恢复方法

网友投稿 536 2024-03-11



TiDB 中,根据用户定义的多种副本规则,一份数据可能会同时存储在多个节点中,从而保证在单个或少数节点暂时离线或损坏时,读写数据不受任何影响。但是,当一个 Region 的多数或全部副本在短时间内全部下线时,该 Region 会处于暂不可用的状态,无法进行读写操作。

集群3副本丢失2副本 unsafe-recover恢复方法

如果一段数据的多数副本发生了永久性损坏(如磁盘损坏)等问题,从而导致节点无法上线时,此段数据会一直保持暂不可用的状态。这时,如果用户希望集群恢复正常使用,在用户能够容忍数据回退或数据丢失的前提下,用户理论上可以通过手动移除不可用副本的方式,使 Region 重新形成多数派,进而让上层业务可以写入和读取(可能是 stale 的,或者为空)这一段数据分片。

集群信息

[tidb@vm116 ~]$ [tidb@vm116 ~]$ tiup cluster display tidb-prd tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-prd Cluster type: tidb Cluster name: tidb-prd Cluster version: v5.4.3 Deploy user: tidb SSH type: builtin TLS encryption: enabled CA certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt Client private key: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem Client certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt Dashboard URL: https://10.2.103.116:32379/dashboard Grafana URL: http://10.2.103.116:5000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir -- ---- ---- ----- ------- ------ -------- ---------- 10.2.103.116:9793 alertmanager 10.2.103.116 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793 10.2.103.116:5000 grafana 10.2.103.116 5000 linux/x86_64 Up - /data1/tidb-deploy/grafana-5000 10.2.103.116:32379 pd 10.2.103.116 32379/3380 linux/x86_64 Up|L|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379 10.2.103.116:9390 prometheus 10.2.103.116 9390/32020 linux/x86_64 Up /data1/tidb-data/prometheus-9390 /data1/tidb-deploy/prometheus-9390 10.2.103.116:43000 tidb 10.2.103.116 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000 10.2.103.116:30160 tikv 10.2.103.116 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160 10.2.103.116:30162 tikv 10.2.103.116 30162/30182 linux/x86_64 Up /data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162 10.2.103.116:30163 tikv 10.2.103.116 30163/30183 linux/x86_64 Up /data1/tidb-data/tikv-30163 /data1/tidb-deploy/tikv-30163 Total nodes: 8

查询数据

MySQL [(none)]> use test; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MySQL [test]> select count(*) from t3; +----------+ | count(*) | +----------+ | 3271488 | +----------+ 1 row in set (0.00 sec)

模拟tikv 宕机,同时强制2个tikv 缩容

[tidb@vm116 ~]$ tiup cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force██ ██ █████ ██████ ███ ██ ██ ███ ██ ██████ ██ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██ ██ ██ █ ██ ███████ ██████ ██ ██ ██ ██ ██ ██ ██ ██ ███ ██ ███ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ███ ███ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██████ Forcing scalein is unsafe and may result in data loss for stateful components. DO NOT use `--force` if you have any component in Pending Offline status. The process is irreversible and could NOT be cancelled. Only use `--force` when some of the servers are already permanently offline. Are you sure to continue(Type "Yes, I know my data might be lost." to continue) : Yes, I know my data might be lost. This operation will delete the 10.2.103.116:30160,10.2.103.116:30162 nodes in `tidb-prd` and all their data. Do you want to continue[y/N]:(default=N)y The component`[tikv]` will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it Do you want to continue[y/N]:(default=N) y Scale-in nodes... + [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.116:30160 10.2.103.116:30162] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:600IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:defaultOperation:StartOperation} Stopping component tikv Stopping instance10.2.103.116 Stop tikv 10.2.103.116:30160 success Destroying component tikv Destroying instance 10.2.103.116 Destroy 10.2.103.116 success - Destroy tikv paths: [/data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160/log /data1/tidb-deploy/tikv-30160 /etc/systemd/system/tikv-30160.service] Stopping component tikv Stopping instance 10.2.103.116 Stop tikv 10.2.103.116:30162 success Destroying component tikv Destroying instance 10.2.103.116Destroy10.2.103.116 success - Destroy tikv paths: [/data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162/log /data1/tidb-deploy/tikv-30162 /etc/systemd/system/tikv-30162.service] + [ Serial ] - UpdateMeta: cluster=tidb-prd, deleted=`10.2.103.116:30160,10.2.103.116:30162` + [ Serial ] - UpdateTopology: cluster=tidb-prd+ Refresh instance configs - Generate config pd -> 10.2.103.116:32379 ... Done - Generate config tikv -> 10.2.103.116:30163 ... Done - Generate config tidb -> 10.2.103.116:43000 ... Done - Generate config prometheus -> 10.2.103.116:9390 ... Done - Generate config grafana -> 10.2.103.116:5000 ... Done - Generate config alertmanager -> 10.2.103.116:9793 ... Done + Reload prometheus and grafana - Reload prometheus -> 10.2.103.116:9390 ... Done - Reload grafana -> 10.2.103.116:5000 ... Done Scaled cluster `tidb-prd` in successfully [tidb@vm116 ~]$

查询数据报错

MySQL [test]> select count(*) from t3; ERROR 9005 (HY000): Region isunavailable MySQL[test]>

报错日志

[2023/03/24 11:08:45.587 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8e640: Retry in 1000 milliseconds"] [2023/03/24 11:08:45.587 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=1] [2023/03/24 11:08:45.587 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:46.180 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:46.180 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=5002] [addr=10.2.103.116:30162] [2023/03/24 11:08:46.181 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627326.181128634\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"] [2023/03/24 11:08:46.181 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8ed40: Retry in 1000 milliseconds"] [2023/03/24 11:08:46.181 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30162] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=5002] [2023/03/24 11:08:46.181 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:48.162 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"] [2023/03/24 11:08:48.174 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"] [2023/03/24 11:08:50.587 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:50.587 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=1] [addr=10.2.103.116:30160] [2023/03/24 11:08:50.588 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627330.588107444\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30160\"}"] [2023/03/24 11:08:50.588 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8f440: Retry in 1000 milliseconds"] [2023/03/24 11:08:50.588 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=1] [2023/03/24 11:08:50.588 +08:00] [INFO] [store.rs:2580] ["broadcasting unreachable"] [unreachable_store_id=1] [store_id=5001] [2023/03/24 11:08:50.588 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:51.181 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:51.181 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoin

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:避免生产环境TiKV IO-Util接近100%的问题定位指南
下一篇:零售业大规模场景下数据库选型与迁移的ToC系统实践
相关文章