集群3副本丢失2副本 unsafe-recover恢复方法-PingCAP

集群3副本丢失2副本 unsafe-recover恢复方法

网友投稿 656 2024-03-11

在 TiDB 中，根据用户定义的多种副本规则，一份数据可能会同时存储在多个节点中，从而保证在单个或少数节点暂时离线或损坏时，读写数据不受任何影响。但是，当一个 Region 的多数或全部副本在短时间内全部下线时，该 Region 会处于暂不可用的状态，无法进行读写操作。

集群3副本丢失2副本 unsafe-recover恢复方法

如果一段数据的多数副本发生了永久性损坏（如磁盘损坏）等问题，从而导致节点无法上线时，此段数据会一直保持暂不可用的状态。这时，如果用户希望集群恢复正常使用，在用户能够容忍数据回退或数据丢失的前提下，用户理论上可以通过手动移除不可用副本的方式，使 Region 重新形成多数派，进而让上层业务可以写入和读取（可能是 stale 的，或者为空）这一段数据分片。

集群信息

[tidb@vm116 ~]$ [tidb@vm116 ~]$ tiup cluster display tidb-prd tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-prd Cluster type: tidb Cluster name: tidb-prd Cluster version: v5.4.3 Deploy user: tidb SSH type: builtin TLS encryption: enabled CA certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt Client private key: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem Client certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt Dashboard URL: https://10.2.103.116:32379/dashboard Grafana URL: http://10.2.103.116:5000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir -- ---- ---- ----- ------- ------ -------- ---------- 10.2.103.116:9793 alertmanager 10.2.103.116 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793 10.2.103.116:5000 grafana 10.2.103.116 5000 linux/x86_64 Up - /data1/tidb-deploy/grafana-5000 10.2.103.116:32379 pd 10.2.103.116 32379/3380 linux/x86_64 Up|L|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379 10.2.103.116:9390 prometheus 10.2.103.116 9390/32020 linux/x86_64 Up /data1/tidb-data/prometheus-9390 /data1/tidb-deploy/prometheus-9390 10.2.103.116:43000 tidb 10.2.103.116 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000 10.2.103.116:30160 tikv 10.2.103.116 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160 10.2.103.116:30162 tikv 10.2.103.116 30162/30182 linux/x86_64 Up /data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162 10.2.103.116:30163 tikv 10.2.103.116 30163/30183 linux/x86_64 Up /data1/tidb-data/tikv-30163 /data1/tidb-deploy/tikv-30163 Total nodes: 8

查询数据

MySQL [(none)]> use test; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MySQL [test]> select count(*) from t3; +----------+ | count(*) | +----------+ | 3271488 | +----------+ 1 row in set (0.00 sec)

模拟tikv 宕机，同时强制2个tikv 缩容

[tidb@vm116 ~]$ tiup cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force tiup is checking updates for component cluster ... Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force██ ██ █████ ██████ ███ ██ ██ ███ ██ ██████ ██ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██ ██ ██ █ ██ ███████ ██████ ██ ██ ██ ██ ██ ██ ██ ██ ███ ██ ███ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ███ ███ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██████ Forcing scalein is unsafe and may result in data loss for stateful components. DO NOT use `--force` if you have any component in Pending Offline status. The process is irreversible and could NOT be cancelled. Only use `--force` when some of the servers are already permanently offline. Are you sure to continue(Type "Yes, I know my data might be lost." to continue) : Yes, I know my data might be lost. This operation will delete the 10.2.103.116:30160,10.2.103.116:30162 nodes in `tidb-prd` and all their data. Do you want to continue[y/N]:(default=N)y The component`[tikv]` will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it Do you want to continue[y/N]:(default=N) y Scale-in nodes... + [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [Parallel] - UserSSH: user=tidb, host=10.2.103.116 + [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.116:30160 10.2.103.116:30162] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:600IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:defaultOperation:StartOperation} Stopping component tikv Stopping instance10.2.103.116 Stop tikv 10.2.103.116:30160 success Destroying component tikv Destroying instance 10.2.103.116 Destroy 10.2.103.116 success - Destroy tikv paths: [/data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160/log /data1/tidb-deploy/tikv-30160 /etc/systemd/system/tikv-30160.service] Stopping component tikv Stopping instance 10.2.103.116 Stop tikv 10.2.103.116:30162 success Destroying component tikv Destroying instance 10.2.103.116Destroy10.2.103.116 success - Destroy tikv paths: [/data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162/log /data1/tidb-deploy/tikv-30162 /etc/systemd/system/tikv-30162.service] + [ Serial ] - UpdateMeta: cluster=tidb-prd, deleted=`10.2.103.116:30160,10.2.103.116:30162` + [ Serial ] - UpdateTopology: cluster=tidb-prd+ Refresh instance configs - Generate config pd -> 10.2.103.116:32379 ... Done - Generate config tikv -> 10.2.103.116:30163 ... Done - Generate config tidb -> 10.2.103.116:43000 ... Done - Generate config prometheus -> 10.2.103.116:9390 ... Done - Generate config grafana -> 10.2.103.116:5000 ... Done - Generate config alertmanager -> 10.2.103.116:9793 ... Done + Reload prometheus and grafana - Reload prometheus -> 10.2.103.116:9390 ... Done - Reload grafana -> 10.2.103.116:5000 ... Done Scaled cluster `tidb-prd` in successfully [tidb@vm116 ~]$

查询数据报错

MySQL [test]> select count(*) from t3; ERROR 9005 (HY000): Region isunavailable MySQL[test]>

报错日志

[2023/03/24 11:08:45.587 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8e640: Retry in 1000 milliseconds"] [2023/03/24 11:08:45.587 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=1] [2023/03/24 11:08:45.587 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:46.180 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:46.180 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=5002] [addr=10.2.103.116:30162] [2023/03/24 11:08:46.181 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627326.181128634\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"] [2023/03/24 11:08:46.181 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8ed40: Retry in 1000 milliseconds"] [2023/03/24 11:08:46.181 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30162] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=5002] [2023/03/24 11:08:46.181 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:48.162 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"] [2023/03/24 11:08:48.174 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"] [2023/03/24 11:08:50.587 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:50.587 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=1] [addr=10.2.103.116:30160] [2023/03/24 11:08:50.588 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627330.588107444\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30160\"}"] [2023/03/24 11:08:50.588 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8f440: Retry in 1000 milliseconds"] [2023/03/24 11:08:50.588 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=1] [2023/03/24 11:08:50.588 +08:00] [INFO] [store.rs:2580] ["broadcasting unreachable"] [unreachable_store_id=1] [store_id=5001] [2023/03/24 11:08:50.588 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30160] [store_id=1] [2023/03/24 11:08:51.181 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30162] [store_id=5002] [2023/03/24 11:08:51.181 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoin

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

656 2024-03-11

集群3副本丢失2副本 unsafe-recover恢复方法

黄东旭解析 TiDB 的核心优势

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

高成本云服务？TiDB 帮你省钱

推荐文章

HTAP 还可以这么玩？丨TiDB 在 IoT 智慧园区的应用

新特性解析丨TiDB 资源管控的设计思路与场景解析

TiDB赋能保险业-首个全栈自主核心保单系统成功投产

首个云原生、分布式、全栈国产化银行核心业务系统投产上线丨TiDB × 杭州银行

TiDB 在社交场景的解决方案实践

电商数据技术栈，在海量数据增长下如何实现实时与全量兼得？

金融行业数据库的选择

TiDB 在智能制造中的应用实践

TiDB 在全球头部物流企业计费管理系统的应用实践

PingCAP与教育部教育管理信息中心合作，推动普惠教育数字化转型

友情链接

热评文章

TiDB 中标杭州银行核心系统数据库项目

TiDB 首批通过信通院 HTAP 数据库基础能力评

PingCAP 与 Wisconsin-Madiso

PingCAP 成为中国唯一入选 Forrester

TiDB 走进东软集团，共建医疗数字化基石

共享开源技术，共建开放生态丨平凯星辰余梦杰出席 20

集群3副本丢失2副本 unsafe-recover恢复方法

微信扫一扫：分享

推荐文章

友情链接

热评文章