在 TiDB 中,根据用户定义的多种副本规则,一份数据可能会同时存储在多个节点中,从而保证在单个或少数节点暂时离线或损坏时,读写数据不受任何影响。但是,当一个 Region 的多数或全部副本在短时间内全部下线时,该 Region 会处于暂不可用的状态,无法进行读写操作。
如果一段数据的多数副本发生了永久性损坏(如磁盘损坏)等问题,从而导致节点无法上线时,此段数据会一直保持暂不可用的状态。这时,如果用户希望集群恢复正常使用,在用户能够容忍数据回退或数据丢失的前提下,用户理论上可以通过手动移除不可用副本的方式,使 Region 重新形成多数派,进而让上层业务可以写入和读取(可能是 stale 的,或者为空)这一段数据分片。
集群信息
[tidb
@vm116 ~]$
[tidb
@vm116 ~]$ tiup cluster display tidb
-prd
tiup
is checking updates
for component cluster
...
Starting component
`cluster
`:
/home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster display tidb
-prd
Cluster
type: tidb
Cluster name: tidb
-prd
Cluster version: v5
.4.3
Deploy
user: tidb
SSH
type: builtin
TLS encryption: enabled
CA certificate:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/ca
.crt
Client private
key:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/client
.pem
Client certificate:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/client
.crt
Dashboard URL: https:
//10.2.103.116:32379/dashboard
Grafana URL: http:
//10.2.103.116:5000
ID Role Host Ports OS
/Arch
Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.2.103.116:
9793 alertmanager
10.2.103.116 9793/9794 linux
/x86_64 Up
/data1
/tidb
-data/alertmanager
-9793 /data1
/tidb
-deploy
/alertmanager
-9793
10.2.103.116:
5000 grafana
10.2.103.116 5000 linux
/x86_64 Up
- /data1
/tidb
-deploy
/grafana
-5000
10.2.103.116:
32379 pd
10.2.103.116 32379/3380 linux
/x86_64 Up
|L
|UI
/data1
/tidb
-data/pd
-32379 /data1
/tidb
-deploy
/pd
-32379
10.2.103.116:
9390 prometheus
10.2.103.116 9390/32020 linux
/x86_64 Up
/data1
/tidb
-data/prometheus
-9390 /data1
/tidb
-deploy
/prometheus
-9390
10.2.103.116:
43000 tidb
10.2.103.116 43000/20080 linux
/x86_64 Up
- /data1
/tidb
-deploy
/tidb
-34000
10.2.103.116:
30160 tikv
10.2.103.116 30160/30180 linux
/x86_64 Up
/data1
/tidb
-data/tikv
-30160 /data1
/tidb
-deploy
/tikv
-30160
10.2.103.116:
30162 tikv
10.2.103.116 30162/30182 linux
/x86_64 Up
/data1
/tidb
-data/tikv
-30162 /data1
/tidb
-deploy
/tikv
-30162
10.2.103.116:
30163 tikv
10.2.103.116 30163/30183 linux
/x86_64 Up
/data1
/tidb
-data/tikv
-30163 /data1
/tidb
-deploy
/tikv
-30163
Total nodes:
8
查询数据
MySQL
[(none
)]> use test
;
Reading
table information
for completion
of table and column names
You can turn
off this feature
to get a quicker startup
with -A
Database changed
MySQL
[test
]> select count(*) from t3
;
+----------+
| count(*) |
+----------+
| 3271488 |
+----------+
1 row in set (0.00 sec
)
模拟tikv 宕机,同时强制2个tikv 缩容
[tidb
@vm116 ~]$ tiup cluster scale
-in tidb
-prd
-N
10.2.103.116:
30160,10.2.103.116:
30162 --force
tiup
is checking updates
for component cluster
...
Starting component
`cluster
`:
/home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster scale
-in tidb
-prd
-N
10.2.103.116:
30160,10.2.103.116:
30162 --force
██ ██ █████ ██████ ███ ██ ██ ███ ██ ██████
██ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██ ██
██ █ ██ ███████ ██████ ██ ██ ██ ██ ██ ██ ██ ██ ███
██ ███ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
███ ███ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██████
Forcing scale
in is unsafe
and may result
in data loss
for stateful components
.
DO NOT use `--force` if you have any component in Pending Offline status.
The process
is irreversible
and could
NOT be cancelled
.
Only
use `--force` when some of the servers are already permanently offline.
Are you sure
to continue?
(Type "Yes, I know my data might be lost." to continue)
: Yes
, I know my
data might be lost
.
This operation will
delete the
10.2.103.116:
30160,10.2.103.116:
30162 nodes
in `tidb
-prd
` and all their
data.
Do you want
to continue?
[y
/N
]:
(default=N
) y
The component
`[tikv
]` will become tombstone
, maybe
exists in several minutes
or hours
, after that you can
use the prune command
to clean it
Do you want
to continue?
[y
/N
]:
(default=N
) y
Scale
-in nodes
...
+ [ Serial ] - SSHKeySet: privateKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/ssh
/id_rsa
, publicKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/ssh
/id_rsa
.pub
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [ Serial ] - ClusterOperate: operation
=DestroyOperation
, options
={Roles:
[] Nodes:
[10.2.103.116:
30160 10.2.103.116:
30162] Force:
true SSHTimeout:
5 OptTimeout:
120 APITimeout:
600 IgnoreConfigCheck:
false NativeSSH:
false SSHType: Concurrency:
5 SSHProxyHost: SSHProxyPort:
22 SSHProxyUser:tidb SSHProxyIdentity:
/home
/tidb
/.ssh
/id_rsa SSHProxyUsePassword:
false SSHProxyTimeout:
5 CleanupData:
false CleanupLog:
false CleanupAuditLog:
false RetainDataRoles:
[] RetainDataNodes:
[] DisplayMode:
default Operation:StartOperation}
Stopping component tikv
Stopping instance
10.2.103.116
Stop tikv
10.2.103.116:
30160 success
Destroying component tikv
Destroying instance
10.2.103.116
Destroy
10.2.103.116 success
- Destroy tikv paths:
[/data1
/tidb
-data/tikv
-30160 /data1
/tidb
-deploy
/tikv
-30160/log
/data1
/tidb
-deploy
/tikv
-30160 /etc
/systemd
/system
/tikv
-30160.service
]
Stopping component tikv
Stopping instance
10.2.103.116
Stop tikv
10.2.103.116:
30162 success
Destroying component tikv
Destroying instance
10.2.103.116
Destroy
10.2.103.116 success
- Destroy tikv paths:
[/data1
/tidb
-data/tikv
-30162 /data1
/tidb
-deploy
/tikv
-30162/log
/data1
/tidb
-deploy
/tikv
-30162 /etc
/systemd
/system
/tikv
-30162.service
]
+ [ Serial ] - UpdateMeta: cluster
=tidb
-prd
, deleted
=`10.2.103.116:30160,10.2.103.116:30162`
+ [ Serial ] - UpdateTopology: cluster
=tidb
-prd
+ Refresh instance configs
- Generate config pd
-> 10.2.103.116:
32379 ... Done
- Generate config tikv
-> 10.2.103.116:
30163 ... Done
- Generate config tidb
-> 10.2.103.116:
43000 ... Done
- Generate config prometheus
-> 10.2.103.116:
9390 ... Done
- Generate config grafana
-> 10.2.103.116:
5000 ... Done
- Generate config alertmanager
-> 10.2.103.116:
9793 ... Done
+ Reload prometheus
and grafana
- Reload prometheus
-> 10.2.103.116:
9390 ... Done
- Reload grafana
-> 10.2.103.116:
5000 ... Done
Scaled cluster
`tidb
-prd
` in successfully
[tidb
@vm116 ~]$
查询数据报错
MySQL
[test
]> select count(*) from t3
;
ERROR
9005 (HY000
): Region
is unavailable
MySQL
[test
]>
报错日志
[2023/03/24 11:
08:
45.587 +08:
00] [INFO
] [<unknown
>] ["Subchannel 0x7f4177d8e640: Retry in 1000 milliseconds"]
[2023/03/24 11:
08:
45.587 +08:
00] [ERROR
] [raft_client
.rs:
504] ["connection aborted"] [addr
=10.2.103.116:
30160] [receiver_err
="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error
="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id
=1]
[2023/03/24 11:
08:
45.587 +08:
00] [ERROR
] [raft_client
.rs:
776] ["connection abort"] [addr
=10.2.103.116:
30160] [store_id
=1]
[2023/03/24 11:
08:
46.180 +08:
00] [INFO
] [raft_client
.rs:
742] ["resolve store address ok"] [addr
=10.2.103.116:
30162] [store_id
=5002]
[2023/03/24 11:
08:
46.180 +08:
00] [INFO
] [raft_client
.rs:
627] ["server: new connection with tikv endpoint"] [store_id
=5002] [addr
=10.2.103.116:
30162]
[2023/03/24 11:
08:
46.181 +08:
00] [INFO
] [<unknown
>] ["Connect failed: {\"created\":\"@1679627326.181128634\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"]
[2023/03/24 11:
08:
46.181 +08:
00] [INFO
] [<unknown
>] ["Subchannel 0x7f4177d8ed40: Retry in 1000 milliseconds"]
[2023/03/24 11:
08:
46.181 +08:
00] [ERROR
] [raft_client
.rs:
504] ["connection aborted"] [addr
=10.2.103.116:
30162] [receiver_err
="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error
="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id
=5002]
[2023/03/24 11:
08:
46.181 +08:
00] [ERROR
] [raft_client
.rs:
776] ["connection abort"] [addr
=10.2.103.116:
30162] [store_id
=5002]
[2023/03/24 11:
08:
48.162 +08:
00] [WARN
] [endpoint
.rs:
606] [error
-response
] [err
="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"]
[2023/03/24 11:
08:
48.174 +08:
00] [WARN
] [endpoint
.rs:
606] [error
-response
] [err
="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"]
[2023/03/24 11:
08:
50.587 +08:
00] [INFO
] [raft_client
.rs:
742] ["resolve store address ok"] [addr
=10.2.103.116:
30160] [store_id
=1]
[2023/03/24 11:
08:
50.587 +08:
00] [INFO
] [raft_client
.rs:
627] ["server: new connection with tikv endpoint"] [store_id
=1] [addr
=10.2.103.116:
30160]
[2023/03/24 11:
08:
50.588 +08:
00] [INFO
] [<unknown
>] ["Connect failed: {\"created\":\"@1679627330.588107444\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30160\"}"]
[2023/03/24 11:
08:
50.588 +08:
00] [INFO
] [<unknown
>] ["Subchannel 0x7f4177d8f440: Retry in 1000 milliseconds"]
[2023/03/24 11:
08:
50.588 +08:
00] [ERROR
] [raft_client
.rs:
504] ["connection aborted"] [addr
=10.2.103.116:
30160] [receiver_err
="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error
="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id
=1]
[2023/03/24 11:
08:
50.588 +08:
00] [INFO
] [store
.rs:
2580] ["broadcasting unreachable"] [unreachable_store_id
=1] [store_id
=5001]
[2023/03/24 11:
08:
50.588 +08:
00] [ERROR
] [raft_client
.rs:
776] ["connection abort"] [addr
=10.2.103.116:
30160] [store_id
=1]
[2023/03/24 11:
08:
51.181 +08:
00] [INFO
] [raft_client
.rs:
742] ["resolve store address ok"] [addr
=10.2.103.116:
30162] [store_id
=5002]
[2023/03/24 11:
08:
51.181 +08:
00] [INFO
] [raft_client
.rs:
627] ["server: new connection with tikv endpoint"] [store_id
=5002] [addr
=10.2.103.116:
30162]
[2023/03/24 11:
08:
51.182 +08:
00] [INFO
] [<unknown
>] ["Connect failed: {\"created\":\"@1679627331.182361851\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"]
[2023/03/24 11:
08:
51.182 +08:
00] [INFO
] [<unknown
>] ["Subchannel 0x7f4177d8fb40: Retry in 999 milliseconds"]
[2023/03/24 11:
08:
51.182 +08:
00] [ERROR
] [raft_client
.rs:
504] ["connection aborted"] [addr
=10.2.103.116:
30162] [receiver_err
="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error
="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id
=5002]
[2023/03/24 11:
08:
51.182 +08:
00] [INFO
] [store
.rs:
2580] ["broadcasting unreachable"] [unreachable_store_id
=5002] [store_id
=5001]
[2023/03/24 11:
08:
51.182 +08:
00] [ERROR
] [raft_client
.rs:
776] ["connection abort"] [addr
=10.2.103.116:
30162] [store_id
=5002]
^C
[tidb
@vm116 log
]$
准备修复(v6.1 以前版本)
1、暂停PD调度
[tidb
@vm116 ~]$ tiup ctl:v5
.4.3 pd
-u
"https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component
`ctl
`:
/home
/tidb
/.tiup
/components
/ctl
/v5
.4.3/ctl pd
-u https:
//10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» config
set region
-schedule
-limit 0
Success
!
» config
set replica
-schedule
-limit 0
Success
!
» config
set leader
-schedule
-limit 0
Success
!
» config
set merge-schedule
-limit 0
Success
!
» config
set hot
-region
-schedule
-limit 0
Success
!
»
2、检查副本
使用 pd-ctl 检查大于等于一半副本数在故障节点上的 Region; 要求:PD 处于运行状态;
» region
--jq=.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,5002) then . else empty end) | length>=$total-length) }
{
"id":
2003,"peer_stores":
[1,5002,5001]}
{
"id":
7001,"peer_stores":
[1,5002,5001]}
{
"id":
7005,"peer_stores":
[1,5002,5001]}
{
"id":
7009,"peer_stores":
[1,5002,5001]}
»
3、stop 需要修复的tikv
[tidb
@vm116 ~]$ tiup cluster display tidb
-prd
tiup
is checking updates
for component cluster
...
Starting component
`cluster
`:
/home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster display tidb
-prd
Cluster
type: tidb
Cluster name: tidb
-prd
Cluster version: v5
.4.3
Deploy
user: tidb
SSH
type: builtin
TLS encryption: enabled
CA certificate:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/ca
.crt
Client private
key:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/client
.pem
Client certificate:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/client
.crt
Dashboard URL: https:
//10.2.103.116:32379/dashboard
Grafana URL: http:
//10.2.103.116:5000
ID Role Host Ports OS
/Arch
Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.2.103.116:
9793 alertmanager
10.2.103.116 9793/9794 linux
/x86_64 Up
/data1
/tidb
-data/alertmanager
-9793 /data1
/tidb
-deploy
/alertmanager
-9793
10.2.103.116:
5000 grafana
10.2.103.116 5000 linux
/x86_64 Up
- /data1
/tidb
-deploy
/grafana
-5000
10.2.103.116:
32379 pd
10.2.103.116 32379/3380 linux
/x86_64 Up
|L
|UI
/data1
/tidb
-data/pd
-32379 /data1
/tidb
-deploy
/pd
-32379
10.2.103.116:
9390 prometheus
10.2.103.116 9390/32020 linux
/x86_64 Up
/data1
/tidb
-data/prometheus
-9390 /data1
/tidb
-deploy
/prometheus
-9390
10.2.103.116:
43000 tidb
10.2.103.116 43000/20080 linux
/x86_64 Up
- /data1
/tidb
-deploy
/tidb
-34000
10.2.103.116:
30163 tikv
10.2.103.116 30163/30183 linux
/x86_64 Up
/data1
/tidb
-data/tikv
-30163 /data1
/tidb
-deploy
/tikv
-30163
Total nodes:
6
[tidb
@vm116 ~]$ tiup cluster stop tidb
-prd
-N
10.2.103.116:
30163
tiup
is checking updates
for component cluster
...
Starting component
`cluster
`:
/home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster stop tidb
-prd
-N
10.2.103.116:
30163
Will stop the cluster tidb
-prd
with nodes:
10.2.103.116:
30163, roles:
.
Do you want
to continue?
[y
/N
]:
(default=N
) y
+ [ Serial ] - SSHKeySet: privateKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/ssh
/id_rsa
, publicKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/ssh
/id_rsa
.pub
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [ Serial ] - StopCluster
Stopping component tikv
Stopping instance
10.2.103.116
Stop tikv
10.2.103.116:
30163 success
Stopping component node_exporter
Stopping component blackbox_exporter
Stopped cluster
`tidb
-prd
` successfully
4、unsafe-recover操作
在所有未发生掉电故障的实例上,对所有 Region 移除掉所有位于故障节点上的 Peer; 要求:在所有未发生掉电故障的机器上运行,且需要关闭 TiKV 节点;
[tidb
@vm116 v5
.4.3]$
./tikv
-ctl
--data-dir /data1/tidb-data/tikv-30163 unsafe-recover remove-fail-stores -s 1,5002 --all-regions
[2023/03/24 11:
25:
47.978 +08:
00] [WARN
] [config
.rs:
612] ["compaction guard is disabled due to region info provider not available"]
[2023/03/24 11:
25:
47.978 +08:
00] [WARN
] [config
.rs:
715] ["compaction guard is disabled due to region info provider not available"]
removing stores
[1, 5002] from configurations
...
success
5、启动已经修复的tikv
[tidb
@vm116 v5
.4.3]$ tiup cluster
start tidb
-prd
-N
10.2.103.116:
30163
tiup
is checking updates
for component cluster
...
Starting component
`cluster
`:
/home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster
start tidb
-prd
-N
10.2.103.116:
30163
Starting cluster tidb
-prd
...
+ [ Serial ] - SSHKeySet: privateKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/ssh
/id_rsa
, publicKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/ssh
/id_rsa
.pub
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [ Serial ] - StartCluster
Starting component tikv
Starting instance
10.2.103.116:
30163
Start instance
10.2.103.116:
30163 success
Starting component node_exporter
Starting instance
10.2.103.116
Start 10.2.103.116 success
Starting component blackbox_exporter
Starting instance
10.2.103.116
Start 10.2.103.116 success
+ [ Serial ] - UpdateTopology: cluster
=tidb
-prd
Started cluster
`tidb
-prd
` successfully
6、检查 Region Leader
使用 pd-ctl 检查没有 Leader 的 Region ; 要求:PD 处于运行状态;
[tidb
@vm116 v5
.4.3]$ tiup ctl:v5
.4.3 pd
-u
"https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component
`ctl
`:
/home
/tidb
/.tiup
/components
/ctl
/v5
.4.3/ctl pd
-u https:
//10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» region
--jq .regions[]|select(has("leader")|not)|{id: .id, peer_stores: [.peers[].store_id]}
»
检查数据索引一致性,要求:PD、TiKV、TiDB 处于运行状态;
MySQL
[test
]> select count(*) from t3
;
+----------+
| count(*) |
+----------+
| 3271488 |
+----------+
1 row in set (0.55 sec
)
MySQL
[test
]>
MySQL
[test
]> admin
check table t3
;
Query OK
, 0 rows affected
(0.00 sec
)
MySQL
[test
]>
8、恢复调度
[tidb
@vm116 ~]$ tiup ctl:v5
.4.3 pd
-u
"https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component
`ctl
`:
/home
/tidb
/.tiup
/components
/ctl
/v5
.4.3/ctl pd
-u https:
//10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» config
set region
-schedule
-limit 2000
Success
!
» config
set replica
-schedule
-limit 32
Success
!
» config
set leader
-schedule
-limit 8
Success
!
» config
set merge-schedule
-limit 16
Success
!
» config
set hot
-region
-schedule
-limit 2
Success
!
»
异常情况
1、No such region
./tikv
-ctl
--data-dir /data3/tidb/data unsafe-recover remove-fail-stores -s 1 -r 50377
[INFO
] [mod
.rs:
118] ["encryption: none of key dictionary and file dictionary are found."]
[INFO
] [mod
.rs:
479] ["encryption is disabled."]
[WARN
] [config
.rs:
587] ["compaction guard is disabled due to region info provider not available"]
[WARN
] [config
.rs:
682] ["compaction guard is disabled due to region info provider not available"]
removing stores
[1] from configurations
...
Debugger::remove_fail_stores:
"No such region 50377 on the store"
2、创建空 Region 解决 Unavailable 报错
要求:PD 处于运行状态,命令的目标 TiKV 处于关闭状态
./tikv
-ctl
--ca-path /data3/tidb/deploy/tls/ca.crt --key-path /data3/tidb/deploy/tls/tikv.pem --cert-path /data3/tidb/deploy/tls/tikv.crt --data-dir /data3/tidb/data recreate-region -p https://10.2.103.116:32379 -r 50377
[INFO
] [mod
.rs:
118] ["encryption: none of key dictionary and file dictionary are found."]
[INFO
] [mod
.rs:
479] ["encryption is disabled."]
initing empty region
with peer_id
...
success
v6.1 版本修复
1、集群信息
[tidb
@vm116 ~]$ tiup cluster display tidb
-prd
tiup
is checking updates
for component cluster
...
Starting component
`cluster
`:
/home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster display tidb
-prd
Cluster
type: tidb
Cluster name: tidb
-prd
Cluster version: v6
.1.5
Deploy
user: tidb
SSH
type: builtin
TLS encryption: enabled
CA certificate:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/ca
.crt
Client private
key:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/client
.pem
Client certificate:
/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/tls
/client
.crt
Dashboard URL: https:
//10.2.103.116:32379/dashboard
Grafana URL: http:
//10.2.103.116:5000
ID Role Host Ports OS
/Arch
Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.2.103.116:
9793 alertmanager
10.2.103.116 9793/9794 linux
/x86_64 Up
/data1
/tidb
-data/alertmanager
-9793 /data1
/tidb
-deploy
/alertmanager
-9793
10.2.103.116:
5000 grafana
10.2.103.116 5000 linux
/x86_64 Up
- /data1
/tidb
-deploy
/grafana
-5000
10.2.103.116:
32379 pd
10.2.103.116 32379/3380 linux
/x86_64 Up
|L
|UI
/data1
/tidb
-data/pd
-32379 /data1
/tidb
-deploy
/pd
-32379
10.2.103.116:
9390 prometheus
10.2.103.116 9390/32020 linux
/x86_64 Up
/data1
/tidb
-data/prometheus
-9390 /data1
/tidb
-deploy
/prometheus
-9390
10.2.103.116:
43000 tidb
10.2.103.116 43000/20080 linux
/x86_64 Up
- /data1
/tidb
-deploy
/tidb
-34000
10.2.103.116:
30160 tikv
10.2.103.116 30160/30180 linux
/x86_64 Up
/data1
/tidb
-data/tikv
-30160 /data1
/tidb
-deploy
/tikv
-30160
10.2.103.116:
30162 tikv
10.2.103.116 30162/30182 linux
/x86_64 Up
/data1
/tidb
-data/tikv
-30162 /data1
/tidb
-deploy
/tikv
-30162
10.2.103.116:
30163 tikv
10.2.103.116 30163/30183 linux
/x86_64 Up
/data1
/tidb
-data/tikv
-30163 /data1
/tidb
-deploy
/tikv
-30163
Total nodes:
8
2、强制缩容2个tikv
[tidb
@vm116 ~]$ tiup cluster scale
-in tidb
-prd
-N
10.2.103.116:
30160,10.2.103.116:
30162 --force
tiup
is checking updates
for component cluster
...
Starting component
`cluster
`:
/home
/tidb
/.tiup
/components
/cluster
/v1
.11.3/tiup
-cluster scale
-in tidb
-prd
-N
10.2.103.116:
30160,10.2.103.116:
30162 --force
██ ██ █████ ██████ ███ ██ ██ ███ ██ ██████
██ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██ ██
██ █ ██ ███████ ██████ ██ ██ ██ ██ ██ ██ ██ ██ ███
██ ███ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
███ ███ ██ ██ ██ ██ ██ ████ ██ ██ ████ ██████
Forcing scale
in is unsafe
and may result
in data loss
for stateful components
.
DO NOT use `--force` if you have any component in Pending Offline status.
The process
is irreversible
and could
NOT be cancelled
.
Only
use `--force` when some of the servers are already permanently offline.
Are you sure
to continue?
(Type "Yes, I know my data might be lost." to continue)
: Yes
, I know my
data might be lost
.
This operation will
delete the
10.2.103.116:
30160,10.2.103.116:
30162 nodes
in `tidb
-prd
` and all their
data.
Do you want
to continue?
[y
/N
]:
(default=N
) y
The component
`[tikv
]` will become tombstone
, maybe
exists in several minutes
or hours
, after that you can
use the prune command
to clean it
Do you want
to continue?
[y
/N
]:
(default=N
) y
Scale
-in nodes
...
+ [ Serial ] - SSHKeySet: privateKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/ssh
/id_rsa
, publicKey
=/home
/tidb
/.tiup
/storage
/cluster
/clusters
/tidb
-prd
/ssh
/id_rsa
.pub
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [Parallel
] - UserSSH:
user=tidb
, host
=10.2.103.116
+ [ Serial ] - ClusterOperate: operation
=DestroyOperation
, options
={Roles:
[] Nodes:
[10.2.103.116:
30160 10.2.103.116:
30162] Force:
true SSHTimeout:
5 OptTimeout:
120 APITimeout:
600 IgnoreConfigCheck:
false NativeSSH:
false SSHType: Concurrency:
5 SSHProxyHost: SSHProxyPort:
22 SSHProxyUser:tidb SSHProxyIdentity:
/home
/tidb
/.ssh
/id_rsa SSHProxyUsePassword:
false SSHProxyTimeout:
5 CleanupData:
false CleanupLog:
false CleanupAuditLog:
false RetainDataRoles:
[] RetainDataNodes:
[] DisplayMode:
default Operation:StartOperation}
failed
to delete tikv: error requesting https:
//10.2.103.116:32379/pd/api/v1/store/7014, response: "[PD:core:ErrStoresNotEnough]can not remove store 7014 since the number of up stores would be 2 while need 3"
, code
400
Stopping component tikv
Stopping instance
10.2.103.116
Stop tikv
10.2.103.116:
30162 success
Destroying component tikv
Destroying instance
10.2.103.116
Destroy
10.2.103.116 success
- Destroy tikv paths:
[/data1
/tidb
-data/tikv
-30162 /data1
/tidb
-deploy
/tikv
-30162/log
/data1
/tidb
-deploy
/tikv
-30162 /etc
/systemd
/system
/tikv
-30162.service
]
failed
to delete tikv: error requesting https:
//10.2.103.116:32379/pd/api/v1/store/7013, response: "[PD:core:ErrStoresNotEnough]can not remove store 7013 since the number of up stores would be 2 while need 3"
, code
400
Stopping component tikv
Stopping instance
10.2.103.116
Stop tikv
10.2.103.116:
30160 success
Destroying component tikv
Destroying instance
10.2.103.116
Destroy
10.2.103.116 success
- Destroy tikv paths:
[/etc
/systemd
/system
/tikv
-30160.service
/data1
/tidb
-data/tikv
-30160 /data1
/tidb
-deploy
/tikv
-30160/log
/data1
/tidb
-deploy
/tikv
-30160]
+ [ Serial ] - UpdateMeta: cluster
=tidb
-prd
, deleted
=`10.2.103.116:30160,10.2.103.116:30162`
+ [ Serial ] - UpdateTopology: cluster
=tidb
-prd
+ Refresh instance configs
- Generate config pd
-> 10.2.103.116:
32379 ... Done
- Generate config tikv
-> 10.2.103.116:
30163 ... Done
- Generate config tidb
-> 10.2.103.116:
43000 ... Done
- Generate config prometheus
-> 10.2.103.116:
9390 ... Done
- Generate config grafana
-> 10.2.103.116:
5000 ... Done
- Generate config alertmanager
-> 10.2.103.116:
9793 ... Done
+ Reload prometheus
and grafana
- Reload prometheus
-> 10.2.103.116:
9390 ... Done
- Reload grafana
-> 10.2.103.116:
5000 ... Done
Scaled cluster
`tidb
-prd
` in successfully
3、查询store信息
[tidb
@vm116 ~]$ tiup ctl:v5
.4.3 pd
-u
"https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component
`ctl
`:
/home
/tidb
/.tiup
/components
/ctl
/v5
.4.3/ctl pd
-u https:
//10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» store
{
"count":
3,
"stores":
[
{
"store": {
"id":
7014,
"address":
"10.2.103.116:30162",
"version":
"6.1.5",
"peer_address":
"10.2.103.116:30162",
"status_address":
"10.2.103.116:30182",
"git_hash":
"e554126f6e83a6ddc944ddc51746b6def303ec1a",
"start_timestamp":
1679629145,
"deploy_path":
"/data1/tidb-deploy/tikv-30162/bin",
"last_heartbeat":
1679629565041246521,
"node_state":
1,
"state_name":
"Disconnected"
}
,
"status": {
"capacity":
"492GiB",
"available":
"424GiB",
"used_size":
"324.6MiB",
"leader_count":
2,
"leader_weight":
1,
"leader_score":
2,
"leader_size":
178,
"region_count":
7,
"region_weight":
1,
"region_score":
410.17607715843184,
"region_size":
251,
"slow_score":
1,
"start_ts":
"2023-03-24T11:39:05+08:00",
"last_heartbeat_ts":
"2023-03-24T11:46:05.041246521+08:00",
"uptime":
"7m0.041246521s"
}
}
,
{
"store": {
"id":
5001,
"address":
"10.2.103.116:30163",
"version":
"6.1.5",
"peer_address":
"10.2.103.116:30163",
"status_address":
"10.2.103.116:30183",
"git_hash":
"e554126f6e83a6ddc944ddc51746b6def303ec1a",
"start_timestamp":
1679629139,
"deploy_path":
"/data1/tidb-deploy/tikv-30163/bin",
"last_heartbeat":
1679629599338071366,
"node_state":
1,
"state_name":
"Up"
}
,
"status": {
"capacity":
"492GiB",
"available":
"435GiB",
"used_size":
"342MiB",
"leader_count":
5,
"leader_weight":
1,
"leader_score":
5,
"leader_size":
73,
"region_count":
7,
"region_weight":
1,
"region_score":
409.8657305188558,
"region_size":
251,
"slow_score":
1,
"start_ts":
"2023-03-24T11:38:59+08:00",
"last_heartbeat_ts":
"2023-03-24T11:46:39.338071366+08:00",
"uptime":
"7m40.338071366s"
}
}
,
{
"store": {
"id":
7013,
"address":
"10.2.103.116:30160",
"version":
"6.1.5",
"peer_address":
"10.2.103.116:30160",
"status_address":
"10.2.103.116:30180",
"git_hash":
"e554126f6e83a6ddc944ddc51746b6def303ec1a",
"start_timestamp":
1679629155,
"deploy_path":
"/data1/tidb-deploy/tikv-30160/bin",
"last_heartbeat":
1679629565148211763,
"node_state":
1,
"state_name":
"Disconnected"
}
,
"status": {
"capacity":
"492GiB",
"available":
"424GiB",
"used_size":
"324.6MiB",
"leader_count":
0,
"leader_weight":
1,
"leader_score":
0,
"leader_size":
0,
"region_count":
7,
"region_weight":
1,
"region_score":
410.17510527918483,
"region_size":
251,
"slow_score":
1,
"start_ts":
"2023-03-24T11:39:15+08:00",
"last_heartbeat_ts":
"2023-03-24T11:46:05.148211763+08:00",
"uptime":
"6m50.148211763s"
}
}
]
}
4、查询数据报错
MySQL
[test
]> select count(*) from t3
;
ERROR
9002 (HY000
): TiKV server timeout
MySQL
[test
]>
5、tikv 错误日志
[2023/03/24 11:
48:
47.549 +08:
00] [ERROR
] [raft_client
.rs:
824] ["connection abort"] [addr
=10.2.103.116:
30160] [store_id
=7013]
[2023/03/24 11:
48:
47.813 +08:
00] [INFO
] [raft
.rs:
1550] ["starting a new election"] [term
=9] [raft_id
=7004] [region_id
=7001]
[2023/03/24 11:
48:
47.813 +08:
00] [INFO
] [raft
.rs:
1170] ["became pre-candidate at term 9"] [term
=9] [raft_id
=7004] [region_id
=7001]
[2023/03/24 11:
48:
47.813 +08:
00] [INFO
] [raft
.rs:
1299] ["broadcasting vote request"] [to="[250648, 250650]"] [log_index
=1863] [log_term
=9] [term
=9] [type=MsgRequestPreVote
] [raft_id
=7004] [region_id
=7001]
[2023/03/24 11:
48:
47.927 +08:
00] [INFO
] [<unknown
>] ["subchannel 0x7fae72e89c00 {address=ipv4:10.2.103.116:30162, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30162, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22260, grpc.internal.security_connector=0x7fae72e0ad80, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30162, random id=324}: connect failed: {\"created\":\"@1679629727.927814392\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.1+1.44.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"]
[2023/03/24 11:
48:
47.928 +08:
00] [INFO
] [<unknown
>] ["subchannel 0x7fae72e89c00 {address=ipv4:10.2.103.116:30162, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30162, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22260, grpc.internal.security_connector=0x7fae72e0ad80, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30162, random id=324}: Retry in 999 milliseconds"]
[2023/03/24 11:
48:
47.928 +08:
00] [INFO
] [advance
.rs:
296] ["check leader failed"] [to_store
=7014] [error
="\"[rpc failed] RpcFailure: 14-UNAVAILABLE failed to connect to all addresses\""]
[2023/03/24 11:
48:
47.928 +08:
00] [INFO
] [<unknown
>] ["subchannel 0x7fae72eb5c00 {address=ipv4:10.2.103.116:30160, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30160, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22ee0, grpc.internal.security_connector=0x7fae72fbd640, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30160, random id=325}: connect failed: {\"created\":\"@1679629727.928184608\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.1+1.44.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30160\"}"]
[2023/03/24 11:
48:
47.928 +08:
00] [INFO
] [<unknown
>] ["subchannel 0x7fae72eb5c00 {address=ipv4:10.2.103.116:30160, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30160, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22ee0, grpc.internal.security_connector=0x7fae72fbd640, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30160, random id=325}: Retry in 1000 milliseconds"]
[2023/03/24 11:
48:
47.928 +08:
00] [INFO
] [advance
.rs:
296] ["check leader failed"] [to_store
=7013] [error
="\"[rpc failed] RpcFailure: 14-UNAVAILABLE failed to connect to all addresses\""]
[2023/03/24 11:
48:
48.037 +08:
00] [WARN
] [endpoint
.rs:
621] [error
-response
] [err
="Region error (will back off and retry) message: \"peer is not leader for region 7009, leader may None\" not_leader { region_id: 7009 }"]
[2023/03/24 11:
48:
48.089 +08:
00] [WARN
] [endpoint
.rs:
621] [error
-response
] [err
="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"]
[2023/03/24 11:
48:
48.409 +08:
00] [WARN
] [endpoint
.rs:
621] [error
-response
] [err
="Region error (will back off and retry) message: \"peer is not leader for region 252001, leader may None\" not_leader { region_id: 252001 }"]
^C
[tidb
@vm116 log
]$
6、PD unsafe 修复
1、执行修复命令
[tidb
@vm116 ctl
]$ tiup ctl:v6
.1.5 pd
-u
"https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component
`ctl
`:
/home
/tidb
/.tiup
/components
/ctl
/v6
.1.5/ctl pd
-u https:
//10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» unsafe remove
-failed
-stores
7013,7014
Success
!
2、查询修复进度
» unsafe remove
-failed
-stores
show
[
{
"info":
"Unsafe recovery enters collect report stage: failed stores 7013, 7014",
"time":
"2023-03-24 11:56:44.910"
}
,
{
"info":
"Unsafe recovery enters force leader stage",
"time":
"2023-03-24 11:56:49.390",
"actions": {
"store 5001":
[
"force leader on regions: 7001, 7005, 7009, 252001, 252005, 252009, 2003"
]
}
}
,
{
"info":
"Collecting reports from alive stores(0/1)",
"time":
"2023-03-24 11:57:02.286",
"details":
[
"Stores that have not dispatched plan: ",
"Stores that have reported to PD: ",
"Stores that have not reported to PD: 5001"
]
}
]
3、修复完成
» unsafe remove
-failed
-stores
show
[
{
"info":
"Unsafe recovery enters collect report stage: failed stores 7013, 7014",
"time":
"2023-03-24 11:56:44.910"
}
,
{
"info":
"Unsafe recovery enters force leader stage",
"time":
"2023-03-24 11:56:49.390",
"actions": {
"store 5001":
[
"force leader on regions: 7001, 7005, 7009, 252001, 252005, 252009, 2003"
]
}
}
,
{
"info":
"Unsafe recovery enters demote failed voter stage",
"time":
"2023-03-24 11:57:20.434",
"actions": {
"store 5001":
[
"region 7001 demotes peers { id:250648 store_id:7014 }, { id:250650 store_id:7013 }",
"region 7005 demotes peers { id:250647 store_id:7013 }, { id:250649 store_id:7014 }",
"region 7009 demotes peers { id:250644 store_id:7014 }, { id:250646 store_id:7013 }",
"region 252001 demotes peers { id:252003 store_id:7013 }, { id:252004 store_id:7014 }",
"region 252005 demotes peers { id:252007 store_id:7013 }, { id:252008 store_id:7014 }",
"region 252009 demotes peers { id:252011 store_id:7013 }, { id:252012 store_id:7014 }",
"region 2003 demotes peers { id:250643 store_id:7013 }, { id:250645 store_id:7014 }"
]
}
}
,
{
"info":
"Unsafe recovery finished",
"time":
"2023-03-24 11:57:22.443",
"details":
[
"affected table ids: 73, 77, 68, 70"
]
}
]
7、查询数据
MySQL
[test
]> select count(*) from t3
;
+----------+
| count(*) |
+----------+
| 3271488 |
+----------+
1 row in set (0.50 sec
)
MySQL
[test
]>
MySQL
[test
]> admin
check table t3
;
Query OK
, 0 rows affected
(0.00 sec
)
MySQL
[test
]>
总结:
1、尽量在PD的调度上满足异常宕机数据的高可用,考虑多个标签,比如机房,机架,机器,可以降低丢数据的风险。
2、在v6.1 之前,如果出现多副本的丢失,恢复步骤相对的繁琐,人工介入太多。 v6.1后恢复相对简答。如果可以的话,尽量升级到v6.1 这样能够快速恢复