记三次升级TiDB集群到v6.1.5的案例与收益

网友投稿 673 2024-03-11

团队升级TiDB版本的事情是规划很久了，迟迟没操作还是因为很多预期意外的问题是否有能力覆盖解决。

本文写的时间是8月底，今天刚好总结的时候看到了分享给大家以作为版本升级的参考。

我们的业务集群TiDB数据量本身是很大，单集群数十TB规模，加之业务的重要性，本着非必要不升级的原则。

随着TiDB版本迭代，很多新的功能特性和性能提升以及我们遇到的一些bug修复，不断吸引着我们作出升级的决定。

文中记录的三则为计划升级16套TiDB集群中遇到的问题，将性能压力大且符合升级需求的TiDB集群升级到v6.1.5。

问题一：升级完成后其中一台TiFlash无法正常重启，一直oom

问题分析过程：

研发侧报错： 1、uncategorized SQLException; SQL state [HY000]; error code [1105]; other error for mpp stream: DB::Exception: Exchange receiver meet error : DB::Exception: Exchange receiver meet error : send async stream request fail; nested exception is java.sql.SQLException: other error for mpp stream: DB::Exception: Exchange receiver meet error : DB::Exception: Exchange receiver meet error : send async stream request fail/span> tidb节点上错误日志： 2、[ERROR] 2023-08-23 ****** [http-nio-8080-exec-6] [com.*****.man.web.framework.exception.GlobalErrHandler] [7c25fb69df2800c5] - 通用异常信息:[ ### Error querying database. Cause: java.sql.SQLException: TiFlash server timeout ### The error may exist in URL [jar:file:/web/servers/tomcat/webapps/ROOT/WEB-INF/lib/***-man-dao-2.0.0-RELEASE.jar!/mapper/tidb/SheetListTidbMapper.xml] ### The error may involve*****.man.dao.tidb.SheetListTidbMapper.selectTidbPageByQuery-Inline ### The error occurred while setting parameters ### SQL: SELECT ****** ; ###Cause: java.sql.SQLException: TiFlash server timeout ; uncategorized SQLException; SQL state [HY000]; error code [9012]; TiFlash server timeout; nested exception is java.sql.SQLException: TiFlash server timeout] org.springframework.jdbc.UncategorizedSQLException: ###Error querying database. Cause: java.sql.SQLException: TiFlash server timeout ; uncategorized SQLException; SQL state [HY000]; error code [9012]; TiFlash server timeout; nested exception is java.sql.SQLException: TiFlash server timeout

其查询计划为，没有任何合适的索引能走到tikv查询数据，只有走TiFlash。

第一个错误是MPP老版本的一个bug错误已经修复了，按理不会报这个错误。

后面确认是TiFlash升级完成后挂掉不能正常启动，确认了虽然有两个TiFlash节点实际上还是单副本运行，所以一直报错。

问题分析及处理建议：

v5.1的TiFlash升级到v6.1.5不能保证升级过程中会不会有其他问题，这台机器升级过程是恢复临时占用内存超过机器内存16G，不能在原地升级完成。

大家以后在执行版本升级时，第一硬件资源要跟上，第二升级的组件要是多副本的。

这样在升级过程中不用等待重新扩缩TiFlash同步数据带来的数据迁移时间。

问题二：使用--offline参数升级集群无法继续进行，cluster is running and cannot be upgraded offline

ll upgrade component "tiflash" to "v6.1.5", will upgrade component "cdc" to "v6.1.5", will upgrade component "pd" to "v6.1.5", will upgrade component "tikv" to "v6.1.5", will upgrade component "tidb" to "v6.1.5", will upgrade component "prometheus" to "v6.1.5", y Upgrading cluster... Error: cluster is running and cannot be upgraded offline

至于为什么会发生这个问题，没去排查了。最后确认的是：tiup中关闭了TiFlash组件，状态也显示已关闭，但服务器上systemctl status tiflash-9000显示running，实际上组件并未正常停止，这个问题也好理解，毕竟是异步关闭的组件。

遇到这个问题处理办法就是去TiFlash组件服务器上关闭一下。

问题三：升级期间有一个超大DDL在执行，导致升级操作完成后tidb计算节点不能提供服务

这个服务直接就不可用了，我们来看个组件的日志情况

排查过程，我们一般都是grep err日志，发现错误日志集中在PD上，开始在怀疑PD有问题

PD的相关日志：

[2023/08/24 22:51:27.992 +08:00] [ERROR] [client.go:847] ["[pd] getTS error"] [dc-location=global] [stream-addr=http://10.16.57.62:2379][error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"][2023/08/24 22:51:27.993 +08:00] [INFO] [base_client.go:275] ["[pd]cannot update member from this address"][address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"] [2023/08/24 22:51:27.993 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"] [2023/08/24 22:51:27.994 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"][2023/08/24 22:51:27.994 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"] [2023/08/24 22:51:27.995 +08:00] [INFO] [base_client.go:275] ["[pd]cannot update member from this address"][address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"] [2023/08/24 22:51:27.995 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"] [2023/08/24 22:51:27.995 +08:00] [ERROR] [base_client.go:144] ["[pd]failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://10.16.57.106:2379 http://10.16.57.62:2379 http://10.16.57.82:2379] error"][2023/08/24 22:51:28.265 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"] [2023/08/24 22:51:28.266 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"] [2023/08/24 22:51:28.266 +08:00] [ERROR] [base_client.go:144] ["[pd]failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://10.16.57.106:2379 http://10.16.57.62:2379 http://10.16.57.82:2379] error"][2023/08/24 22:51:28.266 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address dont exist error"][2023/08/24 22:51:28.476 +08:00] [ERROR] [client.go:847] ["[pd] getTS error"] [dc-location=global] [stream-addr=http://10.16.57.62:2379][error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]

大致情况就是执行任何操作请求都提示没有pd leader能处理。

TiDB的日志：

[2023/08/24 23:16:29.989 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.106:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.106:2379 status:READY"][2023/08/24 23:16:29.990 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.62:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.62:2379 status:READY"][2023/08/24 23:16:29.990 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.82:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.82:2379 status:READY"] [2023/08/25 01:07:49.023 +08:00] [INFO] [printer.go:34] ["Welcome to TiDB."] ["Release Version"=v6.1.5] [Edition=Community] ["Git Commit Hash"=73d82e330b02a39f74073d98daefbadd7deab9b9] ["Git Branch"=heads/refs/tags/v6.1.5] ["UTC Build Time"="2023-02-22 06:21:24"] [GoVersion=go1.19.5] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306] [2023/08/25 01:07:49.024 +08:00] [INFO] [printer.go:48] ["loaded config"][config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"10.16.57.82\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"10.16.57.82:2379,10.16.57.106:2379,10.16.57.62:2379\",\"socket\":\"/tmp/tidb-4000.sock\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/tmp/0_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage\",\"tmp-storage-quota\":-1,\"server-version\":\"\",\"version-comment\":\"\",\"tidb-edition\":\"\",\"tidb-release-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/data/tidb-deploy/tidb-4000/log/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"/data/tidb-deploy/tidb-4000/log/tidb_slow_query.log\",\"slow-threshold\":1000,\"expensive-threshold\":10000,\"record-plan-in-slow-log\":1,\"query-log-max-len\":4096},\"instance\":{\"tidb_general_log\":false,\"tidb_pprof_sql_cpu\":false,\"ddl_slow_threshold\":300,\"tidb_expensive_query_time_threshold\":60,\"tidb_enable_slow_log\":true,\"tidb_slow_log_threshold\":1000,\"tidb_record_plan_in_slow_log\":1,\"tidb_check_mb4_value_in_utf8\":true,\"tidb_force_priority\":\"NO_PRIORITY\",\"tidb_memory_usage_alarm_ratio\":0.8,\"tidb_enable_collect_execution_info\":true,\"plugin_dir\":\"/data/deploy/plugin\",\"plugin_load\":\"\"},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null,\"spilled-file-encryption-method\":\"plaintext\",\"enable-sem\":false,\"auto-tls\":false,\"tls-version\":\"\",\"rsa-key-size\":4096,\"secure-bootstrap\":false},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-concurrent-streams\":1024,\"grpc-initial-window-size\":2097152,\"grpc-max-send-msg-size\":2147483647},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"server-memory-quota\":0,\"memory-usage-alarm-ratio\":0.8,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0,\"query-feedback-limit\":512,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-entry-size-limit\":6291456,\"txn-total-size-limit\":104857600,\"tcp-keep-alive\":true,\"tcp-no-delay\":true,\"cross-join\":true,\"distinct-agg-push-down\":false,\"projection-push-down\":false,\"max-txn-ttl\":3600000,\"index-usage-sync-lease\":\"0s\",\"plan-replayer-gc-lease\":\"10m\",\"gogc\":100,\"enforce-mpp\":false,\"stats-load-concurrency\":5,\"stats-load-queue-size\":1000,\"enable-stats-cache-mem-quota\":false,\"committer-concurrency\":128,\"run-auto-analyze\":true},\"prepared-plan-cache\":{\"enabled\":true,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"pd-client\":{\"pd-server-timeout\":3},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-compression-type\":\"none\",\"commit-timeout\":\"41s\",\"async-commit\":{\"keys-limit\":256,\"total-key-size-limit\":4096,\"safe-window\":2000000000,\"allowed-clock-drift\":500000000},\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"1s\",\"copr-cache\":{\"capacity-mb\":1000},\"ttl-refreshed-txn-size\":33554432,\"resolve-lock-lite-threshold\":16},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"/data/deploy/plugin\",\"load\":\"\"},\"pessimistic-txn\":{\"max-retry-count\":256,\"deadlock-history-capacity\":10,\"deadlock-history-collect-retryable\":false,\"pessimistic-auto-commit\":false},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"index-limit\":64,\"table-column-count-limit\":1017,\"graceful-wait-before-shutdown\":0,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"top-sql\":{\"receiver-address\":\"\"},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":1000,\"new_collations_enabled_on_first_bootstrap\":true,\"experimental\":{\"allow-expression-index\":false},\"enable-collect-execution-info\":true,\"skip-register-to-dashboard\":false,\"enable-telemetry\":false,\"labels\":{},\"enable-global-index\":false,\"deprecate-integer-display-length\":false,\"enable-enum-length-limit\":true,\"stores-refresh-interval\":60,\"enable-tcp4-only\":false,\"enable-forwarding\":false,\"max-ballast-object-size\":0,\"ballast-object-size\":0,\"enable-global-kill\":true,\"enable-batch-dml\":false,\"mem-quota-query\":2147483648,\"oom-action\":\"cancel\"}"][2023/08/25 01:07:49.024 +08:00] [INFO] [main.go:343] ["disable Prometheus push client"] 这个地方的日志也显示TiDB节点的配置正常加载正常启动，但是就是未正常打开4000端口对外提供服务 [2023/08/25 01:07:51.410 +08:00] [INFO] [session.go:3311] ["CRUCIAL OPERATION"] [conn=0] [schemaVersion=3242] [cur_db=] [sql="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"] [user=] [2023/08/25 01:07:51.442 +08:00] [INFO] [ddl_worker.go:339] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] [2023/08/25 01:07:51.444 +08:00] [INFO] [ddl.go:632] ["[ddl] start DDL job"] [job="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"] 怀疑是未正常执行升级

因为是按计划升级，也没有在TiDB上执行DDL，从未怀疑这一点是问题的关键。然而事实的关键往往就在你没发现的可能忽略地方，我们排查了tikv都是正常的，间接也证明了PD应该也是正常的，否则多副本的存储服务肯定也不行。

回过头来排查tidb本身，突然发现一个从MySQL同步到这个TiDB集群的一个物理加索引DDL。

问题逐渐清晰，因为MySQL有分库分表，并且只保存着近半年的热数据执行很快就完成。TiDB里面有着全量的数据聚合到一张表，这个加索引的物理DDL也自动同步到了TiDB执行，并且执行时间很长。

因为有了整体问题思路，再回头看TiDB日志

[2023/08/24 22:00:27.008 +08:00] [WARN] [session.go:1533] ["run statement failed"] [conn=5897899] [schemaVersion=3240] [error="[ddl:1061]index already exist idx_create; a background job is trying to add the same index, please check by `ADMIN SHOW DDL JOBS`"] [session="{\n \"currDBName\": \"***_voucher\",\n \"id\": 5897899,\n \"status\": 2,\n \"strictMode\": true,\n \"user\": {\n \"Username\": \"tidb_cold_ w\",\n \"Hostname\": \"10.16.56.178\",\n \"CurrentUser\": false,\n \"AuthUsername\": \"tidb_cold_w\",\n \"AuthHostname\": \"10.%.%.%\"\n }\n}"][2023/08/24 22:00:27.008 +08:00] [INFO] [conn.go:877] ["command dispatched failed"] [conn=5897899] [connInfo="id:5897899, addr:10.16.56.178:28898 status:10, collation:utf8_general_ci, user:tidb_cold_w"] [command =Query] [status="inTxn:0, autocommit:1"] [sql="ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"] [txn_mode=PESSIMISTIC] [err="[ddl:1061]index already exist idx_create; a background job is t rying to add the same index, please check by `ADMIN SHOW DDL JOBS`\ngithub.com/pingcap/errors.AddStack\n\t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/# cat tidb.log |grep "run reorg job wait timeout" [2023/08/24 22:09:51.308 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7104213] [doneKey=6774522551193595979] [] [2023/08/24 22:09:56.356 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7183573] [doneKey=6774522551193595979] [] [2023/08/24 22:10:01.391 +08:00] [INFO] [reorg.go:284] ["[ddl]run reorg job wait timeout"][waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7262165] [doneKey=6774522551193595979] [] [2023/08/24 22:10:06.481 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7336405][doneKey=6774522551193595979] [][2023/08/24 22:10:11.570 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7419143] [doneKey=6774522551193595979] [] [2023/08/24 22:10:16.640 +08:00] [INFO] [reorg.go:284] ["[ddl]run reorg job wait timeout"][waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7464902] [doneKey=6774522551193595979] [] [2023/08/24 22:10:21.804 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7544549][doneKey=6774692334375633788.next] [][2023/08/24 22:11:43.519 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8656519] [doneKey=6775048577556492528.next] [] [2023/08/24 22:11:48.593 +08:00] [INFO] [reorg.go:284] ["[ddl]run reorg job wait timeout"][waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8737159] [doneKey=6775048577556492528.next] [] [2023/08/24 22:11:53.648 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8792281][doneKey=6775048577556492528.next] [][2023/08/24 22:11:58.740 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8826499] [doneKey=6775048577556492528.next] [] [2023/08/25 04:37:44.221 +08:00] [INFO] [reorg.go:301] ["[ddl]run reorg job wait timeout"][waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256292839] [doneKey=6858402545246070082.next] [] [2023/08/25 04:37:49.308 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256353399] [doneKey=6858402545246070082.next] [] [2023/08/25 04:37:54.366 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256386509] [doneKey=6858402545246070082.next] [] [2023/08/25 04:37:59.438 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256442291] [doneKey=6858463469667393591.next] []

根因分析：

[2023/08/24 22:00:27.008 +08:00] [INFO] [conn.go:877] ["command dispatched failed"] [conn=5897899] [connInfo="id:5897899, addr:10.16.56.178:28898 status:10, collation:utf8_general_ci, user:tidb_cold_w"] [command =Query] [status="inTxn:0, autocommit:1"] [sql="ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"] [txn_mode=PESSIMISTIC] [err="[ddl:1061]index already exist idx_create; a background job is t rying to add the same index, please check by `ADMIN SHOW DDL JOBS`

[2023/08/25 01:07:51.410 +08:00] [INFO] [session.go:3311] ["CRUCIAL OPERATION"] [conn=0] [schemaVersion=3242] [cur_db=] [sql="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"] [user=]

[2023/08/25 01:07:51.444 +08:00] [INFO] [ddl.go:632] ["[ddl] start DDL job"] [job="ID:4591, Type:add column,State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"]

因升级导致内部执行的"ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"

在DDL执行队列中排在了物理DDL "ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"后面，导致升级并未真正完成，卡住了打开4000端口的流程。

改进：在进行大版本升级时，操作集群不要有大DDL执行，提前检查【ADMIN SHOW DDL JOBS】

最奇怪的事情是：物理DDL执行完成后，tidb计算节点打开服务后，进数据库看那个索引并没有加上。

标签：TiDB 数据库数据迁移 pingcap

记三次升级TiDB集群到v6.1.5的案例与收益

问题一：升级完成后其中一台TiFlash无法正常重启，一直oom

问题二：使用--offline参数升级集群无法继续进行，cluster is running and cannot be upgraded offline

问题三：升级期间有一个超大DDL在执行，导致升级操作完成后tidb计算节点不能提供服务

根因分析：

改进：在进行大版本升级时，操作集群不要有大DDL执行，提前检查【ADMIN SHOW DDL JOBS】

黄东旭解析 TiDB 的核心优势

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

高成本云服务？TiDB 帮你省钱

推荐文章

HTAP 还可以这么玩？丨TiDB 在 IoT 智慧园区的应用

新特性解析丨TiDB 资源管控的设计思路与场景解析

TiDB赋能保险业-首个全栈自主核心保单系统成功投产

首个云原生、分布式、全栈国产化银行核心业务系统投产上线丨TiDB × 杭州银行

TiDB 在社交场景的解决方案实践

电商数据技术栈，在海量数据增长下如何实现实时与全量兼得？

金融行业数据库的选择

TiDB 在智能制造中的应用实践

TiDB 在全球头部物流企业计费管理系统的应用实践

PingCAP与教育部教育管理信息中心合作，推动普惠教育数字化转型

友情链接

热评文章

TiDB 中标杭州银行核心系统数据库项目

TiDB 首批通过信通院 HTAP 数据库基础能力评

PingCAP 与 Wisconsin-Madiso

PingCAP 成为中国唯一入选 Forrester

TiDB 走进东软集团，共建医疗数字化基石

共享开源技术，共建开放生态丨平凯星辰余梦杰出席 20

记三次升级TiDB集群到v6.1.5的案例与收益

问题一：升级完成后其中一台TiFlash无法正常重启，一直oom

问题二：使用--offline参数升级集群无法继续进行，cluster is running and cannot be upgraded offline

问题三：升级期间有一个超大DDL在执行，导致升级操作完成后tidb计算节点不能提供服务

根因分析：

改进：在进行大版本升级时，操作集群不要有大DDL执行，提前检查【ADMIN SHOW DDL JOBS】

微信扫一扫：分享

推荐文章

友情链接

热评文章