【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2

阅读 18

2024-11-02




一、概述

  当前我们生产TiDB集群版本为v5.3.0,需要升级到v7.5.2解决我们生产环境中遇到的问题,并使用上新功能提高集群性能及降低维护成本。

  由于升级版本跨度大,在生产升级前,我们阅读了从V5.3.0到7.5.2的所有Release(这很重要),并记录下相关的影响项以及一些对自己有价值的新特性,然后先在自己搭建的虚拟机环境做相关验证。

  数据库作为最底层的核心组件,稳定性是至关重要,因此我们制定的升级路径为:【个人环境(工作电脑上搭建的虚拟机)升级测试】 -> 【开发环境升级验证】 -> 【测试环境升级验证】 -> 【生产环境升级】。

  开发环境升级完成后,让开发同学使用一周左右确保没有问题,然后再对测试环境进行升级。测试环境是最接近生产的环境,因此最容易在此处暴露出问题,我们就是在此步发现CDC的maxwell格式和老版本格式不一致的问题,避免了一次升级导致业务不可用的严重故障。测试环境升级完成观察约两周左右,再对生产环境进行升级。

  值得注意的是,在相同的版本下,通过升级达到该版本和直接搭建该版本的默认参数是不一样的。为了保证生产升级前要把问题全暴露出来,我们选择从第一次搭建的版本开始升级。例如我们是从v4.0.4开始使用TiDB,然后升级到v4.0.10,再到现在的v5.3.0,因此我们在非生产环境做验证时,先搭建v4.0.4,然后再一路升级到v5.3.0,此时再把虚拟机保存一个快照,便于反复测试v5.3.0到v7.5.2的升级验证。

   建议在升级前把show global variables和show config的内容全保存下来,这样做的好处是升级后当有一些负面影响,可以通过对比升级前后的变量及配置,比较快速的定位到问题(例如升级后的备份速度明显变慢,经对比发现backup.num-threads由原来的8自动变更为2导致)

   备份永远是DBA最后的王牌,我们在生产环境真正升级前,先制定了全备和增量的备份策略,并应用在另外一个备用集群中,虽然最终没启用备用集群,但因为有它的存在,我们的升级中遇到问题也不致于乱了阵脚,还有选择试一试的勇气。

  通过版本升级,我们主要达到以下目的:



1.1、解决现版本的bug

(1)、解决备份时间过长 及 增量备份、还原太慢问题(目前采用tibinlog还原很慢)。

(2)、ticdc 会卡死的问题(v6.5.8解决了非常多的 ticdc问题)。

(3)、解决对指定ip段的用户每次登录日志都报错“Failed to get user record”,导致产生大量日志的问题。

(4)、解决tidb-server OOM后加载统计信息很慢导致业务无法恢复的问题。

  在 TiDB 启动阶段,初始统计信息加载完成之前执行的 SQL 可能有不合理的执行计划,从而影响性能。为了避免这种情况,从 v7.1.0 开始,TiDB 引入了配置项 force-init-stats。可以控制 TiDB 启动时是否在统计信息初始化完成后再对外提供服务。该配置项从 v7.2.0 起默认开启。

(5)、修复将 FLOAT 列改为 DOUBLE 列后查询结果有误的问题(v5.3.1修复)。

(6)、修复了查询 INFORMATION_SCHEMA.CLUSTER_SLOW_QUERY 表导致 TiDB 服务器 OOM 的问题,在 Grafana dashboard 中查看慢查询记录的时候可能会触发该问题 #33893(5.3.2重要bug修复【官方不建议使用该版本】)。



1.2、新增功能有利于提高性能及降低维护成本

(1)、6.3.0支持自动分区。

(2)、7.1.0支持分区重组(分区拆分合并)。

(3)、TiDB 在 v6.0.0 版本中引入了缓存表功能。

(4)、6.2.0支持 point-in-time recovery (PITR),允许恢复备份集群的历史任意时间点的快照。

(5)、从 v7.4.0 开始,TiDB 支持在 GROUP BY 子句中使用 WITH ROLLUP 修饰符和 GROUPING 函数。

(6)、支持统计信息采集配置持久化  tidb_persist_analyze_options(5.4.0新加功能)。

(7)、优化备份对集群的影响。

(8)、在6.0.0中,对内存悲观锁进行优化,可以有效降低 10% 延迟,提升 10% QPS。



二、升级后的性能表现

(1)、集群的稳定性提高:升级到7.5.2后,TiDB集群OOM的次数相比之前大幅度下降。

(2)、TiKV组件内也会自动GC,不需要重启TiKV节点来帮助GC回收region。

(3)、使用表TTL (Time To Live,生存时间)清理过期数据 ,减少了使用脚本处理过期数据造成的数据库压力。

(4)、如下图所示,本次升级后数据库的响应时间及TiDB server内存都得到优化。

【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2_kafka

 

【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2_kafka_02

 

 


三、升级演练(非生产环境升级)


3.1、 升级前检查


3.1.1、停止相关定时作业

停止包含有备份、还原、ddl操作的所有定时作业


3.1.2、检查server-version

server-version 的值设置为空或者当前 TiDB 真实的版本值,避免出现非预期行为

mysql> show config where name like '%server-version%';
+------+---------------------+----------------+-------+
| Type | Instance            | Name           | Value |
+------+---------------------+----------------+-------+
| tidb | 192.168.68.129:4000 | server-version |       |
| tidb | 192.168.68.128:4000 | server-version |       |
+------+---------------------+----------------+-------+
2 rows in set, 1 warning (0.07 sec)


3.1.3、系统架构检查

在 Linux AMD64 架构的硬件平台部署 TiFlash 时,CPU 必须支持 AVX2 指令集,执行以下命令有输出:

cat /proc/cpuinfo | grep avx2

在 Linux ARM64 架构的硬件平台部署 TiFlash 时,CPU 必须支持 ARMv8 架构,执行以下命令有输出:

cat /proc/cpuinfo | grep 'crc32' | grep 'asimd'

注意:对于虚拟机搭建的自测环境不支持avx2的情况,可以通过修改tiflash启动脚本绕过【生产环境务必支持avx2】,但该文件会在升级过程中会被覆盖还原,最后导致升级不成功,因此需要在升级过程中不断的检查该文件是否被覆盖,如果被覆盖了要及时修改回来,此时可以拷贝以下脚本在所有TiFlash节点上执行,以实现实时监控并修改【注意,以下脚本务必在执行升级前先执行】:

function update_tiflash_script()
{
        # run_tiflash.sh 脚本所在路径【【【【【注意要根据实际情况修改此路径】】】】】】
        scripts_path='/data/tidb-deploy/tiflash-9000/scripts'
 
        now=`date +%F%T | sed -r 's/-|://g'`
        # 备份原脚本
        cp ${scripts_path}/run_tiflash.sh ${scripts_path}/run_tiflash.sh.${now}
        while [ 1 = 1 ]; do
                echo "正在监控‘${scripts_path}/run_tiflash.sh’文件,升级完成后请按“ctrl + c”停止本脚本"
                isExist=`cat ${scripts_path}/run_tiflash.sh | grep 'required_cpu_flags' | wc -l`
                if [ "${isExist}" != "0" ]; then
                        isModifed=`cat ${scripts_path}/run_tiflash.sh | grep 'required_cpu_flags="avx"' | wc -l`
                        if [ "${isModifed}" = "0" ]; then
                                echo "Not found 'required_cpu_flags=\"avx\"', try modify..."
                                # 先把原来的注释掉
                                sed -i 's/required_cpu_flags=/# required_cpu_flags=&/g' ${scripts_path}/run_tiflash.sh
                                # 然后进行修改
                                sed -i '/# required_cpu_flags=/i\    required_cpu_flags="avx"' ${scripts_path}/run_tiflash.sh
                        fi
                fi
                sleep 0.5
        done
}
update_tiflash_script

另外,此处修改虽然可以绕过因TiFlash升级失败而导致的整个集群升级失败的问题,但是成功升级集群后,TiFlash仍然因为不支持avx2而启动失败。

参考:https://asktug.com/t/topic/1021704/22

 


3.1.4、Prometheus问题

   升级 v5.3 之前版本的集群到 v5.3 及后续版本时,默认部署的 Prometheus 会从 v2.8.1 升级到 v2.27.1,v2.27.1 提供更多的功能并解决了安全风险。Prometheus v2.27.1 相对于 v2.8.1 存在 Alert 时间格式变化,详情见 Prometheus commit。

 


3.2、开始升级


3.2.1、升级 TiUP 版本

tiup 版本不低于 1.11.3

tiup update --self
tiup --version


3.2.2、升级 TiUP Cluster 版本

tiup cluster 版本不低于 1.11.3

tiup update cluster
tiup cluster --version


3.2.3、确保无ddl操作

mysql> admin show ddl;
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
| SCHEMA_VER | OWNER_ID                             | OWNER_ADDRESS       | RUNNING_JOBS | SELF_ID                              | QUERY |
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
|        100 | cad28f9e-dcda-4782-8e14-c792604d4275 | 192.168.68.128:4000 |              | cad28f9e-dcda-4782-8e14-c792604d4275 |       |
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
1 row in set (0.01 sec)

注意:升级过程中勿进行ddl操作

 


3.2.4、确保无备份和还原操作

mysql> show backups;
Empty set (0.00 sec)
mysql> show restores;
Empty set (0.00 sec)


3.2.5、检查当前集群的健康状况

[root@localhost ~]# tiup cluster check tidb-test --cluster
Checking updates for component cluster... Timedout (after 2s)
+ Download necessary tools
...... <此处忽略若干日志>
Checking region status of the cluster tidb-test...
All regions are healthy.
[root@localhost ~]#

           执行结束后,最后会输出 region status 检查结果。如果结果为 "All regions are healthy",则说明当前集群中所有 region 均为健康状态,可以继续执行升级;如果结果为 "Regions are not fully healthy: m miss-peer, n pending-peer" 并提示 "Please fix unhealthy regions before other operations.",则说明当前集群中有 region 处在异常状态,应先排除相应异常状态,并再次检查结果为 "All regions are healthy" 后再继续升级。

           如果有错误,可以先尝试自动修复:

tiup cluster check tidb-test --cluster --apply


3.2.6、升级 TiDB 集群

tiup cluster upgrade tidb-test v7.5.2

此时需要重启各组件:

[root@localhost ~]# tiup cluster upgrade tidb-test v7.5.2
Before the upgrade, it is recommended to read the upgrade guide at https://docs.pingcap.com/tidb/stable/upgrade-tidb-using-tiup and finish the preparation steps.
This operation will upgrade tidb v5.3.0 cluster tidb-test to v7.5.2:
will upgrade and restart component "            tiflash" to "v7.5.2",
will upgrade and restart component "                cdc" to "v7.5.2",
will upgrade and restart component "                 pd" to "v7.5.2",
will upgrade and restart component "               tikv" to "v7.5.2",
will upgrade and restart component "               pump" to "v7.5.2",
will upgrade and restart component "               tidb" to "v7.5.2",
will upgrade and restart component "            drainer" to "v7.5.2",
will upgrade and restart component "         prometheus" to "v7.5.2",
will upgrade and restart component "            grafana" to "v7.5.2",
will upgrade component     "node-exporter" to "",
will upgrade component "blackbox-exporter" to "".
Do you want to continue? [y/N]:(default=N) y


3.2.7、升级br

(1)升级前备份

开发环境升级前的br备份(即步骤3.2.6之前做备份)

[tidb@localhost ~]$ br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701" --log-file backup_full.log
Detail BR log in backup_full.log
Full backup <---------------------------------------------------------------------------------------------------------------------\..............................................................................................> 55.57%{"level":"warn","ts":"2024-07-01T16:19:11.501+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-1a1b68b6-4fbb-423c-806b-a471b994fbad/192.168.100.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Full backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/07/01 16:37:57.134 +08:00] [INFO] [collector.go:65] ["Full backup success summary"] [total-ranges=8105] [ranges-succeed=8105] [ranges-failed=0] [backup-checksum=5m34.229454455s] [backup-fast-checksum=578.178902ms] [backup-total-ranges=12445] [total-take=30m12.00797445s] [BackupTS=450840825721257986] [total-kv=2812160710] [total-kv-size=221.7GB] [average-speed=122.3MB/s] [backup-data-size(after-compressed)=47.51GB] [Size=47506313613]
[tidb@localhost ~]$

共耗时30m12秒,备份文件为47.5G

 

(2)升级br

# 下载地址:
wget https://download.pingcap.org/tidb-community-toolkit-v7.5.2-linux-amd64.tar.gz
tar xvzf tidb-community-toolkit-v7.5.2-linux-amd64.tar.gz
cd tidb-community-toolkit-v7.5.2-linux-amd64
tar xvzf br-v7.5.2-linux-amd64.tar.gz
cp br /usr/bin
# 尝试备份
su - tidb
br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701_2" --log-file backup_full_2.log

(3)升级br后备份

# backup.num-threads=2时
[tidb@localhost ~]$ time br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701_2" --log-file backup_full_2.log
Detail BR log in backup_full_2.log
Full Backup <----.................................................................................................................................................................................................................> 1.68%{"level":"warn","ts":"2024-07-01T20:02:02.873855+0800","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001e6700/192.168.100.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Full Backup <---/.................................................................................................................................................................................................................> 1.69%{"level":"warn","ts":"2024-07-01T20:19:33.890588+0800","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001e6700/192.168.100.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Full Backup <----\................................................................................................................................................................................................................> 1.98%
Full Backup <----/................................................................................................................................................................................................................> 1.98%
Full Backup <-----................................................................................................................................................................................................................> 1.98%
Full Backup <----\................................................................................................................................................................................................................> 1.99%
Full Backup <----|................................................................................................................................................................................................................> 1.99%
Full Backup <---------------------------------------------\......................................................................................................................................................................> 21.39%{"level":"warn","ts":"2024-07-01T20:59:41.485154+0800","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001e6700/192.168.100.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/07/01 22:12:29.231 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=8111] [ranges-succeed=8111] [ranges-failed=0] [backup-fast-checksum=1.012687562s] [backup-checksum=7m47.96059744s] [backup-total-ranges=12509] [total-take=2h12m22.072926235s] [BackupTS=450844481438089217] [total-kv=2812160811] [total-kv-size=221.7GB] [average-speed=27.91MB/s] [backup-data-size(after-compressed)=50.03GB] [Size=50026772216]
 
real    132m31.496s
user    1m27.695s
sys     1m2.156s
 
 
# backup.num-threads=4时
[tidb@localhost ~]$ time br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701_3" --log-file backup_full_3.log
Detail BR log in backup_full_3.log
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/07/02 00:09:55.013 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=8104] [ranges-succeed=8104] [ranges-failed=0] [backup-checksum=7m5.51865244s] [backup-fast-checksum=932.893636ms] [backup-total-ranges=12509] [total-take=40m7.219175349s] [backup-data-size(after-compressed)=50.03GB] [Size=50026756234] [BackupTS=450847779061760014] [total-kv=2812160813] [total-kv-size=221.7GB] [average-speed=92.09MB/s]
 
real    40m7.655s
user    0m46.710s
sys     0m34.209s

 

# 使用 ratelimit 参数时
[tidb@localhost ~]$ time br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701_4" --ratelimit 128 --log-file backup_full_4.log
Detail BR log in backup_full_4.log
[2024/07/02 00:11:52.901 +08:00] [WARN] [backup.go:312] ["setting `--ratelimit` and `--concurrency` at the same time, ignoring `--concurrency`: `--ratelimit` forces sequential (i.e. concurrency = 1) backup"] [ratelimit=134.2MB/s] [concurrency-specified=4]
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/07/02 01:28:02.552 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=8104] [ranges-succeed=8104] [ranges-failed=0] [backup-checksum=7m9.190860539s] [backup-fast-checksum=1.025710248s] [backup-total-ranges=12509] [total-take=1h16m9.669375594s] [total-kv=2812160813] [total-kv-size=221.7GB] [average-speed=48.51MB/s] [backup-data-size(after-compressed)=50.03GB] [Size=50026756230] [BackupTS=450848440988729349]
 
real    76m9.965s
user    1m3.750s
sys     0m49.402s

(4)升级前后备份对比

 

备份耗时

备份文件大小

升级前(backup.num-threads=6)

30m12s

47.5G

升级后(backup.num-threads=2)

2h12m22s

50.0G

升级后(backup.num-threads=4)

40m7s

50.0G

 


3.3、报错处理


3.3.1、报错1

Error: init config failed: 192.168.68.128:9093: transfer from /data/tidb-deploy/alertmanager-9093/conf/alertmanager.yml to /data/tidb-deploy/alertmanager-9093/conf/alertmanager.yml failed: failed to scp /data/tidb-deploy/alertmanager-9093/conf/alertmanager.yml to tidb@192.168.68.128:/data/tidb-deploy/alertmanager-9093/conf/alertmanager.yml: Process exited with status 1

解决方法:先把通过缩容把 alertmanager 干掉,然后再升级

tiup cluster scale-in tidb-test --node 192.168.68.128:9093

 


3.3.2、报错2

Error: failed to get leader count 192.168.68.129: metric tikv_raftstore_region_count{type="leader"} not found

先手工尝试能否获取这个信息:

[root@localhost ~]# curl -sl 192.168.68.129:20180/metrics | grep 'tikv_raftstore_region_count{type="leader"}'
tikv_raftstore_region_count{type="leader"} 3

有输出说明是可以获取的,此时在之前的基础上继续跑,需要先看跑到哪里:

[root@localhost ~]# tiup cluster audit
ID           Time                       Command
--           ----                       -------
gqmyWYvCwwx  2024-05-23T18:46:36+08:00  /root/.tiup/components/cluster/v1.15.1/tiup-cluster check ./topology.yaml
gqmz0wgydxj  2024-05-23T18:47:39+08:00  /root/.tiup/components/cluster/v1.15.1/tiup-cluster check ./topology.yaml
......
grCGsxGpBfj  2024-06-26T17:48:19+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster display tidb-test
grCGG9LP9dp  2024-06-26T17:51:27+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster scale-in tidb-test --node 192.168.68.128:9093
grCGT8Nzntq  2024-06-26T17:54:40+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster display tidb-test
grCGYsZn30J  2024-06-26T17:55:56+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster upgrade tidb-test v7.5.2
grCHxJRwQ3s  2024-06-26T18:04:30+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster audit
可以看到是执行到ID为 grCGYsZn30J 的地方,此时我们继续执行:
[root@localhost ~]# tiup cluster replay grCGYsZn30J
Will replay the command `tiup cluster upgrade tidb-test v7.5.2`
Do you want to continue? [y/N]: (default=N) y
......
Upgraded cluster `tidb-test` successfully

参考:https://docs.pingcap.com/zh/tidb/v7.5/upgrade-tidb-using-tiup#41-%E5%8D%87%E7%BA%A7%E6%97%B6%E6%8A%A5%E9%94%99%E4%B8%AD%E6%96%AD%E5%A4%84%E7%90%86%E5%AE%8C%E6%8A%A5%E9%94%99%E5%90%8E%E5%A6%82%E4%BD%95%E7%BB%A7%E7%BB%AD%E5%8D%87%E7%BA%A7

 

3.3.3、报错3

[root@localhost ~]# tiup cluster upgrade tidb-test v7.5.2
Before the upgrade, it is recommended to read the upgrade guide at https://docs.pingcap.com/tidb/stable/upgrade-tidb-using-tiup and finish the preparation steps.
This operation will upgrade tidb v5.3.0 cluster tidb-test to v7.5.2:
will upgrade and restart component "            tiflash" to "v7.5.2",
will upgrade and restart component "                cdc" to "v7.5.2",
will upgrade and restart component "                 pd" to "v7.5.2",
will upgrade and restart component "               tikv" to "v7.5.2",
will upgrade and restart component "               pump" to "v7.5.2",
will upgrade and restart component "               tidb" to "v7.5.2",
will upgrade and restart component "            drainer" to "v7.5.2",
will upgrade and restart component "         prometheus" to "v7.5.2",
will upgrade and restart component "            grafana" to "v7.5.2",
will upgrade component     "node-exporter" to "",
will upgrade component "blackbox-exporter" to "".
Do you want to continue? [y/N]:(default=N) y
Upgrading cluster...
 
...... <此处忽略若干的日志>
 
  - Generate config blackbox_exporter -> 192.168.68.128 ... Done
+ [ Serial ] - UpgradeCluster
Upgrading component tiflash
        Restarting instance 192.168.68.132:9000
 
Error: failed to restart: 192.168.68.132 tiflash-9000.service, please check the instance's log(/data/tidb-deploy/tiflash-9000/log) for more detail.: timed out waiting for port 3930 to be started after 2m0s
 
Verbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2024-06-28-21-40-11.log.
[root@localhost ~]#

经查tiflash日志,发现是不支持avx2

【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2_tidb_03

 

重启集群后发现很多节点起不来

【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2_kafka_04

 

除TiFlash外(由于不支持avx2),其他逐个节点起来后正常。

 


四、生产环境升级

 


4.1、确定升级方案

【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2_kafka_05

 

4.2、前期准备

(1)、系统架构检查,在 Linux AMD64 架构的硬件平台部署 TiFlash 时,CPU 必须支持 AVX2 指令集

#  在 Linux AMD64 架构的硬件平台部署 TiFlash 时,CPU 必须支持 AVX2 指令集,执行以下命令有输出:

cat /proc/cpuinfo | grep avx2

# 在 Linux ARM64 架构的硬件平台部署 TiFlash 时,CPU 必须支持 ARMv8 架构,执行以下命令有输出:

cat /proc/cpuinfo | grep 'crc32' | grep 'asimd'

(2)、TiDB瘦身(清除过期的历史数据,减少重启TiDB时迁移Leader所消耗的时间)

(3)、重启TiKV以释放空间(此处由于5.3.0版本TiKV无法释放空间,需要重启才能释放空间)

(4)、重建Prometheus(解决因共用Prometheus,tiup失去Prometheus控制权问题)

(5)、修改统计作业中ddl语句(此处由于升级期间,如果有ddl语句执行会导致升级失败)

(6)、取得所有授权语句并应用于新集群(链接附件为源码,密码为tidb

(7)、给出新集群的各节点的CPU、内存、磁盘信息(运维申请服务器)

(8)、编写升级后ticdc的toml文件(因原来的maxwell格式和新版本的maxwell格式不兼容,故需要重做ticdc)

4.3、升级前准备

由于本次是跨大版本的升级,做好最坏打算的前期工作。提前准备一套备用集群,配置与生产环境相同,并且将生产环境的全量备份还原到备用集群上。

(1)、中午开启集群备份

(2)、申请新集群服务器

(3)、注释掉一切维护作业(包括归档、备份、统计信息处理)

(4)、新集群搭建

(5)、新集群br搭建在其中一台pd服务器上执行

(6)、新集群挂载OSS(用于备份还原)

(7)、新集群全备还原

参考:有效性测试时,全备还原耗时149分钟

# br restore full --pd 新集群pd的ip:2379 -s local://备份路径 --log-file restorefull.log
# 【【【注意要替换文件名】】】
br restore full --pd host_ip:2379 -s local:///dbbak/tidbFullBak/mg_tidb_full_20240717130001 --log-file restorefull.log

(8)、老集群做第一次增量备份

# 取得上一次备份的TS 【【【注意要修改备份文件夹】】】
LAST_BACKUP_TS=`br validate decode --field="end-version" -s local:///dbbak/tidbFullBak/mg_tidb_full_20240717130001 | tail -n1`
echo $LAST_BACKUP_TS
# 开始增量备份
br backup full\
    --pd host_ip:2379 \
    --ratelimit 128 \
    -s local:///dbbak/tidbFullBak/mg_tidb_incr_20240717_1 \
    --lastbackupts ${LAST_BACKUP_TS}

(9)、新集群做第一次增量还原

注:如果第一次增量备份很快的话,可以不需做这次还原,而是真正升级前再基于全备做一次增量备份

br restore full --pd host_ip:2379 -s local:///dbbak/tidbFullBak/mg_tidb_incr_20240717_1 --log-file restoreincr.log

 


4.4、升级集群及组件


4.4.1、回收所有ddl权限


4.4.2、再做一次增量备份

# 取得上一次备份的TS 【【【注意要修改备份文件夹】】】

# 基于上次增量再做增量
LAST_BACKUP_TS=`br validate decode --field="end-version" -s local:///dbbak/tidbFullBak/mg_tidb_incr_20240717_1 | tail -n1`
echo $LAST_BACKUP_TS

# 【以上1、2只选一个,视具体情况决定】

# 开始增量备份
br backup full\
    --pd host_ip:2379 \
    --ratelimit 128 \
    -s local:///dbbak/tidbFullBak/mg_tidb_incr_20240717_2 \
    --lastbackupts ${LAST_BACKUP_TS}


4.4.3、升级集群


4.4.3.1、系统检查

(1)、停止包含有备份、还原、ddl操作的所有定时作业

(2)、server-version 的值设置为空或者当前 TiDB 真实的版本值,避免出现非预期行为

mysql> show config where name like '%server-version%';
+------+---------------------+----------------+-------+
| Type | Instance            | Name           | Value |
+------+---------------------+----------------+-------+
| tidb | 192.168.68.129:4000 | server-version |       |
| tidb | 192.168.68.128:4000 | server-version |       |
+------+---------------------+----------------+-------+
2 rows in set, 1 warning (0.07 sec)

4.4.3.2、升级 TiUP 版本

tiup 版本不低于 1.11.3

tiup update --self
tiup --version

4.4.3.3、升级 TiUP Cluster 版本

tiup cluster 版本不低于 1.11.3

tiup update cluster
tiup cluster --version

4.4.3.4、确保无ddl操作

mysql> admin show ddl;
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
| SCHEMA_VER | OWNER_ID                             | OWNER_ADDRESS       | RUNNING_JOBS | SELF_ID                              | QUERY |
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
|        100 | cad28f9e-dcda-4782-8e14-c792604d4275 | 192.168.68.128:4000 |              | cad28f9e-dcda-4782-8e14-c792604d4275 |       |
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
1 row in set (0.01 sec)

确保RUNNG_JOBS无值

注意:升级过程中勿进行ddl操作

4.4.3.5、确保无备份和还原操作

mysql> show backups;
Empty set (0.00 sec)

mysql> show restores;
Empty set (0.00 sec)

4.4.3.6、检查当前集群的健康状况

[root@localhost ~]# tiup cluster check mg-tidb --cluster
Checking updates for component cluster... Timedout (after 2s)
+ Download necessary tools
......
Checking region status of the cluster tidb-test...
All regions are healthy.
[root@localhost ~]#

  执行结束后,最后会输出 region status 检查结果。如果结果为 "All regions are healthy",则说明当前集群中所有 region 均为健康状态,可以继续执行升级;如果结果为 "Regions are not fully healthy: m miss-peer, n pending-peer" 并提示 "Please fix unhealthy regions before other operations.",则说明当前集群中有 region 处在异常状态,应先排除相应异常状态,并再次检查结果为 "All regions are healthy" 后再继续升级。  

       如果有错误,可以先尝试自动修复:

tiup cluster check mg-tidb --cluster --apply

4.4.3.7、删除ticdc作业,并记录当前时间

当时时间为:xxxx:xxx:xxx

# 删除cdc: 表 xxx_cmdinfo(其它表依次类推)
cdc cli changefeed remove --changefeed-id cmdinfo-kafka --pd=http://host_ip:2379,http://host_ip:2379,http://host_ip:2379 --force

# 删除任务后会保留任务的同步状态信息 24 小时(主要用于记录同步的 checkpoint),24 小时内不能创建同名的任务。如果希望彻底删除任务信息,可以指定 --force 或 -f 参数删除,删除后 changefeed 的所有信息都会被清理,可以立即创建同名的 changefeed。


4.3.8、升级 TiDB 集群

tiup cluster upgrade mg-tidb v7.5.2

需要重启各组件:

[root@localhost ~]# tiup cluster upgrade mg-tidb v7.5.2
Before the upgrade, it is recommended to read the upgrade guide at https://docs.pingcap.com/tidb/stable/upgrade-tidb-using-tiup and finish the preparation steps.
This operation will upgrade tidb v5.3.0 cluster tidb-test to v7.5.2:
will upgrade and restart component "            tiflash" to "v7.5.2",
will upgrade and restart component "                cdc" to "v7.5.2",
will upgrade and restart component "                 pd" to "v7.5.2",
will upgrade and restart component "               tikv" to "v7.5.2",
will upgrade and restart component "               pump" to "v7.5.2",
will upgrade and restart component "               tidb" to "v7.5.2",
will upgrade and restart component "            drainer" to "v7.5.2",
will upgrade and restart component "         prometheus" to "v7.5.2",
will upgrade and restart component "            grafana" to "v7.5.2",
will upgrade component     "node-exporter" to "",
will upgrade component "blackbox-exporter" to "".
Do you want to continue? [y/N]:(default=N) y


4.4、升级ticdc

原ticdc为maxwell格式,输出到kafka时每条记录独立一行,升级到v7.5.2后,多条记录对应一行,因此需要重做ticdc,并改为canal-json格式

4.4.1、升级ticdc

# 查看 changefeed (V5.3.0的命令),确认之前的 changefeed 已经删除
tiup cdc cli changefeed list --pd=host_ip:2379
# 输出为 [] 则表示全删除
# 升级cdc到7.5.2版本
tiup update cdc:v7.5.2
# 查看 changefeed (V7.5.2的命令),
tiup cdc cli changefeed list --server=host_ip:8300

4.4.2、创建 changefeed

# 重新创建ticdc-xxx_cmdinfo 表
tiup cdc cli changefeed create \
    --server=172.16.5.9:8300  \
    --sink-uri="kafka://xxx.xxx.xxx.xxx:9092/ticdc_xxx_cmdinfo?protocol=canal-json&kafka-version=2.4.1&partition-num=6&max-message-bytes=67108864&replication-factor=1" \
    --changefeed-id="xxx-cmdinfo-kafka" \
    --config=xxx_cmdinfo.toml

4.4.3、更新数据,追回升级过程中丢失的ticdc数据

取出4.4.3.7、删除ticdc作业,并记录当前时间”记录的时间,并适当再往前调10分钟,然后对相关表在这个时间及之后的数据做一次更新(例如对一个无关重要的字段做更新,此处对create_time加1秒)

update xxx_cmdinfo set create_time = date_add(create_time, interval +1 second) where create_time > 'xxx:xx:xx';

4.4.4、修改kafka的消费代码

由于ticdc写到kafka的格式也发现变化,因此需要修改相关的kafka消费代码,格式差异见“5.1、升级到V7.5.2版本后ticdc同步到kafka的maxwell格式josn记录异常”

 


4.5、升级br

# 下载地址:

wget https://download.pingcap.org/tidb-community-toolkit-v7.5.2-linux-amd64.tar.gz
tar xvzf tidb-community-toolkit-v7.5.2-linux-amd64.tar.gz
cd tidb-community-toolkit-v7.5.2-linux-amd64
tar xvzf br-v7.5.2-linux-amd64.tar.gz
cp br /usr/bin
# 尝试备份
# su - tidb
# br backup full --pd "host_ip:2379" -s "local:///nfs/full_20240701_2" --log-file backup_full_2.log


4.6、恢复回收的ddl权限

执行前面备份的权限,以恢复回收的ddl权限

 


五、升级期间遇到的问题


5.1、升级到V7.5.2版本后ticdc同步到kafka的maxwell格式josn记录异常

升级后,ticdc同步到kafka的maxwell格式,从kafka消费下来,v5.3.0是每条记录一行,v7.5.2是多条记录一行,且一行内的的多条记录是没分隔符的。

要使消费下来,每条记录为一行,需要将升级后的v7.5.2的ticdc同步到kafka的输出格式改为canal-json格式,而 v5.3.0是没有 canal-json 格式的。

==========insert==============
# maxwell的insert格式(升级前)
{"database":"test","table":"t","type":"insert","ts":1637823163,"data":{"create_time":"2018-01-01 00:00:00","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"}}
# json-canal的insert格式(升级后)
{"id":0,"database":"test","table":"t","pkNames":["id"],"isDdl":false,"type":"INSERT","es":1721044339113,"ts":1721044339899,"sql":"","sqlType":{"id":4,"dept":-6,"name":12,"create_time":93,"last_login_time":93},"mysqlType":{"dept":"tinyint","name":"varchar","create_time":"datetime","last_login_time":"datetime","id":"int"},"old":null,"data":[{"id":"1","dept":"1","name":"user_1","create_time":"2018-01-01 00:00:00","last_login_time":"2018-03-01 12:00:00"}]}
 
==========update==============
# maxwell的update格式(升级前)
{"database":"test","table":"t","type":"update","ts":1637824161,"data":{"create_time":"2021-11-25 15:09:21","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"},"old":{"create_time":"2018-01-01 00:00:00"}}
# json-canal的insert格式(升级后)
{"database": "test", "table": "t", "type": "update","ts":1637824161,"data": {"create_time":"2021-11-25 15:09:21","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"},"old":{"create_time":"2018-01-01 00:00:00"}}
 
==========delete==============
# maxwell的delete格式(升级前)
{"database":"test","table":"t","type":"delete","ts":1637824320,"old":{"create_time":"2021-11-25 15:10:46","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"}}
# maxwell的delete格式(升级后)
{"database": "test", "table": "t", "type": "delete","ts":1637824161,"data": {"create_time":"2021-11-25 15:09:21","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"},"old":{"create_time":"2018-01-01 00:00:00"}}

解决方案:此处我们是开发修改消费kafka的代码进行解决。在升级前停止TiCDC同步,并且等待kafka内topic消费完,确定数据到最新值后,升级TiDB集群,升级后将新的消费代码上线。

参考:https://asktug.com/t/topic/1005840?replies_to_post_number=2

 


5.2、单个PD节点无法启动问题

       通过TiDB社区老师的建议指导,阅读源码才得知,PD的启动与环境变量有关。PD节点启动前回先去获取环境变量是否有配置,后再启动节点。由于我们先前为了方便DM的使用,修改了环境变量导致,导致PD启动需要获取环境变量而导致的无法启动。

以下图一为服务器的环境变量、图二为PD的源码。

图一:

【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2_tiup_06

 

图二:

【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2_tidb_07

 

 


5.3、备份还原

        新集群的搭建默认参数new_collations_enabled_on_first_bootstrap为true,然而我们从v5.3.0版本升级为v7.5.2版本后参数new_collations_enabled_on_first_bootstrap为false,参数不对应,备份还原不成功。

需要在新建集群的时候将new_collations_enabled_on_first_bootstrap设为false才能正确进行全量还原。(该参数只有在集群搭建的时候设置才有效)

 


六、升级后遗留的问题

目前我们7.5.2版本的TiDB Dashboard 流量可视化有问题,某个表一旦产生流量后,后面即使没有读写操作都会一直显示有流量,这导致失去了利用流量可视化定位问题的核武器,目前还在寻找解决方法。

【TiDB 企业实践-麦谷科技】超详细的升级最佳实践-从 TiDB v5.3.0 到 v7.5.2_kafka_08

 

详情请参考asktug:https://asktug.com/t/topic/1029492/1


七、总结

         此次 TiDB 集群升级的历程,犹如一场充满挑战与收获的冒险。在这个过程中,我真切地领悟到了精心规划与充分准备所蕴含的巨大价值。与 TiDB 社区的互动交流及反馈,如同开启了一扇通往未来的窗户,让我看到了这款产品持续进步的无限可能。我坚信,在不断前行的道路上,TiDB 必将为用户呈上更为稳定、高效且易用的数据库解决方案。与此同时,我满心期待着在未来的日子里,能与社区携手并肩,共同探寻更多提升性能、降低成本的有效途径。

        特别感谢下@升级导师-军军、@升级导师-刘培梁、@表妹和群里各位大佬们的鼎力支持。

 

精彩评论(0)

0 0 举报