PostgreSQL通用分布式数据库架构-citus-PingCAP

PostgreSQL通用分布式数据库架构-citus

网友投稿 1974 2023-04-03

***通用分布式数据库架构-citus

为什么需要分布式数据库

有很多原因数据库需要扩展性。1、请求需要访问的数据量过大（单纯的数据量大不是理由，例如从不访问，归档即可）；2、服务器CPU、内存、网络、IO到了瓶颈，响应时间大大下降；3、MPP中，集中式数据库在设计时通常为了开发人员使用更加顺畅和丝滑，尽可能的让数据库设计和SQL非常简单，比如不需要指定某些表实际上是存在主外键关系，从而导致并行执行效果打折；或者并行执行在一开始并不包含，后面逐渐增强，导致并行执行有天然的缺陷，分区亦如此。这三通常是根本原因。

这三种模式在内核的可优化空间上也有较大差异：原生按照分布式数据库的架构通常具有最大优化空间、SQL无限制，但是受限于硬件、在ms甚至更低级别的时延上表现较差；中间件架构受限于侵入能力，在事务能力上比较弱、在物理优化策略上比较弱、SQL限制多、尤其是管理特性、系统函数和序列等、唯一约束、分析函数等限制更多，但是对于OLTP的扩展提升效果佳，绑定于postgresql citus算是介于两者之间（对于中间件架构上容易陷入的一个误区是轻量，不管是中间件本身还是资源要求，实际上是完整的CN角色，只要需要deparse，需要的资源和单个DN节点可以认为是一样的，本质上也是一个数据库，无非是少了存储而已，内存和CPU甚至要求更高也不奇怪）；带GTM的分布式数据库侵入性最强，因为要实现Global MVCC，理论上需要统一所有节点的LSN以及事务号，但实际实现是可以妥协的，比如仅实现可重复度以上的隔离级别就不需要统一所有的lsn和事务号，但是节点过多比如16节点甚至8节点以上时就可能出现严重的扩展性问题。

分布式数据库中大家最关心的一致性问题

1、对于统计分析而言，如果需要精准统计的场景，比如清结算、统计报表，通常跑批期间库是只读的，所以相差10ms级（因为NTP时间差异在10ms数量级）的时间差异不是关键问题。

2、对于OLTP而言，通常分片会以客户id为维度，所有的操作都会带上客户id分片，比如扣钱、加持仓等，所以路由到单个分片，也不存在不一致性问题。

2.1、按照产品为维度，给持仓表所有的产品=10001的记录持仓数量增加10%，然后再加1。此时操作多个分片，要确保即使发生不可控的交叉执行，也不会发生丢失更新。也就是只要A和B有序达到CN节点，所有记录的结果总是1，而不会出现2。

1、所有shard中所有产品=10001的记录的持仓数量为原持仓数量*1.1+1

2、部分shard中产品=10001的记录的持仓数量为（原持仓数量+1）*1.1

2.2、2.1的并发应该尽可能的高。

2.1和2.2是分布式数据库应该解决的问题。

Citus介绍

实际上很多系统数据量积累的比较大、同时业务流程又很复杂，但是TPS不高，这些系统要应用于分布式数据库，物理模型的设计是很重要的，不是简单的库能套上去的，例如数据中心、历史库的包含就很重要了，尤其是冗余、本地查找表、colocation。亦或者上去了就下不来了，维护成本会非常高。

注：citus架构的优点在于，它认为分布式是一个特性，而不是属性。这一点LZ在所有场合都是这么坚信，95%+的系统永远都不需要微服务架构，数据库也不需要分布式，因为其到不了那个容量，所以理论上可以扩展使得应用能够同时运行于单实例和分布式，而其它一开始就设计为分布式的数据库是很难的。

因为没有做单独GTM节点的概念，citus的协调者无法实现多活，这种情况下容易出现协调者单点（citus 11.0.2支持MX，真正的多CN，lightdb采用同样的多CN架构），如下：

对此，Citus还提供了两个参数use_secondary_node和writable_standby_coordinator以支持写入能力扩展及数据节点读写分离。这样standby cn也可以执行查询和DML操作。如下所示：

由此可见，可靠的分布式数据库架构是非常复杂的，如果没有非常一体化的监控管理平台，其维护难度可想而知。

对于我们上面提到的分布式一致性问题，citus解决了2.1，但是没有解决2.2。

Citus安装

[zjh@lightdb1 usr]$ rpm -ql postgresql13-13.3-1PGDG.rhel7.x86_64/usr/pgsql-13/bin/clusterdb/usr/pgsql-13/bin/createdb/usr/pgsql-13/bin/createuser/usr/pgsql-13/bin/dropdb/usr/pgsql-13/bin/dropuser/usr/pgsql-13/bin/pg_basebackup/usr/pgsql-13/bin/pg_config/usr/pgsql-13/bin/pg_dump/usr/pgsql-13/bin/pg_dumpall[zjh@lightdb1 usr]$ rpm -qa | grep cituscitus_13-10.0.3-1.rhel7.x86_64r[zjh@lightdb1 usr]$ rpm -ql citus_13-10.0.3-1.rhel7.x86_64/usr/pgsql-13/doc/extension/README-citus.md/usr/pgsql-13/lib/citus.so/usr/pgsql-13/share/extension/citus--10.0-1--10.0-2.sql/usr/pgsql-13/share/extension/citus--10.0-2--10.0-3.sql/usr/pgsql-13/share/extension/citus--8.0-1--8.0-2.sql/usr/pgsql-13/share/extension/citus--8.0-1.sql/usr/pgsql-13/share/extension/citus--8.0-10--8.0-11.sql/usr/pgsql-13/share/extension/citus--8.0-11--8.0-12.sql/usr/pgsql-13/share/extension/citus--8.0-12--8.0-13.sql/usr/pgsql-13/share/extension/citus--8.0-13--8.1-1.sql/usr/pgsql-13/share/extension/citus--8.0-2--8.0-3.sql/usr/pgsql-13/share/extension/citus--8.0-3--8.0-4.sql/usr/pgsql-13/share/extension/citus--8.0-4--8.0-5.sql/usr/pgsql-13/share/extension/citus--8.0-5--8.0-6.sql/usr/pgsql-13/share/extension/citus--8.0-6--8.0-7.sql/usr/pgsql-13/share/extension/citus--8.0-7--8.0-8.sql/usr/pgsql-13/share/extension/citus--8.0-8--8.0-9.sql/usr/pgsql-13/share/extension/citus--8.0-9--8.0-10.sql/usr/pgsql-13/share/extension/citus--8.1-1--8.2-1.sql/usr/pgsql-13/share/extension/citus--8.2-1--8.2-2.sql/usr/pgsql-13/share/extension/citus--8.2-2--8.2-3.sql/usr/pgsql-13/share/extension/citus--8.2-3--8.2-4.sql/usr/pgsql-13/share/extension/citus--8.2-4--8.3-1.sql/usr/pgsql-13/share/extension/citus--8.3-1--9.0-1.sql/usr/pgsql-13/share/extension/citus--9.0-1--9.0-2.sql/usr/pgsql-13/share/extension/citus--9.0-2--9.1-1.sql/usr/pgsql-13/share/extension/citus--9.1-1--9.2-1.sql/usr/pgsql-13/share/extension/citus--9.2-1--9.2-2.sql/usr/pgsql-13/share/extension/citus--9.2-2--9.2-4.sql/usr/pgsql-13/share/extension/citus--9.2-4--9.3-2.sql/usr/pgsql-13/share/extension/citus--9.3-1--9.2-4.sql/usr/pgsql-13/share/extension/citus--9.3-2--9.4-1.sql/usr/pgsql-13/share/extension/citus--9.4-1--9.5-1.sql/usr/pgsql-13/share/extension/citus--9.5-1--10.0-1.sql/usr/pgsql-13/share/extension/citus.control/usr/share/doc/citus_13-10.0.3/usr/share/doc/citus_13-10.0.3/CHANGELOG.md/usr/share/licenses/citus_13-10.0.3/usr/share/licenses/citus_13-10.0.3/LICENSE

然后正常通过initdb创建postgresql数据库，1个CN，2个DN。

如下：

安装citus插件：

-- CN和DN都要配置shared_preload_libraries='citus' -- 第一个插件必须是citusCREATE EXTENSION citus; -- 安装在postgres用户下即可

SELECT * from citus_add_node('10.0.0.1', 13588);SELECT * from citus_add_node('10.0.0.1', 23588);

查询DN列表：

postgres=# SELECT * FROM citus_get_active_worker_nodes(); node_name | node_port --------------+----------- 10.0.0.1 | 23588 10.0.0.1 | 13588(2 rows)

概念

在citus中，分片和节点不是一对一关系，这一点不同于greenplum，更接近nosql如couchbase的设计，一定程度上这么做也避免了使用了citus之后还需要分区（因为postgresql的分区优化是远不如***的，到目前为止PG15唯一的用途就是AP里面的pruning）的必要性（这是个优点、也是个缺点，平衡的结果）。

Citus表类型

citus中表分三种类型，1：分库表（每个DN n个分片，分片数量可配置，一般是订单表和客户表）；2：广播表（每个DN一份，CN不包括，一般是字典表、产品表、费率表、机构表、权限表等）；3：全局表（仅存在于CN，一些系统参数表，统计表，也可能广播存储，看情况），全局表一般不会和广播表、分库表进行关联，默认CN创建表的时候就是local表，也可以通过SELECT undistribute_table('github_events');将分库表切换回local表（此时会数据先迁移回来，也是缩容的一种方式）。

广播表和分库表，广播表和广播表之间关联会很多。

同时会存在多种业务存在于同一个数据库中的情况，例如库存和客户，操作日志和订单，小二和菜单、功能、客户，并且同时有从菜单维度查，也有从小二维度查。所以citus支持对表进行分组，相关分组的表，citus在生成分布式执行计划的时候就知道那些是相关的，哪些是无关的。如下：

SELECT create_distributed_table('event', 'tenant_id');　　SELECT create_distributed_table('page', 'tenant_id', colocate_with => 'event');

分组的前提是两个表使用相同字段作为分片字段。分组可以使得SQL的优化更加进一步。

总有一会儿，你会发现库存和客户表进行关联，通过订单进行的。这个时候库存是根据产品分片的，客户是通过客户id分片的。此时效果会怎么样呢？

不同于greenplum支持distributed by语法，citus因为采用extension实现，没有扩展pg本身的语法，所以采用函数的方式来指定表是否为分布式表。

CREATE TABLE companies (id bigserial PRIMARY KEY,name text NOT NULL,image_url text,created_at timestamp without time zone NOT NULL,updated_at timestamp without time zone NOT NULL);SELECT create_distributed_table('companies', 'id'); -- companies表为分布式表，id是用于分片的字段

需要注意的是，citus分片数量和worker数量不是一一对应，这和gp不同，但类似于现在tidb、oceanbase的做法。如下：

要创建广播表，可以使用create_reference_table函数：

SELECT create_reference_table('geo_ips'); -- 所有worker节点广播，不包含CN

大多数的DDL语句citus都支持，会负责分布式调用所有worker。

分布式数据库性能优化关键

co-location是分布式数据库性能好坏和扩展性的关键。

如果实在无法实现colocation，哪怕通过lightdb逻辑复制，尽可能避免下列操作：1、agg(distinct value2)，带distinct的聚合函数2、分析函数() OVER (PARTITION BY <...> ORDER BY <...> )3、任意字段排序和分页查询3、对于这三种情况，gp的interconnect也无能为力，CN必然是瓶颈；4、流式计算也不能解决to B端的问题，只能靠更加智能的算法和空间换时间；

自定义数据分布算法、副本数、分片数

citus原生支持自定义副本数（原生是通过分布式事务来实现的，我们lightdb建议采用流复制实现模式）和分片数，可通过全局配置或函数控制。Lightdb扩展了数据分布算法，支持多列、同时也将提供基于UDF自定义，22.3。

Citus函数类型

不管用户是否承认，相同的功能，存储过程和函数实现的效率就是要比应用发送SQL过来效率更高。所以citus支持了分布式函数的概念。当通过create_distributed_function将函数指定为分布式函数之后，CN在调用函数时会直接以函数为单位派发到DN，而不是在CN执行的时候将SQL解析出来通过SPI执行（存储过程执行SQL就是通过SPI调用存储引擎）。

新增节点

新增节点后，默认不会启用，需要调用rebalance_table_shards让citus对数据进行迁移，然后才会被访问。

SELECT rebalance_table_shards('companies');

执行计划分析

explain(analyze,verbose,buffers) select count(*) as low_stockfrom ( select s_w_id, s_i_id, s_quantity from bmsql_stock where s_w_id = 975 and s_quantity < 12 and s_i_id in ( select ol_i_id from bmsql_district join bmsql_order_line on ol_w_id = d_w_id and ol_d_id = d_id and ol_o_id >= d_next_o_id - 20 and ol_o_id < d_next_o_id where d_w_id = 975 and d_id = 9 ) ) as LQUERY PLAN |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) (actual time=9.781..9.782 rows=1 loops=1) | Output: remote_scan.low_stock | Task Count: 1 | Tuple data received from nodes: 1 bytes | Tasks Shown: All | -> Task | Query: SELECT count(*) AS low_stock FROM (SELECT bmsql_stock.s_w_id, bmsql_stock.s_i_id, bmsql_stock.s_quantity FROM public.bmsql_stock_103384 bmsql_stock WHERE ((bmsql_stock.s_w_id OPERATOR(pg_catalog.=) 975) AND (bmsql_stock.s_quantity OPERATOR(| Tuple data received from node: 1 bytes | Node: host=127.0.0.1 port=13588 dbname=postgres | -> Aggregate (cost=25597.32..25597.33 rows=1 width=8) (actual time=1.276..1.277 rows=1 loops=1) | Output: count(*) | Buffers: shared hit=810 | -> Nested Loop (cost=7612.59..25597.14 rows=73 width=0) (actual time=0.389..1.272 rows=4 loops=1) | Inner Unique: true | Buffers: shared hit=810 | -> HashAggregate (cost=7612.16..7646.24 rows=3408 width=4) (actual time=0.163..0.206 rows=186 loops=1) | Output: bmsql_order_line.ol_i_id | Group Key: bmsql_order_line.ol_i_id | Batches: 1 Memory Usage: 129kB | Buffers: shared hit=42 | -> Nested Loop (cost=0.71..7603.64 rows=3408 width=4) (actual time=0.055..0.131 rows=189 loops=1) | Output: bmsql_order_line.ol_i_id | Buffers: shared hit=42 | -> Index Scan using bmsql_district_pkey_103191 on public.bmsql_district_103191 bmsql_district (cost=0.27..8.30 rows=1 width=12) (actual time=0.014..0.014 rows=1 loops=1) | Output: bmsql_district.d_w_id, bmsql_district.d_id, bmsql_district.d_ytd, bmsql_district.d_tax, bmsql_district.d_next_o_id, bmsql_district.d_name, bmsql_district.d_street_1, bmsql_district.d_street_2, bmsql_district.d| Index Cond: ((bmsql_district.d_w_id = 975) AND (bmsql_district.d_id = 9)) | Buffers: shared hit=3 | -> Index Scan using bmsql_order_line_pkey_103351 on public.bmsql_order_line_103351 bmsql_order_line (cost=0.44..7561.26 rows=3408 width=16) (actual time=0.022..0.081 rows=189 loops=1) | Output: bmsql_order_line.ol_w_id, bmsql_order_line.ol_d_id, bmsql_order_line.ol_o_id, bmsql_order_line.ol_number, bmsql_order_line.ol_i_id, bmsql_order_line.ol_delivery_d, bmsql_order_line.ol_amount, bmsql_order_line.| Index Cond: ((bmsql_order_line.ol_w_id = 975) AND (bmsql_order_line.ol_d_id = 9) AND (bmsql_order_line.ol_o_id >= (bmsql_district.d_next_o_id - 20)) AND (bmsql_order_line.ol_o_id < bmsql_district.d_next_o_id)) | Buffers: shared hit=39 | -> Index Scan using bmsql_stock_pkey_103384 on public.bmsql_stock_103384 bmsql_stock (cost=0.43..5.27 rows=1 width=4) (actual time=0.006..0.006 rows=0 loops=186) | Output: bmsql_stock.s_w_id, bmsql_stock.s_i_id, bmsql_stock.s_quantity, bmsql_stock.s_ytd, bmsql_stock.s_order_cnt, bmsql_stock.s_remote_cnt, bmsql_stock.s_data, bmsql_stock.s_dist_01, bmsql_stock.s_dist_02, bmsql_stock.s_dist_03| Index Cond: ((bmsql_stock.s_w_id = 975) AND (bmsql_stock.s_i_id = bmsql_order_line.ol_i_id)) | Filter: (bmsql_stock.s_quantity < 12) | Rows Removed by Filter: 1 | Buffers: shared hit=768 | Planning Time: 0.755 ms | Execution Time: 1.498 ms |Planning: | Buffers: shared hit=3 |Planning Time: 0.324 ms |Execution Time: 9.796 ms |

一般SQL，失真不算很严重。

性能分析

lightdb@test_for_fdw=# explain select * from t1; QUERY PLAN ----------------------------------------------------------------------------- Custom Scan (Canopy Adaptive) (cost=0.00..0.00 rows=0 width=0) Task Count: 1 Tasks Shown: All -> Task Node: host=127.0.0.1 port=20001 dbname=test_for_fdw -> Seq Scan on t1_102040 t1 (cost=0.00..22.70 rows=1270 width=36)(6 rows)Time: 2.884 mslightdb@test_for_fdw=# select * from t1; id | value ----+-------- 1 | value1 2 | value2 3 | value3(3 rows)Time: 1.609 mslightdb@test_for_fdw=# select * from t1; id | value ----+-------- 1 | value1 2 | value2 3 | value3(3 rows)Time: 1.439 mslightdb@test_for_fdw=# select * from t1; id | value ----+-------- 1 | value1 2 | value2 3 | value3(3 rows)Time: 2.031 mslightdb@test_for_fdw=# select * from t1; id | value ----+-------- 1 | value1 2 | value2 3 | value3(3 rows)Time: 1.809 ms=================lightdb@test_for_fdw=# select undistribute_table('t1');NOTICE: creating a new table for public.t1NOTICE: moving the data of public.t1NOTICE: dropping the old public.t1NOTICE: renaming the new table to public.t1 undistribute_table -------------------- (1 row)Time: 423.286 mslightdb@test_for_fdw=# select * from t1; id | value ----+-------- 1 | value1 2 | value2 3 | value3(3 rows)Time: 0.681 mslightdb@test_for_fdw=# select * from t1; id | value ----+-------- 1 | value1 2 | value2 3 | value3(3 rows)Time: 0.774 mslightdb@test_for_fdw=# select * from t1; id | value ----+-------- 1 | value1 2 | value2 3 | value3(3 rows)Time: 0.478 ms

由上可知，即使只有一个DN节点，查询一张极小的表，只要分布式，就会有差不多1ms的固定开销（因为多了一次解析、序列化、反序列化）。

高可用

流复制模式即可，HA既可以使用keepalived做切换、也可以使用citus自身机制，使用keepalived的话需要N个VIP、还有公有云slb的问题。

promote可以使用patroni或repmgr。

lightdb分布式模式标准使用类repmgr和citus自身机制（因为DN主备采用交叉部署）。k8s下使用patroni+keepalived（DN主备各自独享pod）。

CN成为瓶颈

citus 11.0.2也支持MX，但是性能总体一般，且全开源。

bypass-CN（direct-DN）模式

存储过程、函数与视图、序列、自增、标识列等特性

见下文。

主键

zjh@postgres=# create table dt(id int primary key, v text);CREATE TABLEzjh@postgres=# select create_distributed_table('dt','v');ERROR: cannot create constraint on "dt"DETAIL: Distributed relations cannot have UNIQUE, EXCLUDE, or PRIMARY KEY constraints that do not include the partition column (with an equality operator if EXCLUDE).zjh@postgres=# alter table dt add v1 text;ALTER TABLEzjh@postgres=# alter table dt add v2 text;ALTER TABLEzjh@postgres=# alter table dt add dist_col int;ALTER TABLEzjh@postgres=# select create_distributed_table('dt','dist_col');ERROR: cannot create constraint on "dt"DETAIL: Distributed relations cannot have UNIQUE, EXCLUDE, or PRIMARY KEY constraints that do not include the partition column (with an equality operator if EXCLUDE).zjh@postgres=# \dS+ dt; Table "public.dt" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+---------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | v | text | | | | extended | | v1 | text | | | | extended | | v2 | text | | | | extended | | dist_col | integer | | | | plain | | Indexes: "dt_pkey" PRIMARY KEY, btree (id)Access method: heapzjh@postgres=# alter table dt drop constraint dt_pkey;ALTER TABLEzjh@postgres=# alter table dt add primary key(id,dt_pkey);ERROR: column "dt_pkey" of relation "dt" does not existzjh@postgres=# alter table dt add primary key(id,dist_col);ALTER TABLEzjh@postgres=# select create_distributed_table('dt','dist_col'); create_distributed_table -------------------------- (1 row)

*** SQL特性兼容性与限制

分非分布式事务和分布式事务。lazy模式的分布式事务启动机制（源码需要再确认下）。但是事务不是延迟的。

使用benchmarksql进行TPC-C测试

因为TPC-C所有的表都co-location到warehouse_id了，所以跑TPCC是没有问题的。只不过citus的重写着实有点蠢（当然也可以认为是过于形式化（formal））。如下：

2021-10-07 21:21:47.037945T [239675] LOG: duration: 97782.322 ms execute : SELECT count(*) AS low_stock FROM (SELECT bmsql_stock.s_w_id, bmsql_stock.s_i_id, bmsql_stock.s_quantity FROM public.bmsql_stock_103379 bmsql_stock WHERE ((bmsql_stock.s_w_id OPERATOR(pg_catalog.=) $1) AND (bmsql_stock.s_quantity OPERATOR(pg_catalog.<) $2) AND (bmsql_stock.s_i_id OPERATOR(pg_catalog.=) ANY (SELECT bmsql_order_line.ol_i_id FROM (public.bmsql_district_103186 bmsql_district JOIN public.bmsql_order_line_103346 bmsql_order_line ON (((bmsql_order_line.ol_w_id OPERATOR(pg_catalog.=) bmsql_district.d_w_id) AND (bmsql_order_line.ol_d_id OPERATOR(pg_catalog.=) bmsql_district.d_id) AND (bmsql_order_line.ol_o_id OPERATOR(pg_catalog.>=) (bmsql_district.d_next_o_id OPERATOR(pg_catalog.-) 20)) AND (bmsql_order_line.ol_o_id OPERATOR(pg_catalog.<) bmsql_district.d_next_o_id)))) WHERE ((bmsql_district.d_w_id OPERATOR(pg_catalog.=) $3) AND (bmsql_district.d_id OPERATOR(pg_catalog.=) $4)))))) l2021-10-07 21:21:47.037945T [239675] DETAIL: parameters: $1 = '974', $2 = '13', $3 = '974', $4 = '10'

同时，citus到worker节点后，执行计划的效果很不理想。有些select count(1)执行居然要几十秒，在单机时只要及时毫秒。tpmC从20万掉下到6万。在单机下不需要创建索引的，在分布式下可能需要创建索引。

管理接口与管理工具

除了标准的建表功能外，分布式数据库至少要支持：

显示的广播接口，包括：到每个主worker节点，到每个主副worker节点，到每个主分片，到每个主副分片。

-- List the work_mem setting of each worker databaseSELECT run_command_on_workers($cmd$ SHOW work_mem; $cmd$);

-- Get the estimated row count for a distributed table by summing the-- estimated counts of rows for each shard.SELECT sum(result::bigint) AS estimated_count FROM run_command_on_shards( 'my_distributed_table', $cmd$ SELECT reltuples FROM pg_class c JOIN pg_catalog.pg_namespace n on n.oid=c.relnamespace WHERE (n.nspname || '.' || relname)::regclass = '%s'::regclass AND n.nspname NOT IN ('citus', 'pg_toast', 'pg_catalog') $cmd$ );

显示的单播可用接口，包括：到任一worker节点，到任一分片。

事务

不同于fdw中对于远程节点的事务使用可重复度隔离级别，citus中custom scan远程事务基于用户设置的事务级别，比如读提交，这块性能影响不大。

运维需要解决的问题

采用基于citus架构的分布式PG，需要首先解决运维管理上的一致性，包括物理备份和恢复，逻辑备份和恢复（特别是线性扩展，都经过CN节点就失去了意义），DN节点主备自动切换和资源浪费问题，导入导出（简单copy in/out），MX架构还需要考虑各个节点顺序问题。前面3点lightdb已经解决。

查询数据分布情况

select * from pg_catalog.pg_dist_shard; -- 查询所有表的分片名select * from pg_catalog.pg_dist_partition; -- 查询所有select * from citus_tables ct; -- 所有citus管理的表

TPC-H测试

citus对TPC-H的支持不太好，准确的是说复杂关联支持不好。但凡涉及到关联字段不包含分片键、没有co-location的几乎都不支持。如下：

Vuser 1:Query Failed : select o_year, sum(case when nation = 'MOZAMBIQUE' then volume else 0 end) / sum(volume) as mkt_share from ( select extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) as volume, n2.n_name as nation from part, supplier, lineitem, orders, customer, nation n1, nation n2, region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'AFRICA' and s_nationkey = n2.n_nationkey and o_orderdate between date '1995-01-01' and date '1996-12-31' and p_type = 'STANDARD POLISHED STEEL') all_nations group by o_year order by o_year : ERROR: complex joins are only supported when all distributed tables are co-located and joined on their distribution columns

Vuser 1:Query Failed : select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal from ( select substr(c_phone, 1, 2) as cntrycode, c_acctbal from customer where substr(c_phone, 1, 2) in ('23', '32', '17', '18', '16', '20', '25') and c_acctbal > ( select avg(c_acctbal) from customer where c_acctbal > 0.00 and substr(c_phone, 1, 2) in ('23', '32', '17', '18', '16', '20', '25')) and not exists ( select * from orders where o_custkey = c_custkey)) custsale group by cntrycode order by cntrycode : ERROR: direct joins between distributed and local tables are not supported

因为citus是插件化，注定了不可能和原生GP一样默认为分布式MPP而生。开启citus.enable_repartition_joins后，有10个语句默认跑不通。

citus注意点

postgres=# create table t_batch(id int primary key generated always as identity,d1 bigint,d2 bigint,d3 bigint);CREATE TABLEpostgres=# SELECT create_distributed_table('t_batch','id');ERROR: cannot distribute relation: t_batchDETAIL: Distributed relations must not use GENERATED ... AS IDENTITY.

但是支持bigserial

postgres=# create table t_batch(id bigserial primary key,d1 bigint,d2 bigint,d3 bigint);CREATE TABLEpostgres=# SELECT create_distributed_table('t_batch','id'); create_distributed_table -------------------------- (1 row)

序列及序列作为默认值支持

postgres=# alter table bmsql_history postgres-# alter column hist_id set default nextval('bmsql_hist_id_seq');ALTER TABLEpostgres=# alter table bmsql_history add primary key (hist_id); -- 约束必须加名字ERROR: cannot create constraint without a name on a distributed table

alter table bmsql_history add constraint bmsql_history_pkey primary key (hist_id);ERROR: cannot create constraint on "bmsql_history" Detail: Distributed relations cannot have UNIQUE, EXCLUDE, or PRIMARY KEY constraints that do not include the partition column (with an equality operator if EXCLUDE).

postgres=# select pg_size_pretty(citus_relation_size('search_doc_new_ic')); pg_size_pretty ---------------- 10045 MB(1 row)Time: 1.367 mspostgres=# select pg_size_pretty(citus_table_size('search_doc_new_ic')); -- 不应该差这么多 pg_size_pretty ---------------- 216 GB(1 row)Time: 14.957 mspostgres=# select pg_size_pretty(citus_total_relation_size('search_doc_new_ic')); pg_size_pretty ---------------- 243 GB(1 row)

主外键限制

tpch=# SELECT create_distributed_table('orders', 'o_orderkey');NOTICE: Copying data from local table...NOTICE: copying the data has completedDETAIL: The local data in the table is no longer visible, but is still on disk.HINT: To remove the local data, run: SELECT truncate_local_data_after_distributing_table($$public.orders$$)ERROR: cannot create foreign key constraint since relations are not colocated or not referencing a reference tableDETAIL: A distributed table can only have foreign keys if it is referencing another colocated hash distributed table or a reference tabletpch=# \dS+ orders Table "public.orders" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- o_orderdate | timestamp without time zone | | | | plain | | o_orderkey | numeric | | not null | | main | | o_custkey | numeric | | not null | | main | | o_orderpriority | character(15) | | | | extended | | o_shippriority | numeric | | | | main | | o_clerk | character(15) | | | | extended | | o_orderstatus | character(1) | | | | extended | | o_totalprice | numeric | | | | main | | o_comment | character varying(79) | | | | extended | | Indexes: "orders_pk" PRIMARY KEY, btree (o_orderkey) "order_customer_fkidx" btree (o_custkey)Foreign-key constraints: "order_customer_fk" FOREIGN KEY (o_custkey) REFERENCES customer(c_custkey)Referenced by: TABLE "lineitem" CONSTRAINT "lineitem_order_fk" FOREIGN KEY (l_orderkey) REFERENCES orders(o_orderkey)Access method: heapNOTICE: removing table public.lineitem from metadata as it is not connected to any reference tables via foreign keystpch=# SELECT create_distributed_table('part', 'p_partkey');NOTICE: Copying data from local table...NOTICE: copying the data has completedDETAIL: The local data in the table is no longer visible, but is still on disk.HINT: To remove the local data, run: SELECT truncate_local_data_after_distributing_table($$public.part$$)ERROR: cannot create foreign key constraint since foreign keys from reference tables and local tables to distributed tables are not supportedDETAIL: Reference tables and local tables can only have foreign keys to reference tables and local tables

参数优化建议

主要的系统表

主要的管理函数

citus_add_nodecitus_remove_nodecitus_get_active_worker_nodesmaster_get_table_metadataget_shard_id_for_distribution_columncitus_relation_sizecitus_table_sizecitus_total_relation_sizecitus_move_shard_placementrebalance_table_shardsThe default strategy is appropriate under these circumstances:The shards are roughly the same sizeThe shards get roughly the same amount of trafficWorker nodes are all the same size/typeShards haven’t been pinned to particular workers　　If any of these assumptions don’t hold, then the default rebalancing can result in a bad plan. In this case you may customize the strategy, using the rebalance_strategy parameter.It’s advisable to call get_rebalance_table_shards_plan before running rebalance_table_shards, to see and verify the actions to be performed.get_rebalance_table_shards_planget_rebalance_progresscitus_add_rebalance_strategycitus_set_default_rebalance_strategycitus_remote_connection_statscitus_drain_node

LightDB 22.x计划增强

高可用主备支持cn可写 22.3已支持哈希算法外置实现direct dn多字段分片，先机构id，再客户id；先客户id，再账户id，依赖于hash算法的二次开发支持优化器提示。lightdb 22.2已支持，后续版本持续增强支持set广播

常见问题

DN节点[lightdb@hs-10-20-30-195 log]$ grep 133029 lightdb-2022-01-09_052610.log 2022-01-09 05:41:43.628964T citus lightdb@findptlocaldis 10.20.137.41(50559) client backend ROLLBACK PREPARED 42704[2022-01-09 05:36:14 CST] 0 [221460] ERROR: prepared transaction with identifier "citus_0_133029_1096882410_5065" does not exist2022-01-09 05:41:43.628964T citus lightdb@findptlocaldis 10.20.137.41(50559) client backend ROLLBACK PREPARED 42704[2022-01-09 05:36:14 CST] 0 [221460] STATEMENT: ROLLBACK PREPARED 'citus_0133029_1096882410_5065'CN节点CN节点上也有分布式事务ID找不到的日志，但是日志级别是告警。分析：两个可能性，1、没问题。因为2PC有一个失败的话，就会去回滚。所以CN是warning，DN是error。2、真的错了，citus 2PC实现有缺陷，citus误判。2pc日志本身保存在pg_dist_transaction中，理论上不会出现pg_dist_transaction有，但是DN没有的情况（验证逻辑是需要分析和测试下有一个参与者失败的时候，发了几次回滚）。citus maintainer负责定期会去轮训清理、如果没收到应答会重发。

DN节点2022-01-15 11:38:26.049486T citus lightdb@findptlocaldis 192.168.1.113(51432) client backend ROLLBACK PREPARED 55000[2022-01-15 11:33:25 CST] 0 [22567] ERROR: prepared transaction with identifier "citus_0_53889_14465378_116756" is busy

DN节点lightdb-2022-01-14_135814.log:2022-01-14 20:39:17.000449T citus lightdb@findptlocaldis 192.168.1.112(45070) client backend ROLLBACK PREPARED 0A000[2022-01-14 20:30:15 CST] 0 [21215] ERROR: prepared transaction belongs to another databaselightdb-2022-01-14_135814.log-2022-01-14 20:39:17.000449T citus lightdb@findptlocaldis 192.168.1.112(45070) client backend ROLLBACK PREPARED 0A000[2022-01-14 20:30:15 CST] 0 [21215] HINT: Connect to the database where the transaction was prepared to finish it.lightdb-2022-01-14_135814.log-2022-01-14 20:39:17.000449T citus lightdb@findptlocaldis 192.168.1.112(45070) client backend ROLLBACK PREPARED 0A000[2022-01-14 20:30:15 CST] 0 [21215] STATEMENT: ROLLBACK PREPARED 'citus_0_14522_30628126_136054'

非分片字段关联报错，如下：

SQL 错误 [XX000]: ERROR: the query contains a join that requires repartitioning Hint: Set canopy.enable_repartition_joins to on to enable repartitioning

解决方法，CN节点设置set canopy.enable_repartition_joins=on;。其执行计划如下：

CN节点报错

任何SQL都报错，如下：

配置shared_preload_libraries= 'citus,auto_explain,pg_stat_statements,pg_prewarm,pg_wait_sampling,pg_stat_kcache,pg_cron,pgaudit,pg_qualstats'。跟auto_explain无关，不配置也发生空指针，methods为NULL。开发环境代码不一致，最新lightdb代码已解决。

CN节点cpu高，压不上去

arm+kylin linux下cn节点系统调用很高，user 10%，system 60%。需优化citus。lightdb 22.4版本解决。

下图为初步优化后：

citus与auto_explain冲突导致crash问题

pg_cron对citus的支持

lightdb版lt_cron支持citus。

pg_hint_plan对citus的支持

lightdb版lt_hint_plan支持citus。

lt_ope保序加密对citus的支持

lightdb保序加密特性支持citus。

citus不支持for update，即使是单分片操作或者加了分片字段

delete/update等DML非分片字段会在CN节点加分片锁，其实加的就是建议锁，所以一致性没问题。但是并发会比较弱。

mx架构下DN节点自增主键报错"nextval(sequence) calls in worker nodes are not supported for column defaults of type int or smallint"

原因：dn节点不支持nextval特性，需要使用人工生成的主键，比如雪花ID。

/* * worker_nextval calculates nextval() in worker nodes * for int and smallint column default types * TODO: not error out but get the proper nextval() */Datumworker_nextval(PG_FUNCTION_ARGS){ ereport(ERROR, (errmsg( "nextval(sequence) calls in worker nodes are not supported" " for column defaults of type int or smallint"))); PG_RETURN_INT32(0);}

参考

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

1974 2023-04-03

PostgreSQL通用分布式数据库架构-citus

黄东旭解析 TiDB 的核心优势

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

高成本云服务？TiDB 帮你省钱

推荐文章

HTAP 还可以这么玩？丨TiDB 在 IoT 智慧园区的应用

新特性解析丨TiDB 资源管控的设计思路与场景解析

TiDB赋能保险业-首个全栈自主核心保单系统成功投产

首个云原生、分布式、全栈国产化银行核心业务系统投产上线丨TiDB × 杭州银行

TiDB 在社交场景的解决方案实践

电商数据技术栈，在海量数据增长下如何实现实时与全量兼得？

金融行业数据库的选择

TiDB 在智能制造中的应用实践

TiDB 在全球头部物流企业计费管理系统的应用实践

PingCAP与教育部教育管理信息中心合作，推动普惠教育数字化转型

友情链接

热评文章

TiDB 中标杭州银行核心系统数据库项目

TiDB 首批通过信通院 HTAP 数据库基础能力评

PingCAP 与 Wisconsin-Madiso

PingCAP 成为中国唯一入选 Forrester

TiDB 走进东软集团，共建医疗数字化基石

共享开源技术，共建开放生态丨平凯星辰余梦杰出席 20

PostgreSQL通用分布式数据库架构-citus

微信扫一扫：分享

推荐文章

友情链接

热评文章