在分布式数据库中怎么计算count

网友投稿 569 2024-02-26

在分布式数据库中怎么计算count

这篇文章主要介绍“在分布式数据库中怎么计算count”，在日常操作中，相信很多人在在分布式数据库中怎么计算count问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”在分布式数据库中怎么计算count”的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

在分布式数据库中怎么计算count

背景

在分布式数据库中，计算count(distinct xxx)，需要对distinct 的字段，

1、去重，

2、重分布去重后的数据，（这一步，如果distinct值特别多，那么就会比较耗时）

3、然后再去重，

4、最后count (xxx)，

5、求所有节点的count SUM。

例如，以下是Greenplum的执行计划例子

postgres=# explain analyze select count(distinct c_acctbal) from customer; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=182242.41..182242.42 rows=1 width=8) Rows out: 1 rowswith 0.006 ms to first row, 69 ms to end, start offset by 23ms. -> Gather Motion16:1(slice2; segments: 16) (cost=53392.85..173982.82 rows=660767 width=8) Rows out: 818834 rows at destinationwith 3.416 ms to first row, 447 ms to end, start offset by 23 ms. -> HashAggregate (cost=53392.85..61652.43 rows=41298 width=8) Group By: customer.c_acctbalRows out: Avg 51177.1 rows x 16 workers. Max 51362 rows (seg3) with 0.004 ms to first row, 33 ms to end, start offset by 25 ms. -> Redistribute Motion 16:16(slice1; segments: 16) (cost=30266.00..43481.34 rows=41298 width=8) Hash Key: customer.c_acctbal Rows out: Avg 89865.6 rows x 16 workers at destination. Max 90305 rows (seg3)with 18 ms to first row, 120 ms to end, start offset by 25ms. -> HashAggregate (cost=30266.00..30266.00 rows=41298 width=8) Group By: customer.c_acctbal Rows out: Avg 89865.6 rows x 16 workers. Max 89929 rows (seg2) with 0.007 ms to first row, 33 ms to end, start offset by 26ms. -> Append-only Columnar Scan on customer (cost=0.00..22766.00 rows=93750 width=8) Rows out: Avg 93750.0 rows x 16 workers. Max 93751 rows (seg4) with 20 ms to first row, 30 ms to end, start offset by 26ms. Slicestatistics: (slice0) Executor memory: 387K bytes. (slice1) Executormemory: 6527K bytes avg x 16 workers, 6527K bytes max (seg0). (slice2) Executor memory:371K bytes avg x 16 workers, 371K bytes max (seg0). Statement statistics: Memory used: 1280000K bytes Optimizer status: legacy query optimizer Total runtime: 723.143 ms (23 rows)

以下是citus的例子

postgres=# explain analyzeselectcount(distinct bid) from pgbench_accounts ; QUERY PLAN------------------------------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=0.00..0.00 rows=0 width=0) (actual time=31.748..31.749 rows=1loops=1) -> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) (actual time=31.382..31.510 rows=1280 loops=1) Task Count: 128 Tasks Shown: One of 128-> Task Node: host=172.24.211.224 port=1921dbname=postgres -> HashAggregate (cost=231.85..231.95 rows=10 width=4) (actual time=3.700..3.702 rows=10 loops=1) Group Key: bid -> Seq Scan on pgbench_accounts_106812 pgbench_accounts (cost=0.00..212.48 rows=7748 width=4) (actual time=0.017..2.180 rows=7748 loops=1) Planningtime: 0.445 ms Execution time: 3.781ms Planningtime: 1.399 ms Execution time: 32.159 ms (13 rows)

对于可估值计算的场景，即不需要精确distinct值的场景，***提供了一个名为hll的插件，可以用来估算distinct元素个数。

citus 结合hll，可以实现超高速的count(distinct xxx)，即使distinct值非常非常多，也不慢。

SET citus.count_distinct_error_rate to 0.005; 0.005表示失真度

hll加速citus count(distinct xxx)使用举例

部署

1、所有节点（coordinator 与 worker节点），安装hll软件

yum install -y gcc-c++ cd ~/ git clone https://github.com/citusdata/postgresql-hll cd postgresql-hll . /var/lib/pgsql/.bash_profile USE_PGXS=1 makeUSE_PGXS=1 make install

2、所有节点（coordinator 与 worker节点），在需要用到HLL的DB中增加插件

su - postgres -c "psql -d postgres -c create extension hll;" su - postgres -c "psql -d newdb -c create extension hll;"使用举例1、创建测试表，128 shardcreate table test (id int primary key, a int, b int, c int); set citus.shard_count =128; selectcreate_distributed_table(test, id);2、写入10亿测试数据，a字段10唯一值，b字段100唯一值，c字段100万唯一值insert into test select id, random()*9, random()*99, random()*999999 from generate_series(1,1000000000) t(id);3、（coordinator节点）设置全局或当前会话级参数，指定失真度，越小失真度越小SET citus.count_distinct_error_rate to0.005; newdb=# explain select count(distinct bid) from pgbench_accounts group by bid;QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=0.00..0.00 rows=0 width=0) Group Key: remote_scan.worker_column_2 -> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) Task Count: 128 Tasks Shown: One of 128 -> Task Node: host=172.24.211.224 port=8001dbname=newdb -> GroupAggregate (cost=97272.79..105102.29 rows=1000 width=36) Group Key: bid -> Sort (cost=97272.79..99227.04 rows=781700 width=4) Sort Key: bid -> Seq Scan on pgbench_accounts_102008 pgbench_accounts (cost=0.00..20759.00 rows=781700 width=4) (12 rows)4、对比是否使用HLL加速(少量唯一值，HLL没有性能提升，因为本身就不存在瓶颈)4.1、未使用hllnewdb=# set citus.count_distinct_error_rate to 0; newdb=# select count(distinct bid) from pgbench_accounts; count ------- 1000 (1 row)Time: 423.364 ms postgres=# set citus.count_distinct_error_rate to 0; postgres=# select count(distinct a) from test; count ------- 10 (1 row) Time: 2392.709 ms (00:02.393)4.2、使用hllnewdb=# set citus.count_distinct_error_rate to 0.005; newdb=# select count(distinct bid) from pgbench_accounts;count ------- 1000 (1 row)Time: 444.287 ms postgres=# set citus.count_distinct_error_rate to 0.005; postgres=# select count(distinct a) from test; count ------- 10 (1 row) Time: 2375.473 ms (00:02.375)5、对比是否使用HLL加速(大量唯一值，HLL性能提升显著)5.1、未使用hllpostgres=# set citus.count_distinct_error_rate to 0; count ---------- 10000000 (1 row) Time: 5826241.205 ms (01:37:06.241)

128个节点，每个节点最多发送10亿/128条数据给coordinator，慢是可以理解的。另一方面，coordinator可以边接收边去重(postgresql 11增加了parallel gather, merge sort等能力，citus coordinator可以借鉴)，没必要等所有数据都收完再去重。

5.2、使用hllpostgres=# set citus.count_distinct_error_rate to 0.005; postgres=# select count(distinct (a,c)) from test; count --------- 9999995 (1 row) Time: 4468.749 ms (00:04.469)6、设置不同的精度参数，性能对比newdb=# set citus.count_distinct_error_rate to 0.1; newdb=# select count(distinct (aid,bid)) from pgbench_accounts ;count ---------- 94778491 (1 row)Time: 545.301 ms newdb=# set citus.count_distinct_error_rate to 0.01; newdb=# select count(distinct (aid,bid)) from pgbench_accounts ;count ----------- 100293937 (1 row)Time: 554.333 ms -- 推荐设置0.005 newdb=# set citus.count_distinct_error_rate to 0.005; newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; count ----------- 100136086 (1 row)Time: 1053.070 ms (00:01.053) newdb=# set citus.count_distinct_error_rate to 0.001; newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; count ----------- 100422107 (1 row) Time: 9287.934 ms (00:09.288)

到此，关于“在分布式数据库中怎么计算count”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注网站，小编会继续努力为大家带来更多实用的文章！

标签：数据库分布式数据库

在分布式数据库中怎么计算count

在分布式数据库中怎么计算count

背景

hll加速citus count(distinct xxx)使用举例

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

高成本云服务？TiDB 帮你省钱

零售业数据库选型与迁移ToC系统实践大规模场景应用

推荐文章

HTAP 还可以这么玩？丨TiDB 在 IoT 智慧园区的应用

新特性解析丨TiDB 资源管控的设计思路与场景解析

TiDB赋能保险业-首个全栈自主核心保单系统成功投产

首个云原生、分布式、全栈国产化银行核心业务系统投产上线丨TiDB × 杭州银行

TiDB 在社交场景的解决方案实践

电商数据技术栈，在海量数据增长下如何实现实时与全量兼得？

金融行业数据库的选择

TiDB 在智能制造中的应用实践

TiDB 在全球头部物流企业计费管理系统的应用实践

PingCAP与教育部教育管理信息中心合作，推动普惠教育数字化转型

友情链接

热评文章

TiDB 中标杭州银行核心系统数据库项目

TiDB 首批通过信通院 HTAP 数据库基础能力评

PingCAP 与 Wisconsin-Madiso

PingCAP 成为中国唯一入选 Forrester

TiDB 走进东软集团，共建医疗数字化基石

共享开源技术，共建开放生态丨平凯星辰余梦杰出席 20

在分布式数据库中怎么计算count

在分布式数据库中怎么计算count

背景

hll加速citus count(distinct xxx)使用举例

微信扫一扫：分享

推荐文章

友情链接

热评文章