TiDB 优化器逻辑优化之 OR 表达式条件消除

网友投稿 629 2023-12-26

第一章背景介绍

好久没发文章了, 发篇曾经研究过的一篇水文.

通常来说, 永假的 OR 谓词条件是可以被消除的, 并且在通用数据库上可以见到对应逻辑.

如果不消除, 会引入额外的筛选机制, 导致大量计算资源被消耗, 引发 SQL 性能降低的现象.

第二章理论概括

以下述代码块为例, 演示具体恒假析取表达式怎么被消除:

create table jan(id int, name int); explain select * from jan where id=123 or 1=2; select version();

*** 实验

可以看到 Filter: (id = 123) 已经直接将 or 1=2 的永假析取条件消除了.

test=# explain select * from jan where id=123 or 1=2; QUERY PLAN ----------------------------------------------------- Seq Scan on jan (cost=0.00..21.25 rows=4 width=62) Filter: (id = 123) (2 rows) test=# select version(); version ------------------------------------------------------------------------------------------------------------------- *** 13.12 on arm-apple-darwin22.4.0, compiled by Apple clang version 14.0.3 (clang-1403.0.22.14.1), 64-bit (1 row)

*** 实验

可以看到 1 = 2 result is FALSE 并在 Output & filters 中没有出现相关筛选, 说明也消除了永假析取条件.

并且 OB 官网有对该种情况进行描述.

第三章现状分析

上面介绍了 OB 和 PG 该功能的实现, 下面演示当前 TiDB 的实现情况.

截止至 v7.1.1 TiDB 在单表查询时, or(eq(test.jan.id, 123), 0) 的恒假析取条件直接下推至 TiKV, 而不是直接在执行计划中消除.

第四章问题分析

其实, 单表查询的使用方法并不一定会造成问题, 但是如果在多表 join 的情况下, 便会加剧该情况. 如 Issue-45785 中提到的, 在 TiDB instance 侧仍需要 CARTESIAN inner join 算子过滤 or(and(eq(edmp.t1.a, edmp.t2.a), eq(edmp.t1.b, 1)), 0). 比较理想的情况, 应该是从 CARTESIAN inner join 变为 inner join.

CREATE TABLE `t1` ( `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, `c` int(11) DEFAULT NULL, KEY `idx_a` (`a`) ); CREATE TABLE `t2` ( `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, `c` int(11) DEFAULT NULL, KEY `idx_a` (`a`) ); mysql> explain select /*+ INL_JOIN(t2)*/ * from t1,t2 where t1.a=t2.a and t1.b = 1 or 1=2; +------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+ | id | estRows | task | access object | operator info | +------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+ | HashJoin_9 | 100000.00 | root | | CARTESIAN inner join, other cond:or(and(eq(edmp.t1.a, edmp.t2.a), eq(edmp.t1.b, 1)), 0) | | ├─TableReader_12(Build) | 10.00 | root | | data:Selection_11 | | │ └─Selection_11 | 10.00 | cop[tikv] | | or(eq(edmp.t1.b, 1), 0) | | │ └─TableFullScan_10 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo | | └─TableReader_14(Probe) | 10000.00 | root | | data:TableFullScan_13 | | └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo | +------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+ 6 rows in set, 1 warning (0.00 sec)

第五章解决方案

从 PG 的实现方式来看, 会有个 canonicalize_qual 函数试图将表达式强制转换为规范的 and-of-or 或 or-of-and形式. 在这个过程中, 会调用到 find_duplicate_ors 函数(如下所示部分), /* Get rid of any constant inputs */ 注释部分会将永假析取条件直接消除.

/* * find_duplicate_ors * Given a qualification tree with the NOTs pushed down, search for * OR clauses to which the inverse OR distributive law might apply. * Only the top-level AND/OR structure is searched. * * While at it, we remove any NULL constants within the top-level AND/OR * structure, eg in a WHERE clause, "x OR NULL::boolean" is reduced to "x". * In general that would change the result, so eval_const_expressions cant * do it; but at top level of WHERE, we dont need to distinguish between * FALSE and NULL results, so its valid to treat NULL::boolean the same * as FALSE and then simplify AND/OR accordingly. Conversely, in a top-level * CHECK constraint, we may treat a NULL the same as TRUE. * * Returns the modified qualification. AND/OR flatness is preserved. */ static Expr * find_duplicate_ors(Expr *qual, bool is_check) { if (is_orclause(qual)) { List *orlist = NIL; ListCell *temp; /* Recurse */ foreach(temp, ((BoolExpr *) qual)->args) { Expr *arg = (Expr *) lfirst(temp); arg = find_duplicate_ors(arg, is_check); /* Get rid of any constant inputs */ if (arg && IsA(arg, Const)) { Const *carg = (Const *) arg; if (is_check) { /* Within OR in CHECK, drop constant FALSE */ if (!carg->constisnull && !DatumGetBool(carg->constvalue)) continue; /* Constant TRUE or NULL, so OR reduces to TRUE */ return (Expr *) makeBoolConst(true, false); } else { /* Within OR in WHERE, drop constant FALSE or NULL */ if (carg->constisnull || !DatumGetBool(carg->constvalue)) continue; /* Constant TRUE, so OR reduces to TRUE */ return arg; } } orlist = lappend(orlist, arg); } /* Flatten any ORs pulled up to just below here */ orlist = pull_ors(orlist); /* Now we can look for duplicate ORs */ return process_duplicate_ors(orlist); } else if (is_andclause(qual)) { List *andlist = NIL; ListCell *temp; ......

因此, TiDB 其实可以仿制类似逻辑, 在 buildSelection 处理 where 条件时, 利用递归的方法实现这个功能.

期待大佬们解决这个简单的问题了, 望产品逐渐完善!

第六章文献参考

PG 源代码

PG 源代码解读-*** 源码解读（33）- 查询语句#18（查询优化-表达式预处理#3

TiDB 源代码

标签：TiDB 数据库

TiDB 优化器逻辑优化之 OR 表达式条件消除

第一章背景介绍

第二章理论概括

第三章现状分析

第四章问题分析

第五章解决方案

第六章文献参考

黄东旭解析 TiDB 的核心优势

麒麟v10 上部署 TiDB v5.1.2 生产环境优化实践

高成本云服务？TiDB 帮你省钱

推荐文章

HTAP 还可以这么玩？丨TiDB 在 IoT 智慧园区的应用

新特性解析丨TiDB 资源管控的设计思路与场景解析

TiDB赋能保险业-首个全栈自主核心保单系统成功投产

首个云原生、分布式、全栈国产化银行核心业务系统投产上线丨TiDB × 杭州银行

TiDB 在社交场景的解决方案实践

电商数据技术栈，在海量数据增长下如何实现实时与全量兼得？

金融行业数据库的选择

TiDB 在智能制造中的应用实践

TiDB 在全球头部物流企业计费管理系统的应用实践

PingCAP与教育部教育管理信息中心合作，推动普惠教育数字化转型

友情链接

热评文章

TiDB 中标杭州银行核心系统数据库项目

TiDB 首批通过信通院 HTAP 数据库基础能力评

PingCAP 与 Wisconsin-Madiso

PingCAP 成为中国唯一入选 Forrester

TiDB 走进东软集团，共建医疗数字化基石

共享开源技术，共建开放生态丨平凯星辰余梦杰出席 20

TiDB 优化器逻辑优化之 OR 表达式条件消除

第一章 背景介绍

第二章 理论概括

第三章 现状分析

第四章 问题分析

第五章 解决方案

第六章 文献参考

微信扫一扫：分享

推荐文章

友情链接

热评文章

第一章背景介绍

第二章理论概括

第三章现状分析

第四章问题分析

第五章解决方案

第六章文献参考