TiDB 优化器逻辑优化之 OR 表达式条件消除

网友投稿 514 2023-12-26

第一章 背景介绍

好久没发文章了, 发篇曾经研究过的一篇水文.

通常来说, 永假的 OR 谓词条件是可以被消除的, 并且在通用数据库上可以见到对应逻辑.

如果不消除, 会引入额外的筛选机制, 导致大量计算资源被消耗, 引发 SQL 性能降低的现象.

第二章 理论概括

以下述代码块为例, 演示具体恒假析取表达式怎么被消除:

create table jan(id int, name int); explain select * from jan where id=123 or 1=2; select version();

*** 实验

可以看到 Filter: (id = 123) 已经直接将 or 1=2 的永假析取条件消除了.

test=# explain select * from jan where id=123 or 1=2; QUERY PLAN ----------------------------------------------------- Seq Scan on jan (cost=0.00..21.25 rows=4 width=62) Filter: (id = 123) (2 rows) test=# select version(); version ------------------------------------------------------------------------------------------------------------------- *** 13.12 on arm-apple-darwin22.4.0, compiled by Apple clang version 14.0.3 (clang-1403.0.22.14.1), 64-bit (1 row)

*** 实验

可以看到 1 = 2 result is FALSE 并在 Output & filters 中没有出现相关筛选, 说明也消除了永假析取条件.

并且 OB 官网 有对该种情况进行描述.

第三章 现状分析

上面介绍了 OB 和 PG 该功能的实现, 下面演示当前 TiDB 的实现情况.

截止至 v7.1.1 TiDB 在单表查询时, or(eq(test.jan.id, 123), 0) 的恒假析取条件直接下推至 TiKV, 而不是直接在执行计划中消除.

mysql> explain select * from jan where id=123 or 1=2; +-------------------------+----------+-----------+---------------+--------------------------------+ | id | estRows | task | access object | operator info | +-------------------------+----------+-----------+---------------+--------------------------------+ | TableReader_7 | 10.00 | root | | data:Selection_6 | | └─Selection_6 | 10.00 | cop[tikv] | | or(eq(test.jan.id, 123), 0) | | └─TableFullScan_5 | 10000.00 | cop[tikv] | table:jan | keep order:false, stats:pseudo | +-------------------------+----------+-----------+---------------+--------------------------------+ 3 rows in set (0.00 sec) mysql> select version(); +--------------------+ | version() | +--------------------+ | 5.7.25-TiDB-v7.1.1 | +--------------------+ 1 row in set (0.00 sec)

第四章 问题分析

其实, 单表查询的使用方法并不一定会造成问题, 但是如果在多表 join 的情况下, 便会加剧该情况. 如 Issue-45785 中提到的, 在 TiDB instance 侧仍需要 CARTESIAN inner join 算子过滤 or(and(eq(edmp.t1.a, edmp.t2.a), eq(edmp.t1.b, 1)), 0). 比较理想的情况, 应该是从 CARTESIAN inner join 变为 inner join.

CREATE TABLE `t1` ( `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, `c` int(11) DEFAULT NULL, KEY `idx_a` (`a`) ); CREATE TABLE `t2` ( `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, `c` int(11) DEFAULT NULL, KEY `idx_a` (`a`) ); mysql> explain select /*+ INL_JOIN(t2)*/ * from t1,t2 where t1.a=t2.a and t1.b = 1 or 1=2; +------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+ | id | estRows | task | access object | operator info | +------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+ | HashJoin_9 | 100000.00 | root | | CARTESIAN inner join, other cond:or(and(eq(edmp.t1.a, edmp.t2.a), eq(edmp.t1.b, 1)), 0) | | ├─TableReader_12(Build) | 10.00 | root | | data:Selection_11 | | │ └─Selection_11 | 10.00 | cop[tikv] | | or(eq(edmp.t1.b, 1), 0) | | │ └─TableFullScan_10 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo | | └─TableReader_14(Probe) | 10000.00 | root | | data:TableFullScan_13 | | └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo | +------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+ 6 rows in set, 1 warning (0.00 sec)

第五章 解决方案

从 PG 的实现方式来看, 会有个 canonicalize_qual 函数试图将表达式强制转换为规范的 and-of-or 或 or-of-and形式. 在这个过程中, 会调用到 find_duplicate_ors 函数(如下所示部分), /* Get rid of any constant inputs */ 注释部分会将永假析取条件直接消除.

/* * find_duplicate_ors * Given a qualification tree with the NOTs pushed down, search for * OR clauses to which the inverse OR distributive law might apply. * Only the top-level AND/OR structure is searched. * * While at it, we remove any NULL constants within the top-level AND/OR * structure, eg in a WHERE clause, "x OR NULL::boolean" is reduced to "x". * In general that would change the result, so eval_const_expressions cant * do it; but at top level of WHERE, we dont need to distinguish between * FALSE and NULL results, so its valid to treat NULL::boolean the same * as FALSE and then simplify AND/OR accordingly. Conversely, in a top-level * CHECK constraint, we may treat a NULL the same as TRUE. * * Returns the modified qualification. AND/OR flatness is preserved. */ static Expr * find_duplicate_ors(Expr *qual, bool is_check) { if (is_orclause(qual)) { List *orlist = NIL; ListCell *temp; /* Recurse */ foreach(temp, ((BoolExpr *) qual)->args) { Expr *arg = (Expr *) lfirst(temp); arg = find_duplicate_ors(arg, is_check); /* Get rid of any constant inputs */ if (arg && IsA(arg, Const)) { Const *carg = (Const *) arg; if (is_check) { /* Within OR in CHECK, drop constant FALSE */ if (!carg->constisnull && !DatumGetBool(carg->constvalue)) continue; /* Constant TRUE or NULL, so OR reduces to TRUE */ return (Expr *) makeBoolConst(true, false); } else { /* Within OR in WHERE, drop constant FALSE or NULL */ if (carg->constisnull || !DatumGetBool(carg->constvalue)) continue; /* Constant TRUE, so OR reduces to TRUE */ return arg; } } orlist = lappend(orlist, arg); } /* Flatten any ORs pulled up to just below here */ orlist = pull_ors(orlist); /* Now we can look for duplicate ORs */ return process_duplicate_ors(orlist); } else if (is_andclause(qual)) { List *andlist = NIL; ListCell *temp; ......

因此, TiDB 其实可以仿制类似逻辑, 在 buildSelection 处理 where 条件时, 利用递归的方法实现这个功能.

期待大佬们解决这个简单的问题了, 望产品逐渐完善!

第六章 文献参考

PG 源代码

PG 源代码解读-*** 源码解读(33)- 查询语句#18(查询优化-表达式预处理#3

TiDB 源代码

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:TiDB 7.x 源码编译之 TiDB Server 篇,及新特性详解
下一篇:Null-Aware 问题对 TiDB 优化器的影响(OOM)
相关文章