搞清楚 MySQL 派生表、物化表、临时表

2022-02-23 · 预计阅读 3 分钟 · MySQL

以前只知道 MySQL 里有临时表，后来随着花在 MySQL 上的时间越来越多，奇怪的知识增加了不少，比如：物化、半连接、首次匹配……

刚开始看到一些新名词的时候，充满了疑惑，也傻傻的分不清楚，经过一段时间的折腾，对这些概念多了一些了解。

今天先来说说 3 种表：派生表、物化表、临时表，刚开始看到派生表、物化表的时候，虽然官方文档和一些书籍上都有介绍，但并不十分清楚它们都是干嘛的，会用在什么地方？

派生表、物化表，看起来高大上，实际上它们两个也是临时表，只是官方给某些场景下使用的临时表取了个名字而已。

1. 派生表

派生表，是用于存储子查询产生的结果的临时表，这个子查询特指 FROM 子句 里的子查询，如果是出现在其它地方的子查询，就不叫这个名字了，所以本质上来说，派生表也是临时表。

1explain select * from t1 inner join (
2    select distinct i1 from t3 where id in (3, 666, 990)
3) as a on t1.i1 = a.i1

1+----+-------------+------------+------------+-------+---------------+---------+---------+--------+------+----------+----------------------------------------------------+
2| id | select_type | table      | partitions | type  | possible_keys | key     | key_len | ref    | rows | filtered | Extra                                              |
3+----+-------------+------------+------------+-------+---------------+---------+---------+--------+------+----------+----------------------------------------------------+
4| 1  | PRIMARY     | <derived2> | <null>     | ALL   | <null>        | <null>  | <null>  | <null> | 3    | 100.0    | <null>                                             |
5| 1  | PRIMARY     | t1         | <null>     | ALL   | <null>        | <null>  | <null>  | <null> | 8    |  12.5    | Using where; Using join buffer (Block Nested Loop) |
6| 2  | DERIVED     | t3         | <null>     | range | PRIMARY       | PRIMARY | 4       | <null> | 3    | 100.0    | Using where; Using temporary                       |
7+----+-------------+------------+------------+-------+---------------+---------+---------+--------+------+----------+----------------------------------------------------+

通过上面的 explain 结果可以看到，select 中的子查询，会产生一个派生表，存储子查询的查询结果，然后用 t3 表和派生表（derived2）进行连接操作

2. 物化表

物化表，也是用于存储子查询产生的结果的临时表，这个子查询特指 WHERE 子句中查询条件里的子查询。

物化表有两种使用场景：

对子查询进行半连接优化时，使用物化策略
IN 子查询转换为 SUBQUERY、UNCACHEABLE SUBQUERY 的 exists 相关子查询时，把子查询的结果物化，避免对于主查询符合条件的每一条记录，子查询都要执行一次从原表里读取数据

1explain select * from t1 where t1.i1 in (
2    select i1 from t3 where id in (3, 666, 990, 887, 76) and i2 > 16384
3)

1+----+--------------+-------------+------------+-------+---------------+---------+---------+--------+--------+----------+----------------------------------------------------+
2| id | select_type  | table       | partitions | type  | possible_keys | key     | key_len | ref    | rows   | filtered | Extra                                              |
3+----+--------------+-------------+------------+-------+---------------+---------+---------+--------+--------+----------+----------------------------------------------------+
4| 1  | SIMPLE       | <subquery2> | <null>     | ALL   | <null>        | <null>  | <null>  | <null> | <null> | 100.0    | <null>                                             |
5| 1  | SIMPLE       | t1          | <null>     | ALL   | <null>        | <null>  | <null>  | <null> | 8      |  12.5    | Using where; Using join buffer (Block Nested Loop) |
6| 2  | MATERIALIZED | t3          | <null>     | range | PRIMARY       | PRIMARY | 4       | <null> | 5      |  33.33   | Using where                                        |
7+----+--------------+-------------+------------+-------+---------------+---------+---------+--------+--------+----------+----------------------------------------------------+

通过上面的 explain 结果可以看到，IN 子查询的结果物化之后（select_type = MATERIALIZED）和 t1 表进行连接操作

子查询有 5 种优化策略：子查询表上拉（table pullout）、重复值消除（duplicate weedout）、首次匹配（first match）、松散扫描（loose scan）、物化连接（materialization），我在执行上面的 select 语句的 explain 时，通过计算成本选择的是重复值消除策略，为了演示，我用 set optimizer_switch='duplicateweedout=off' 禁用了重复值策略

1explain 
2select * from t1 where t1.i1 in (
3    select i1 from t3 where id in (3, 666, 990, 887, 76) and i2 > rand() * 100
4) and t1.str1 > 'abc';

1+----+----------------------+-------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+
2| id | select_type          | table | partitions | type  | possible_keys | key     | key_len | ref    | rows | filtered | Extra       |
3+----+----------------------+-------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+
4| 1  | PRIMARY              | t1    | <null>     | ALL   | <null>        | <null>  | <null>  | <null> | 8    | 33.33    | Using where |
5| 2  | UNCACHEABLE SUBQUERY | t3    | <null>     | range | PRIMARY       | PRIMARY | 4       | <null> | 5    | 33.33    | Using where |
6+----+----------------------+-------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+

通过上面的 explain 结果，实际上没有体现出来使用了物化表，这需要用上另一个杀手锏了（optimizer_trace）

先执行以下命令，开启 optimizer_trace

1set optimizer_trace="enabled=on";
2set optimizer_trace_max_mem_size=1048576;

然后执行 SQL 语句

1-- 注意：
2-- 查询 information_schema.optimizer_trace 表的 SQL 要和前面的 SQL 一起执行
3-- 不然查不到执行过程的信息
4select * from t1 where t1.i1 in (
5    select i1 from t3 where id in (3, 666, 990, 887, 76) and i2 > rand() * 100
6) and t1.str1 > 'abc';
7select * from information_schema.optimizer_trace;

 1{
 2    "transformation": {
 3        "select#": 2,
 4        "from": "IN (SELECT)",
 5        "to": "materialization",
 6        "chosen": true,
 7        "unknown_key_1": {
 8            "creating_tmp_table": {
 9                "tmp_table_info": {
10                    "row_length": 5,
11                    "key_length": 4,
12                    "unique_constraint": false,
13                    "location": "memory (heap)",
14                    "row_limit_estimate": 3355443
15                }
16            }
17        }
18    }
19}

上面的 JSON 是摘取了 trace 结果中的一段，是把 IN 子查询的结果集物化到临时表了，并且使用的是 Memory 引擎的临时表

3. 临时表

临时表，除了派生表、物化表之外，其它会用到临时表的地方，都是为了用空间换时间的，主要有以下使用场景：

group by 不能使用索引时
distinct 不能使用索引时
unioin
...

欢迎扫码关注公众号，我们一起学习更多 MySQL 知识：