一 案例背景
从生产控制台上看到一条下述SQL:
格式化之后SQL语句为:
select '01' status, count(1) from t_ai_prd_item where status = '01' and deleted = false union all select '02' status, count(1) from t_ai_prd_item where status = '02' and deleted = false
一个union all的联合查询,每次执行耗时1秒。有没有优化余地呢?
二 优化分析
下述通过copy原表t_ai_prd_item为t1,来进行分析优化。
1 看SQL执行计划及数据量分布:
ai=> explain analyze select '01' status, count(1) from t1 where status = '01' and deleted = false ai-> union all ai-> select '02' status, count(1) from t1 where status = '02' and deleted = false; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Append (cost=160483.13..320974.20 rows=2 width=40) (actual time=5211.374..6250.940 rows=2 loops=1) -> Aggregate (cost=160483.13..160483.14 rows=1 width=0) (actual time=5211.374..5211.374 rows=1 loops=1) -> Seq Scan on t1 (cost=0.00..160483.12 rows=1 width=0) (actual time=5211.369..5211.369 rows=0 loops=1) Filter: ((NOT deleted) AND ((status)::text = '01'::text)) Rows Removed by Filter: 1395189 -> Aggregate (cost=160491.03..160491.04 rows=1 width=0) (actual time=1039.563..1039.563 rows=1 loops=1) -> Seq Scan on t1 t1_1 (cost=0.00..160483.12 rows=3163 width=0) (actual time=901.577..1039.307 rows=2835 loops=1) Filter: ((NOT deleted) AND ((status)::text = '02'::text)) Rows Removed by Filter: 1392354 Planning time: 0.417 ms Execution time: 6251.024 ms (11 rows) ai=> select count(*) from t_ai_prd_item; count --------- 1395189 (1 row) ai=> select count(*),status from t_ai_prd_item group by status; count | status ---------+-------- 364 | 04 25 | 05 2835 | 02 1391965 | 06 (4 rows) ai=>
分析发现,表1中有1395189条记录,status=’01’的记录为0,status=’02’的记录为2835条。
2 那么在status字段上建立一个btree index,效果会怎么样呢?
ai=> create index idx_status on t1(status); CREATE INDEX ai=> explain analyze select '01' status, count(1) from t1 where status = '01' and deleted = false ai-> union all ai-> select '02' status, count(1) from t1 where status = '02' and deleted = false; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Append (cost=8.45..730.24 rows=2 width=40) (actual time=0.037..3.999 rows=2 loops=1) -> Aggregate (cost=8.45..8.46 rows=1 width=0) (actual time=0.036..0.036 rows=1 loops=1) -> Index Scan using idx_status on t1 (cost=0.43..8.45 rows=1 width=0) (actual time=0.033..0.033 rows=0 loops=1) Index Cond: ((status)::text = '01'::text) Filter: (NOT deleted) -> Aggregate (cost=721.76..721.77 rows=1 width=0) (actual time=3.962..3.962 rows=1 loops=1) -> Index Scan using idx_status on t1 t1_1 (cost=0.43..714.10 rows=3063 width=0) (actual time=0.029..3.673 rows=2835 loops=1) Index Cond: ((status)::text = '02'::text) Filter: (NOT deleted) Planning time: 0.291 ms Execution time: 4.067 ms (11 rows) ai=>
效果不错,但是结合到实际业务场景,此处只是针对status=’01’和status=’02’的情况来做统计。那么有没有更好的方法来解决这个场景呢?
3 创建1个partial index来测试
ai=> create index idx_partial on t1(status) where (status='01' or status='02'); CREATE INDEX ai=> explain analyze select '01' status, count(1) from t1 where status = '01' and deleted = false ai-> union all ai-> select '02' status, count(1) from t1 where status = '02' and deleted = false; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Append (cost=8.30..732.19 rows=2 width=40) (actual time=0.019..3.916 rows=2 loops=1) -> Aggregate (cost=8.30..8.31 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1) -> Index Scan using idx_partial on t1 (cost=0.28..8.30 rows=1 width=0) (actual time=0.015..0.015 rows=0 loops=1) Index Cond: ((status)::text = '01'::text) Filter: (NOT deleted) -> Aggregate (cost=723.85..723.86 rows=1 width=0) (actual time=3.897..3.897 rows=1 loops=1) -> Index Scan using idx_partial on t1 t1_1 (cost=0.28..716.20 rows=3063 width=0) (actual time=0.030..3.599 rows=2835 loops=1) Index Cond: ((status)::text = '02'::text) Filter: (NOT deleted) Planning time: 0.234 ms Execution time: 3.992 ms (11 rows) ai=>
对比,发现SQL的执行效率几乎没有差别。但是:
ai=> \di+ idx_status List of relations Schema | Name | Type | Owner | Table | Size | Description --------+------------+-------+-------+-------+-------+------------- ai | idx_status | index | ai | t1 | 30 MB | (1 row) ai=> \di+ idx_partial List of relations Schema | Name | Type | Owner | Table | Size | Description --------+-------------+-------+-------+-------+-------+------------- ai | idx_partial | index | ai | t1 | 80 kB | (1 row) ai=>
4 小结:
在确保SQL执行效率的同时,这个partial index所占的存储空间是b-tree index的1/384,大大降低了存储空间的开销。
三 关于partial index
1 什么是partial index?
分区索引是基于表中部分数据建立的,被索引的部分数据是原数据的一个子集。这个子集数据是基于某些条件可以明确计算出来的。
2 partial index适用场景?
对于表中数据分布“不均匀”的场景,比如某些数据占据绝大多数,而其它一些数据只占少量,且要满足查询条件针是对那些小数据量的查询情形。此例正好满足这种情况,总数据量为140万左右,而状态为01和02的数据只占极少的比例,且查询是针对状态为01和02的查询。
通过partial index可以排除大量普通数据或业务上“不感兴趣”的数据。如:一个在线订单系统,可以针对那些不在经常访问的客户端IP范围之外的IP进行创建分区索引,或者针对已下单还未支付的订单进行分区索引的创建。这样,当查询那些不在常用IP范围内的订单,或者那些已下单未支付的订单信息时,可以保证查询效率。
3 partial index的优势?
由于其是针对部分数据进行索引,显然同传统b-tree索引相比,其所占磁盘空间将大大缩小。当重建维护这些分区索引时,速度也非常快。
4 参考
partial index官方文档地址:
https://www.postgresql.org/docs/9.3/static/indexes-partial.html