如何在PostgreSQL中使用partial index来优化SQL?

一 案例背景

从生产控制台上看到一条下述SQL:

格式化之后SQL语句为:

select '01' status, count(1) from t_ai_prd_item where status = '01' and deleted = false	
union all 
select '02' status, count(1) from t_ai_prd_item where status = '02' and deleted = false

一个union all的联合查询,每次执行耗时1秒。有没有优化余地呢?

 二 优化分析

下述通过copy原表t_ai_prd_item为t1,来进行分析优化。

1 看SQL执行计划及数据量分布:

ai=> explain analyze select '01' status, count(1) from t1 where status = '01' and deleted = false 
ai-> union all 
ai-> select '02' status, count(1) from t1 where status = '02' and deleted = false;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Append  (cost=160483.13..320974.20 rows=2 width=40) (actual time=5211.374..6250.940 rows=2 loops=1)
   ->  Aggregate  (cost=160483.13..160483.14 rows=1 width=0) (actual time=5211.374..5211.374 rows=1 loops=1)
         ->  Seq Scan on t1  (cost=0.00..160483.12 rows=1 width=0) (actual time=5211.369..5211.369 rows=0 loops=1)
               Filter: ((NOT deleted) AND ((status)::text = '01'::text))
               Rows Removed by Filter: 1395189
   ->  Aggregate  (cost=160491.03..160491.04 rows=1 width=0) (actual time=1039.563..1039.563 rows=1 loops=1)
         ->  Seq Scan on t1 t1_1  (cost=0.00..160483.12 rows=3163 width=0) (actual time=901.577..1039.307 rows=2835 loops=1)
               Filter: ((NOT deleted) AND ((status)::text = '02'::text))
               Rows Removed by Filter: 1392354
 Planning time: 0.417 ms
 Execution time: 6251.024 ms
(11 rows)

ai=> select count(*) from t_ai_prd_item;                       
  count  
---------
 1395189
(1 row)

ai=> select count(*),status from t_ai_prd_item group by status;
  count  | status 
---------+--------
     364 | 04
      25 | 05
    2835 | 02
 1391965 | 06
(4 rows)

ai=>

分析发现,表1中有1395189条记录,status=’01’的记录为0,status=’02’的记录为2835条。

2 那么在status字段上建立一个btree index,效果会怎么样呢?

ai=> create index idx_status on t1(status);
CREATE INDEX
ai=> explain analyze select '01' status, count(1) from t1 where status = '01' and deleted = false 
ai-> union all 
ai-> select '02' status, count(1) from t1 where status = '02' and deleted = false;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=8.45..730.24 rows=2 width=40) (actual time=0.037..3.999 rows=2 loops=1)
   ->  Aggregate  (cost=8.45..8.46 rows=1 width=0) (actual time=0.036..0.036 rows=1 loops=1)
         ->  Index Scan using idx_status on t1  (cost=0.43..8.45 rows=1 width=0) (actual time=0.033..0.033 rows=0 loops=1)
               Index Cond: ((status)::text = '01'::text)
               Filter: (NOT deleted)
   ->  Aggregate  (cost=721.76..721.77 rows=1 width=0) (actual time=3.962..3.962 rows=1 loops=1)
         ->  Index Scan using idx_status on t1 t1_1  (cost=0.43..714.10 rows=3063 width=0) (actual time=0.029..3.673 rows=2835 loops=1)
               Index Cond: ((status)::text = '02'::text)
               Filter: (NOT deleted)
 Planning time: 0.291 ms
 Execution time: 4.067 ms
(11 rows)

ai=>

效果不错,但是结合到实际业务场景,此处只是针对status=’01’和status=’02’的情况来做统计。那么有没有更好的方法来解决这个场景呢?

3 创建1个partial index来测试

ai=> create index idx_partial on t1(status) where (status='01' or status='02');
CREATE INDEX
ai=> explain analyze select '01' status, count(1) from t1 where status = '01' and deleted = false 
ai-> union all 
ai-> select '02' status, count(1) from t1 where status = '02' and deleted = false;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=8.30..732.19 rows=2 width=40) (actual time=0.019..3.916 rows=2 loops=1)
   ->  Aggregate  (cost=8.30..8.31 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1)
         ->  Index Scan using idx_partial on t1  (cost=0.28..8.30 rows=1 width=0) (actual time=0.015..0.015 rows=0 loops=1)
               Index Cond: ((status)::text = '01'::text)
               Filter: (NOT deleted)
   ->  Aggregate  (cost=723.85..723.86 rows=1 width=0) (actual time=3.897..3.897 rows=1 loops=1)
         ->  Index Scan using idx_partial on t1 t1_1  (cost=0.28..716.20 rows=3063 width=0) (actual time=0.030..3.599 rows=2835 loops=1)
               Index Cond: ((status)::text = '02'::text)
               Filter: (NOT deleted)
 Planning time: 0.234 ms
 Execution time: 3.992 ms
(11 rows)

ai=>

对比,发现SQL的执行效率几乎没有差别。但是:

ai=> \di+ idx_status 
                         List of relations
 Schema |    Name    | Type  | Owner | Table | Size  | Description 
--------+------------+-------+-------+-------+-------+-------------
 ai     | idx_status | index | ai    | t1    | 30 MB | 
(1 row)

ai=> \di+ idx_partial 
                         List of relations
 Schema |    Name     | Type  | Owner | Table | Size  | Description 
--------+-------------+-------+-------+-------+-------+-------------
 ai     | idx_partial | index | ai    | t1    | 80 kB | 
(1 row)

ai=>

4 小结:

在确保SQL执行效率的同时,这个partial index所占的存储空间是b-tree index的1/384,大大降低了存储空间的开销。

三 关于partial index

1 什么是partial index?

分区索引是基于表中部分数据建立的,被索引的部分数据是原数据的一个子集。这个子集数据是基于某些条件可以明确计算出来的。

2 partial index适用场景?

对于表中数据分布“不均匀”的场景,比如某些数据占据绝大多数,而其它一些数据只占少量,且要满足查询条件针是对那些小数据量的查询情形。此例正好满足这种情况,总数据量为140万左右,而状态为01和02的数据只占极少的比例,且查询是针对状态为01和02的查询。

通过partial index可以排除大量普通数据或业务上“不感兴趣”的数据。如:一个在线订单系统,可以针对那些不在经常访问的客户端IP范围之外的IP进行创建分区索引,或者针对已下单还未支付的订单进行分区索引的创建。这样,当查询那些不在常用IP范围内的订单,或者那些已下单未支付的订单信息时,可以保证查询效率。

3 partial index的优势?

由于其是针对部分数据进行索引,显然同传统b-tree索引相比,其所占磁盘空间将大大缩小。当重建维护这些分区索引时,速度也非常快。

4 参考

partial index官方文档地址:
https://www.postgresql.org/docs/9.3/static/indexes-partial.html

PostgreSQL中hash索引的小结

一 关于PostgreSQL中使用hash index的小结如下:

  1. 只适合于等值查询;
  2. 不受WAL-logged保护(pre-PG 10),如果数据库崩溃,则需要重建该类型索引;
  3. 索引上的改变不会经由streaming或file-based复制到备库;
  4. 不支持在多列上创建联合hash index;
  5. 通常,不建议使用hash index。

二 PostgreSQL中使用hash index导致的一则错误案例

 
### Error updating database.  Cause: org.postgresql.util.PSQLException: ERROR: could not read block 0 in file "base/82560/1801225": read only 0 of 8192 bytes
### The error may involve com.onlyou.platform.form.entity.FormDetailEntityMapper.updateByPrimaryKeySelective-Inline
### The error occurred while setting parameters
### SQL: UPDATE t_onlyou_form_detail SET form_id=?,label=?,base_type=?,is_show=?,sort=?,create_ts=?,create_user_id=?,update_ts=?,update_user_id=? WHERE (id=?)
### Cause: org.postgresql.util.PSQLException: ERROR: could not read block 0 in file "base/82560/1801225": read only 0 of 8192 bytes
; uncategorized SQLException for SQL []; SQL state [XX001]; error code [0]; ERROR: could not read block 0 in file "base/82560/1801225": read only 0 of 8192 bytes; nested exception is org.postgresql.util.PSQLException: ERROR: could not read block 0 in file "base/82560/1801225": read only 0 of 8192 bytes

原因是数据库配置了主备结构,并且期间执行过切换。在新的主库(原备库)上,发现了上述错误,经排查该表上有1个hash index,解决办法就是重建该hash index,或者改为其它类型的index。

三 小结

这是之前,一个项目在阿里云生产环境RDS上遇到的错误。

ERROR: could not read block 0 in file "base/82560/1801225": read only 0 of 8192 bytes;很有可能把我们指向错误的方向,误以为数据库存储是不是出问题,或者数据块上出现了错误?