一 OS环境
[root@localhost ~]# cat /etc/redhat-release
CentOS release 6.5 (Final)
[root@localhost ~]# uname -rm
2.6.32-431.el6.x86_64 x86_64
[root@localhost ~]# free -m
total used free shared buffers cached
Mem: 48216 47094 1122 0 286 44795
-/+ buffers/cache: 2012 46204
Swap: 24175 1712 22463
[root@localhost ~]#
二 数据库环境
ai=> select version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
(1 row)
ai=> \dt+ t_btree
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+-------+--------+-------------
public | t_btree | table | ai | 346 MB |
(1 row)
ai=> select count(*) from t_btree ;
count
----------
10000000
(1 row)
ai=>
三 external merge
ai=> show shared_buffers ;
shared_buffers
----------------
128MB
(1 row)
ai=> show work_mem ;
work_mem
----------
4MB
(1 row)
ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1580363.83..1605363.83 rows=10000000 width=4) (actual time=8397.774..10610.457 rows=10000000 loops=1)
Output: id
Sort Key: t_btree.id
Sort Method: external merge Disk: 136936kB
Buffers: shared hit=15871 read=28377, temp read=50518 written=50518
-> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.028..861.348 rows=10000000 loops=1)
Output: id
Buffers: shared hit=15871 read=28377
Planning time: 2.056 ms
Execution time: 11135.532 ms
(10 rows)
ai=>
这里的排序方法采用的external mege,意味着数据量太大了,在内存里排序排不下(work_mem=4MB对于当前场景的排序,不够用),只好交换到磁盘上排序,磁盘排序完成之后,再把排序结果交换到内存中。效率最低。
shared_buffers相当于Oracle数据库的database buffer cache,单纯加大该参数,即使所有的数据都能从shared buffers命中,对该排序依然无效,如下:
ai=> explain (analyze,verbose,buffers,costs,timing) select id from t_btree order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1580363.83..1605363.83 rows=10000000 width=4) (actual time=8032.685..10126.444 rows=10000000 loops=1)
Output: id
Sort Key: t_btree.id
Sort Method: external merge Disk: 136936kB
Buffers: shared hit=44248, temp read=50518 written=50518
-> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.021..792.516 rows=10000000 loops=1)
Output: id
Buffers: shared hit=44248
Planning time: 0.062 ms
Execution time: 10655.508 ms
(10 rows)
ai=> select 44248*8;
?column?
----------
353984
(1 row)
ai=> select 44248*8/1024.0;
?column?
----------------------
345.6875000000000000
(1 row)
ai=> show shared_buffers ;
shared_buffers
----------------
4GB
(1 row)
ai=> show work_mem ;
work_mem
----------
4MB
(1 row)
ai=>
shared_buffers=4GB,work_mem=4MB时。执行计划中,Buffers: shared hit=44248,内存命中44248个内存块儿,每个内存页8KB,换算之后,约为346MB,足够缓存该表的全部数据,即所有数据都从内存中读取。但是,排序方法依然是external merge。
四 quicksort排序
ai=> show shared_buffers ;
shared_buffers
----------------
512MB
(1 row)
ai=> show work_mem ;
work_mem
----------
4MB
(1 row)
ai=> set work_mem ='800MB';
SET
ai=> show work_mem ;
work_mem
----------
800MB
(1 row)
ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1306922.83..1331922.83 rows=10000000 width=4) (actual time=15001.049..17213.924 rows=10000000 loops=1)
Output: id
Sort Key: t_btree.id
Sort Method: quicksort Memory: 741817kB
Buffers: shared hit=15903 read=28345
-> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.060..920.913 rows=10000000 loops=1)
Output: id
Buffers: shared hit=15903 read=28345
Planning time: 0.054 ms
Execution time: 17756.761 ms
(10 rows)
ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1306922.83..1331922.83 rows=10000000 width=4) (actual time=6075.205..7945.550 rows=10000000 loops=1)
Output: id
Sort Key: t_btree.id
Sort Method: quicksort Memory: 741817kB
Buffers: shared hit=16063 read=28185
-> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.060..879.729 rows=10000000 loops=1)
Output: id
Buffers: shared hit=16063 read=28185
Planning time: 0.070 ms
Execution time: 8466.822 ms
(10 rows)
ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1306922.83..1331922.83 rows=10000000 width=4) (actual time=5400.587..7208.614 rows=10000000 loops=1)
Output: id
Sort Key: t_btree.id
Sort Method: quicksort Memory: 741817kB
Buffers: shared hit=16095 read=28153
-> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.059..847.957 rows=10000000 loops=1)
Output: id
Buffers: shared hit=16095 read=28153
Planning time: 0.068 ms
Execution time: 7724.861 ms
(10 rows)
ai=>
说明:
1 默认情况下work_mem=4MB,不满足在内存中完成该排序操作,通过向上调整work_mem=800MB ,使其能够在内存中完成该排序,这里的800MB是逐渐上调得到的一个值。为了学习研究用,实际生产环境,调整该参数值时,要注意该值work_mem*max_connections不能超过总物理内存大小;
2 调整完work_mem=800MB后,多次观察该SQL的执行计划,可以发现随着Buffers: shared hit=16095 read=28153的变化,其Execution time逐渐降低。
This will definitely be faster than external merge, since all of the data is brought into memory and then the sorting is done in memory itself.
五 top-N heapsort
ai=> show work_mem ;
work_mem
----------
4MB
(1 row)
ai=> show shared_buffers ;
shared_buffers
----------------
128MB
(1 row)
ai=> explain (analyze,verbose,buffers,costs,timing) select id from t_btree order by id limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=360344.40..360344.43 rows=10 width=4) (actual time=1784.116..1784.117 rows=10 loops=1)
Output: id
Buffers: shared hit=3 read=44248
-> Sort (cost=360344.40..385344.40 rows=10000000 width=4) (actual time=1784.113..1784.113 rows=10 loops=1)
Output: id
Sort Key: t_btree.id
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=3 read=44248
-> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.023..856.626 rows=10000000 loops=1)
Output: id
Buffers: shared read=44248
Planning time: 0.396 ms
Execution time: 1784.149 ms
(13 rows)
ai=>
说明:
1 这里SQL加上了limit 10,则执行计划选择了top-N heapsort,同时,内存只用了25kB。
2 PostgreSQL数据库维护了一个heap的内存结构,该heap有一个上限大小。排序过程大致如下:PostgreSQL先根据limit大小,顺序的把数据放入heap中,等到heap被填满之后,再去读取下一个数据,判断其值是否小于heap中已有的最大值?如果大于最大值,则直接丢弃这个数据,返回heap中的结果集,排序完成。如果小于最大值,则把这个值放入heap中的最大值处,并把heap中之前的那个最大值移除heap,继续读取下一个值,重复该过程,直到读取的新值不再大于heap中的最大值为止。
3 由于SQL中有order by id,才会涉及到排序操作,如果没有这个order by从句的话,那么无需涉及排序,直接返回表中读取的前10条记录;
4 这里无需对整个表进行排序,只需获取最小的10条数据,并对其进行排序即可。
六 index 排序
ai=> create index idx_t_btree_id on t_btree(id);
CREATE INDEX
ai=> show work_mem ;
work_mem
----------
4MB
(1 row)
ai=> show shared_buffers ;
shared_buffers
----------------
128MB
(1 row)
ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.87 rows=10 width=4) (actual time=0.030..0.045 rows=10 loops=1)
Output: id
Buffers: shared hit=13
-> Index Only Scan using idx_t_btree_id on public.t_btree (cost=0.43..436680.67 rows=10000033 width=4) (actual time=0.028..0.039 rows=10 loops=1)
Output: id
Heap Fetches: 10
Buffers: shared hit=13
Planning time: 0.120 ms
Execution time: 0.066 ms
(9 rows)
ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id desc limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.87 rows=10 width=4) (actual time=0.024..0.038 rows=10 loops=1)
Output: id
Buffers: shared hit=12 read=1
-> Index Only Scan Backward using idx_t_btree_id on public.t_btree (cost=0.43..436680.67 rows=10000033 width=4) (actual time=0.022..0.035 rows=10 loops=1)
Output: id
Heap Fetches: 10
Buffers: shared hit=12 read=1
Planning time: 0.213 ms
Execution time: 0.061 ms
(9 rows)
ai=>
说明:
实际情况下,通过index来访问数据,并不是严格意义的排序,只不过是index本身已经是排好序的数据集,数据库只需直接从index读取数据即可,并不需要执行额外的排序操作。这里,id字段上有一个B-tree index,无论是按照id的升序、降序排序,都可以直接从index读取数据。
七 小结
1 此前,并没有认真的关注过external merge排序操作,原来是因为内存空间不够,work_mem内存区不足以用于排序的操作,故而数据库不得不采用将数据交换到磁盘排序,然后把排序结果交换回内存;
2 quick sort是一种成熟的排序方法,大学的数据结构课程学过,早已还给老师了,找时间回顾一下该知识点;
3 heap sort方法对于自己是一种新的排序,通过小结,也大致清楚了其排序的实现方式。
八 参考
https://madusudanan.com/blog/all-you-need-to-know-about-sorting-in-postgres/