一 OS环境
[root@localhost ~]# cat /etc/redhat-release CentOS release 6.5 (Final) [root@localhost ~]# uname -rm 2.6.32-431.el6.x86_64 x86_64 [root@localhost ~]# free -m total used free shared buffers cached Mem: 48216 47094 1122 0 286 44795 -/+ buffers/cache: 2012 46204 Swap: 24175 1712 22463 [root@localhost ~]#
二 数据库环境
ai=> select version(); version ---------------------------------------------------------------------------------------------------------- PostgreSQL 9.6.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit (1 row) ai=> \dt+ t_btree List of relations Schema | Name | Type | Owner | Size | Description --------+---------+-------+-------+--------+------------- public | t_btree | table | ai | 346 MB | (1 row) ai=> select count(*) from t_btree ; count ---------- 10000000 (1 row) ai=>
三 external merge
ai=> show shared_buffers ; shared_buffers ---------------- 128MB (1 row) ai=> show work_mem ; work_mem ---------- 4MB (1 row) ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Sort (cost=1580363.83..1605363.83 rows=10000000 width=4) (actual time=8397.774..10610.457 rows=10000000 loops=1) Output: id Sort Key: t_btree.id Sort Method: external merge Disk: 136936kB Buffers: shared hit=15871 read=28377, temp read=50518 written=50518 -> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.028..861.348 rows=10000000 loops=1) Output: id Buffers: shared hit=15871 read=28377 Planning time: 2.056 ms Execution time: 11135.532 ms (10 rows) ai=>
这里的排序方法采用的external mege,意味着数据量太大了,在内存里排序排不下(work_mem=4MB对于当前场景的排序,不够用),只好交换到磁盘上排序,磁盘排序完成之后,再把排序结果交换到内存中。效率最低。
shared_buffers相当于Oracle数据库的database buffer cache,单纯加大该参数,即使所有的数据都能从shared buffers命中,对该排序依然无效,如下:
ai=> explain (analyze,verbose,buffers,costs,timing) select id from t_btree order by id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Sort (cost=1580363.83..1605363.83 rows=10000000 width=4) (actual time=8032.685..10126.444 rows=10000000 loops=1) Output: id Sort Key: t_btree.id Sort Method: external merge Disk: 136936kB Buffers: shared hit=44248, temp read=50518 written=50518 -> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.021..792.516 rows=10000000 loops=1) Output: id Buffers: shared hit=44248 Planning time: 0.062 ms Execution time: 10655.508 ms (10 rows) ai=> select 44248*8; ?column? ---------- 353984 (1 row) ai=> select 44248*8/1024.0; ?column? ---------------------- 345.6875000000000000 (1 row) ai=> show shared_buffers ; shared_buffers ---------------- 4GB (1 row) ai=> show work_mem ; work_mem ---------- 4MB (1 row) ai=>
shared_buffers=4GB,work_mem=4MB时。执行计划中,Buffers: shared hit=44248,内存命中44248个内存块儿,每个内存页8KB,换算之后,约为346MB,足够缓存该表的全部数据,即所有数据都从内存中读取。但是,排序方法依然是external merge。
四 quicksort排序
ai=> show shared_buffers ; shared_buffers ---------------- 512MB (1 row) ai=> show work_mem ; work_mem ---------- 4MB (1 row) ai=> set work_mem ='800MB'; SET ai=> show work_mem ; work_mem ---------- 800MB (1 row) ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Sort (cost=1306922.83..1331922.83 rows=10000000 width=4) (actual time=15001.049..17213.924 rows=10000000 loops=1) Output: id Sort Key: t_btree.id Sort Method: quicksort Memory: 741817kB Buffers: shared hit=15903 read=28345 -> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.060..920.913 rows=10000000 loops=1) Output: id Buffers: shared hit=15903 read=28345 Planning time: 0.054 ms Execution time: 17756.761 ms (10 rows) ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Sort (cost=1306922.83..1331922.83 rows=10000000 width=4) (actual time=6075.205..7945.550 rows=10000000 loops=1) Output: id Sort Key: t_btree.id Sort Method: quicksort Memory: 741817kB Buffers: shared hit=16063 read=28185 -> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.060..879.729 rows=10000000 loops=1) Output: id Buffers: shared hit=16063 read=28185 Planning time: 0.070 ms Execution time: 8466.822 ms (10 rows) ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Sort (cost=1306922.83..1331922.83 rows=10000000 width=4) (actual time=5400.587..7208.614 rows=10000000 loops=1) Output: id Sort Key: t_btree.id Sort Method: quicksort Memory: 741817kB Buffers: shared hit=16095 read=28153 -> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.059..847.957 rows=10000000 loops=1) Output: id Buffers: shared hit=16095 read=28153 Planning time: 0.068 ms Execution time: 7724.861 ms (10 rows) ai=>
说明:
1 默认情况下work_mem=4MB,不满足在内存中完成该排序操作,通过向上调整work_mem=800MB ,使其能够在内存中完成该排序,这里的800MB是逐渐上调得到的一个值。为了学习研究用,实际生产环境,调整该参数值时,要注意该值work_mem*max_connections不能超过总物理内存大小;
2 调整完work_mem=800MB后,多次观察该SQL的执行计划,可以发现随着Buffers: shared hit=16095 read=28153的变化,其Execution time逐渐降低。
This will definitely be faster than external merge, since all of the data is brought into memory and then the sorting is done in memory itself.
五 top-N heapsort
ai=> show work_mem ; work_mem ---------- 4MB (1 row) ai=> show shared_buffers ; shared_buffers ---------------- 128MB (1 row) ai=> explain (analyze,verbose,buffers,costs,timing) select id from t_btree order by id limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=360344.40..360344.43 rows=10 width=4) (actual time=1784.116..1784.117 rows=10 loops=1) Output: id Buffers: shared hit=3 read=44248 -> Sort (cost=360344.40..385344.40 rows=10000000 width=4) (actual time=1784.113..1784.113 rows=10 loops=1) Output: id Sort Key: t_btree.id Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=3 read=44248 -> Seq Scan on public.t_btree (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.023..856.626 rows=10000000 loops=1) Output: id Buffers: shared read=44248 Planning time: 0.396 ms Execution time: 1784.149 ms (13 rows) ai=>
说明:
1 这里SQL加上了limit 10,则执行计划选择了top-N heapsort,同时,内存只用了25kB。
2 PostgreSQL数据库维护了一个heap的内存结构,该heap有一个上限大小。排序过程大致如下:PostgreSQL先根据limit大小,顺序的把数据放入heap中,等到heap被填满之后,再去读取下一个数据,判断其值是否小于heap中已有的最大值?如果大于最大值,则直接丢弃这个数据,返回heap中的结果集,排序完成。如果小于最大值,则把这个值放入heap中的最大值处,并把heap中之前的那个最大值移除heap,继续读取下一个值,重复该过程,直到读取的新值不再大于heap中的最大值为止。
3 由于SQL中有order by id,才会涉及到排序操作,如果没有这个order by从句的话,那么无需涉及排序,直接返回表中读取的前10条记录;
4 这里无需对整个表进行排序,只需获取最小的10条数据,并对其进行排序即可。
六 index 排序
ai=> create index idx_t_btree_id on t_btree(id); CREATE INDEX ai=> show work_mem ; work_mem ---------- 4MB (1 row) ai=> show shared_buffers ; shared_buffers ---------------- 128MB (1 row) ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id limit 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..0.87 rows=10 width=4) (actual time=0.030..0.045 rows=10 loops=1) Output: id Buffers: shared hit=13 -> Index Only Scan using idx_t_btree_id on public.t_btree (cost=0.43..436680.67 rows=10000033 width=4) (actual time=0.028..0.039 rows=10 loops=1) Output: id Heap Fetches: 10 Buffers: shared hit=13 Planning time: 0.120 ms Execution time: 0.066 ms (9 rows) ai=> explain (analyze,verbose,buffers,costs,timing) select * from t_btree order by id desc limit 10; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..0.87 rows=10 width=4) (actual time=0.024..0.038 rows=10 loops=1) Output: id Buffers: shared hit=12 read=1 -> Index Only Scan Backward using idx_t_btree_id on public.t_btree (cost=0.43..436680.67 rows=10000033 width=4) (actual time=0.022..0.035 rows=10 loops=1) Output: id Heap Fetches: 10 Buffers: shared hit=12 read=1 Planning time: 0.213 ms Execution time: 0.061 ms (9 rows) ai=>
说明:
实际情况下,通过index来访问数据,并不是严格意义的排序,只不过是index本身已经是排好序的数据集,数据库只需直接从index读取数据即可,并不需要执行额外的排序操作。这里,id字段上有一个B-tree index,无论是按照id的升序、降序排序,都可以直接从index读取数据。
七 小结
1 此前,并没有认真的关注过external merge排序操作,原来是因为内存空间不够,work_mem内存区不足以用于排序的操作,故而数据库不得不采用将数据交换到磁盘排序,然后把排序结果交换回内存;
2 quick sort是一种成熟的排序方法,大学的数据结构课程学过,早已还给老师了,找时间回顾一下该知识点;
3 heap sort方法对于自己是一种新的排序,通过小结,也大致清楚了其排序的实现方式。
八 参考
https://madusudanan.com/blog/all-you-need-to-know-about-sorting-in-postgres/