PostgreSQL 哈希链接和哈希聚合-CFANZ编程社区

PostgreSQL 哈希链接和哈希聚合_Memory

在PostgreSQL中，表和表之间进行关联关系的情况下，在等值链接中，两个表如果一个是大表一个是小表，PostgreSQL 更倾向与使用 hash join 的方式来解决问题。主要的原因在于通过hash join 会利用内存来进行等值链接的对比针对这种链接的方式，效率更高，

SELECT customer.first_name, customer.last_name, SUM(rental.return_date - rental.rental_date) AS total_rental_duration
FROM customer
JOIN rental ON customer.customer_id = rental.customer_id
GROUP BY customer.customer_id
ORDER BY total_rental_duration DESC
LIMIT 5;

我们用一个示例的语句来说明这个问题，上面的语句是针对两个表进行Join操作，并且针对小表的中的客户来进行分组，找到在dvdrental 中租赁最多的前五。

dvdrental=# select count(*) from rental;
 count 
-------
 16044
(1 row)

dvdrental=# select count(*) from customer;
 count 
-------
   599
(1 row)n Time: 9.835 ms
(15 rows)

Limit  (cost=511.60..511.62 rows=5 width=33) (actual time=9.752..9.755 rows=5 loops=1)
   ->  Sort  (cost=511.60..513.10 rows=599 width=33) (actual time=9.751..9.753 rows=5 loops=1)
         Sort Key: (sum((rental.return_date - rental.rental_date))) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=495.66..501.65 rows=599 width=33) (actual time=9.555..9.645 rows=599 loops=1)
               Group Key: customer.customer_id
               Batches: 1  Memory Usage: 169kB
               ->  Hash Join  (cost=22.48..375.33 rows=16044 width=33) (actual time=0.186..5.451 rows=16044 loops=1)
                     Hash Cond: (rental.customer_id = customer.customer_id)
                     ->  Seq Scan on rental  (cost=0.00..310.44 rows=16044 width=18) (actual time=0.006..1.233 rows=16044 loops=1)
                     ->  Hash  (cost=14.99..14.99 rows=599 width=17) (actual time=0.160..0.161 rows=599 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 39kB
                           ->  Seq Scan on customer  (cost=0.00..14.99 rows=599 width=17) (actual time=0.004..0.073 rows=599 loops=1)
 Planning Time: 0.545 ms
 Execution Time: 9.835 ms
(15 rows)

在实际运行中，通过执行计划，可以很清晰的看到在执行中，先对小表customer 表进行hash buckets 操作，然后在对rental 表进行hash buckets 操作，最后进行了hash join 的表连接，hash buckets 主要的作用是存储具有相同哈希值的键值连接条件。当进行hash 链接，系统将遍历每个hash buckets，搜索具有匹配hash 值的连接，最后返回匹配行。

PostgreSQL 哈希链接和哈希聚合_Memory_02

hash 连接在使用中需要注意，在使用中两个数据集合都需要加载到内存中，来构建hash 表进行hash 操作，并且在使用hash 桶的情况下，需要注意值的倾斜的问题，如果表中的大部分值都是一致的则使用这样的算法会导致一个hash 桶的数据量远远大于其他的桶。其中优化的方式一般通过充足的内存，优化hash 函数，让值更加的分散到hash 桶中等，此为hash 连接。

hash 聚合，哈希聚合是种常用的数据处理算法，他会对如sum, avg max, min 等group by 操作进行数据的分组和聚合计算，在处理的过程中，会将数据分成多个组，每个组具有相同的分组键，聚合计算会对该组中的数据进行合并计算。

hash 聚合的优点减少了磁盘的IO 消耗，将大部分聚合计算都在内存中进行，同时基于hash聚合可以使用并行的能力，充分利用多核心的CPU 来进行计算加速数据的处理。

在PostgreSQL中有四个参数与这部分有关

postgres=# select name,setting from pg_settings where name like '%hash%' or name like 'hash%' or name like '%hash';
         name         | setting 
----------------------+---------
 enable_hashagg       | on
 enable_hashjoin      | on
 enable_parallel_hash | on
 hash_mem_multiplier  | 1

这里有是哪个部分，hash 聚合，hash join ，hash 并行三个部分都可以进行开关，默认是开启的，另一个部分hash_mem_multiplier ，这个参数定义了操作中hash 表所需要的内存倍数，其中指定了 hash表使用内存大小与work_mem 的比例，其中这里有两种设置参数值。

hash_mem_multiplier 可以设置的值为 1 OR 2 ，1 为产生了hash内存值 = work_mem , 2 为 hash内存值 = work_mem /2 , 这里需要注意，如果work_mem 设置的较大，可以采用 2 ,如果 work_mem 本身不大则默认为1 即可。

PostgreSQL 哈希链接和哈希聚合_postgresql_03

QUERY PLAN                                                     
                 
-------------------------------------------------------------------------------------------------------------------------------------
-----------------
 Limit  (cost=1847.14..1847.15 rows=5 width=33) (actual time=17.283..17.287 rows=5 loops=1)
   ->  Sort  (cost=1847.14..1848.64 rows=599 width=33) (actual time=17.281..17.284 rows=5 loops=1)
         Sort Key: (sum((rental.return_date - rental.rental_date))) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=1431.37..1837.19 rows=599 width=33) (actual time=9.231..17.123 rows=599 loops=1)
               Group Key: customer.customer_id
               ->  Merge Join  (cost=1431.37..1710.87 rows=16044 width=33) (actual time=9.197..14.684 rows=16044 loops=1)
                     Merge Cond: (customer.customer_id = rental.customer_id)
                     ->  Index Scan using customer_pkey on customer  (cost=0.28..37.63 rows=599 width=17) (actual time=0.008..1.470 r
ows=599 loops=1)
                     ->  Sort  (cost=1431.09..1471.20 rows=16044 width=18) (actual time=9.178..10.859 rows=16044 loops=1)
                           Sort Key: rental.customer_id
                           Sort Method: quicksort  Memory: 1632kB
                           ->  Seq Scan on rental  (cost=0.00..310.44 rows=16044 width=18) (actual time=0.020..2.945 rows=16044 loops
=1)
 Planning Time: 1.963 ms
 Execution Time: 17.963 ms
(15 rows)

dvdrental=# select name,setting from pg_settings where name like '%hash%' or name like 'hash%' or name like '%hash';
         name         | setting 
----------------------+---------
 enable_hashagg       | off
 enable_hashjoin      | off
 enable_parallel_hash | on
 hash_mem_multiplier  | 1
(4 rows)

dvdrental=#

从这个结果我们可以看到在hash join ,hash 聚合关闭的情况下，对我们整体的执行计划至少慢了1倍。

PostgreSQL 哈希链接和哈希聚合_哈希算法_04