文档课题:Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图.
1、理论知识
Oracle数据库使用如下参数作为选择直方图的依据:
NDV:一列中不同值的数量.如一个列只包含100、200和300,则该列的NDV为3;
n:直方图bucket数量,默认值为254;
p:百分比阈值,p=(1-(1/n))*100,如n=254,则p=99.6.
DBMS_STATS收集统计信息时estimate_percent是否设置为AUTO_SAMPLE_SIZE(默认值).
下图为选择具体直方图的流程:

直方图的基数算法
直方图基数算法依赖于endpoint number、endpoint value以及列值是popular or nonpopular.
endpoint number
endpoint number是唯一标识bucket的数字.在frequency和hybrid直方图中,endpoint number是当前bucket和以前bucket中包含的所有值的累积总次数.如endpoint number为100的buckut表示当前和以前bucket中所有值的总次数为100.
而在height-balanced直方图中,优化器按顺序对buckets进行编号,从0或1开始.通常情况下endpoint number既是bucket number.
endpoint value:
endpoint value是bucket中的最大值,如一个bucket中只包含52794和52795,则endpoint value为52795.
popular and nonpopular values
直方图中某个值的popular程度对基数估计算法的影响如下:
Popular values
Popular values作为多个bucket的端点值出现.优化器首先检查一个值是否为bucket的端点值,从而确定该值是否为popular value.如果是这样,那么对于频率直方图,优化器将从当前bucket的endpoint number中减去前一个bucket的endpoint number.hybrid直方图已经为每个endpoint单独存储此信息.如果该值大于1,则该值为popular.
优化器使用如下公式估算popular值的基数.
cardinality of popular value =
(num of rows in table) *
(num of endpoints spanned by this value / total num of endpoints)
nonpopular value
不是popular value,那就是nonpopular value.
优化器使用以下公式评估nonpopular value的基数.
cardinality of nonpopular value =
(num of rows in table) * density
说明:优化器使用基于buckets和NDV等因素的内部算法计算density.density为0到1之间的十进制数,值接近1表示在谓词列表中引用该列的查询时,优化器期望返回许多行,值接近0表示优化器期望返回的行很少.
Bucket Compression
在某些情况下,为了减少bucket的总数,优化器将多个bucket压缩到一个bucket.如下面的frequency直方图表示第一个桶号为1,最后一个桶号为23.
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 52792
6 52793
8 52794
9 52795
10 52796
12 52797
14 52798
23 52799
说明:以上示例中存在bucket被压缩,原本从bucket 2到bucket 6每个bucket都包含一个值52793,但优化器将此类bucket压缩到endpoint number最高的bucket(bucket 6)中,此时bucket 6包含5个52793值.由此可知该值为popular value,因为当前bucket的endpoint number(6)与前一个bucket的endpoint number(1)之间相差5.因此在压缩之前,值52793是5个桶的端点值.
如下标明被压缩的bucket及popular value.
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 52792 -> nonpopular
6 52793 -> buckets 2-6 compressed into 6; popular
8 52794 -> buckets 7-8 compressed into 8; popular
9 52795 -> nonpopular
10 52796 -> nonpopular
12 52797 -> buckets 11-12 compressed into 12; popular
14 52798 -> buckets 13-14 compressed into 14; popular
23 52799 -> buckets 15-23 compressed into 23; popular
Frequence直方图
frequency直方图中,每个不同的列值对应于直方图的单个bucket.因为每个值都有专用bucket,所以有些bucket可能有很多值,而有些bucket可能只有很少的值.
数据库创建frequence直方图,需满足以下条件:
A、 NDV需小于等于n,n值是直方图的buckets数值(默认为254);
B、 收集统计信息时estimate_percent参数需设置为用户指定的值或auto_sample_size
说明:从Oracle Database 12c开始,若sampling size的默认值是AUTO_SAMPLE_SIZE,那么数据库将从全表扫描中创建frequency直方图.对于其它抽样百分比规格,数据库从样本派生frequency直方图.在Oracle Database 12c以前,数据库基于小样本收集直方图,意味着低频值可能不会出现在样本中.在这种情况下使用密度可能会导致优化器高估选择性.
说明:如下进行各类直方图的相关测试,版本为oracle 19.12.
2、frequency直方图
2.1、准备测试数据
[oracle@dbserver ~]$ sqlplus / as sysdba
SQL*Plus: Release 19.0.0.0.0 - Production on Sat Jul 15 22:18:41 2023
Version 19.12.0.0.0
Copyright (c) 1982, 2021, Oracle. All rights reserved.
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.12.0.0.0
SYS@rmlis> conn leo/leo;
Connected.
LEO@rmlis> create table countries(country_subregion_id number);
insert into countries values(52792);
insert into countries values(52793);
insert into countries values(52793);
insert into countries values(52793);
insert into countries values(52793);
insert into countries values(52793);
insert into countries values(52794);
insert into countries values(52794);
insert into countries values(52795);
insert into countries values(52796);
insert into countries values(52797);
insert into countries values(52797);
insert into countries values(52798);
insert into countries values(52798);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
SYS@rmlis> select country_subregion_id,count(*) from leo.countries group by country_subregion_id order by 1;
COUNTRY_SUBREGION_ID COUNT(*)
-------------------- ----------
52792 1
52793 5
52794 2
52795 1
52796 1
52797 2
52798 2
52799 9
8 rows selected.
2.2、生成frequency直方图
--收集表leo.countries和列country_subregion_id的统计信息,使用buckets的默认值254.
SYS@rmlis> begin
2 dbms_stats.gather_table_stats(ownname => 'LEO',
3 tabname => 'COUNTRIES',
4 method_opt => 'FOR COLUMNS COUNTRY_SUBREGION_ID');
5 end;
6 /
PL/SQL procedure successfully completed.
2.3、相关查询
--查直方图信息
LEO@rmlis> col table_name for a20
LEO@rmlis> col column_name for a20
LEO@rmlis> select table_name, column_name, num_distinct, histogram
2 from user_tab_col_statistics
3 where table_name = 'COUNTRIES'
4* and column_name = 'COUNTRY_SUBREGION_ID';
TABLE_NAME COLUMN_NAME NUM_DISTINCT HISTOGRAM
-------------------- -------------------- ------------ ---------------
COUNTRIES COUNTRY_SUBREGION_ID 8 FREQUENCY
--查endpoint number以及endpoint value
LEO@rmlis> select endpoint_number, endpoint_value
2 from user_histograms
3 where table_name = 'COUNTRIES'
4 and column_name = 'COUNTRY_SUBREGION_ID';
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 52792
6 52793
8 52794
9 52795
10 52796
12 52797
14 52798
23 52799
8 rows selected.
2.4、图形解析
如下是8个bucket的详细图解.

说明:如图所示,每个不同的值都有自己的bucket.因为是frequency直方图,endpoint number为端点的累积次数.对于52793,endpoint number 6表示该值出现5次(6-1).对于52794,endpoint number 8表示该值出现2次(8-6).
endpoint number比前一个endpoint number至少大2的每个bucket都包含一个popular值.因此桶6、8、12、14和23均包含popular值.优化器根据端点数计算它们的基数.如优化器使用以下公式计算值52799的基数(C),其中表的行数为23.
C=23*(9/23)
Buckets 1,9和10包含nopopular值,优化器根据密度计算基数.
3、Height-Balanced直方图
3.1、理论知识
在height-balanced直方图中,列值被分成bucket,以便每个bucket包含相同数量的行. 如99个硬币分配到4个bucket中,每个bucket大约25个硬币.直方图显示endpoints所在范围.
生成Height-Balanced直方图条件:
Oracle 12c之前,当NDV大于n时,数据库会创建height-balanced直方图.该类型直方图对于至少在两个bucket中作为endpoint出现的值的范围谓词和相等谓词非常有用.
A、 NDV大于n值,n为直方图buckets数量(默认值为254);
B、 收集统计信息时estimate_percent不能设置为auto_sample_size.
由此可见,若Oracle 12c创建新直方图时采样百分比设置为AUTO_SAMPLE_SIZE,则该直方图只能是top frequency或bybrid.
若11g升级到12c,在升级之前创建的height-based直方图能继续使用.但若重新统计信息,那么该表上现有的height-based直方图将被替换.替代的直方图类型将取决于NDV和以下标准:
A、若采样百分比为AUTO_SAMPLE_SIZE,则数据库生成hybrid或frequency直方图.
B、若采样百分比不是AUTO_SAMPLE_SIZE,则数据库将创建height-balanced直方图或frequency直方图.
3.2、生成height-balanced直方图
LEO@rmlis> select country_subregion_id, count(*)
2 from leo.countries
3 group by country_subregion_id
4 order by 1;
COUNTRY_SUBREGION_ID COUNT(*)
-------------------- ----------
52792 1
52793 5
52794 2
52795 1
52796 1
52797 2
52798 2
52799 9
8 rows selected.
LEO@rmlis> begin
2 dbms_stats.gather_table_stats(ownname => 'LEO',
3 tabname => 'COUNTRIES',
4 method_opt => 'FOR COLUMNS COUNTRY_SUBREGION_ID SIZE 7',
5 estimate_percent => 100);
6 end;
7 /
PL/SQL procedure successfully completed.
3.3、相关查询
--查country_subregion_id列上直方图信息
LEO@rmlis> select a.column_name,
2 a.table_name,
3 b.num_rows,
4 a.num_distinct Cardinality,
5 round(a.num_distinct / b.num_rows * 100, 2) selectivity,
6 a.histogram,
7 a.num_buckets
8 from dba_tab_col_statistics a, dba_tables b
9 where a.owner = b.owner
10 and a.table_name = b.table_name
11 and a.owner = 'LEO'
12 and a.table_name = 'COUNTRIES';
COLUMN_NAME TABLE_NAME NUM_ROWS CARDINALITY SELECTIVITY HISTOGRAM NUM_BUCKETS
-------------------- -------------------- ---------- ----------- ----------- --------------- -----------
COUNTRY_SUBREGION_ID COUNTRIES 23 8 34.78 HEIGHT BALANCED 7
--查询endpoint number和endpoint value值
LEO@rmlis> select endpoint_number, endpoint_value
2 from user_histograms
3 where table_name = 'COUNTRIES'
4 and column_name = 'COUNTRY_SUBREGION_ID';
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
0 52792
2 52793
3 52795
4 52798
7 52799
3.4、图形解析
说明:bucket number与endpoint number相同.优化器将每个bucket中最后一行的值记录为endpoint value,然后进行检查,确保最小值是第一个bucket的endpoint value,最大值是最后一个bucket的endpoint value.本例优化器添加bucket 0,使最小值52792成为bucket的端点.
优化器将23行均匀分布到7个指定的直方图bucket中,因此每个bucket大约包含3行.然而优化器会压缩具有相同端点的buckets.因此优化器将5个52793放入bucket 2,而不是bucket 1包含2个52793,bucket 2包含3个52793。类似地优化器将9个52799放入桶7,不是让桶5、6和7各包含3个52799值.
本例桶3桶4包含nonpopular值,因当前endpoint number和以前的endpoint number之间差值是1.此时优化器根据密度计算此类值的基数.其余桶包含popular value,优化器根据endpoint numbers计算此类值基数.

4、Top Frequency直方图
4.1、理论知识
top frequency直方图由frequency直方图演变而来,其忽略出现次数少的值.如1000个硬币中只有一个便士,那么在将硬币分类到bucket中时忽略便士.Top frequency直方图可以为highly popular values生成更好的直方图.
生成top frequency直方图需满足以下条件:
A、 NDV大于n值,n为直方图buckets数量(默认值为254);
B、 前n个频繁值所占行的百分比大于等于阈值p,p=(1-(1/n))*100;
C、 收集统计信息时estimate_percent参数需设置为auto_sample_size.
4.2、生成top frequency直方图
LEO@rmlis> select country_subregion_id,count(*) from leo.countries group by country_subregion_id order by 1;
COUNTRY_SUBREGION_ID COUNT(*)
-------------------- ----------
52792 1
52793 5
52794 2
52795 1
52796 1
52797 2
52798 2
52799 9
8 rows selected.
指定7个buckets收集统计信息
LEO@rmlis> begin
2 dbms_stats.gather_table_stats(ownname => 'LEO',
3 tabname => 'COUNTRIES',
4 method_opt => 'FOR COLUMNS COUNTRY_SUBREGION_ID SIZE 7');
5 end;
6 /
PL/SQL procedure successfully completed.
4.3、相关查询
--查直方图信息
LEO@rmlis> select a.column_name,
2 a.table_name,
3 b.num_rows,
4 a.num_distinct Cardinality,
5 round(a.num_distinct / b.num_rows * 100, 2) selectivity,
6 a.histogram,
7 a.num_buckets
8 from dba_tab_col_statistics a, dba_tables b
9 where a.owner = b.owner
10 and a.table_name = b.table_name
11 and a.owner = 'LEO'
12 and a.table_name = 'COUNTRIES';
COLUMN_NAME TABLE_NAME NUM_ROWS CARDINALITY SELECTIVITY HISTOGRAM NUM_BUCKETS
-------------------- -------------------- ---------- ----------- ----------- --------------- -----------
COUNTRY_SUBREGION_ID COUNTRIES 23 8 34.78 TOP-FREQUENCY 7
说明:contries.country_subregion_id列包含8个不同的值,但直方图只有7个bucket,且estimate_percent默认为auto_sample_size,此条件下数据库只能创建top frequency或者hybrid直方图.在country_subregion_id列上出现频率最高的前7个值占据96.5%(22/23)的行,超过85.7%(1-1/7),因此生成top frequency直方图.
--查endpoint number和endpoint value
LEO@rmlis> select endpoint_number, endpoint_value
2 from user_histograms
3 where table_name = 'COUNTRIES'
4 and column_name = 'COUNTRY_SUBREGION_ID';
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 52792
6 52793
8 52794
9 52796
11 52797
13 52798
22 52799
7 rows selected.
4.4、图形解析

说明:除52795外每个不同的值都有自己的bucket,因为52795为nopopular,所以被排除在直方图之外.作为标准frequency直方图,endpoint number表示累积值.
5、Hybrid直方图
5.1、理论知识
Hybrid直方图结合height-based直方图和frequency直方图的优点.使优化器能够在某些情况下获得更好的执行计划.
Height-balanced直方图有时对popular值产生不准确的评估.如一个值仅作为一个endpoint value出现,但几乎占用两个buckets,优化器会认为该值为nonpopular.
为解决该问题,hybrid直方图在分配值时使任何值都不占用一个以上的bucket,然后存储
endpoint repeat count值,即endpoint value重复的次数.通过使用endpoint repeat
count,优化器可以获得大多popular值的准确估计.
Endpoint repeat counts工作原理
在桶中分配硬币的类比来说明端点重复计数的工作原理.
下图展示一个硬币列,将值从低到高排序.

通过设置DBMS_STATS的method_opt参数收集该表的统计信息.GATHER_TABLE_STATS FOR所有列大小设置为3.本例中优化器最初将coins列中的值分组到三个桶中,如下所示.

若一个桶边界分割了一个值,使得该值一部分出现在一个bucket中,另一部分出现在另一个bucket,那么优化器将桶边界值(以及所有其他后续桶边界值)向前移动,以包括该值的所有出现.如优化器移动值5,使其完全位于第一个bucket中,而值25现在完全位于第二个bucket中.

endpoint repeat count测量相应的bucket endpoint(即右侧桶边界的值)重复自身的次数.如第一个桶中值5重复3次,因此endpoint repeat count为3.

说明:Height-balanced存储的信息没有bybrid多.通过使用endpoint repeat count,优化器可以准确地确定endpoint value出现的次数.如优化器知道值5出现3次,值25出现4次,值100出现2次.此信息有助于优化器生成更好的基数估计.
生成Hybrid直方图需满足以下条件:
与top frequence直方图相比,hybrid直方图唯一区别是前n个值的比例小于阈值p.
A、 NDV大于n,n为直方图bucket数(默认为254);
B、 前n个频繁值所占行的百分比小于阈值p,p=(1-(1/n))*100;
C、 DBMS_STATS统计信息收集时estimate_percent参数设置为AUTO_SAMPLE_SIZE.
D、
注意:如果用户指定自己的百分比,那么数据库将创建frequency或height-balanced直方图.
5.2、准备测试数据
--创建测试数据
LEO@rmlis> create table products(prod_subcategory_id number);
Table created.
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2032);
insert into products values(2032);
insert into products values(2032);
insert into products values(2032);
insert into products values(2032);
insert into products values(2032);
insert into products values(2054);
insert into products values(2054);
insert into products values(2054);
insert into products values(2054);
insert into products values(2054);
insert into products values(2054);
insert into products values(2056);
insert into products values(2056);
insert into products values(2056);
insert into products values(2056);
insert into products values(2056);
insert into products values(2031);
insert into products values(2031);
insert into products values(2031);
insert into products values(2031);
insert into products values(2031);
insert into products values(2042);
insert into products values(2042);
insert into products values(2042);
insert into products values(2042);
insert into products values(2042);
insert into products values(2051);
insert into products values(2051);
insert into products values(2051);
insert into products values(2051);
insert into products values(2051);
insert into products values(2036);
insert into products values(2036);
insert into products values(2036);
insert into products values(2036);
insert into products values(2043);
insert into products values(2043);
insert into products values(2043);
insert into products values(2033);
insert into products values(2033);
insert into products values(2034);
insert into products values(2034);
insert into products values(2013);
insert into products values(2013);
insert into products values(2012);
insert into products values(2012);
insert into products values(2053);
insert into products values(2053);
insert into products values(2035);
insert into products values(2035);
insert into products values(2022);
insert into products values(2041);
insert into products values(2044);
insert into products values(2011);
insert into products values(2021);
insert into products values(2052);
5.3、生成hybrid直方图
LEO@rmlis> begin
2 dbms_stats.gather_table_stats(ownname => 'LEO',
3 tabname => 'PRODUCTS',
4 method_opt => 'FOR COLUMNS PROD_SUBCATEGORY_ID SIZE 10');
5 end;
6 /
PL/SQL procedure successfully completed.
5.4、相关查询
--查每个不同值出现次数
LEO@rmlis> r
1 select count(prod_subcategory_id) as num_of_rows, prod_subcategory_id
2 from products
3 group by prod_subcategory_id
4* order by 1 desc
NUM_OF_ROWS PROD_SUBCATEGORY_ID
----------- -------------------
8 2014
7 2055
6 2054
6 2032
5 2042
5 2051
5 2031
5 2056
4 2036
3 2043
2 2034
2 2035
2 2013
2 2053
2 2033
2 2012
1 2022
1 2052
1 2021
1 2044
1 2011
1 2041
22 rows selected.
说明:该列包含22个不同值.因为桶的数量(10)少于22,所以优化器不能创建frequency直方图.优化器会同时考虑hybrid和top frequency,而要符合top frequenc条件,前10个最频繁值占行的百分比必须>=p,p=(1-(1/10))*100,即90%.此处最频繁的10个值占72行中的54行,占比75%<90%.因此优化器选择hybrid直方图.
--查直方图信息
LEO@rmlis> select a.column_name,
2 a.table_name,
3 b.num_rows,
4 a.num_distinct Cardinality,
5 round(a.num_distinct / b.num_rows * 100, 2) selectivity,
6 a.histogram,
7 a.num_buckets
8 from dba_tab_col_statistics a, dba_tables b
9 where a.owner = b.owner
10 and a.table_name = b.table_name
11 and a.owner = 'LEO'
12 and a.table_name = 'PRODUCTS';
COLUMN_NAME TABLE_NAME NUM_ROWS CARDINALITY SELECTIVITY HISTOGRAM NUM_BUCKETS
-------------------- -------------------- ---------- ----------- ----------- --------------- -----------
PROD_SUBCATEGORY_ID PRODUCTS 72 22 30.56 HYBRID 10
--查endpoint number及endpoint value和endpoint repeat count
LEO@rmlis> select endpoint_number, endpoint_value, endpoint_repeat_count
2 from user_histograms
3 where table_name = 'PRODUCTS'
4 and column_name = 'PROD_SUBCATEGORY_ID'
5 order by 1;
ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT
--------------- -------------- ---------------------
1 2011 1
13 2014 8
26 2032 6
36 2036 4
45 2043 3
52 2052 1
54 2053 2
60 2054 6
67 2055 7
72 2056 5
10 rows selected.
小结:在height-based直方图中,优化器会将72行均匀地分布到10个指定的直方图桶中,这样每个桶大约包含7行.此为hybrid直方图,所以优化器会分配值,如此任何值都不会占用一个以上的桶.如优化器不会将值2036放入一个bucket中,而在另一个bucket中也放入2036,所有的2036都放在bucket 36中.
endpoint repeat count显示桶中最大值重复的次数.通过使用endpoint number和repeat count,优化器可以计算基数.如桶36中包含2033、2034、2035和2036的实例.端点值2036的端点重复计数为4,因此优化器就知道存在该值的4个实例.对于不是端点值的2033,优化器使用密度来估计基数.
参考文档:https://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm#TGSQL95039