Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图-CFANZ编程社区

文档课题：Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图.
1、理论知识
Oracle数据库使用如下参数作为选择直方图的依据：
NDV：一列中不同值的数量.如一个列只包含100、200和300，则该列的NDV为3；
n：直方图bucket数量，默认值为254；
p：百分比阈值，p=(1-(1/n))*100，如n=254，则p=99.6.
DBMS_STATS收集统计信息时estimate_percent是否设置为AUTO_SAMPLE_SIZE(默认值).

下图为选择具体直方图的流程:

Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图_高度平衡直方图

直方图的基数算法
直方图基数算法依赖于endpoint number、endpoint value以及列值是popular or nonpopular.

endpoint number
endpoint number是唯一标识bucket的数字.在frequency和hybrid直方图中，endpoint number是当前bucket和以前bucket中包含的所有值的累积总次数.如endpoint number为100的buckut表示当前和以前bucket中所有值的总次数为100.
而在height-balanced直方图中，优化器按顺序对buckets进行编号，从0或1开始.通常情况下endpoint number既是bucket number.

endpoint value：
endpoint value是bucket中的最大值，如一个bucket中只包含52794和52795，则endpoint value为52795.

popular and nonpopular values
直方图中某个值的popular程度对基数估计算法的影响如下:

Popular values
Popular values作为多个bucket的端点值出现.优化器首先检查一个值是否为bucket的端点值，从而确定该值是否为popular value.如果是这样，那么对于频率直方图，优化器将从当前bucket的endpoint number中减去前一个bucket的endpoint number.hybrid直方图已经为每个endpoint单独存储此信息.如果该值大于1，则该值为popular.

优化器使用如下公式估算popular值的基数.
cardinality of popular value = 
  (num of rows in table) * 
  (num of endpoints spanned by this value / total num of endpoints)	

nonpopular value
不是popular value，那就是nonpopular value.
优化器使用以下公式评估nonpopular value的基数.
cardinality of nonpopular value =
  (num of rows in table) * density

说明：优化器使用基于buckets和NDV等因素的内部算法计算density.density为0到1之间的十进制数，值接近1表示在谓词列表中引用该列的查询时，优化器期望返回许多行，值接近0表示优化器期望返回的行很少.

Bucket Compression
在某些情况下，为了减少bucket的总数，优化器将多个bucket压缩到一个bucket.如下面的frequency直方图表示第一个桶号为1，最后一个桶号为23.

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
              1          52792
              6          52793
              8          52794 
              9          52795
             10          52796
             12          52797
             14          52798
             23          52799

说明：以上示例中存在bucket被压缩，原本从bucket 2到bucket 6每个bucket都包含一个值52793，但优化器将此类bucket压缩到endpoint number最高的bucket(bucket 6)中，此时bucket 6包含5个52793值.由此可知该值为popular value，因为当前bucket的endpoint number(6)与前一个bucket的endpoint number(1)之间相差5.因此在压缩之前，值52793是5个桶的端点值.

如下标明被压缩的bucket及popular value.

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
              1          52792 -> nonpopular
              6          52793 -> buckets 2-6 compressed into 6; popular
              8          52794 -> buckets 7-8 compressed into 8; popular
              9          52795 -> nonpopular
             10          52796 -> nonpopular
             12          52797 -> buckets 11-12 compressed into 12; popular
             14          52798 -> buckets 13-14 compressed into 14; popular
             23          52799 -> buckets 15-23 compressed into 23; popular

Frequence直方图
frequency直方图中，每个不同的列值对应于直方图的单个bucket.因为每个值都有专用bucket，所以有些bucket可能有很多值，而有些bucket可能只有很少的值.
数据库创建frequence直方图，需满足以下条件：
A、	NDV需小于等于n，n值是直方图的buckets数值(默认为254)；
B、	收集统计信息时estimate_percent参数需设置为用户指定的值或auto_sample_size

说明：从Oracle Database 12c开始，若sampling size的默认值是AUTO_SAMPLE_SIZE，那么数据库将从全表扫描中创建frequency直方图.对于其它抽样百分比规格，数据库从样本派生frequency直方图.在Oracle Database 12c以前，数据库基于小样本收集直方图，意味着低频值可能不会出现在样本中.在这种情况下使用密度可能会导致优化器高估选择性.

说明：如下进行各类直方图的相关测试，版本为oracle 19.12.
2、frequency直方图
2.1、准备测试数据
[oracle@dbserver ~]$ sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Jul 15 22:18:41 2023
Version 19.12.0.0.0

Copyright (c) 1982, 2021, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.12.0.0.0

SYS@rmlis> conn leo/leo;
Connected.

LEO@rmlis> create table countries(country_subregion_id number);

insert into countries values(52792);
insert into countries values(52793);
insert into countries values(52793);
insert into countries values(52793);
insert into countries values(52793);
insert into countries values(52793);
insert into countries values(52794);
insert into countries values(52794);
insert into countries values(52795);
insert into countries values(52796);
insert into countries values(52797);
insert into countries values(52797);
insert into countries values(52798);
insert into countries values(52798);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);
insert into countries values(52799);

SYS@rmlis> select country_subregion_id,count(*) from leo.countries group by country_subregion_id order by 1;

COUNTRY_SUBREGION_ID   COUNT(*)
-------------------- ----------
               52792          1
               52793          5
               52794          2
               52795          1
               52796          1
               52797          2
               52798          2
               52799          9

8 rows selected.

2.2、生成frequency直方图
--收集表leo.countries和列country_subregion_id的统计信息，使用buckets的默认值254.
SYS@rmlis> begin
  2    dbms_stats.gather_table_stats(ownname    => 'LEO',
  3                                  tabname    => 'COUNTRIES',
  4                                  method_opt => 'FOR COLUMNS COUNTRY_SUBREGION_ID');
  5  end;
  6  /

PL/SQL procedure successfully completed.

2.3、相关查询
--查直方图信息
LEO@rmlis> col table_name for a20
LEO@rmlis> col column_name for a20
LEO@rmlis> select table_name, column_name, num_distinct, histogram
  2    from user_tab_col_statistics
  3   where table_name = 'COUNTRIES'
  4*    and column_name = 'COUNTRY_SUBREGION_ID';

TABLE_NAME           COLUMN_NAME          NUM_DISTINCT HISTOGRAM
-------------------- -------------------- ------------ ---------------
COUNTRIES            COUNTRY_SUBREGION_ID            8 FREQUENCY

--查endpoint number以及endpoint value
LEO@rmlis> select endpoint_number, endpoint_value
  2    from user_histograms
  3   where table_name = 'COUNTRIES'
  4     and column_name = 'COUNTRY_SUBREGION_ID';

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
              1          52792
              6          52793
              8          52794
              9          52795
             10          52796
             12          52797
             14          52798
             23          52799

8 rows selected.
2.4、图形解析
如下是8个bucket的详细图解.

Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图_频率直方图_02

说明：如图所示，每个不同的值都有自己的bucket.因为是frequency直方图，endpoint number为端点的累积次数.对于52793，endpoint number 6表示该值出现5次(6-1).对于52794，endpoint number 8表示该值出现2次(8-6).
endpoint number比前一个endpoint number至少大2的每个bucket都包含一个popular值.因此桶6、8、12、14和23均包含popular值.优化器根据端点数计算它们的基数.如优化器使用以下公式计算值52799的基数(C)，其中表的行数为23.
C=23*(9/23)

Buckets 1,9和10包含nopopular值，优化器根据密度计算基数.

3、Height-Balanced直方图
3.1、理论知识
在height-balanced直方图中，列值被分成bucket，以便每个bucket包含相同数量的行. 如99个硬币分配到4个bucket中，每个bucket大约25个硬币.直方图显示endpoints所在范围.

生成Height-Balanced直方图条件：
Oracle 12c之前，当NDV大于n时，数据库会创建height-balanced直方图.该类型直方图对于至少在两个bucket中作为endpoint出现的值的范围谓词和相等谓词非常有用.

A、	NDV大于n值，n为直方图buckets数量(默认值为254);
B、	收集统计信息时estimate_percent不能设置为auto_sample_size.

由此可见，若Oracle 12c创建新直方图时采样百分比设置为AUTO_SAMPLE_SIZE，则该直方图只能是top frequency或bybrid.
若11g升级到12c，在升级之前创建的height-based直方图能继续使用.但若重新统计信息，那么该表上现有的height-based直方图将被替换.替代的直方图类型将取决于NDV和以下标准: 
A、若采样百分比为AUTO_SAMPLE_SIZE，则数据库生成hybrid或frequency直方图. 
B、若采样百分比不是AUTO_SAMPLE_SIZE，则数据库将创建height-balanced直方图或frequency直方图.
3.2、生成height-balanced直方图
LEO@rmlis> select country_subregion_id, count(*)
  2    from leo.countries
  3   group by country_subregion_id
  4   order by 1;

COUNTRY_SUBREGION_ID   COUNT(*)
-------------------- ----------
               52792          1
               52793          5
               52794          2
               52795          1
               52796          1
               52797          2
               52798          2
               52799          9

8 rows selected.

LEO@rmlis> begin
  2    dbms_stats.gather_table_stats(ownname          => 'LEO',
  3                                  tabname          => 'COUNTRIES',
  4                                  method_opt       => 'FOR COLUMNS COUNTRY_SUBREGION_ID SIZE 7',
  5                                  estimate_percent => 100);
  6  end;
  7  /

PL/SQL procedure successfully completed.
3.3、相关查询
--查country_subregion_id列上直方图信息
LEO@rmlis> select a.column_name,
  2         a.table_name,
  3         b.num_rows,
  4         a.num_distinct Cardinality,
  5         round(a.num_distinct / b.num_rows * 100, 2) selectivity,
  6         a.histogram,
  7         a.num_buckets
  8    from dba_tab_col_statistics a, dba_tables b
  9   where a.owner = b.owner
 10     and a.table_name = b.table_name
 11     and a.owner = 'LEO'
 12     and a.table_name = 'COUNTRIES';

COLUMN_NAME          TABLE_NAME             NUM_ROWS CARDINALITY SELECTIVITY HISTOGRAM       NUM_BUCKETS
-------------------- -------------------- ---------- ----------- ----------- --------------- -----------
COUNTRY_SUBREGION_ID COUNTRIES                    23           8       34.78 HEIGHT BALANCED           7

--查询endpoint number和endpoint value值
LEO@rmlis> select endpoint_number, endpoint_value
  2    from user_histograms
  3   where table_name = 'COUNTRIES'
  4     and column_name = 'COUNTRY_SUBREGION_ID';

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
              0          52792
              2          52793
              3          52795
              4          52798
              7          52799
3.4、图形解析
说明：bucket number与endpoint number相同.优化器将每个bucket中最后一行的值记录为endpoint value，然后进行检查，确保最小值是第一个bucket的endpoint value，最大值是最后一个bucket的endpoint value.本例优化器添加bucket 0，使最小值52792成为bucket的端点. 
优化器将23行均匀分布到7个指定的直方图bucket中，因此每个bucket大约包含3行.然而优化器会压缩具有相同端点的buckets.因此优化器将5个52793放入bucket 2，而不是bucket 1包含2个52793，bucket 2包含3个52793。类似地优化器将9个52799放入桶7，不是让桶5、6和7各包含3个52799值. 

本例桶3桶4包含nonpopular值，因当前endpoint number和以前的endpoint number之间差值是1.此时优化器根据密度计算此类值的基数.其余桶包含popular value，优化器根据endpoint numbers计算此类值基数.

Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图_混合直方图_03

4、Top Frequency直方图
4.1、理论知识
top frequency直方图由frequency直方图演变而来，其忽略出现次数少的值.如1000个硬币中只有一个便士，那么在将硬币分类到bucket中时忽略便士.Top frequency直方图可以为highly popular values生成更好的直方图.

生成top frequency直方图需满足以下条件：
A、	NDV大于n值，n为直方图buckets数量(默认值为254);
B、	前n个频繁值所占行的百分比大于等于阈值p，p=(1-(1/n))*100;
C、	收集统计信息时estimate_percent参数需设置为auto_sample_size.

4.2、生成top frequency直方图
LEO@rmlis> select country_subregion_id,count(*) from leo.countries group by country_subregion_id order by 1;

COUNTRY_SUBREGION_ID   COUNT(*)
-------------------- ----------
               52792          1
               52793          5
               52794          2
               52795          1
               52796          1
               52797          2
               52798          2
               52799          9

8 rows selected.
指定7个buckets收集统计信息
LEO@rmlis> begin
  2    dbms_stats.gather_table_stats(ownname    => 'LEO',
  3                                  tabname    => 'COUNTRIES',
  4                                  method_opt => 'FOR COLUMNS COUNTRY_SUBREGION_ID SIZE 7');
  5  end;
  6  /

PL/SQL procedure successfully completed.

4.3、相关查询
--查直方图信息
LEO@rmlis> select a.column_name,
  2         a.table_name,
  3         b.num_rows,
  4         a.num_distinct Cardinality,
  5         round(a.num_distinct / b.num_rows * 100, 2) selectivity,
  6         a.histogram,
  7         a.num_buckets
  8    from dba_tab_col_statistics a, dba_tables b
  9   where a.owner = b.owner
 10     and a.table_name = b.table_name
 11     and a.owner = 'LEO'
 12     and a.table_name = 'COUNTRIES';

COLUMN_NAME          TABLE_NAME             NUM_ROWS CARDINALITY SELECTIVITY HISTOGRAM       NUM_BUCKETS
-------------------- -------------------- ---------- ----------- ----------- --------------- -----------
COUNTRY_SUBREGION_ID COUNTRIES                    23           8       34.78 TOP-FREQUENCY             7

说明：contries.country_subregion_id列包含8个不同的值，但直方图只有7个bucket，且estimate_percent默认为auto_sample_size，此条件下数据库只能创建top frequency或者hybrid直方图.在country_subregion_id列上出现频率最高的前7个值占据96.5%(22/23)的行，超过85.7%(1-1/7)，因此生成top frequency直方图.

--查endpoint number和endpoint value
LEO@rmlis> select endpoint_number, endpoint_value
  2    from user_histograms
  3   where table_name = 'COUNTRIES'
  4     and column_name = 'COUNTRY_SUBREGION_ID';

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
              1          52792
              6          52793
              8          52794
              9          52796
             11          52797
             13          52798
             22          52799

7 rows selected.

4.4、图形解析

Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图_高度平衡直方图_04

说明：除52795外每个不同的值都有自己的bucket，因为52795为nopopular，所以被排除在直方图之外.作为标准frequency直方图，endpoint number表示累积值.

5、Hybrid直方图
5.1、理论知识
Hybrid直方图结合height-based直方图和frequency直方图的优点.使优化器能够在某些情况下获得更好的执行计划. 
Height-balanced直方图有时对popular值产生不准确的评估.如一个值仅作为一个endpoint value出现，但几乎占用两个buckets，优化器会认为该值为nonpopular. 
为解决该问题，hybrid直方图在分配值时使任何值都不占用一个以上的bucket，然后存储
endpoint repeat count值，即endpoint value重复的次数.通过使用endpoint repeat 
count，优化器可以获得大多popular值的准确估计.
Endpoint repeat counts工作原理
在桶中分配硬币的类比来说明端点重复计数的工作原理. 
下图展示一个硬币列，将值从低到高排序.

Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图_最高频率直方图_05

通过设置DBMS_STATS的method_opt参数收集该表的统计信息.GATHER_TABLE_STATS FOR所有列大小设置为3.本例中优化器最初将coins列中的值分组到三个桶中，如下所示.

Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图_最高频率直方图_06

若一个桶边界分割了一个值，使得该值一部分出现在一个bucket中，另一部分出现在另一个bucket，那么优化器将桶边界值(以及所有其他后续桶边界值)向前移动，以包括该值的所有出现.如优化器移动值5，使其完全位于第一个bucket中，而值25现在完全位于第二个bucket中.

Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图_频率直方图_07

endpoint repeat count测量相应的bucket endpoint(即右侧桶边界的值)重复自身的次数.如第一个桶中值5重复3次，因此endpoint repeat count为3.

Oracle数据库直方图解析——频率直方图、高度平衡直方图、最高频率直方图、混合直方图_混合直方图_08

说明：Height-balanced存储的信息没有bybrid多.通过使用endpoint repeat count，优化器可以准确地确定endpoint value出现的次数.如优化器知道值5出现3次，值25出现4次，值100出现2次.此信息有助于优化器生成更好的基数估计.

生成Hybrid直方图需满足以下条件：
与top frequence直方图相比，hybrid直方图唯一区别是前n个值的比例小于阈值p.
A、	NDV大于n，n为直方图bucket数(默认为254);
B、	前n个频繁值所占行的百分比小于阈值p，p=(1-(1/n))*100;
C、	DBMS_STATS统计信息收集时estimate_percent参数设置为AUTO_SAMPLE_SIZE.
D、	
注意：如果用户指定自己的百分比，那么数据库将创建frequency或height-balanced直方图.

5.2、准备测试数据
--创建测试数据
LEO@rmlis> create table products(prod_subcategory_id number);

Table created.

insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2014);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2055);
insert into products values(2032);
insert into products values(2032);
insert into products values(2032);
insert into products values(2032);
insert into products values(2032);
insert into products values(2032);
insert into products values(2054);
insert into products values(2054);
insert into products values(2054);
insert into products values(2054);
insert into products values(2054);
insert into products values(2054);
insert into products values(2056);
insert into products values(2056);
insert into products values(2056);
insert into products values(2056);
insert into products values(2056);
insert into products values(2031);
insert into products values(2031);
insert into products values(2031);
insert into products values(2031);
insert into products values(2031);
insert into products values(2042);
insert into products values(2042);
insert into products values(2042);
insert into products values(2042);
insert into products values(2042);
insert into products values(2051);
insert into products values(2051);
insert into products values(2051);
insert into products values(2051);
insert into products values(2051);
insert into products values(2036);
insert into products values(2036);
insert into products values(2036);
insert into products values(2036);
insert into products values(2043);
insert into products values(2043);
insert into products values(2043);
insert into products values(2033);
insert into products values(2033);
insert into products values(2034);
insert into products values(2034);
insert into products values(2013);
insert into products values(2013);
insert into products values(2012);
insert into products values(2012);
insert into products values(2053);
insert into products values(2053);
insert into products values(2035);
insert into products values(2035);
insert into products values(2022);
insert into products values(2041);
insert into products values(2044);
insert into products values(2011);
insert into products values(2021);
insert into products values(2052);

5.3、生成hybrid直方图
LEO@rmlis> begin
  2    dbms_stats.gather_table_stats(ownname    => 'LEO',
  3                                  tabname    => 'PRODUCTS',
  4                                  method_opt => 'FOR COLUMNS PROD_SUBCATEGORY_ID SIZE 10');
  5  end;
  6  /

PL/SQL procedure successfully completed.

5.4、相关查询
--查每个不同值出现次数
LEO@rmlis> r
  1  select count(prod_subcategory_id) as num_of_rows, prod_subcategory_id
  2    from products
  3   group by prod_subcategory_id
  4*  order by 1 desc

NUM_OF_ROWS PROD_SUBCATEGORY_ID
----------- -------------------
          8                2014
          7                2055
          6                2054
          6                2032
          5                2042
          5                2051
          5                2031
          5                2056
          4                2036
          3                2043
          2                2034
          2                2035
          2                2013
          2                2053
          2                2033
          2                2012
          1                2022
          1                2052
          1                2021
          1                2044
          1                2011
          1                2041

22 rows selected.

说明：该列包含22个不同值.因为桶的数量(10)少于22，所以优化器不能创建frequency直方图.优化器会同时考虑hybrid和top frequency，而要符合top frequenc条件，前10个最频繁值占行的百分比必须>=p，p=(1-(1/10))*100，即90%.此处最频繁的10个值占72行中的54行，占比75%<90%.因此优化器选择hybrid直方图.

--查直方图信息
LEO@rmlis> select a.column_name,
  2         a.table_name,
  3         b.num_rows,
  4         a.num_distinct Cardinality,
  5         round(a.num_distinct / b.num_rows * 100, 2) selectivity,
  6         a.histogram,
  7         a.num_buckets
  8    from dba_tab_col_statistics a, dba_tables b
  9   where a.owner = b.owner
 10     and a.table_name = b.table_name
 11     and a.owner = 'LEO'
 12     and a.table_name = 'PRODUCTS';

COLUMN_NAME          TABLE_NAME             NUM_ROWS CARDINALITY SELECTIVITY HISTOGRAM       NUM_BUCKETS
-------------------- -------------------- ---------- ----------- ----------- --------------- -----------
PROD_SUBCATEGORY_ID  PRODUCTS                     72          22       30.56 HYBRID                   10

--查endpoint number及endpoint value和endpoint repeat count
LEO@rmlis> select endpoint_number, endpoint_value, endpoint_repeat_count
  2    from user_histograms
  3   where table_name = 'PRODUCTS'
  4     and column_name = 'PROD_SUBCATEGORY_ID'
  5   order by 1;

ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT
--------------- -------------- ---------------------
              1           2011                     1
             13           2014                     8
             26           2032                     6
             36           2036                     4
             45           2043                     3
             52           2052                     1
             54           2053                     2
             60           2054                     6
             67           2055                     7
             72           2056                     5

10 rows selected.

小结：在height-based直方图中，优化器会将72行均匀地分布到10个指定的直方图桶中，这样每个桶大约包含7行.此为hybrid直方图，所以优化器会分配值，如此任何值都不会占用一个以上的桶.如优化器不会将值2036放入一个bucket中，而在另一个bucket中也放入2036，所有的2036都放在bucket 36中. 
endpoint repeat count显示桶中最大值重复的次数.通过使用endpoint number和repeat count，优化器可以计算基数.如桶36中包含2033、2034、2035和2036的实例.端点值2036的端点重复计数为4，因此优化器就知道存在该值的4个实例.对于不是端点值的2033，优化器使用密度来估计基数.

参考文档：https://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm#TGSQL95039