Histograms – What is Wrong with this Quote?

4 07 2012

July 4, 2012

It has been several months since I posted my review of the first half of the “Oracle Database 11gR2 Performance Tuning Cookbook” book.  I had planned to post the review for the second half of the book a couple of weeks later, however four months have passed by since that time, and the second half of the review has yet to materialize.  Someone recently left the following comment attached to one of my reviews on Amazon:

“This is the most comprehensive feedback I have ever read!!”

I appeciate the comment.  Keep in mind that particular review was only roughly 9.5 typewritten pages in length; it takes a lot of time to put together an 18, 24, or 35 typewritten page review that is just as comprehensive as the one that elicited the above comment.

On to the topic of this article…

While reading the “Oracle Database 11gR2 Performance Tuning Cookbook” book, I noticed an interesting set of statements about the use of histograms (bolded sections appeared bolded in the book):

“Histograms are useful only on indexed columns containing skewed values, because they help the optimizer to choose whether to use the index or not to access values.  Obviously, if the frequency for a specific value is very high, using the index won’t be the best choice….”

“Don’t use histograms in situations where:

  • The column is not used in the WHERE clauses of the queries
  • The data in the column is uniformly distributed
  • Bind variables are used when comparing against the column”

What, if anything, is wrong with the above quote from the book?  If possible, provide test cases to support or refute the above quote from the book (see the tips at the bottom of the blue section at the right of this blog page regarding how to post code sections).

The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true. It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past). If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.


Actions

Information

10 responses

4 07 2012
Mladen Gogala

Well, there are several things here. First, histogram can also be used on a non-indexed column, in a nester loop type join, to determine the access path to the inner table. Also, it may be used to decide between hash join and sort/merge join,
Second, the phrase “do not use histograms in the situations where columns is not used in the where clauses” strikes me as odd. Histograms are not used in the situations, they’re objects, created on columns or, in version 11G, groups of columns or column expressions.
Third, the advice about the bind variables. Oracle has bind variable peeking since version 9i and the first value used is put through the CBO paces, including histograms, to determine the execution plan. Some newer versions have something called “adaptive cursor sharing”, which will cause the statement to be re-parsed if the bind variable values have different characteristics. Again, ACS and bind aware cursors rely on histograms.
About the only thing I concur with is the statement about histograms being unnecessary in case of uniform distribution. That is also my pet peeve with Oracle: it should not create histograms on columns with unique constraints and declared not null. Single column primary keys should not have histograms, period.
Finally, happy 4th of July to everybody. Charles, you should ignore histograms today and devote some attention to BBQ. Ah yes: histograms should not be used on 4th of July.

4 07 2012
Charles Hooper

Mladen,

Histograms on the 4th of July, probably not the best combination. 😉

Nice answers to start the dialog regarding this quote from the book. In the book, the third bullet point item mentioned that adaptive cursor sharing with bind peeking would be covered in a later recipe, but the bullet point item did not mention whether or not adaptive cursor sharing makes the bullet pointed statement stronger or weaker.

I found something related to the book quote, from the Oracle Database 8.1.7 documentation:
http://docs.oracle.com/cd/A87860_01/doc/server.817/a76992/stats.htm#27069

“Histograms are not useful for columns with the following characteristics:
* All predicates on the column use bind variables.
* The column data is uniformly distributed.
* The column is not used in WHERE clauses of queries.
* The column is unique and is used only with equality predicates.”

Equality predicates… that must be the tie in with the 4th of July.

Since “skewed values” is in bold in the book, I wonder if the definition of “skewed” is important. To me, that term means that some of the distinct values in a column have significant more (or less) rows with that value, when compared to other distinct values in the same column. Uniformly distributed could mean that all of the distinct values in a column have roughly the same number of rows sharing those distinct values.

Let’s set up a simple test case. Since I read somewhere that NULL values cause problems, I will use an “out of bound” value of 10,000 in place of any NULL that might naturally appear in a column. The table built will contain the numbers 0 through 100 in column C2 and 1 through 100 in column C3 with every 101st row considered to have a NULL value in that column (thus the value 10,000):

DROP TABLE T2 PURGE;
 
CREATE TABLE T2 AS
SELECT
  ROWNUM C1,
  MOD(ROWNUM,101) C2,
  DECODE(MOD(ROWNUM,101),0,10000,MOD(ROWNUM,101)) C3,
  RPAD('A',200,'A') C4
FROM
  DUAL
CONNECT BY
  LEVEL<=100000;
 
COMMIT;
 
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',METHOD_OPT=>'FOR ALL COLUMNS SIZE 1')

In the above, statistics were collected on the table without histograms. Let’s try a couple of tests. First, I want to see the rows with either the maximum value in column C3, or the “out of bound” value representing a NULL.

 
SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2,
  C3
FROM
  T2
WHERE
  C3 IN (100,10000);
 
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));
 
SQL_ID  a7htzq376rch4, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2,   C3 FROM   T2 WHERE
C3 IN (100,10000)
 
Plan hash value: 1513984157
 
------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |   1980 |00:00:00.04 |    3168 |
|*  1 |  TABLE ACCESS FULL| T2   |      1 |   1980 |   1980 |00:00:00.04 |    3168 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(("C3"=100 OR "C3"=10000))

That worked well without a histogram, the cardinality estimate is exactly correct. If table T2 were then joined to another table, the accurate cardinality estimate could help the optimizer determine if table T2 should be the driving table, and which type of join method should be most efficient.

Let’s try again with a SQL statement that produces the same resultset as the above SQL statement:

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2,
  C3
FROM
  T2
WHERE
  C3 BETWEEN 100 AND 10000;
 
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));
 
SQL_ID  g04bcqbxbut9p, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2,   C3 FROM   T2 WHERE
C3 BETWEEN 100 AND 10000
 
Plan hash value: 1513984157
 
------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |   1980 |00:00:00.02 |    3168 |
|*  1 |  TABLE ACCESS FULL| T2   |      1 |    100K|   1980 |00:00:00.02 |    3168 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(("C3">=100 AND "C3"<=10000))

That did not work so well. There are only 100,000 rows in the table, and the optimizer is predicting that roughly 100,000 rows will be returned by the query. If the resultset from table T2 were then joined with another table, the performance could be less than optimal due to the error in the cardinality calculation.

Let’s try again, this time trying to retrieve the values without a “NULL” in column C3:

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2,
  C3
FROM
  T2
WHERE
  C3 BETWEEN 1 AND 100;
 
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));
 
SQL_ID  6z9pxyd4pguz8, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2,   C3 FROM   T2 WHERE
C3 BETWEEN 1 AND 100
 
Plan hash value: 1513984157
 
------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |  99010 |00:00:00.12 |    9457 |
|*  1 |  TABLE ACCESS FULL| T2   |      1 |   1980 |  99010 |00:00:00.12 |    9457 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
--------------------------------------------------
   1 - filter(("C3"<=100 AND "C3">=1))

The optimizer also had difficulty with the cardinality calculation for this query. Not a big deal for this query because there is no join, and no index on column C3. But, what if there were a join or an index on column C3?

Let’s try again with a histogram on column C3:

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',METHOD_OPT=>'FOR COLUMNS C3 SIZE 254',NO_INVALIDATE=>FALSE)

The first query:

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2,
  C3
FROM
  T2
WHERE
  C3 IN (100,10000);
 
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));
 
SQL_ID  a7htzq376rch4, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2,   C3 FROM   T2 WHERE
C3 IN (100,10000)
 
Plan hash value: 1513984157
 
------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |   1980 |00:00:00.04 |    3168 |
|*  1 |  TABLE ACCESS FULL| T2   |      1 |   2265 |   1980 |00:00:00.04 |    3168 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(("C3"=100 OR "C3"=10000))

The cardinality estimate is less accurate than before the histogram was built, but the estimate is still reasonably close.

The second query:

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2,
  C3
FROM
  T2
WHERE
  C3 BETWEEN 100 AND 10000;
 
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));
 
SQL_ID  g04bcqbxbut9p, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2,   C3 FROM   T2 WHERE
C3 BETWEEN 100 AND 10000
 
Plan hash value: 1513984157
 
------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |   1980 |00:00:00.03 |    3168 |
|*  1 |  TABLE ACCESS FULL| T2   |      1 |   2265 |   1980 |00:00:00.03 |    3168 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(("C3">=100 AND "C3"<=10000))

The cardinality estimate is significantly better than we saw before the histogram was built.

The third query:

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2,
  C3
FROM
  T2
WHERE
  C3 BETWEEN 1 AND 100;
  
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));
 
SQL_ID  6z9pxyd4pguz8, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2,   C3 FROM   T2 WHERE
C3 BETWEEN 1 AND 100
 
Plan hash value: 1513984157
 
------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |  99010 |00:00:00.12 |    9457 |
|*  1 |  TABLE ACCESS FULL| T2   |      1 |  98863 |  99010 |00:00:00.12 |    9457 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(("C3"<=100 AND "C3">=1))

Again, the cardinality estimate is significantly better than we saw before the histogram was built.

Sorry, I might have skewed too far off topic. Any other comments? “What, if anything, is wrong with the above quote from the book?” I will withhold my comments in this article for at least 24 hours.

5 07 2012
Wolfgang Breitling

“About the only thing I concur with is the statement about histograms being unnecessary in case of uniform distribution. That is also my pet peeve with Oracle: it should not create histograms on columns with unique constraints and declared not null. Single column primary keys should not have histograms, period.”
Oracle 11 does not gather a histogram for such columns – unless the column values have huge outliers and in that case it makes sense to have a histogram if you have a range predicate on that column. Where it still gathers a (frequency ) histogram is for columns with a single value for all rows and there I can not see any rationale for having a histogram. I’m not sure right now, would have to test it, but it may be that the optimizer ignores the statistics ( density ) in this case and uses a more sensible density calculated at run time.

5 07 2012
Mladen Gogala

Thanks Wolfgang, this is very good to know.

5 07 2012
Mladen Gogala

On the other hand, I just checked and here is my result:
SQL> analyze table emp delete statistics;

Table analyzed.

Elapsed: 00:00:00.03
SQL> select column_name,count(*) from user_tab_histograms
2 where table_name=’EMP’
3 group by column_name;

no rows selected

SQL> get /tmp/1
1 begin
2 dbms_stats.gather_table_stats(
3 ownname => user,
4 tabname => ‘EMP’,
5 method_opt => ‘FOR ALL INDEXED COLUMNS SIZE 254′);
6* end;
SQL> /

colum column_name format a15
SQL> select column_name,count(*) from user_tab_histograms
2 where table_name=’EMP’
3 group by column_name;

COLUMN_NAME COUNT(*)
————— ———-
SYS_NC00009$ 12
ENAME 14
EMPNO 14

Elapsed: 00:00:00.02
SQL>

In other words, Oracle has duly collected stats on the single-column primary key (EMPNO). The version is 11G:

SQL> select * from v$version;

BANNER
——————————————————————————–
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – 64bit Production
PL/SQL Release 11.2.0.3.0 – Production
CORE 11.2.0.3.0 Production
TNS for Linux: Version 11.2.0.3.0 – Production
NLSRTL Version 11.2.0.3.0 – Production

Elapsed: 00:00:00.01
SQL>

January and April PSU’s are applied. This is as new as it gets.

5 07 2012
Wolfgang Breitling

But you asked it to do that with your method_opt of ‘… size 254′. Then, of course it does what you asked. It will do the right thing if you use method_opt=>’for all columns size auto’

6 07 2012
Mladen Gogala

Wolfgang, why would I do “size auto” when I don’t know what size of a histogram will Oracle create? I am usually using size 254 or size 127 but never “size auto”. In some releases, SIZE AUTO means that there will be only 2 buckets: minimum and maximum, which is about the same as no histograms at all.

8 07 2012
Wolfgang Breitling

This is a reply to Mladen’s last reply. For some reason there is no reply “button” on his reply.

If you do a size nnn then that is exactly what gather_stats will do. It could be fewer buckets if there are fewer than nnn distinct values of the columns. If there are more distinct values than nnn then you end up with a height-balanced histogram. In any case, yo WILL get a histogram because that is what you requested. If you ask for a histogram with method_opt=>’… size nnn’ then I there is no reason to go with anything but size 254 – unless you deliberately want to force a height-balanced histogram for a column with fewer than 255 distinct values. But in that case you better know exactly what you are doing and you may actually be better off to not gather the histogram in that case but construct it the way you want manually.
If “size auto” gives you only two buckets then that is either because of one of the cases we discussed, e.g. the column of a single-column unique or primary key, where gather_stats determined that a histogram makes no sense. Or the column has never been used in a where clause.

5 07 2012
Reiner Kuehl

I agree with Mladen: The first statement seems definitely wrong: “Histograms are useful only on indexed columns…” That’s at least what I have learned from Wolfgang Breitling in HISTOGRAMS – MYTHS AND FACTS (http://www.centrexcc.com/Histograms%20-%20Myths%20and%20Facts.pdf). Wolfgang shows clearly that Histograms on non-indexed columns are very useful when evaluating the cost of different join methods and orders.
The second point refers to bind variables and histograms: It’s true that in the past histograms and bind variables on the same column have been a very dangerous combination. With Adaptive Cursor Sharing (ACS) I would recommend on having a histogram on such columns. In other words: I would regard upon the “Don’t use histogram” for columns with Bind Variables as misleading to say the least.

6 07 2012
Charles Hooper

There are a lot of excellent comments attached to this blog article. The linked article written by Wolfgang Breitling is excellent – I am sure that I read the article at least once a couple of years ago (a few changes were made to the query optimizer in Oracle Database over the years, so there are a couple of minor items in the article that no longer work as described in the latest Oracle Database releases). I thought that I would share my review comments for this section of the book (the following review comments echo many of the comments provided by other readers of this blog article).

The quoted section from the book seems to be closely derived from a page of the Oracle Database 8.1.7 documentation which was modified and finally removed from later releases of the documentation.

Histograms are potentially useful when created on non-indexed columns, with the goal of improving cardinality estimates, which are used to determine join order and join type when multiple tables are referenced in a SQL statement.

Histograms are also potentially useful when the data is not skewed, helping to improve the cardinality estimates of range scans when outlier values are present (such as an out of range value taking the place of NULL values). See the test case in one of my comments above.

Regarding the statement, “Obviously, if the frequency for a specific value is very high, using the index won’t be the best choice,” the frequency of a value in a column is but one of several factors that the optimizer considers, but a high frequency of a value is not necessarily a problem if the use of the index avoids the need to visit the related table.

Regarding the bullet points that describes when histograms should not be used, “The column is not used in the WHERE clauses of the queries,” with the introduction of ANSI join syntax in Oracle Database 9.0.1, a column that only appears in JOIN clauses could potentially benefit from a histogram.

Prior to the introduction of adaptive cursor sharing in Oracle Database 11.1, bind variables when combined with bind variable peeking and histograms could generate execution plans that, while efficient for the first execution, were potentially considerably inefficient when different bind variable values were submitted during later executions of the SQL statement. Adaptive cursor sharing attempts to address the issues caused by the combination of bind variable peeking and histograms. In cases where the same value is always submitted for a comparison with a submitted bind variable, a histogram on the comparison column could significantly improve the optimizer’s cardinality estimates such that a more efficient execution plan is developed (consider a case when an index on a STATUS column is present, and that STATUS column contains the value of either PROCESSED or NOT PROCESSED). The book only mentioned that adaptive cursor sharing would be discussed in a later recipe – there was no comment in this section of the book regarding whether adaptive cursor sharing helped or further hindered the bullet-pointed items in the book.

Leave a reply to Mladen Gogala Cancel reply