Histograms – What is Wrong with this Quote?

4 07 2012

July 4, 2012

It has been several months since I posted my review of the first half of the “Oracle Database 11gR2 Performance Tuning Cookbook” book.  I had planned to post the review for the second half of the book a couple of weeks later, however four months have passed by since that time, and the second half of the review has yet to materialize.  Someone recently left the following comment attached to one of my reviews on Amazon:

“This is the most comprehensive feedback I have ever read!!”

I appeciate the comment.  Keep in mind that particular review was only roughly 9.5 typewritten pages in length; it takes a lot of time to put together an 18, 24, or 35 typewritten page review that is just as comprehensive as the one that elicited the above comment.

On to the topic of this article…

While reading the “Oracle Database 11gR2 Performance Tuning Cookbook” book, I noticed an interesting set of statements about the use of histograms (bolded sections appeared bolded in the book):

“Histograms are useful only on indexed columns containing skewed values, because they help the optimizer to choose whether to use the index or not to access values.  Obviously, if the frequency for a specific value is very high, using the index won’t be the best choice….”

“Don’t use histograms in situations where:

  • The column is not used in the WHERE clauses of the queries
  • The data in the column is uniformly distributed
  • Bind variables are used when comparing against the column”

What, if anything, is wrong with the above quote from the book?  If possible, provide test cases to support or refute the above quote from the book (see the tips at the bottom of the blue section at the right of this blog page regarding how to post code sections).

The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true. It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past). If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.





Histograms and Bind Variables, But Why?

29 01 2011

January 29, 2011

In a recent OTN thread a person asked a couple of good questions about why histograms and the use of bind variables sometimes cause problems.  The questions did not ask whether or not one should use histograms on the table columns where those columns often appear in WHERE clauses, with the columns compared to  bind variables.  Instead, the original poster (OP) asked the all important question WHY.  In a previous article I provided my response to an OTN thread where the OP of that thread wanted to use histograms to fix bind peeking problems.

The specific questions asked in the recent OTN thread include:

When a SQL is using bind variables how histograms affect the excution plan?

Why histograms can’t work well with bind variables?

I remember a document mentioned that “do not use histograms when using bind variables”. But why? 

The answers to these questions have been answered many times in articles written by a number of authors, for example:

Rather than point the OP to one of the above articles, I decided instead to create a test case to demonstrate what could happen on Oracle Database 10.2.0.4 (simulated) and 11.2.0.2 when columns that are compared to bind variables in the WHERE clause also have histograms.  Below is my response, slightly reworded:

—-

Histograms can work with bind variables, but the end result is typically not the desired outcome. Bind variables are used to reduce the number of different execution plans. Histograms are used to help the optimizer find what is supposed to be the best execution plan for the supplied predicates, and in the case of bind variables, those are the peeked values of the bind variables. So, if you have a histogram on a column, and for the initial hard parse of the SQL statement the most common value in that column is submitted in the bind variable – the generated execution plan is considered by the optimizer to be the “best” execution plan for the supplied bind variable values. Now assume that instead, the least popular value in the column is specified – the optimizer could produce a very different execution plan for the same SQL statement, one that is optimized for the least popular value (this might be an index range scan, rather than a full table scan). Assume that the execution plan cannot change when the bind variable values change during future executions – if the table column contains a single popular value and many unpopular values, if the initial hard parse is performed with the single popular value, you could find that all future executions of that SQL statement perform full table scans, even when only a couple of rows from the table are selected.

Here is a quick test case on Oracle Database 11.2.0.2 to demonstrate:

CREATE TABLE T1 (
  C1 NUMBER,
  C2 NUMBER,
  C3 VARCHAR2(300));

INSERT INTO
  T1
SELECT
  *
FROM
  (SELECT
    ROWNUM C1,
    DECODE(MOD(ROWNUM,100),99,99,1) C2,
    RPAD('A',300,'A') C3
  FROM
    DUAL
  CONNECT BY
    LEVEL <= 1000000)
ORDER BY
  C2;

CREATE INDEX IND_T1_C2 ON T1(C2);

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,METHOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE 254') 

The above created a table with 1,000,000 rows where 99% of the rows have a value of 1 in column C2 and 1% have a value of 99, and the rows are inserted with a perfect clustering factor due to the ORDER BY clause. A histogram was created on the indexed column.

Let’s try a test, we will pick an unpopular value of 2 for the bind variable when the query is initially hard parsed:

VARIABLE N1 NUMBER
EXEC :N1:=2

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2
FROM
  T1
WHERE
  C2 = :N1;

no rows selected

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SQL_ID  c7su63uw7nch6, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2 FROM   T1 WHERE   C2 =
:N1

Plan hash value: 236868917

------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       3 |      1 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   5957 |      0 |00:00:00.01 |       3 |      1 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C2 |      1 |   5957 |      0 |00:00:00.01 |       3 |      1 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C2"=:N1) 

So, there were no rows selected, the optimizer predicted that 5,957 rows would be returned, and an index access path was selected for data retrieval. Would this index access path also be appropriate for a bind variable value of 1? Let’s continue the test, this time picking the value 99 for the bind variable:

EXEC :N1:=99
SET TIMING ON

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2
FROM
  T1
WHERE
  C2 = :N1;

...
10000 rows selected.

Elapsed: 00:00:05.35

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SQL_ID  c7su63uw7nch6, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2 FROM   T1 WHERE   C2 =
:N1

Plan hash value: 236868917

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  10000 |00:00:00.02 |    1783 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   5957 |  10000 |00:00:00.02 |    1783 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C2 |      1 |   5957 |  10000 |00:00:00.01 |     690 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C2"=:N1) 

Once again, the execution plan shows that the optimizer predicted 5,957 rows would be retrieved even though 10,000 rows were actually retrieved. Notice also that the child number is still shown as 0, indicating that a hard parse was not performed. Let’s continue the test, this time with a bind variable value of 1:

EXEC :N1:=1

SET AUTOTRACE TRACEONLY STATISTICS

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2
FROM
  T1
WHERE
  C2 = :N1;

990000 rows selected.

Elapsed: 00:00:18.78

Statistics
---------------------------------------------------
          1  recursive calls
          1  db block gets
     108571  consistent gets
          0  physical reads
         96  redo size
   21958348  bytes sent via SQL*Net to client
     726508  bytes received via SQL*Net from client
      66001  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     990000  rows processed

SET AUTOTRACE OFF 

Because I used AUTOTRACE to prevent the 990,000 rows from scrolling on screen, I have to specify the SQL_ID and CHILD_NUMBER to retrieve the execution plan:

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('c7su63uw7nch6',0,'ALLSTATS LAST'));

SQL_ID  c7su63uw7nch6, child number 0
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2 FROM   T1 WHERE   C2 =
:N1

Plan hash value: 236868917

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  10000 |00:00:00.02 |    1783 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   5957 |  10000 |00:00:00.02 |    1783 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C2 |      1 |   5957 |  10000 |00:00:00.01 |     690 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C2"=:N1) 

That cannot be the execution plan that was used because it still shows that 10,000 rows were retrieved during the last execution, where the AUTOTRACE statistics showed that 990,000 rows were actually retrieved. Let’s try again, this time retrieving the execution plan for CHILD_NUMBER 1:

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('c7su63uw7nch6',1,'ALLSTATS LAST'));

SQL_ID  c7su63uw7nch6, child number 1
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2 FROM   T1 WHERE   C2 =
:N1

Plan hash value: 3617692013

------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |    990K|00:00:00.83 |     108K|
|*  1 |  TABLE ACCESS FULL| T1   |      1 |    988K|    990K|00:00:00.83 |     108K|
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("C2"=:N1) 

The above shows the actual execution plan that was used (sse the article Explain Plan Lies, Autotrace Lies, TKPROF Lies, What is the Plan?  to see why we cannot use AUTOTRACE or EXPLAIN PLAN to see the actual execution plan). Adaptive cursor sharing (first available with Oracle Database 11.1) stepped in and forced the re-evaluation of the execution plan to prevent a very slow retrieval through the index – that re-evaluation will not happen prior to Oracle Database 11.1 (CURSOR_SHARING=’SIMILAR’ might have the same effect in older Oracle Database releases when literal values are used in the SQL statement).

Just to demonstrate:

ALTER SESSION SET OPTIMIZER_FEATURES_ENABLE='10.2.0.4';

VARIABLE N1 NUMBER
EXEC :N1:=2

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2
FROM
  T1
WHERE
  C2 = :N1;

no rows selected

Elapsed: 00:00:00.00

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SQL_ID  c7su63uw7nch6, child number 2
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2 FROM   T1 WHERE   C2 =
:N1

Plan hash value: 236868917

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       3 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   5957 |      0 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C2 |      1 |   5957 |      0 |00:00:00.01 |       3 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C2"=:N1) 

Note in the above that the CHILD_NUMBER is now 2 because we changed the optimizer’s execution environment (see the articles How to Determine which First Rows OPTIMIZER_MODE was SpecifiedSELECT Statement is Fast, INSERT INTO Using the SELECT Statement is Brutally Slow 3Reviewing Session-Level Parameters to better understand what might trigger a change in the optimizer’s execution environment).

Continuing:

EXEC :N1:=99
SET TIMING ON

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2
FROM
  T1
WHERE
  C2 = :N1;

10000 rows selected.

Elapsed: 00:00:05.31

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SQL_ID  c7su63uw7nch6, child number 2
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2 FROM   T1 WHERE   C2 =
:N1

Plan hash value: 236868917

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  10000 |00:00:00.02 |    1783 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   5957 |  10000 |00:00:00.02 |    1783 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C2 |      1 |   5957 |  10000 |00:00:00.01 |     690 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C2"=:N1)

The CHILD_NUMBER is still 2, so there was no hard parse.

Continuing:

EXEC :N1:=1

SET AUTOTRACE TRACEONLY STATISTICS

SELECT /*+ GATHER_PLAN_STATISTICS */
  C1,
  C2
FROM
  T1
WHERE
  C2 = :N1;

990000 rows selected.

Elapsed: 00:00:16.91

Statistics
---------------------------------------------------
          0  recursive calls
          0  db block gets
     175927  consistent gets
          0  physical reads
          0  redo size
   21958348  bytes sent via SQL*Net to client
     726508  bytes received via SQL*Net from client
      66001  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     990000  rows processed

SET AUTOTRACE OFF

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('c7su63uw7nch6',2,'ALLSTATS LAST'));

SQL_ID  c7su63uw7nch6, child number 2
-------------------------------------
SELECT /*+ GATHER_PLAN_STATISTICS */   C1,   C2 FROM   T1 WHERE   C2 =
:N1

Plan hash value: 236868917

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |    990K|00:00:01.63 |     175K|
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   5957 |    990K|00:00:01.63 |     175K|
|*  2 |   INDEX RANGE SCAN          | IND_T1_C2 |      1 |   5957 |    990K|00:00:00.68 |   67932 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C2"=:N1) 

The above is the execution plan for CHILD_NUMBER 2 – notice that this time it is reporting 990,000 rows retrieved, so this IS the execution plan that was used for the bind variable value that exists in 99% of the table rows.  Adaptive cursor sharing did not take effect and force the re-evaluation of the execution plan – the execution plan was NOT changed to a full table scan. That is the risk that you take if you allow histograms to exist on columns that have unequal distributions of values, bind variables are used in the WHERE clause that references the column, and bind variable peeking is enabled (enabled by default in Oracle Database 9i and above, bind variable peeking is controlled by the hidden parameter _OPTIM_PEEK_USER_BINDS, which defaults to TRUE).





Plan Cardinality Estimates Problem with 11.1.0.7 and 11.2.0.1

15 02 2010

February 15, 2010

Here is a fun test on Oracle Database 11.1.0.7 and 11.2.0.1 (all posted tests are from 11.2.0.1).

First, we will create a table and collect statistics for the table with indexes on a couple of the columns and histograms on a couple of the columns:

CREATE TABLE T1 (
  C1 NUMBER NOT NULL,
  C2 NUMBER NOT NULL,
  C3 NUMBER NOT NULL,
  C4 NUMBER NOT NULL,
  C5 VARCHAR2(30) NOT NULL,
  C6 VARCHAR2(30) NOT NULL,
  FILLER VARCHAR2(200),
  PRIMARY KEY (C1));

INSERT INTO T1
SELECT
  ROWNUM,
  ROWNUM,
  TRUNC(ROWNUM/100+1),
  TRUNC(ROWNUM/100+1),
  CHR(65+TRUNC(ROWNUM/10000))||TRUNC(ROWNUM/100+1),
  CHR(65+TRUNC(ROWNUM/10000))||TRUNC(ROWNUM/100+1),
  LPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=100000;

COMMIT;

CREATE INDEX IND_T1_C3 ON T1(C3);
CREATE INDEX IND_T1_C4 ON T1(C4);
CREATE INDEX IND_T1_C5 ON T1(C5);

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR COLUMNS SIZE 254 C2, C4, C6')

Just for a review of that GATHER_TABLE_STATS procedure, from the documentation:

When gathering statistics on a table, DBMS_STATS gathers information about the data distribution of the columns within the table. The most basic information about the data distribution is the maximum value and minimum value of the column. However, this level of statistics may be insufficient for the optimizer’s needs if the data within the column is skewed. For skewed data distributions, histograms can also be created as part of the column statistics to describe the data distribution of a given column. Histograms are described in more details in “Viewing Histograms“.

Histograms are specified using the METHOD_OPT argument of the DBMS_STATS gathering procedures. Oracle recommends setting the METHOD_OPT to FOR ALL COLUMNS SIZE AUTO. With this setting, Oracle Database automatically determines which columns require histograms and the number of buckets (size) of each histogram. You can also manually specify which columns should have histograms and the size of each histogram.

The table created by the above script will have 100,000 rows with indexes on columns C3, C4, and C5.  Columns C2, C4, and C6 will have histograms with 254 buckets.  Let’s see the maximum values for the table columns (column C2 values are identical to C1, column C4 values are identical to C3, column C6 values are identical to C5)):

SELECT
  MAX(C1) MAX_C1,
  MAX(C3) MAX_C3,
  MAX(C5) MAX_C5
FROM
  T1;

MAX_C1     MAX_C3 MAX_C5
------ ---------- ------
100000       1001 K1001

The maximum value for column C3 is 1001 – most of the distinct values in that column have 100 matching rows, except for the value 1001 which has a single row.  We will try a couple of tests with my VBS tool for automatically generating DBMS_XPLANs, with the DBMS_XPLAN Type set to “ALLSTATS LAST” and Stats Level set to “ALL“.  First, we will try the maximum value for column C3 (and also C4):

SELECT
  *
FROM
  T1
WHERE
  C3=1001;

The returned execution plan is displayed below (after this point the execution plan will be displayed directly below the SQL statement):

SQL_ID  0vcvak7bgbdzt, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3=1001

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      1 |00:00:00.01 |       3 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   1000 |      1 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    400 |      1 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3"=1001)

OK, it was more or less expected that the cardinality estimate would be incorrect here because there is no histogram on column C3.  It is a bit odd that the optimizer predicts that the index will return 400 rows that then causes 1000 rows to be returned from the table.  Now let’s try the same SQL statement referencing column C4:

SELECT
  *
FROM
  T1
WHERE
  C4=1001;

SQL_ID  d3zfa447tsjrz, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4=1001

Plan hash value: 7035821

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      1 |00:00:00.01 |       3 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |    100 |      1 |00:00:00.01 |       3 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C4 |      1 |    100 |      1 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C4"=1001)

100 is closer to the actual number of rows than were predicted for the same SQL statement using column C3, so the histogram probably helped.  What happens when we specify values for C3 and C4 that exceed the maximum values in those columns?

SELECT
  *
FROM
  T1
WHERE
  C3=1101;

SQL_ID  7hy399svng33n, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3=1101

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   1000 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    400 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3"=1101)

-

SELECT
  *
FROM
  T1
WHERE
  C4=1101;

SQL_ID  at676unwj7uk1, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4=1101

Plan hash value: 7035821

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |     90 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C4 |      1 |     90 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C4"=1101)

In the above, the SQL statement that accesses column C3 continued to estimate the same number of rows would be returned when the value exceeded the maximum value for the column by roughly 10%.  When the same restriction was applied to column C4, the optimizer predicted that 10% less rows would be returned.  Interesting…

SELECT
  *
FROM
  T1
WHERE
  C3=1201;

SQL_ID  f94b21zwvsn11, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3=1201

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   1000 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    400 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3"=1201)

-

SELECT
  *
FROM
  T1
WHERE
  C4=1201;

SQL_ID  4sf3hjx44u1sn, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4=1201

Plan hash value: 7035821

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |     80 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C4 |      1 |     80 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C4"=1201)

Exceeding the maximum value by 20% returned the same cardinality estimate for the SQL statement using column C3 as seen earlier, while the cardinality estimate decreased by 20% for the SQL statement accessing column C4.

SELECT
  *
FROM
  T1
WHERE
  C3=1901;

SQL_ID  86z7nbb5u2p26, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3=1901

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   1000 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    400 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3"=1901)

-

SELECT
  *
FROM
  T1
WHERE
  C4=1901;

SQL_ID  08rg4uf562h4x, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4=1901

Plan hash value: 7035821

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |     10 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C4 |      1 |     10 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C4"=1901)

When exceeding the maximum value by 90% we see the same pattern, the cardinality estimates for the first SQL statement were unaffected, while the cardinality estimate for the second SQL statement decreased by 90%.  With a value of 2001 specified for column C4 the optimizer’s predicted cardinality decreased to 1 row.

What about value ranges?

SELECT
  *
FROM
  T1
WHERE
  C3 BETWEEN 1101 AND 1201;

SQL_ID  95zzq8vb523gf, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3 BETWEEN 1101 AND 1201

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |    250 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    450 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3">=1101 AND "C3"<=1201)

Notice that the optimizer is estimating that 250 rows will be returned, which is less than the sum of the estimated 1,000 rows that would be returned for the value 1101 and the estimated 1,000 rows for value 1201.  Let’s try again specifying column C4:

SELECT
  *
FROM
  T1
WHERE
  C4 BETWEEN 1101 AND 1201;

SQL_ID  9t2fv8dcz7jsv, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4 BETWEEN 1101 AND 1201

Plan hash value: 7035821

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |    100 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C4 |      1 |     90 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C4">=1101 AND "C4"<=1201)

Interesting – the index is predicted to return 90 rows which then causes the table to return 100 rows (edit: Oracle 10.2.0.2 predicts 90 rows for both the index and the table).  Let’s try again specifying ranges that exceed the maximum values by 90% to 100%:

SELECT
  *
FROM
  T1
WHERE
  C3 BETWEEN 1901 AND 2001;

SQL_ID  99btm350uvvbp, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3 BETWEEN 1901 AND 2001

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |    250 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    450 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3">=1901 AND "C3"<=2001)

-

SELECT
  *
FROM
  T1
WHERE
  C4 BETWEEN 1901 AND 2001;

SQL_ID  bnwyf98m74q26, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4 BETWEEN 1901 AND 2001

Plan hash value: 7035821

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      0 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |    100 |      0 |00:00:00.01 |       2 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C4 |      1 |     10 |      0 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C4">=1901 AND "C4"<=2001)

The first query is predicting the same number of rows will be returned as seen earlier.  The second query is again predicting that 100 rows will be returned from the table as a result of the 10 predicted rows that will be returned from the index (edit: Oracle 10.2.0.2 predicts 10 rows for both the index and the table).  When the range was changed to 1101 through 2001, the same cardinality estimates displayed for the range of 1101 to 1201 were again returned for both SQL statements.

If the above accurately depicts what happens when the maximum recorded value for a column is exceeded, what might happen when statistics are not gathered on a regular basis?  Hold that thought.

Let’s try again using a value range for C3 and C4 that are not beyond the maximum values for the columns:

SELECT
  *
FROM
  T1
WHERE
  C3 BETWEEN 101 AND 201;

SQL_ID  8jd9h693mbkvc, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3 BETWEEN 101 AND 201

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  10100 |00:00:00.02 |     547 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |    250 |  10100 |00:00:00.02 |     547 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    450 |  10100 |00:00:00.01 |     124 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3">=101 AND "C3"<=201)

-

SELECT
  *
FROM
  T1
WHERE
  C4 BETWEEN 101 AND 201;

SQL_ID  2bk0njs0atfpw, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4 BETWEEN 101 AND 201

Plan hash value: 7035821

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  10100 |00:00:00.02 |     547 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |  10039 |  10100 |00:00:00.02 |     547 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C4 |      1 |  10039 |  10100 |00:00:00.01 |     124 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C4">=101 AND "C4"<=201)

Once again, the first SQL statement is estimating that the index range scan will return 450 rows which will then cause the table to return 250 rows.  In actuality, both the index and the table return 10,100 rows (do we possibly have a statistics problem here?).  The second query returns cardinality estimates that are closer to the actual number of rows returned – the histogram helped here.

Let’s try again with a wider value range:

SELECT
  *
FROM
  T1
WHERE
  C3 BETWEEN 101 AND 1401;

SQL_ID  0bapwrbn3ch8j, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3 BETWEEN 101 AND 1401

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  90001 |00:00:00.13 |    4865 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |    250 |  90001 |00:00:00.13 |    4865 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    450 |  90001 |00:00:00.04 |    1090 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3">=101 AND "C3"<=1401)

-

SELECT
  *
FROM
  T1
WHERE
  C4 BETWEEN 101 AND 1401;

SQL_ID  3ppn43z5ukur6, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4 BETWEEN 101 AND 1401

Plan hash value: 3617692013

------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |  90001 |00:00:00.05 |    4147 |
|*  1 |  TABLE ACCESS FULL| T1   |      1 |  89961 |  90001 |00:00:00.05 |    4147 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(("C4">=101 AND "C4"<=1401))

Again, the cardinality estimates and execution plan are the same as before for the query using column C3, even though we are selecting 90% of the rows in the table.  The query using column C4 switched to a full table scan, and has a much more accurate cardinality estimate.  This same pattern holds true regardless of the values specified for the low and high ends of the range.

Let’s switch back to SQL*Plus and try a couple of experiments with bind variables (note that I am actually submitting the SQL statements using my Toy Project so that I do not have to watch all of the rows scroll on the screen):

VARIABLE N1 NUMBER
VARIABLE N2 NUMBER

ALTER SESSION SET STATISTICS_LEVEL='ALL';

EXEC :N1:=101
EXEC :N2:=1401

SET ARRAYSIZE 100

SELECT
  *
FROM
  T1
WHERE
  C3 BETWEEN :N1 AND :N2;

...

SELECT
  *
FROM
  TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SQL_ID  2ksn64btq6fx4, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3 BETWEEN :1 AND :2

Plan hash value: 108045900

----------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |      1 |        |  90001 |00:00:00.18 |    4865 |
|*  1 |  FILTER                      |           |      1 |        |  90001 |00:00:00.18 |    4865 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1        |      1 |    250 |  90001 |00:00:00.13 |    4865 |
|*  3 |    INDEX RANGE SCAN          | IND_T1_C3 |      1 |    450 |  90001 |00:00:00.04 |    1090 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(:1<=:2)
   3 - access("C3">=:1 AND "C3"<=:2)

The same execution plan as returned when we used literals (constants), so bind peeking is probably working.  Now let’s try the query that accesses column C4:

SELECT
  *
FROM
  T1
WHERE
  C4 BETWEEN :N1 AND :N2;

...

SELECT
  *
FROM
  TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SQL_ID  gwgk5h8u3k4tp, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4 BETWEEN :1 AND :2

Plan hash value: 3332582666

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  90001 |00:00:00.10 |    4147 |
|*  1 |  FILTER            |      |      1 |        |  90001 |00:00:00.10 |    4147 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |  89961 |  90001 |00:00:00.05 |    4147 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(:1<=:2)
   2 - filter(("C4">=:1 AND "C4"<=:2))

Bind variable peeking was used to obtain a close estimate for the cardinalities.  So, what happens when we change the value of bind variable N2?

EXEC :N2:=101

SELECT
  *
FROM
  T1
WHERE
  C3 BETWEEN :N1 AND :N2;

SELECT
  *
FROM
  TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SQL_ID  2ksn64btq6fx4, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3 BETWEEN :1 AND :2

Plan hash value: 108045900

----------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |      1 |        |    100 |00:00:00.01 |       7 |
|*  1 |  FILTER                      |           |      1 |        |    100 |00:00:00.01 |       7 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1        |      1 |    250 |    100 |00:00:00.01 |       7 |
|*  3 |    INDEX RANGE SCAN          | IND_T1_C3 |      1 |    450 |    100 |00:00:00.01 |       3 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(:1<=:2)
   3 - access("C3">=:1 AND "C3"<=:2)

-

SELECT
  *
FROM
  T1
WHERE
  C3 BETWEEN :N1 AND :N2;

SELECT
  *
FROM
  TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SQL_ID  gwgk5h8u3k4tp, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4 BETWEEN :1 AND :2

Plan hash value: 3332582666

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |    100 |00:00:00.01 |    3263 |
|*  1 |  FILTER            |      |      1 |        |    100 |00:00:00.01 |    3263 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |  89961 |    100 |00:00:00.01 |    3263 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(:1<=:2)
   2 - filter(("C4">=:1 AND "C4"<=:2))

The bind variable in the presence of bind peeking with a histogram came back to bit us – ideally Oracle would have used an index range scan for this set of bind variables.

Let’s add another 10,000 rows to the table without gathering statistics and see what happens to the cardinality estimates in the plans:

INSERT INTO T1
SELECT
  (100000+ROWNUM),
  (100000+ROWNUM),
  TRUNC((100000+ROWNUM)/100+1),
  TRUNC((100000+ROWNUM)/100+1),
  CHR(65+TRUNC((100000+ROWNUM)/10000))||TRUNC((100000+ROWNUM)/100+1),
  CHR(65+TRUNC((100000+ROWNUM)/10000))||TRUNC((100000+ROWNUM)/100+1),
  LPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=100000;

COMMIT;

ALTER SYSTEM FLUSH SHARED_POOL;

Here are the execution plans:

SELECT
  *
FROM
  T1
WHERE
  C3=1901;

SQL_ID  86z7nbb5u2p26, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3=1901

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |    100 |00:00:00.01 |       8 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   1000 |    100 |00:00:00.01 |       8 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |    400 |    100 |00:00:00.01 |       4 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3"=1901)

-

SELECT
  *
FROM
  T1
WHERE
  C4=1901;

SQL_ID  08rg4uf562h4x, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C4=1901

Plan hash value: 7035821

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |    100 |00:00:00.01 |       8 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |     10 |    100 |00:00:00.01 |       8 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C4 |      1 |     10 |    100 |00:00:00.01 |       4 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C4"=1901)

The cardinality estimates are unchanged from what we saw earlier.  The first query is estimating 10 times too many rows will be returned, and the second query is estimating 10 times too few rows will be returned.  Let’s delete the rows that were just added and check a couple of statistics:

DELETE FROM
  T1
WHERE
  C1>100000;

100000 rows deleted.

COMMIT;

SELECT
  SUBSTR(COLUMN_NAME,1,3) COL,
  DENSITY,
  NUM_BUCKETS,
  LAST_ANALYZED
FROM
  DBA_TAB_COLUMNS
WHERE
  TABLE_NAME='T1'
ORDER BY
  1;

COL    DENSITY NUM_BUCKETS LAST_ANAL
--- ---------- ----------- ---------
C1
C2      .00001         254 14-FEB-10
C3
C4   .00099998         254 14-FEB-10
C5
C6   .00099998         254 14-FEB-10
FIL

No density values for columns C1, C3, or C5 and the LAST_ANALYZED column is NULL for those same entries (edit: same results on Oracle 10.2.0.2).  Let’s try collecting statistics again without specifying the columns for which histograms should be created:

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,NO_INVALIDATE=>FALSE)

SELECT
  SUBSTR(COLUMN_NAME,1,3) COL,
  DENSITY,
  NUM_BUCKETS,
  LAST_ANALYZED
FROM
  DBA_TAB_COLUMNS
WHERE
  TABLE_NAME='T1'
ORDER BY
  1;

COL    DENSITY NUM_BUCKETS LAST_ANAL
--- ---------- ----------- ---------
C1      .00001           1 14-FEB-10
C2      .00001           1 14-FEB-10
C3  .000999001           1 14-FEB-10
C4  .000999001           1 14-FEB-10
C5  .000999001           1 14-FEB-10
C6  .000999001           1 14-FEB-10
FIL          1           1 14-FEB-10

Now we have no histograms, but the density and LAST_ANALYZED columns are populated for all rows returned by the above query (edit: same results on Oracle 10.2.0.2).  For fun let’s retry one of the queries that returned odd cardinality estimates earlier:

SELECT
  *
FROM
  T1
WHERE
  C3 BETWEEN 101 AND 201;

SQL_ID  8jd9h693mbkvc, child number 0
-------------------------------------
SELECT    *  FROM    T1  WHERE    C3 BETWEEN 101 AND 201

Plan hash value: 1220227203

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  10100 |00:00:00.02 |     547 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |  10200 |  10100 |00:00:00.02 |     547 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C3 |      1 |  10200 |  10100 |00:00:00.01 |     124 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C3">=101 AND "C3"<=201)

The estimated cardinality is just about as accurate as it was with the query that accessed column C4 which had a histogram.

What are your thoughts?  Is the METHOD_OPT parameter of DBMS_STATS.GATHER_TABLE_STATS only used to specify how histograms will be collected for columns, or does that parameter also specify for which columns column-level statistics should be gathered?  If you again collect statistics using the DBMS_STATS.GATHER_TABLE_STATS command found near the start of this article, the query of DBA_TAB_COLUMNS returns the following, which is what was originally expected (edit: same results on Oracle 10.2.0.2):

COL    DENSITY NUM_BUCKETS LAST_ANAL
--- ---------- ----------- ---------
C1      .00001           1 14-FEB-10
C2      .00001         254 14-FEB-10
C3  .000999001           1 14-FEB-10
C4   .00099998         254 14-FEB-10
C5  .000999001           1 14-FEB-10
C6   .00099998         254 14-FEB-10
FIL          1           1 14-FEB-10

——-

Follow-up (8 hours after this article was scheduled to appear):

It is important at times to check more than one source in the Oracle documentation (bold/italic emphasis is mine):

“METHOD_OPT – The value controls column statistics collection and histogram creation.”

Here is an example where Christian Antognini tried to drive that point home:

“Since the syntax is FOR ALL INDEXED COLUMNS, you are gathering statistics for all indexed columns only. I.e. not for all columns. FOR ALL COLUMNS should be used for that…”

“When the option FOR ALL INDEXED COLUMNS is specified, columns statistics are gathered only for the indexed columns.”

“My point was that the parameter method_opt not only impacts histograms, but also column statistics.”

Greg Rahn also made the point in this blog article.

“The METHOD_OPT parameter of DBMS_STATS controls two things:

  1. on which columns statistics will be collected
  2. on which columns histograms will be collected (and how many buckets)”

The message of this blog article is to make certain that you know what the GATHER_TABLE_STATS procedure is actually doing when you use the METHOD_OPT parameter (or alter the default value for the parameter).  The apparent benefit from having the histogram in place might actually be a false benefit – it might be that previously you were not updating the column statistics for that column, until you started creating a histogram on that column (assuming that you were previously collecting histograms on other columns in the table, and you assumed that column statistics were updated for the remaining columns).

Incidentally, if instead of using this to collect statistics for the table:

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR COLUMNS SIZE 254 C2, C4, C6')

You used this:

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR ALL INDEXED COLUMNS size skewonly')

You would end up with the following when executing the query against DBA_TAB_COLUMNS:

COL    DENSITY NUM_BUCKETS LAST_ANAL
--- ---------- ----------- ---------
C1      .00001           1 15-FEB-10
C2
C3  .000999001           1 15-FEB-10
C4  .000999001           1 15-FEB-10
C5   .00099998         254 15-FEB-10
C6
FIL

The above output confirms Christian Antognini’s comment in the OTN thread – column-level statistics were not collected for columns C2, C6, and FILLER.  There is effectively no histogram on columns C3 and C4 (a single bucket, therefore no histogram), but the cardinality estimates in the plan will be close because column-level statistics are present for columns C3 and C4.





DATE Datatype Or NUMBER Datatype – Which Should be Used?

6 01 2010

January 6, 2010

In a recent discussion thread on the Oracle forums, the following question was asked:
http://forums.oracle.com/forums/thread.jspa?threadID=1007653

I have a scenario where I need to store the data in the format YYYYMM (e.g. 201001 which means January, 2010).  I am trying to evaluate what is the most appropriate datatype to store this kind of data. I am comparing 2 options, NUMBER and DATE.  As the data is essentially a component of oracle date datatype and experts like Tom Kyte have proved (with examples) that using right datatype is better for optimizer. So I was expecting that using DATE datatype will yield (at least) similar (if not better) cardinality estimates than using NUMBER datatype. However, my tests show that when using DATE the cardinality estimates are way off from actuals whereas sing NUMBER the cardinality estimates are much closer to actuals.
My questions are:
1) What should be the most appropriate datatype used to store YYYYMM data?
2) Why does using DATE datatype yield estimates that are way off from actuals than using NUMBER datatype?

Test case (update Jan 7, 2010 : there was a copy-paste error in the line for collecting statistics on table B – the original version of the script posted here collected statistics on table A twice):

create table a nologging as select to_number(to_char(add_months(to_date('200101','YYYYMM'),level - 1), 'YYYYMM')) id from dual connect by level <= 289;

create table b (id number) ;

begin
  for i in 1..8192
  loop
    insert into b select * from a ;
  end loop;
  commit;
end;
/ 

alter table a add dt date;

alter table b add dt date;

update a set dt = to_date(id, 'YYYYMM');

update b set dt = to_date(id, 'YYYYMM');

commit;

exec dbms_stats.gather_table_stats(user, 'A', estimate_percent=>NULL);

exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>NULL);

explain plan for select count(*) from b where id between 200810 and 200903;
select * from table(dbms_xplan.display);

explain plan for select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM');
select * from table(dbms_xplan.display);

This is an interesting problem, why would using the NUMBER datatype yield better cardinality estimates than the example with the DATE datatype?  When the NUMBER datatype was used, the optimizer predicted that the full table scan operation would return 46,604 rows, while the optimizer predicted that the full table scan would return 5,919 rows when the DATE datatype was used – the actual number of rows returned is 49,152.

The person who posted the above test case later stated that he believes that the DATE datatype is the correct choice, but he would have a difficult time justifying that opinion when confronted by someone suggesting the use of the NUMBER data type.

I posted the test results from my run with Oracle 11.1.0.7:

SQL> set autotrace traceonly explain
SQL> select count(*) from b where id between 200810 and 200903 ;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  4715   (1)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    |   108K|   527K|  4715   (1)| 00:00:57 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200903 AND "ID">=200810)

SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |  4718   (2)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 57166 |   446K|  4718   (2)| 00:00:57 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

SQL> set autotrace off
SQL> select count(*) from b where id between 200810 and 200903 ;

  COUNT(*)
----------
     49152

SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

  COUNT(*)
----------
     49152 

Well, it seems that Oracle 11.1.0.7 predicted that when the NUMBER datatype was used, the full table scan would return roughly 108,000 rows.  Oracle 11.1.0.7 predicted that when the DATE datatype was used, the full table scan would return 57,166 rows – significantly closer to the actual number of 49,152.  If there were an index on that column, how would the different cardinality estimates affect the possibility that the optimizer might select to use that index rather than a full table scan?  What if the data volume were increased by a factor of, say 1,000 or 1,000,000?

I also captured a 10053 trace during the test run, and found this in the trace file:

******************************************
----- Current SQL Statement for this session (sql_id=7uk18xj0z9uxf) -----
select count(*) from b where id between 200810 and 200903
*******************************************
...
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for B[B]

  Table: B  Alias: B
    Card: Original: 2367488.000000  Rounded: 108124  Computed: 108124.16  Non Adjusted: 108124.16
  Access Path: TableScan
    Cost:  4714.53  Resp: 4714.53  Degree: 0
      Cost_io: 4670.00  Cost_cpu: 636216240
      Resp_io: 4670.00  Resp_cpu: 636216240
  Best:: AccessPath: TableScan
         Cost: 4714.53  Degree: 1  Resp: 4714.53  Card: 108124.16  Bytes: 0

***************************************
...
...
...
******************************************
----- Current SQL Statement for this session (sql_id=2ac0k15zjdg5x) -----
select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM')
*******************************************
...

***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for B[B]

  Table: B  Alias: B
    Card: Original: 2367488.000000  Rounded: 57166  Computed: 57165.51  Non Adjusted: 57165.51
  Access Path: TableScan
    Cost:  4717.89  Resp: 4717.89  Degree: 0
      Cost_io: 4670.00  Cost_cpu: 684264079
      Resp_io: 4670.00  Resp_cpu: 684264079
  Best:: AccessPath: TableScan
         Cost: 4717.89  Degree: 1  Resp: 4717.89  Card: 57165.51  Bytes: 0

***************************************

Notice in the above that Oracle’s statistics gathering process did not create histograms when I collected statistics for the tables.  The calculated cost is the same for either datatype, but what would happen if that table were then joined to another table?  Is the optimizer seeing histograms in some of the original poster’s test cases?

The original poster is running Oracle Database 10.2.0.1, so I ran a test on Oracle 10.2.0.2 with OPTIMIZER_FEATURES_ENABLE set to 10.2.0.1, using the data created by the OP’s data creation script:

ALTER SESSION SET TRACEFILE_IDENTIFIER = 'DateTest';
ALTER SESSION SET EVENTS '10053 trace name context forever, level 1';

select count(*) from b where id between 200810 and 200903 ;

select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

ALTER SESSION SET EVENTS '10053 trace name context off';

set autotrace traceonly explain

select count(*) from b where id between 200810 and 200903 ;

select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

set autotrace off

select count(*) from b where id between 200810 and 200903;

The output from the above script, when run on Oracle 10.2.0.2 follows:

SQL> select count(*) from b where id between 200810 and 200903 ;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |   919  (11)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    |   108K|   527K|   919  (11)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200903 AND "ID">=200810)

SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |   926  (12)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 57166 |   446K|   926  (12)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE('2009-03-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE('2008-10-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss'))

SQL> set autotrace off

SQL> select count(*) from b where id between 200810 and 200903 ;

  COUNT(*)
----------
     49152 

The estimated cardinalities appear to be identical to that of Oracle 11.1.0.7, so why was the original poster seeing different cardinality estimates?  Here is the output from the 10053 trace file:

******************************************
Current SQL statement for this session:
select count(*) from b where id between 200810 and 200903
*******************************************
...
  PARAMETERS WITH ALTERED VALUES
  ******************************
  optimizer_features_enable           = 10.2.0.1
  *********************************
...
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table:  B  Alias:  B
    #Rows: 2367488  #Blks:  16725  AvgRowLen:  13.00
***************************************
SINGLE TABLE ACCESS PATH
  Column (#1): ID(NUMBER)
    AvgLen: 5.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 200101 Max: 202501
  Table:  B  Alias: B    
    Card: Original: 2367488  Rounded: 108124  Computed: 108124.16  Non Adjusted: 108124.16
  Access Path: TableScan
    Cost:  918.67  Resp: 918.67  Degree: 0
      Cost_io: 819.00  Cost_cpu: 632570063
      Resp_io: 819.00  Resp_cpu: 632570063
  Best:: AccessPath: TableScan
         Cost: 918.67  Degree: 1  Resp: 918.67  Card: 108124.16  Bytes: 0
***************************************
...
...
...
******************************************
Current SQL statement for this session:
select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM')
*******************************************
...
  *************************************
  PARAMETERS WITH ALTERED VALUES
  ******************************
  optimizer_features_enable           = 10.2.0.1
  *********************************
...
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table:  B  Alias:  B
    #Rows: 2367488  #Blks:  16725  AvgRowLen:  13.00
***************************************
SINGLE TABLE ACCESS PATH
  Column (#2): DT(DATE)
    AvgLen: 8.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 2451911 Max: 2460677
  Table:  B  Alias: B    
    Card: Original: 2367488  Rounded: 57166  Computed: 57165.51  Non Adjusted: 57165.51
  Access Path: TableScan
    Cost:  926.24  Resp: 926.24  Degree: 0
      Cost_io: 819.00  Cost_cpu: 680617902
      Resp_io: 819.00  Resp_cpu: 680617902
  Best:: AccessPath: TableScan
         Cost: 926.24  Degree: 1  Resp: 926.24  Card: 57165.51  Bytes: 0

Notice that no histograms were collected based on the 10053 trace file.

Now a second test, this time we will instruct Oracle to create histograms, and also force the optimizer to hard parse the SQL statements that reference table B when those SQL statements are re-executed:

exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>NULL,method_opt=>'FOR ALL COLUMNS SIZE 254',no_invalidate=>false);

ALTER SESSION SET TRACEFILE_IDENTIFIER = 'DateTest2';
ALTER SESSION SET EVENTS '10053 trace name context forever, level 1';

select count(*) from b where id between 200810 and 200903 ;

select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM');

ALTER SESSION SET EVENTS '10053 trace name context off';

set autotrace traceonly explain

select count(*) from b where id between 200810 and 200903;

select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM');

set autotrace off

So, what is the output of the above?

SQL> select count(*) from b where id between 200810 and 200903;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |   919  (11)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 46604 |   227K|   919  (11)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200903 AND "ID">=200810)

SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM');

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |   926  (12)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 46604 |   364K|   926  (12)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE('2009-03-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE('2008-10-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss')) 

Interesting, both queries estimate that the full table scan operation will return 46,604 rows – interesting.  That cardinality estimate exactly matches the cardinality estimate in the OP’s plan for the SQL statement that accessed the NUMBER datatype…

For fun, let’s look in the 10053 trace file:

******************************************
Current SQL statement for this session:
select count(*) from b where id between 200810 and 200903
*******************************************
...
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table:  B  Alias:  B
    #Rows: 2367488  #Blks:  16725  AvgRowLen:  13.00
***************************************
SINGLE TABLE ACCESS PATH
  Column (#1): ID(NUMBER)
    AvgLen: 5.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 200101 Max: 202501
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 255
  Table:  B  Alias: B    
    Card: Original: 2367488  Rounded: 46604  Computed: 46604.09  Non Adjusted: 46604.09
  Access Path: TableScan
    Cost:  918.76  Resp: 918.76  Degree: 0
      Cost_io: 819.00  Cost_cpu: 633149246
      Resp_io: 819.00  Resp_cpu: 633149246
  Best:: AccessPath: TableScan
         Cost: 918.76  Degree: 1  Resp: 918.76  Card: 46604.09  Bytes: 0
...
...
...
******************************************
Current SQL statement for this session:
select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM')
*******************************************
...
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table:  B  Alias:  B
    #Rows: 2367488  #Blks:  16725  AvgRowLen:  13.00
***************************************
SINGLE TABLE ACCESS PATH
  Column (#2): DT(DATE)
    AvgLen: 8.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 2451911 Max: 2460677
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 255
  Table:  B  Alias: B    
    Card: Original: 2367488  Rounded: 46604  Computed: 46604.09  Non Adjusted: 46604.09
  Access Path: TableScan
    Cost:  926.22  Resp: 926.22  Degree: 0
      Cost_io: 819.00  Cost_cpu: 680499006
      Resp_io: 819.00  Resp_cpu: 680499006
  Best:: AccessPath: TableScan
         Cost: 926.22  Degree: 1  Resp: 926.22  Card: 46604.09  Bytes: 0

The 10053 trace file shows that in both cases a height balanced histogram with 254 buckets was created.  But, how accurate would the estimate be if there were 1,000 or 1,000,000 times as many rows?  What if the time interval were changed to something else?  What if each of the 289 distinct values for the ID and DT columns did not have an equal distribution of values?

So, why select a DATE datatype rather than a NUMBER datatype?  These are the reasons that I proposed in the discussion thread:

One of the problems with putting date values in number columns is this – if you select the range from 200810 to 200903, the optimizer will likely make the assumption that 200810 is just as likely of a number as 200808, 200812, 200814, 200816, 200820, 200890, 200900, etc. Some of those year/month combinations are simply not possible. In such a case, the optimizer should over-estimate the number of rows returned from that range when the column data type is NUMBER, and should be reasonably close when the column data type is DATE, since the optimizer knows that 200814 (14/1/2008), 200816 (16/1/2008), 200820 (20/1/2008), 200890 (90/1/2008), 200900 (0/1/2009), etc. could never be dates (and would be completely out of the serial sequence of dates). By putting the date type data into a DATE column, you have essentially added a constraint to the database to prevent invalid dates from being added. Additionally, date math, such as finding the number of days between 200802 and 200803 (compared to 200702 and 200703) is very simple – the answer is not 1 in both cases, but rather 29 and 28, respectively.

Any other comments?

OK, enough guessing, let’s try a couple tests.  Here is the test table, with 10,000,000 rows with an uneven distribution of rows for each value:

DROP TABLE B PURGE;

CREATE TABLE B AS
SELECT
  TO_NUMBER(TO_CHAR(TRUNC(TO_DATE('200101','YYYYDD')+SQRT(ROWNUM),'MM'),'YYYYMM')) ID,
  TRUNC(TO_DATE('200101','YYYYDD')+SQRT(ROWNUM),'MM') DT
FROM
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=1000) V1,
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=10000) V2;

CREATE INDEX IND_B_ID ON B(ID);
CREATE INDEX IND_B_DT ON B(DT);

SET LINESIZE 130
SET PAGESIZE 1000

The first test script with no histograms, supplying various ranges for year-month combinations while gathering execution statistics, and displaying the actual execution plans:

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'B',CASCADE=>TRUE,METHOD_OPT=>'FOR ALL COLUMNS SIZE 1',NO_INVALIDATE=>FALSE);

SPOOL HISTOGRAMTEST.TXT

SELECT COUNT(*) FROM B;

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SPOOL OFF

The trimmed output (from Oracle 11.1.0.7) of the first date range follows:

SQL> SELECT COUNT(*) FROM B;

  COUNT(*)                                                                                                                       
----------                                                                                                                       
  10000000                                                                                                                       

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.06 |    2371 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.06 |    2371 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1310K|   1063K|00:00:00.01 |    2371 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200810 AND "ID"<=200903)                                                                                     

SQL>                                                                                                                                   
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.06 |    2816 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.06 |    2816 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    674K|   1063K|00:00:00.01 |    2816 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

In the above, the estimated number of rows with the comparison on the numeric column is about 250,000 rows above the actual, while the comparison on the date column is about 400,000 rows below the actual – could this be enough of a difference to change the execution plan if the clustering factor of the indexes were high?  What if the tables had a larger average row length?  What if this table were joined with another table?  Note that the number of logical blocks accessed is less with the index on the numeric column.

The trimmed output of the second date range follows:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     299 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     299 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1344K|    132K|00:00:00.01 |     299 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200110 AND "ID"<=200203)                                                                                     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     353 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     353 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    674K|    132K|00:00:00.01 |     353 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2001-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2002-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

In the above, the estimated number of rows with the comparison on the numeric column is about 10 times as high as the actual number of rows, while the estimated number of rows with the comparison on the date column is about 5 times as high.  Could this be enough to trigger a different execution plan for the queries – where one uses an index access, while the other uses a full table scan?

The trimmed output of the third date range follows:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     594 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     594 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1344K|    265K|00:00:00.01 |     594 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200210 AND "ID"<=200303)                                                                                     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     705 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     705 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    674K|    265K|00:00:00.01 |     705 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2002-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2003-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

In this case the estimates are about the same as the previous test, but the actual number of rows has doubled.  The optimizer’s estimates are again in favor of the date datatype.

The trimmed output of the fourth date range follows:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     810 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     810 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1285K|    362K|00:00:00.01 |     810 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200812 AND "ID"<=200901)                                                                                     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     962 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     962 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    291K|    362K|00:00:00.01 |     962 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2008-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

This time, the optimizer’s estimate when the date datatype was used is very close, while the optimizer’s estimate when the numeric datatype was used is about 4 times greater than the actual number of rows.

The trimmed output of the fifth date range follows:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     104 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     104 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1295K|  45260 |00:00:00.01 |     104 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200112 AND "ID"<=200201)                                                                                     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     122 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     122 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    291K|  45260 |00:00:00.01 |     122 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2001-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2002-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

This time with the numeric datatype, the optimizer is estimating 1,295,000 rows when in fact only 45,260 are returned during the index range scan.  The estimate with the date datatype is also quite high, but it is 4 to 5 times lower (thus closer to the actual) than with the numeric datatype.

Now, the final trimmed output:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901;

--------------------------------------------------------------------------------------------                                     
| Id  | Operation             | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                     
--------------------------------------------------------------------------------------------                                     
|   0 | SELECT STATEMENT      |          |      1 |        |      1 |00:00:00.78 |   22345 |                                     
|   1 |  SORT AGGREGATE       |          |      1 |      1 |      1 |00:00:00.78 |   22345 |                                     
|*  2 |   INDEX FAST FULL SCAN| IND_B_ID |      1 |   3764K|   4054K|00:00:00.22 |   22345 |                                     
--------------------------------------------------------------------------------------------                                     

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - filter(("ID">=200612 AND "ID"<=200901))                                                                                   

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

-------------------------------------------------------------------------------------                                            
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                            
-------------------------------------------------------------------------------------                                            
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.47 |   24950 |                                            
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.47 |   24950 |                                            
|*  2 |   TABLE ACCESS FULL| B    |      1 |   2623K|   4054K|00:00:00.12 |   24950 |                                            
-------------------------------------------------------------------------------------                                            

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - filter(("DT">=TO_DATE(' 2006-12-01 00:00:00', 'syyyy-mm-dd                                                                
              hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd                                                
              hh24:mi:ss'))) 

Here, with the wider date range, the optimizer is closer with the numeric data type, and has selected to perform a fast full scan of the index, which would use multi-block reads if disk accesses were required.  Note that the optimizer selected to perform a full table scan when the date datatype was used, even though the estimated number of rows was less.  Note too that this in-memory operation completed about twice as fast as the in-memory index fast full scan operation.

Now let’s take a look at what happens when a histogram is present on each of the columns.  The script is identical to the previous script, except for the first two lines:

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'B',CASCADE=>TRUE,METHOD_OPT=>'FOR ALL COLUMNS SIZE 254',NO_INVALIDATE=>FALSE);

SPOOL HISTOGRAMTEST2.TXT

Below are the results from this test run:

SQL> SELECT COUNT(*) FROM B;

  COUNT(*)     
----------     
  10000000     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.09 |    2371 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.09 |    2371 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1083K|   1063K|00:00:00.01 |    2371 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200810 AND "ID"<=200903)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.09 |    2816 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.09 |    2816 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |   1083K|   1063K|00:00:00.01 |    2816 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     299 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     299 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |    130K|    132K|00:00:00.01 |     299 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200110 AND "ID"<=200203)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     353 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     353 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    130K|    132K|00:00:00.01 |     353 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2001-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2002-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     594 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     594 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |    284K|    265K|00:00:00.01 |     594 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200210 AND "ID"<=200303)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     705 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     705 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    284K|    265K|00:00:00.01 |     705 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2002-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2003-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     810 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     810 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |    383K|    362K|00:00:00.01 |     810 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200812 AND "ID"<=200901)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     962 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     962 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    383K|    362K|00:00:00.01 |     962 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2008-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     104 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     104 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |  39391 |  45260 |00:00:00.01 |     104 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200112 AND "ID"<=200201)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     122 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     122 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |  39391 |  45260 |00:00:00.01 |     122 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2001-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2002-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901;

--------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |      1 |        |      1 |00:00:00.75 |   22345 |
|   1 |  SORT AGGREGATE       |          |      1 |      1 |      1 |00:00:00.75 |   22345 |
|*  2 |   INDEX FAST FULL SCAN| IND_B_ID |      1 |   4053K|   4054K|00:00:00.22 |   22345 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - filter(("ID">=200612 AND "ID"<=200901))       

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

-------------------------------------------------------------------------------------      
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |      
-------------------------------------------------------------------------------------      
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.47 |   24950 |      
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.47 |   24950 |      
|*  2 |   TABLE ACCESS FULL| B    |      1 |   4053K|   4054K|00:00:00.12 |   24950 |      
-------------------------------------------------------------------------------------      

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - filter(("DT">=TO_DATE(' 2006-12-01 00:00:00', 'syyyy-mm-dd                          
              hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd          
              hh24:mi:ss')))                         

As the above indicates, with a 254 bucket histogram on both of the columns the optimizer calculates estimated row counts that are typically very close to the actual row counts for both datatypes – essentially the only difference is the number of logical reads.  So, adding the histogram helps, but then what if the OP implements good coding standards and uses bind variables rather than constants (literals)?

So, which datatype would you choose, and why?





Using Histograms to Fix Bind Peeking Problems?

30 11 2009

November 30, 2009

The following question appeared on the OTN forums (http://forums.oracle.com/forums/thread.jspa?threadID=993929):

What is the solution to [bind] variable peeking without going to 11g?
i got answer before as stored outline , but i don’t think this will fix it as stored outlines will stablise the plan which we don’t want , i think histogram is a better solution?

Consider the following test case, which might leave you wondering if creating a histogram on a column used by bind variables is a good idea.
The set up:

SHOW PARAMETER OPTIMIZER

NAME                                 TYPE        VALUE
------------------------------------ ----------- --------
optimizer_dynamic_sampling           integer     2
optimizer_features_enable            string      10.2.0.4
optimizer_index_caching              integer     0
optimizer_index_cost_adj             integer     100
optimizer_mode                       string      ALL_ROWS
optimizer_secure_view_merging        boolean     TRUE

CREATE TABLE T10 AS
SELECT
  ROWNUM COL1,
  DECODE(MOD(ROWNUM,1000),1,1,2,2,3,3,DECODE(MOD(ROWNUM,25),10,10,11,11,25)) COL2,
  LPAD('A',255,'A') COL3
FROM
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=10000) V1,
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=1000) V2;

CREATE INDEX IND_T10_1 ON T10(COL1);
CREATE INDEX IND_T10_2 ON T10(COL2);

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T10',CASCADE=>TRUE,METHOD_OPT=>'FOR COLUMNS SIZE 254 COL2')

SELECT
  COL2,
  COUNT(*) NUM,
  COUNT(*)/10000000*100 PERCENT
FROM
  T10
GROUP BY
  COL2
ORDER BY
  COL2;

COL2        NUM    PERCENT
---- ---------- ----------
   1      10000         .1
   2      10000         .1
   3      10000         .1
  10     400000          4
  11     400000          4
  25    9170000       91.7 

The above created a 10,000,000 row table with 6 distinct values in COL2. 0.1% of the rows have a value of 1 in COL2, and 91.7% of the rows have a value of 25 in COL2. There is an index with a histogram on COL2. Obviously (or not) if we have only COL2=1 in the WHERE clause, we probably would want to use the index on the COL2 column to retrieve rows. Obviously (or not) if we have only COL2=25 in the WHERE clause, we probably would want to use a full table scan to retrieve rows. So, what happens when bind variable peeking takes place when a histogram is present on COL2? Ignore for a moment the elapsed time that is output in the following (note that I flush the buffer cache to force physical reads for consistency – direct I/O is enabled):

VARIABLE N1 NUMBER
EXEC :N1:=1

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SET TIMING ON

SELECT /*+ GATHER_PLAN_STATISTICS */
  COL2,
  COUNT(COL1) C1
FROM
  T10
WHERE
  COL2= :N1
GROUP BY
  COL2;

      COL2         C1
---------- ----------
         1      10000

Elapsed: 00:00:42.72

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | A-Rows |A-Time      | Buffers | Reads  |
-------------------------------------------------------------------------------------------------------------
|   1 |  SORT GROUP BY NOSORT        |           |      1 |      1 |      1 |00:00:42.29 |   10022 |  10022 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T10       |      1 |   8856 |  10000 |00:00:39.03 |   10022 |  10022 |
|*  3 |    INDEX RANGE SCAN          | IND_T10_2 |      1 |   8856 |  10000 |00:00:00.06 |      22 |     22 |
-------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("COL2"=:N1)

EXEC :N1:=25

ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SELECT /*+ GATHER_PLAN_STATISTICS */
  COL2,
  COUNT(COL1) C1
FROM
  T10
WHERE
  COL2= :N1
GROUP BY
  COL2;

      COL2         C1
---------- ----------
        25    9170000

Elapsed: 00:00:32.37

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | A-Rows |A-Time      | Buffers | Reads  |
-------------------------------------------------------------------------------------------------------------
|   1 |  SORT GROUP BY NOSORT        |           |      1 |      1 |      1 |00:00:32.35 |     402K|    402K|
|   2 |   TABLE ACCESS BY INDEX ROWID| T10       |      1 |   8856 |   9170K|00:00:27.57 |     402K|    402K|
|*  3 |    INDEX RANGE SCAN          | IND_T10_2 |      1 |   8856 |   9170K|00:00:09.22 |   17879 |  17879 |
-------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("COL2"=:N1)

EXEC :N1:=25

ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SELECT /*+ GATHER_PLAN_STATISTICS */
  COL2,
  COUNT(COL1) C1
FROM
  T10
WHERE
  COL2= :N1
GROUP BY
  COL2;

      COL2         C1
---------- ----------
        25    9170000

Elapsed: 00:00:20.76

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------
|   1 |  SORT GROUP BY NOSORT|      |      1 |      1 |      1 |00:00:20.57 |     384K|    384K|
|*  2 |   TABLE ACCESS FULL  | T10  |      1 |   9234K|   9170K|00:00:27.54 |     384K|    384K|
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("COL2"=:N1)

EXEC :N1:=1

ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SELECT /*+ GATHER_PLAN_STATISTICS */
  COL2,
  COUNT(COL1) C1
FROM
  T10
WHERE
  COL2= :N1
GROUP BY
  COL2;

      COL2         C1
---------- ----------
         1      10000

Elapsed: 00:00:20.20

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------
|   1 |  SORT GROUP BY NOSORT|      |      1 |      1 |      1 |00:00:20.19 |     384K|    384K|
|*  2 |   TABLE ACCESS FULL  | T10  |      1 |   9234K|  10000 |00:00:28.73 |     384K|    384K|
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("COL2"=:N1)
 

The above shows that the first time the SQL statement is hard parsed, a bind variable value of 1 is set, which causes an index range scan regardless if the query will select 0.1% of the rows or 91.7% of the rows. OK, if we then flush the shared pool and first set the bind variable value to 25, a full table scan is used regardless if we select 91.7% of the rows or 0.1% of the rows. You will note that when the full table scan is used when the bind variable was set to 25 the query completed in 20.76 seconds, and when an index range scan was used with the same bind variable value the query completed in 32.37 seconds.

OK so far, now the potentially confusing part. When an index range scan was used for both bind variable values, Oracle counted the 0.1% of the matching rows (10000) in 42.72 seconds, while counting 91.7% of the rows (9,170,000) in just 32.37 seconds. You might be wondering why Oracle is able to return the result of counting 91.7% of the rows by the index range scan faster than it is able to count 0.1% of the rows – I will leave that for your investigation.

Now, reviewing the above, what is better?:
* Allow the bind variable values submitted during the hard parse to determine the execution plan.
* Use a stored outline to lock the execution plan to always use an index range scan.
* Use a stored outline to lock the execution plan to always use a full table scan.
* Disable bind variable peeking.
* Not enough information is available.