DATE Datatype Or NUMBER Datatype – Which Should be Used?

6 01 2010

January 6, 2010

In a recent discussion thread on the Oracle forums, the following question was asked:
http://forums.oracle.com/forums/thread.jspa?threadID=1007653

I have a scenario where I need to store the data in the format YYYYMM (e.g. 201001 which means January, 2010).  I am trying to evaluate what is the most appropriate datatype to store this kind of data. I am comparing 2 options, NUMBER and DATE.  As the data is essentially a component of oracle date datatype and experts like Tom Kyte have proved (with examples) that using right datatype is better for optimizer. So I was expecting that using DATE datatype will yield (at least) similar (if not better) cardinality estimates than using NUMBER datatype. However, my tests show that when using DATE the cardinality estimates are way off from actuals whereas sing NUMBER the cardinality estimates are much closer to actuals.
My questions are:
1) What should be the most appropriate datatype used to store YYYYMM data?
2) Why does using DATE datatype yield estimates that are way off from actuals than using NUMBER datatype?

Test case (update Jan 7, 2010 : there was a copy-paste error in the line for collecting statistics on table B – the original version of the script posted here collected statistics on table A twice):

create table a nologging as select to_number(to_char(add_months(to_date('200101','YYYYMM'),level - 1), 'YYYYMM')) id from dual connect by level <= 289;

create table b (id number) ;

begin
  for i in 1..8192
  loop
    insert into b select * from a ;
  end loop;
  commit;
end;
/ 

alter table a add dt date;

alter table b add dt date;

update a set dt = to_date(id, 'YYYYMM');

update b set dt = to_date(id, 'YYYYMM');

commit;

exec dbms_stats.gather_table_stats(user, 'A', estimate_percent=>NULL);

exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>NULL);

explain plan for select count(*) from b where id between 200810 and 200903;
select * from table(dbms_xplan.display);

explain plan for select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM');
select * from table(dbms_xplan.display);

This is an interesting problem, why would using the NUMBER datatype yield better cardinality estimates than the example with the DATE datatype?  When the NUMBER datatype was used, the optimizer predicted that the full table scan operation would return 46,604 rows, while the optimizer predicted that the full table scan would return 5,919 rows when the DATE datatype was used – the actual number of rows returned is 49,152.

The person who posted the above test case later stated that he believes that the DATE datatype is the correct choice, but he would have a difficult time justifying that opinion when confronted by someone suggesting the use of the NUMBER data type.

I posted the test results from my run with Oracle 11.1.0.7:

SQL> set autotrace traceonly explain
SQL> select count(*) from b where id between 200810 and 200903 ;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  4715   (1)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    |   108K|   527K|  4715   (1)| 00:00:57 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200903 AND "ID">=200810)

SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |  4718   (2)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 57166 |   446K|  4718   (2)| 00:00:57 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

SQL> set autotrace off
SQL> select count(*) from b where id between 200810 and 200903 ;

  COUNT(*)
----------
     49152

SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

  COUNT(*)
----------
     49152 

Well, it seems that Oracle 11.1.0.7 predicted that when the NUMBER datatype was used, the full table scan would return roughly 108,000 rows.  Oracle 11.1.0.7 predicted that when the DATE datatype was used, the full table scan would return 57,166 rows – significantly closer to the actual number of 49,152.  If there were an index on that column, how would the different cardinality estimates affect the possibility that the optimizer might select to use that index rather than a full table scan?  What if the data volume were increased by a factor of, say 1,000 or 1,000,000?

I also captured a 10053 trace during the test run, and found this in the trace file:

******************************************
----- Current SQL Statement for this session (sql_id=7uk18xj0z9uxf) -----
select count(*) from b where id between 200810 and 200903
*******************************************
...
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for B[B]

  Table: B  Alias: B
    Card: Original: 2367488.000000  Rounded: 108124  Computed: 108124.16  Non Adjusted: 108124.16
  Access Path: TableScan
    Cost:  4714.53  Resp: 4714.53  Degree: 0
      Cost_io: 4670.00  Cost_cpu: 636216240
      Resp_io: 4670.00  Resp_cpu: 636216240
  Best:: AccessPath: TableScan
         Cost: 4714.53  Degree: 1  Resp: 4714.53  Card: 108124.16  Bytes: 0

***************************************
...
...
...
******************************************
----- Current SQL Statement for this session (sql_id=2ac0k15zjdg5x) -----
select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM')
*******************************************
...

***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for B[B]

  Table: B  Alias: B
    Card: Original: 2367488.000000  Rounded: 57166  Computed: 57165.51  Non Adjusted: 57165.51
  Access Path: TableScan
    Cost:  4717.89  Resp: 4717.89  Degree: 0
      Cost_io: 4670.00  Cost_cpu: 684264079
      Resp_io: 4670.00  Resp_cpu: 684264079
  Best:: AccessPath: TableScan
         Cost: 4717.89  Degree: 1  Resp: 4717.89  Card: 57165.51  Bytes: 0

***************************************

Notice in the above that Oracle’s statistics gathering process did not create histograms when I collected statistics for the tables.  The calculated cost is the same for either datatype, but what would happen if that table were then joined to another table?  Is the optimizer seeing histograms in some of the original poster’s test cases?

The original poster is running Oracle Database 10.2.0.1, so I ran a test on Oracle 10.2.0.2 with OPTIMIZER_FEATURES_ENABLE set to 10.2.0.1, using the data created by the OP’s data creation script:

ALTER SESSION SET TRACEFILE_IDENTIFIER = 'DateTest';
ALTER SESSION SET EVENTS '10053 trace name context forever, level 1';

select count(*) from b where id between 200810 and 200903 ;

select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

ALTER SESSION SET EVENTS '10053 trace name context off';

set autotrace traceonly explain

select count(*) from b where id between 200810 and 200903 ;

select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

set autotrace off

select count(*) from b where id between 200810 and 200903;

The output from the above script, when run on Oracle 10.2.0.2 follows:

SQL> select count(*) from b where id between 200810 and 200903 ;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |   919  (11)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    |   108K|   527K|   919  (11)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200903 AND "ID">=200810)

SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |   926  (12)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 57166 |   446K|   926  (12)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE('2009-03-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE('2008-10-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss'))

SQL> set autotrace off

SQL> select count(*) from b where id between 200810 and 200903 ;

  COUNT(*)
----------
     49152 

The estimated cardinalities appear to be identical to that of Oracle 11.1.0.7, so why was the original poster seeing different cardinality estimates?  Here is the output from the 10053 trace file:

******************************************
Current SQL statement for this session:
select count(*) from b where id between 200810 and 200903
*******************************************
...
  PARAMETERS WITH ALTERED VALUES
  ******************************
  optimizer_features_enable           = 10.2.0.1
  *********************************
...
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table:  B  Alias:  B
    #Rows: 2367488  #Blks:  16725  AvgRowLen:  13.00
***************************************
SINGLE TABLE ACCESS PATH
  Column (#1): ID(NUMBER)
    AvgLen: 5.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 200101 Max: 202501
  Table:  B  Alias: B    
    Card: Original: 2367488  Rounded: 108124  Computed: 108124.16  Non Adjusted: 108124.16
  Access Path: TableScan
    Cost:  918.67  Resp: 918.67  Degree: 0
      Cost_io: 819.00  Cost_cpu: 632570063
      Resp_io: 819.00  Resp_cpu: 632570063
  Best:: AccessPath: TableScan
         Cost: 918.67  Degree: 1  Resp: 918.67  Card: 108124.16  Bytes: 0
***************************************
...
...
...
******************************************
Current SQL statement for this session:
select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM')
*******************************************
...
  *************************************
  PARAMETERS WITH ALTERED VALUES
  ******************************
  optimizer_features_enable           = 10.2.0.1
  *********************************
...
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table:  B  Alias:  B
    #Rows: 2367488  #Blks:  16725  AvgRowLen:  13.00
***************************************
SINGLE TABLE ACCESS PATH
  Column (#2): DT(DATE)
    AvgLen: 8.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 2451911 Max: 2460677
  Table:  B  Alias: B    
    Card: Original: 2367488  Rounded: 57166  Computed: 57165.51  Non Adjusted: 57165.51
  Access Path: TableScan
    Cost:  926.24  Resp: 926.24  Degree: 0
      Cost_io: 819.00  Cost_cpu: 680617902
      Resp_io: 819.00  Resp_cpu: 680617902
  Best:: AccessPath: TableScan
         Cost: 926.24  Degree: 1  Resp: 926.24  Card: 57165.51  Bytes: 0

Notice that no histograms were collected based on the 10053 trace file.

Now a second test, this time we will instruct Oracle to create histograms, and also force the optimizer to hard parse the SQL statements that reference table B when those SQL statements are re-executed:

exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>NULL,method_opt=>'FOR ALL COLUMNS SIZE 254',no_invalidate=>false);

ALTER SESSION SET TRACEFILE_IDENTIFIER = 'DateTest2';
ALTER SESSION SET EVENTS '10053 trace name context forever, level 1';

select count(*) from b where id between 200810 and 200903 ;

select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM');

ALTER SESSION SET EVENTS '10053 trace name context off';

set autotrace traceonly explain

select count(*) from b where id between 200810 and 200903;

select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM');

set autotrace off

So, what is the output of the above?

SQL> select count(*) from b where id between 200810 and 200903;

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |   919  (11)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 46604 |   227K|   919  (11)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200903 AND "ID">=200810)

SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM');

Execution Plan
----------------------------------------------------------
Plan hash value: 749587668

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |   926  (12)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 46604 |   364K|   926  (12)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE('2009-03-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE('2008-10-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss')) 

Interesting, both queries estimate that the full table scan operation will return 46,604 rows – interesting.  That cardinality estimate exactly matches the cardinality estimate in the OP’s plan for the SQL statement that accessed the NUMBER datatype…

For fun, let’s look in the 10053 trace file:

******************************************
Current SQL statement for this session:
select count(*) from b where id between 200810 and 200903
*******************************************
...
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table:  B  Alias:  B
    #Rows: 2367488  #Blks:  16725  AvgRowLen:  13.00
***************************************
SINGLE TABLE ACCESS PATH
  Column (#1): ID(NUMBER)
    AvgLen: 5.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 200101 Max: 202501
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 255
  Table:  B  Alias: B    
    Card: Original: 2367488  Rounded: 46604  Computed: 46604.09  Non Adjusted: 46604.09
  Access Path: TableScan
    Cost:  918.76  Resp: 918.76  Degree: 0
      Cost_io: 819.00  Cost_cpu: 633149246
      Resp_io: 819.00  Resp_cpu: 633149246
  Best:: AccessPath: TableScan
         Cost: 918.76  Degree: 1  Resp: 918.76  Card: 46604.09  Bytes: 0
...
...
...
******************************************
Current SQL statement for this session:
select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM')
*******************************************
...
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table:  B  Alias:  B
    #Rows: 2367488  #Blks:  16725  AvgRowLen:  13.00
***************************************
SINGLE TABLE ACCESS PATH
  Column (#2): DT(DATE)
    AvgLen: 8.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 2451911 Max: 2460677
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 255
  Table:  B  Alias: B    
    Card: Original: 2367488  Rounded: 46604  Computed: 46604.09  Non Adjusted: 46604.09
  Access Path: TableScan
    Cost:  926.22  Resp: 926.22  Degree: 0
      Cost_io: 819.00  Cost_cpu: 680499006
      Resp_io: 819.00  Resp_cpu: 680499006
  Best:: AccessPath: TableScan
         Cost: 926.22  Degree: 1  Resp: 926.22  Card: 46604.09  Bytes: 0

The 10053 trace file shows that in both cases a height balanced histogram with 254 buckets was created.  But, how accurate would the estimate be if there were 1,000 or 1,000,000 times as many rows?  What if the time interval were changed to something else?  What if each of the 289 distinct values for the ID and DT columns did not have an equal distribution of values?

So, why select a DATE datatype rather than a NUMBER datatype?  These are the reasons that I proposed in the discussion thread:

One of the problems with putting date values in number columns is this – if you select the range from 200810 to 200903, the optimizer will likely make the assumption that 200810 is just as likely of a number as 200808, 200812, 200814, 200816, 200820, 200890, 200900, etc. Some of those year/month combinations are simply not possible. In such a case, the optimizer should over-estimate the number of rows returned from that range when the column data type is NUMBER, and should be reasonably close when the column data type is DATE, since the optimizer knows that 200814 (14/1/2008), 200816 (16/1/2008), 200820 (20/1/2008), 200890 (90/1/2008), 200900 (0/1/2009), etc. could never be dates (and would be completely out of the serial sequence of dates). By putting the date type data into a DATE column, you have essentially added a constraint to the database to prevent invalid dates from being added. Additionally, date math, such as finding the number of days between 200802 and 200803 (compared to 200702 and 200703) is very simple – the answer is not 1 in both cases, but rather 29 and 28, respectively.

Any other comments?

OK, enough guessing, let’s try a couple tests.  Here is the test table, with 10,000,000 rows with an uneven distribution of rows for each value:

DROP TABLE B PURGE;

CREATE TABLE B AS
SELECT
  TO_NUMBER(TO_CHAR(TRUNC(TO_DATE('200101','YYYYDD')+SQRT(ROWNUM),'MM'),'YYYYMM')) ID,
  TRUNC(TO_DATE('200101','YYYYDD')+SQRT(ROWNUM),'MM') DT
FROM
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=1000) V1,
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=10000) V2;

CREATE INDEX IND_B_ID ON B(ID);
CREATE INDEX IND_B_DT ON B(DT);

SET LINESIZE 130
SET PAGESIZE 1000

The first test script with no histograms, supplying various ranges for year-month combinations while gathering execution statistics, and displaying the actual execution plans:

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'B',CASCADE=>TRUE,METHOD_OPT=>'FOR ALL COLUMNS SIZE 1',NO_INVALIDATE=>FALSE);

SPOOL HISTOGRAMTEST.TXT

SELECT COUNT(*) FROM B;

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SPOOL OFF

The trimmed output (from Oracle 11.1.0.7) of the first date range follows:

SQL> SELECT COUNT(*) FROM B;

  COUNT(*)                                                                                                                       
----------                                                                                                                       
  10000000                                                                                                                       

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.06 |    2371 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.06 |    2371 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1310K|   1063K|00:00:00.01 |    2371 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200810 AND "ID"<=200903)                                                                                     

SQL>                                                                                                                                   
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.06 |    2816 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.06 |    2816 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    674K|   1063K|00:00:00.01 |    2816 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

In the above, the estimated number of rows with the comparison on the numeric column is about 250,000 rows above the actual, while the comparison on the date column is about 400,000 rows below the actual – could this be enough of a difference to change the execution plan if the clustering factor of the indexes were high?  What if the tables had a larger average row length?  What if this table were joined with another table?  Note that the number of logical blocks accessed is less with the index on the numeric column.

The trimmed output of the second date range follows:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     299 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     299 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1344K|    132K|00:00:00.01 |     299 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200110 AND "ID"<=200203)                                                                                     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     353 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     353 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    674K|    132K|00:00:00.01 |     353 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2001-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2002-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

In the above, the estimated number of rows with the comparison on the numeric column is about 10 times as high as the actual number of rows, while the estimated number of rows with the comparison on the date column is about 5 times as high.  Could this be enough to trigger a different execution plan for the queries – where one uses an index access, while the other uses a full table scan?

The trimmed output of the third date range follows:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     594 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     594 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1344K|    265K|00:00:00.01 |     594 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200210 AND "ID"<=200303)                                                                                     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     705 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     705 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    674K|    265K|00:00:00.01 |     705 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2002-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2003-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

In this case the estimates are about the same as the previous test, but the actual number of rows has doubled.  The optimizer’s estimates are again in favor of the date datatype.

The trimmed output of the fourth date range follows:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     810 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     810 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1285K|    362K|00:00:00.01 |     810 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200812 AND "ID"<=200901)                                                                                     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     962 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     962 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    291K|    362K|00:00:00.01 |     962 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2008-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

This time, the optimizer’s estimate when the date datatype was used is very close, while the optimizer’s estimate when the numeric datatype was used is about 4 times greater than the actual number of rows.

The trimmed output of the fifth date range follows:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201;

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     104 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     104 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1295K|  45260 |00:00:00.01 |     104 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("ID">=200112 AND "ID"<=200201)                                                                                     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM');

----------------------------------------------------------------------------------------                                         
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                         
----------------------------------------------------------------------------------------                                         
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     122 |                                         
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     122 |                                         
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    291K|  45260 |00:00:00.01 |     122 |                                         
----------------------------------------------------------------------------------------                                         

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - access("DT">=TO_DATE(' 2001-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')                                                    
              AND "DT"<=TO_DATE(' 2002-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))                                               

This time with the numeric datatype, the optimizer is estimating 1,295,000 rows when in fact only 45,260 are returned during the index range scan.  The estimate with the date datatype is also quite high, but it is 4 to 5 times lower (thus closer to the actual) than with the numeric datatype.

Now, the final trimmed output:

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901;

--------------------------------------------------------------------------------------------                                     
| Id  | Operation             | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                     
--------------------------------------------------------------------------------------------                                     
|   0 | SELECT STATEMENT      |          |      1 |        |      1 |00:00:00.78 |   22345 |                                     
|   1 |  SORT AGGREGATE       |          |      1 |      1 |      1 |00:00:00.78 |   22345 |                                     
|*  2 |   INDEX FAST FULL SCAN| IND_B_ID |      1 |   3764K|   4054K|00:00:00.22 |   22345 |                                     
--------------------------------------------------------------------------------------------                                     

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - filter(("ID">=200612 AND "ID"<=200901))                                                                                   

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

-------------------------------------------------------------------------------------                                            
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |                                            
-------------------------------------------------------------------------------------                                            
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.47 |   24950 |                                            
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.47 |   24950 |                                            
|*  2 |   TABLE ACCESS FULL| B    |      1 |   2623K|   4054K|00:00:00.12 |   24950 |                                            
-------------------------------------------------------------------------------------                                            

Predicate Information (identified by operation id):                                                                              
---------------------------------------------------                                                                              
   2 - filter(("DT">=TO_DATE(' 2006-12-01 00:00:00', 'syyyy-mm-dd                                                                
              hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd                                                
              hh24:mi:ss'))) 

Here, with the wider date range, the optimizer is closer with the numeric data type, and has selected to perform a fast full scan of the index, which would use multi-block reads if disk accesses were required.  Note that the optimizer selected to perform a full table scan when the date datatype was used, even though the estimated number of rows was less.  Note too that this in-memory operation completed about twice as fast as the in-memory index fast full scan operation.

Now let’s take a look at what happens when a histogram is present on each of the columns.  The script is identical to the previous script, except for the first two lines:

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'B',CASCADE=>TRUE,METHOD_OPT=>'FOR ALL COLUMNS SIZE 254',NO_INVALIDATE=>FALSE);

SPOOL HISTOGRAMTEST2.TXT

Below are the results from this test run:

SQL> SELECT COUNT(*) FROM B;

  COUNT(*)     
----------     
  10000000     

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.09 |    2371 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.09 |    2371 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |   1083K|   1063K|00:00:00.01 |    2371 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200810 AND "ID"<=200903)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.09 |    2816 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.09 |    2816 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |   1083K|   1063K|00:00:00.01 |    2816 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     299 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     299 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |    130K|    132K|00:00:00.01 |     299 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200110 AND "ID"<=200203)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     353 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     353 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    130K|    132K|00:00:00.01 |     353 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2001-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2002-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     594 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     594 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |    284K|    265K|00:00:00.01 |     594 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200210 AND "ID"<=200303)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     705 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     705 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    284K|    265K|00:00:00.01 |     705 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2002-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2003-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     810 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     810 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |    383K|    362K|00:00:00.01 |     810 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200812 AND "ID"<=200901)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.03 |     962 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.03 |     962 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |    383K|    362K|00:00:00.01 |     962 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2008-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201;

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     104 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     104 |   
|*  2 |   INDEX RANGE SCAN| IND_B_ID |      1 |  39391 |  45260 |00:00:00.01 |     104 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("ID">=200112 AND "ID"<=200201)         

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM');

----------------------------------------------------------------------------------------   
| Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |   
----------------------------------------------------------------------------------------   
|   0 | SELECT STATEMENT  |          |      1 |        |      1 |00:00:00.01 |     122 |   
|   1 |  SORT AGGREGATE   |          |      1 |      1 |      1 |00:00:00.01 |     122 |   
|*  2 |   INDEX RANGE SCAN| IND_B_DT |      1 |  39391 |  45260 |00:00:00.01 |     122 |   
----------------------------------------------------------------------------------------   

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - access("DT">=TO_DATE(' 2001-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')              
              AND "DT"<=TO_DATE(' 2002-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))         

SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901;

--------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |      1 |        |      1 |00:00:00.75 |   22345 |
|   1 |  SORT AGGREGATE       |          |      1 |      1 |      1 |00:00:00.75 |   22345 |
|*  2 |   INDEX FAST FULL SCAN| IND_B_ID |      1 |   4053K|   4054K|00:00:00.22 |   22345 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - filter(("ID">=200612 AND "ID"<=200901))       

SQL>
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM');

-------------------------------------------------------------------------------------      
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |      
-------------------------------------------------------------------------------------      
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.47 |   24950 |      
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.47 |   24950 |      
|*  2 |   TABLE ACCESS FULL| B    |      1 |   4053K|   4054K|00:00:00.12 |   24950 |      
-------------------------------------------------------------------------------------      

Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - filter(("DT">=TO_DATE(' 2006-12-01 00:00:00', 'syyyy-mm-dd                          
              hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd          
              hh24:mi:ss')))                         

As the above indicates, with a 254 bucket histogram on both of the columns the optimizer calculates estimated row counts that are typically very close to the actual row counts for both datatypes – essentially the only difference is the number of logical reads.  So, adding the histogram helps, but then what if the OP implements good coding standards and uses bind variables rather than constants (literals)?

So, which datatype would you choose, and why?


Actions

Information

7 responses

6 01 2010
Narendra

Charles,

Thanks a lot for dedicating a blog post and more importantly providing exhaustive test cases to elaborate the issue.
If you had read the OP’s response, it appears that OP is not in a position to influence the “implementing good coding standards”. 🙂
As for usage of bind variables, what will happen when bind-peeking and adaptive cursor sharing kicks in ? 🙂
BTW, I vaguly remember reading a blog post by Jonathan Lewis (I guess), the morale of which was “Usage of bind variables and presense of histograms generally do not go together”. I agreed with it. So are we now saying this kind of data pattern is exception to that statement?

p.s. BTW, I am the “OP” of the forum post and my personal opinion is still to use DATE datatype instead of NUMBER. 🙂

6 01 2010
Narendra

Charles,

Here is the Jonathan Lewis’s post I was referring to
http://jonathanlewis.wordpress.com/2009/05/06/philosophy-1/

6 01 2010
Gary

“when the column data type is DATE, since the optimizer knows that 200814 … etc. could never be dates ”
I disagree with the assumption that ‘possible’ values play any part in the optimization process. A decimal value, such as 200810.32 is just as likely to be in a number field as 200811 or 200815. Similarly a DATE value can be ‘2008/10/15 12:31:33’ as easily as ‘2008/10/15’. I don’t believe there is any built in assumption that an integer or ‘day’ value is more likely to be in a field than a decimal or datetime. Instead it is the number of distinct values that plays a part, not possible values.

Say a column contains 150 rows, with 5 distinct values in a number column, with a low of 1 and a high of 150. The assumption Oracle makes is that any one value in that range occurs 30 times. If you do a select from that table for any specific value the estimated number of rows will always be 30, whether the value is 10 or 1.3443.
If you specify a range (eg 1 to 50) it determines what proportion of the entire range (1 to 150) is included in that requested range (eg 33%) and gives an estimate based on 33% of the table being returned. Except it doesn’t seem quite that simple as it assumes some sort of bell curve rather than an even distribution.

In my XE tests, I found the cardinality estimates for number and dates to be pretty close. I wonder if the closeness in distinct values (289) and max bucket size (254) for histograms meant there were some weird edge cases where the estimates were widely off for specific buckets.

6 01 2010
Charles Hooper

Gary,

Thanks for providing your detailed comment.

I believe that my remark that you quoted is not as clear as it probably should be – in the OP’s system 200814 would be the 14th month of the year 2008, which would be an impossible date, but not an impossible number between 200812 and 200901, and without a histogram, the optimizer should assume that some of the 289 distinct values fall into that number range – that is the point I was trying to make. Again, without a histogram, would the optimizer assume that the same percentage of those 289 distinct values fall between 200811 and 200812 as it assumes fall between 200812 and 200901?

The optimizer’s estimated cardinality, without the histogram, would probably swing wildly if the date range was less than a couple years, for instance, sometimes including part of one year and part of the next.

For example (using the OP’s sample data), wrapping around the end of the year:

SELECT
  COUNT(*)
FROM
  B
WHERE
  ID BETWEEN 200112 AND 200201;
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  4712   (1)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    |   104K|   508K|  4712   (1)| 00:00:57 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200201 AND "ID">=200112)

Compare to the last 2 months of the year:

SELECT
  COUNT(*)
FROM
  B
WHERE
  ID BETWEEN 200111 AND 200112;
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  4712   (1)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 17370 | 86850 |  4712   (1)| 00:00:57 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200112 AND "ID">=200111)

Notice the difference in the cardinality estimates.

Now if we repeat the test using the date column, wrapping around the end of the year:

SELECT
  COUNT(*)
FROM
  B
WHERE
  DT BETWEEN TO_DATE('200112','YYYYMM') AND TO_DATE('200201','YYYYMM');
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |  4715   (1)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 24756 |   193K|  4715   (1)| 00:00:57 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE(' 2002-01-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE(' 2001-12-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

Compare again with the last 2 months of a year:

SELECT
  COUNT(*)
FROM
  B
WHERE
  DT BETWEEN TO_DATE('200111','YYYYMM') AND TO_DATE('200112','YYYYMM');
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |  4715   (1)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 24486 |   191K|  4715   (1)| 00:00:57 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE(' 2001-12-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE(' 2001-11-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

Now, the next question is does this cardinality estimate really matter?

SELECT
  MAX(DT),
  MIN(DT),
  MAX(DT)-MIN(DT)
FROM
  B;
 
MAX(DT)   MIN(DT)   MAX(DT)-MIN(DT)
--------- --------- ---------------
01-JAN-25 01-JAN-01            8766

The OP has 24 years of data in the sample table. Will the cardinality difference really change the execution plan? Is the decrease in the number of consistent gets with the NUMBER datatype an advantage? How does the performance change when the average row length is much longer for the table? How does the performance change if there is an index? Does the order in which the rows are inserted matter? What about the CPU cycles burned converting the NUMBER column to a DATE for some comparison, or converting a string to a DATE value for inserting into the table or for use in the WHERE clause? What if bind variables are used, rather than constants?

I would like to encourage people to continue sharing their thoughts on this topic.

6 01 2010
Gary

I see where you are coming from but still disagree. I reproduced the situation in XE. With a the basic DBMS_STATS call, with just estimate_percent=>NULL, it creates 1 bucket for the date and 254 for the number. I think the DBMS_STATS gathering is recognizing the ‘lumpiness’ of numbers. It gets 8000 for each of 200101 to 200112 and doesn’t get any for 200113 to 200199, so it sees that as skew and creates the histograms. It doesn’t see any skew for dates, so doesn’t create histograms.
If you force the stats package to a certain number of histograms for both columns, the cardinality estimates are pretty much the same.
My contention is that the datatype is affecting the SKEWONLY aspect of stats gathering and doesn’t directly affect the optimizer.

7 01 2010
Charles Hooper

Gary, thanks again for your comments.

For the 3 people who have viewed this article (OK, there are probably more people than that), Gary has also posted a blog article on his site:
http://igor-db.blogspot.com/2010/01/cardinality-estimates-are-not.html

It is good to see that this article has sparked some degree of discussion. Gary, while your copy of XE and the OP’s copy of Oracle 10.2.0.1 created a histogram on the NUMBER column and not on the DATE column when this command was executed:

exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>NULL);

No histograms were created during my tests with Oracle 10.2.0.2, 10.2.0.4, or 11.1.0.7 when using the above command with the sample data from the OP’s test case (this was after correcting the copy-paste error as noted above). If a histogram were automatically created on the NUMBER column and the OP switches from using constants to bind variables and bind peeking is left enabled, there could be a significant price to pay for having that histogram present when the bind variable values change. Unfortunately, I do not have the book “Troubleshooting Oracle Performance” with me at the moment, but I seem to remember that the default behavior for histogram collection (important when METHOD_OPT is not specified in the DBMS_STATS call) could be adjusted (edit: pages 147-150 of “Troubleshooting Oracle Performance” describe how to adjust the defaults on 10g and 11g).

I do, however, have a copy of “Cost-Based Oracle Fundamentals” sitting next to me. Page 130 of that book shows why the following example on your blog article predicts such a low cardinality value:

select count(*) from test_150 where id < -5;

-5 is below the minimum value for the ID column, while 200814 (as used in my example) falls between the minimum value and the maximum value.

This is a good discussion.

Here is the explain plan output from 10.2.0.2 using the script at the top of this article:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |   919  (11)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    |   108K|   527K|   919  (11)| 00:00:05 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
   2 - filter("ID"<=200903 AND "ID">=200810)
 
-
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |   926  (12)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 57166 |   446K|   926  (12)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   2 - filter("DT"<=TO_DATE('2009-03-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE('2008-10-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss'))

Here is the explain plan output from 10.2.0.4 using the script at the top of this article:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  4713   (1)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    |   108K|   527K|  4713   (1)| 00:00:57 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<=200903 AND "ID">=200810)
 
-
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |  4716   (1)| 00:00:57 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 57166 |   446K|  4716   (1)| 00:00:57 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

Here is the explain plan output from 11.1.0.7 using the script at the top of this article:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |   919  (11)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS FULL| B    |   108K|   527K|   919  (11)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
   2 - filter("ID"<=200903 AND "ID">=200810)
 
-
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     8 |   926  (12)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| B    | 57166 |   446K|   926  (12)| 00:00:05 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
   2 - filter("DT"<=TO_DATE('2009-03-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss') AND "DT">=TO_DATE('2008-10-01 00:00:00', 'yyyy-mm-dd
              hh24:mi:ss'))
8 01 2010
Gary

Good discussion. Okay, I accept that it isn’t just the way the stats are gathered, and the optimizer is involved.
Now, with a low value of 200101 and a maximum of 202501, that is a (numeric) span of 2400. The range 200810 to 200903 is 94 (or about 4% of the total)
With a low of 2001-01-01 and high of 2025-01-01 we have a span of 8766 days. The range 2008-10-01 to 2009-03-01 is 151 days (or about 1.7% of the total)
We can measure the size of date ranges in weeks (1252 and 21.5 respectively) but the ratio of the ‘range of interest’ to ‘range of existing values’ remains the same at around 1.7%
With 2.3 million values and 8000 rows per distinct value, 4% would be about 93000 from the range, add in the 8000 for the top of range and 8000 for the bottom and it comes close to the 108K estimate.
Similarly 1.7% would be around 41,000 plus the 8000 for the top and 8000 for the bottom comes close to the 57K estimate.

If you use “id in (200110,200111,200112,200201)” and “dt in (date ‘2001-10-01’, date ‘2001-11-01’, date ‘2001-12-01’, date ‘2002-01-01′)”, then they both come out with consistent estimates.

So with either dates or numbers, the range estimates aren’t great. With dates it over-estimates by 8000 (number of records per distinct value). With numbers, the estimate is off by a variable amount depending on how many times you span a chunk representing the missing ’13-99’ months. It would start off by significantly under-estimating for small ranges not including a chunk, over-estimating where it does include a chunk and getting less incorrect the larger the range. At various points the incorrect estimates would balance out. Sort of like a stopped clock being right twice a day, I’d prefer not to rely on it.

Leave a reply to Narendra Cancel reply