January 6, 2010
In a recent discussion thread on the Oracle forums, the following question was asked:
http://forums.oracle.com/forums/thread.jspa?threadID=1007653
I have a scenario where I need to store the data in the format YYYYMM (e.g. 201001 which means January, 2010). I am trying to evaluate what is the most appropriate datatype to store this kind of data. I am comparing 2 options, NUMBER and DATE. As the data is essentially a component of oracle date datatype and experts like Tom Kyte have proved (with examples) that using right datatype is better for optimizer. So I was expecting that using DATE datatype will yield (at least) similar (if not better) cardinality estimates than using NUMBER datatype. However, my tests show that when using DATE the cardinality estimates are way off from actuals whereas sing NUMBER the cardinality estimates are much closer to actuals.
My questions are:
1) What should be the most appropriate datatype used to store YYYYMM data?
2) Why does using DATE datatype yield estimates that are way off from actuals than using NUMBER datatype?Test case (update Jan 7, 2010 : there was a copy-paste error in the line for collecting statistics on table B – the original version of the script posted here collected statistics on table A twice):
create table a nologging as select to_number(to_char(add_months(to_date('200101','YYYYMM'),level - 1), 'YYYYMM')) id from dual connect by level <= 289; create table b (id number) ; begin for i in 1..8192 loop insert into b select * from a ; end loop; commit; end; / alter table a add dt date; alter table b add dt date; update a set dt = to_date(id, 'YYYYMM'); update b set dt = to_date(id, 'YYYYMM'); commit; exec dbms_stats.gather_table_stats(user, 'A', estimate_percent=>NULL); exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>NULL); explain plan for select count(*) from b where id between 200810 and 200903; select * from table(dbms_xplan.display); explain plan for select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM'); select * from table(dbms_xplan.display);
This is an interesting problem, why would using the NUMBER datatype yield better cardinality estimates than the example with the DATE datatype? When the NUMBER datatype was used, the optimizer predicted that the full table scan operation would return 46,604 rows, while the optimizer predicted that the full table scan would return 5,919 rows when the DATE datatype was used – the actual number of rows returned is 49,152.
The person who posted the above test case later stated that he believes that the DATE datatype is the correct choice, but he would have a difficult time justifying that opinion when confronted by someone suggesting the use of the NUMBER data type.
I posted the test results from my run with Oracle 11.1.0.7:
SQL> set autotrace traceonly explain SQL> select count(*) from b where id between 200810 and 200903 ; Execution Plan ---------------------------------------------------------- Plan hash value: 749587668 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 4715 (1)| 00:00:57 | | 1 | SORT AGGREGATE | | 1 | 5 | | | |* 2 | TABLE ACCESS FULL| B | 108K| 527K| 4715 (1)| 00:00:57 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("ID"<=200903 AND "ID">=200810) SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ; Execution Plan ---------------------------------------------------------- Plan hash value: 749587668 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 8 | 4718 (2)| 00:00:57 | | 1 | SORT AGGREGATE | | 1 | 8 | | | |* 2 | TABLE ACCESS FULL| B | 57166 | 446K| 4718 (2)| 00:00:57 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) SQL> set autotrace off SQL> select count(*) from b where id between 200810 and 200903 ; COUNT(*) ---------- 49152 SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ; COUNT(*) ---------- 49152
Well, it seems that Oracle 11.1.0.7 predicted that when the NUMBER datatype was used, the full table scan would return roughly 108,000 rows. Oracle 11.1.0.7 predicted that when the DATE datatype was used, the full table scan would return 57,166 rows – significantly closer to the actual number of 49,152. If there were an index on that column, how would the different cardinality estimates affect the possibility that the optimizer might select to use that index rather than a full table scan? What if the data volume were increased by a factor of, say 1,000 or 1,000,000?
I also captured a 10053 trace during the test run, and found this in the trace file:
****************************************** ----- Current SQL Statement for this session (sql_id=7uk18xj0z9uxf) ----- select count(*) from b where id between 200810 and 200903 ******************************************* ... *************************************** SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for B[B] Table: B Alias: B Card: Original: 2367488.000000 Rounded: 108124 Computed: 108124.16 Non Adjusted: 108124.16 Access Path: TableScan Cost: 4714.53 Resp: 4714.53 Degree: 0 Cost_io: 4670.00 Cost_cpu: 636216240 Resp_io: 4670.00 Resp_cpu: 636216240 Best:: AccessPath: TableScan Cost: 4714.53 Degree: 1 Resp: 4714.53 Card: 108124.16 Bytes: 0 *************************************** ... ... ... ****************************************** ----- Current SQL Statement for this session (sql_id=2ac0k15zjdg5x) ----- select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ******************************************* ... *************************************** SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for B[B] Table: B Alias: B Card: Original: 2367488.000000 Rounded: 57166 Computed: 57165.51 Non Adjusted: 57165.51 Access Path: TableScan Cost: 4717.89 Resp: 4717.89 Degree: 0 Cost_io: 4670.00 Cost_cpu: 684264079 Resp_io: 4670.00 Resp_cpu: 684264079 Best:: AccessPath: TableScan Cost: 4717.89 Degree: 1 Resp: 4717.89 Card: 57165.51 Bytes: 0 ***************************************
Notice in the above that Oracle’s statistics gathering process did not create histograms when I collected statistics for the tables. The calculated cost is the same for either datatype, but what would happen if that table were then joined to another table? Is the optimizer seeing histograms in some of the original poster’s test cases?
The original poster is running Oracle Database 10.2.0.1, so I ran a test on Oracle 10.2.0.2 with OPTIMIZER_FEATURES_ENABLE set to 10.2.0.1, using the data created by the OP’s data creation script:
ALTER SESSION SET TRACEFILE_IDENTIFIER = 'DateTest'; ALTER SESSION SET EVENTS '10053 trace name context forever, level 1'; select count(*) from b where id between 200810 and 200903 ; select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ; ALTER SESSION SET EVENTS '10053 trace name context off'; set autotrace traceonly explain select count(*) from b where id between 200810 and 200903 ; select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ; set autotrace off select count(*) from b where id between 200810 and 200903;
The output from the above script, when run on Oracle 10.2.0.2 follows:
SQL> select count(*) from b where id between 200810 and 200903 ; Execution Plan ---------------------------------------------------------- Plan hash value: 749587668 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 919 (11)| 00:00:05 | | 1 | SORT AGGREGATE | | 1 | 5 | | | |* 2 | TABLE ACCESS FULL| B | 108K| 527K| 919 (11)| 00:00:05 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("ID"<=200903 AND "ID">=200810) SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ; Execution Plan ---------------------------------------------------------- Plan hash value: 749587668 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 8 | 926 (12)| 00:00:05 | | 1 | SORT AGGREGATE | | 1 | 8 | | | |* 2 | TABLE ACCESS FULL| B | 57166 | 446K| 926 (12)| 00:00:05 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("DT"<=TO_DATE('2009-03-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND "DT">=TO_DATE('2008-10-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')) SQL> set autotrace off SQL> select count(*) from b where id between 200810 and 200903 ; COUNT(*) ---------- 49152
The estimated cardinalities appear to be identical to that of Oracle 11.1.0.7, so why was the original poster seeing different cardinality estimates? Here is the output from the 10053 trace file:
****************************************** Current SQL statement for this session: select count(*) from b where id between 200810 and 200903 ******************************************* ... PARAMETERS WITH ALTERED VALUES ****************************** optimizer_features_enable = 10.2.0.1 ********************************* ... BASE STATISTICAL INFORMATION *********************** Table Stats:: Table: B Alias: B #Rows: 2367488 #Blks: 16725 AvgRowLen: 13.00 *************************************** SINGLE TABLE ACCESS PATH Column (#1): ID(NUMBER) AvgLen: 5.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 200101 Max: 202501 Table: B Alias: B Card: Original: 2367488 Rounded: 108124 Computed: 108124.16 Non Adjusted: 108124.16 Access Path: TableScan Cost: 918.67 Resp: 918.67 Degree: 0 Cost_io: 819.00 Cost_cpu: 632570063 Resp_io: 819.00 Resp_cpu: 632570063 Best:: AccessPath: TableScan Cost: 918.67 Degree: 1 Resp: 918.67 Card: 108124.16 Bytes: 0 *************************************** ... ... ... ****************************************** Current SQL statement for this session: select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ******************************************* ... ************************************* PARAMETERS WITH ALTERED VALUES ****************************** optimizer_features_enable = 10.2.0.1 ********************************* ... BASE STATISTICAL INFORMATION *********************** Table Stats:: Table: B Alias: B #Rows: 2367488 #Blks: 16725 AvgRowLen: 13.00 *************************************** SINGLE TABLE ACCESS PATH Column (#2): DT(DATE) AvgLen: 8.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 2451911 Max: 2460677 Table: B Alias: B Card: Original: 2367488 Rounded: 57166 Computed: 57165.51 Non Adjusted: 57165.51 Access Path: TableScan Cost: 926.24 Resp: 926.24 Degree: 0 Cost_io: 819.00 Cost_cpu: 680617902 Resp_io: 819.00 Resp_cpu: 680617902 Best:: AccessPath: TableScan Cost: 926.24 Degree: 1 Resp: 926.24 Card: 57165.51 Bytes: 0
Notice that no histograms were collected based on the 10053 trace file.
Now a second test, this time we will instruct Oracle to create histograms, and also force the optimizer to hard parse the SQL statements that reference table B when those SQL statements are re-executed:
exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>NULL,method_opt=>'FOR ALL COLUMNS SIZE 254',no_invalidate=>false); ALTER SESSION SET TRACEFILE_IDENTIFIER = 'DateTest2'; ALTER SESSION SET EVENTS '10053 trace name context forever, level 1'; select count(*) from b where id between 200810 and 200903 ; select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM'); ALTER SESSION SET EVENTS '10053 trace name context off'; set autotrace traceonly explain select count(*) from b where id between 200810 and 200903; select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM'); set autotrace off
So, what is the output of the above?
SQL> select count(*) from b where id between 200810 and 200903; Execution Plan ---------------------------------------------------------- Plan hash value: 749587668 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 919 (11)| 00:00:05 | | 1 | SORT AGGREGATE | | 1 | 5 | | | |* 2 | TABLE ACCESS FULL| B | 46604 | 227K| 919 (11)| 00:00:05 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("ID"<=200903 AND "ID">=200810) SQL> select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM'); Execution Plan ---------------------------------------------------------- Plan hash value: 749587668 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 8 | 926 (12)| 00:00:05 | | 1 | SORT AGGREGATE | | 1 | 8 | | | |* 2 | TABLE ACCESS FULL| B | 46604 | 364K| 926 (12)| 00:00:05 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("DT"<=TO_DATE('2009-03-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND "DT">=TO_DATE('2008-10-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss'))
Interesting, both queries estimate that the full table scan operation will return 46,604 rows – interesting. That cardinality estimate exactly matches the cardinality estimate in the OP’s plan for the SQL statement that accessed the NUMBER datatype…
For fun, let’s look in the 10053 trace file:
****************************************** Current SQL statement for this session: select count(*) from b where id between 200810 and 200903 ******************************************* ... BASE STATISTICAL INFORMATION *********************** Table Stats:: Table: B Alias: B #Rows: 2367488 #Blks: 16725 AvgRowLen: 13.00 *************************************** SINGLE TABLE ACCESS PATH Column (#1): ID(NUMBER) AvgLen: 5.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 200101 Max: 202501 Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 255 Table: B Alias: B Card: Original: 2367488 Rounded: 46604 Computed: 46604.09 Non Adjusted: 46604.09 Access Path: TableScan Cost: 918.76 Resp: 918.76 Degree: 0 Cost_io: 819.00 Cost_cpu: 633149246 Resp_io: 819.00 Resp_cpu: 633149246 Best:: AccessPath: TableScan Cost: 918.76 Degree: 1 Resp: 918.76 Card: 46604.09 Bytes: 0 ... ... ... ****************************************** Current SQL statement for this session: select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ******************************************* ... BASE STATISTICAL INFORMATION *********************** Table Stats:: Table: B Alias: B #Rows: 2367488 #Blks: 16725 AvgRowLen: 13.00 *************************************** SINGLE TABLE ACCESS PATH Column (#2): DT(DATE) AvgLen: 8.00 NDV: 289 Nulls: 0 Density: 0.0034602 Min: 2451911 Max: 2460677 Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 255 Table: B Alias: B Card: Original: 2367488 Rounded: 46604 Computed: 46604.09 Non Adjusted: 46604.09 Access Path: TableScan Cost: 926.22 Resp: 926.22 Degree: 0 Cost_io: 819.00 Cost_cpu: 680499006 Resp_io: 819.00 Resp_cpu: 680499006 Best:: AccessPath: TableScan Cost: 926.22 Degree: 1 Resp: 926.22 Card: 46604.09 Bytes: 0
The 10053 trace file shows that in both cases a height balanced histogram with 254 buckets was created. But, how accurate would the estimate be if there were 1,000 or 1,000,000 times as many rows? What if the time interval were changed to something else? What if each of the 289 distinct values for the ID and DT columns did not have an equal distribution of values?
So, why select a DATE datatype rather than a NUMBER datatype? These are the reasons that I proposed in the discussion thread:
One of the problems with putting date values in number columns is this – if you select the range from 200810 to 200903, the optimizer will likely make the assumption that 200810 is just as likely of a number as 200808, 200812, 200814, 200816, 200820, 200890, 200900, etc. Some of those year/month combinations are simply not possible. In such a case, the optimizer should over-estimate the number of rows returned from that range when the column data type is NUMBER, and should be reasonably close when the column data type is DATE, since the optimizer knows that 200814 (14/1/2008), 200816 (16/1/2008), 200820 (20/1/2008), 200890 (90/1/2008), 200900 (0/1/2009), etc. could never be dates (and would be completely out of the serial sequence of dates). By putting the date type data into a DATE column, you have essentially added a constraint to the database to prevent invalid dates from being added. Additionally, date math, such as finding the number of days between 200802 and 200803 (compared to 200702 and 200703) is very simple – the answer is not 1 in both cases, but rather 29 and 28, respectively.
Any other comments?
OK, enough guessing, let’s try a couple tests. Here is the test table, with 10,000,000 rows with an uneven distribution of rows for each value:
DROP TABLE B PURGE; CREATE TABLE B AS SELECT TO_NUMBER(TO_CHAR(TRUNC(TO_DATE('200101','YYYYDD')+SQRT(ROWNUM),'MM'),'YYYYMM')) ID, TRUNC(TO_DATE('200101','YYYYDD')+SQRT(ROWNUM),'MM') DT FROM (SELECT ROWNUM RN FROM DUAL CONNECT BY LEVEL<=1000) V1, (SELECT ROWNUM RN FROM DUAL CONNECT BY LEVEL<=10000) V2; CREATE INDEX IND_B_ID ON B(ID); CREATE INDEX IND_B_DT ON B(DT); SET LINESIZE 130 SET PAGESIZE 1000
The first test script with no histograms, supplying various ranges for year-month combinations while gathering execution statistics, and displaying the actual execution plans:
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'B',CASCADE=>TRUE,METHOD_OPT=>'FOR ALL COLUMNS SIZE 1',NO_INVALIDATE=>FALSE); SPOOL HISTOGRAMTEST.TXT SELECT COUNT(*) FROM B; SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM'); SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM'); SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM'); SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM'); SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM'); SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM'); SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SPOOL OFF
The trimmed output (from Oracle 11.1.0.7) of the first date range follows:
SQL> SELECT COUNT(*) FROM B; COUNT(*) ---------- 10000000 SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.06 | 2371 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.06 | 2371 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 1310K| 1063K|00:00:00.01 | 2371 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200810 AND "ID"<=200903) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.06 | 2816 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.06 | 2816 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 674K| 1063K|00:00:00.01 | 2816 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
In the above, the estimated number of rows with the comparison on the numeric column is about 250,000 rows above the actual, while the comparison on the date column is about 400,000 rows below the actual – could this be enough of a difference to change the execution plan if the clustering factor of the indexes were high? What if the tables had a larger average row length? What if this table were joined with another table? Note that the number of logical blocks accessed is less with the index on the numeric column.
The trimmed output of the second date range follows:
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 299 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 299 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 1344K| 132K|00:00:00.01 | 299 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200110 AND "ID"<=200203) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 353 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 353 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 674K| 132K|00:00:00.01 | 353 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2001-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2002-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
In the above, the estimated number of rows with the comparison on the numeric column is about 10 times as high as the actual number of rows, while the estimated number of rows with the comparison on the date column is about 5 times as high. Could this be enough to trigger a different execution plan for the queries – where one uses an index access, while the other uses a full table scan?
The trimmed output of the third date range follows:
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.03 | 594 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.03 | 594 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 1344K| 265K|00:00:00.01 | 594 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200210 AND "ID"<=200303) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.03 | 705 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.03 | 705 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 674K| 265K|00:00:00.01 | 705 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2002-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2003-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
In this case the estimates are about the same as the previous test, but the actual number of rows has doubled. The optimizer’s estimates are again in favor of the date datatype.
The trimmed output of the fourth date range follows:
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.03 | 810 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.03 | 810 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 1285K| 362K|00:00:00.01 | 810 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200812 AND "ID"<=200901) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.03 | 962 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.03 | 962 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 291K| 362K|00:00:00.01 | 962 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2008-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
This time, the optimizer’s estimate when the date datatype was used is very close, while the optimizer’s estimate when the numeric datatype was used is about 4 times greater than the actual number of rows.
The trimmed output of the fifth date range follows:
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.03 | 104 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.03 | 104 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 1295K| 45260 |00:00:00.01 | 104 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200112 AND "ID"<=200201) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 122 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 122 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 291K| 45260 |00:00:00.01 | 122 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2001-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2002-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
This time with the numeric datatype, the optimizer is estimating 1,295,000 rows when in fact only 45,260 are returned during the index range scan. The estimate with the date datatype is also quite high, but it is 4 to 5 times lower (thus closer to the actual) than with the numeric datatype.
Now, the final trimmed output:
SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901; -------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.78 | 22345 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.78 | 22345 | |* 2 | INDEX FAST FULL SCAN| IND_B_ID | 1 | 3764K| 4054K|00:00:00.22 | 22345 | -------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("ID">=200612 AND "ID"<=200901)) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM'); ------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.47 | 24950 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.47 | 24950 | |* 2 | TABLE ACCESS FULL| B | 1 | 2623K| 4054K|00:00:00.12 | 24950 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("DT">=TO_DATE(' 2006-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))
Here, with the wider date range, the optimizer is closer with the numeric data type, and has selected to perform a fast full scan of the index, which would use multi-block reads if disk accesses were required. Note that the optimizer selected to perform a full table scan when the date datatype was used, even though the estimated number of rows was less. Note too that this in-memory operation completed about twice as fast as the in-memory index fast full scan operation.
Now let’s take a look at what happens when a histogram is present on each of the columns. The script is identical to the previous script, except for the first two lines:
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'B',CASCADE=>TRUE,METHOD_OPT=>'FOR ALL COLUMNS SIZE 254',NO_INVALIDATE=>FALSE); SPOOL HISTOGRAMTEST2.TXT
Below are the results from this test run:
SQL> SELECT COUNT(*) FROM B; COUNT(*) ---------- 10000000 SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200810 AND 200903; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.09 | 2371 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.09 | 2371 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 1083K| 1063K|00:00:00.01 | 2371 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200810 AND "ID"<=200903) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200810,'YYYYMM') AND TO_DATE(200903,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.09 | 2816 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.09 | 2816 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 1083K| 1063K|00:00:00.01 | 2816 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2008-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200110 AND 200203; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 299 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 299 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 130K| 132K|00:00:00.01 | 299 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200110 AND "ID"<=200203) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200110,'YYYYMM') AND TO_DATE(200203,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 353 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 353 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 130K| 132K|00:00:00.01 | 353 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2001-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2002-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200210 AND 200303; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 594 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 594 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 284K| 265K|00:00:00.01 | 594 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200210 AND "ID"<=200303) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200210,'YYYYMM') AND TO_DATE(200303,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 705 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 705 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 284K| 265K|00:00:00.01 | 705 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2002-10-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2003-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200812 AND 200901; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 810 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 810 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 383K| 362K|00:00:00.01 | 810 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200812 AND "ID"<=200901) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200812,'YYYYMM') AND TO_DATE(200901,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.03 | 962 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.03 | 962 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 383K| 362K|00:00:00.01 | 962 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2008-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200112 AND 200201; ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 104 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 104 | |* 2 | INDEX RANGE SCAN| IND_B_ID | 1 | 39391 | 45260 |00:00:00.01 | 104 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID">=200112 AND "ID"<=200201) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200112,'YYYYMM') AND TO_DATE(200201,'YYYYMM'); ---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 122 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 122 | |* 2 | INDEX RANGE SCAN| IND_B_DT | 1 | 39391 | 45260 |00:00:00.01 | 122 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DT">=TO_DATE(' 2001-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2002-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE ID BETWEEN 200612 AND 200901; -------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.75 | 22345 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.75 | 22345 | |* 2 | INDEX FAST FULL SCAN| IND_B_ID | 1 | 4053K| 4054K|00:00:00.22 | 22345 | -------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("ID">=200612 AND "ID"<=200901)) SQL> SQL> SELECT /*+ GATHER_PLAN_STATISTICS */ COUNT(*) FROM B WHERE DT BETWEEN TO_DATE(200612,'YYYYMM') AND TO_DATE(200901,'YYYYMM'); ------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.47 | 24950 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.47 | 24950 | |* 2 | TABLE ACCESS FULL| B | 1 | 4053K| 4054K|00:00:00.12 | 24950 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("DT">=TO_DATE(' 2006-12-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "DT"<=TO_DATE(' 2009-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))
As the above indicates, with a 254 bucket histogram on both of the columns the optimizer calculates estimated row counts that are typically very close to the actual row counts for both datatypes – essentially the only difference is the number of logical reads. So, adding the histogram helps, but then what if the OP implements good coding standards and uses bind variables rather than constants (literals)?
—
So, which datatype would you choose, and why?
Charles,
Thanks a lot for dedicating a blog post and more importantly providing exhaustive test cases to elaborate the issue.
If you had read the OP’s response, it appears that OP is not in a position to influence the “implementing good coding standards”. 🙂
As for usage of bind variables, what will happen when bind-peeking and adaptive cursor sharing kicks in ? 🙂
BTW, I vaguly remember reading a blog post by Jonathan Lewis (I guess), the morale of which was “Usage of bind variables and presense of histograms generally do not go together”. I agreed with it. So are we now saying this kind of data pattern is exception to that statement?
p.s. BTW, I am the “OP” of the forum post and my personal opinion is still to use DATE datatype instead of NUMBER. 🙂
Charles,
Here is the Jonathan Lewis’s post I was referring to
http://jonathanlewis.wordpress.com/2009/05/06/philosophy-1/
“when the column data type is DATE, since the optimizer knows that 200814 … etc. could never be dates ”
I disagree with the assumption that ‘possible’ values play any part in the optimization process. A decimal value, such as 200810.32 is just as likely to be in a number field as 200811 or 200815. Similarly a DATE value can be ‘2008/10/15 12:31:33’ as easily as ‘2008/10/15’. I don’t believe there is any built in assumption that an integer or ‘day’ value is more likely to be in a field than a decimal or datetime. Instead it is the number of distinct values that plays a part, not possible values.
Say a column contains 150 rows, with 5 distinct values in a number column, with a low of 1 and a high of 150. The assumption Oracle makes is that any one value in that range occurs 30 times. If you do a select from that table for any specific value the estimated number of rows will always be 30, whether the value is 10 or 1.3443.
If you specify a range (eg 1 to 50) it determines what proportion of the entire range (1 to 150) is included in that requested range (eg 33%) and gives an estimate based on 33% of the table being returned. Except it doesn’t seem quite that simple as it assumes some sort of bell curve rather than an even distribution.
In my XE tests, I found the cardinality estimates for number and dates to be pretty close. I wonder if the closeness in distinct values (289) and max bucket size (254) for histograms meant there were some weird edge cases where the estimates were widely off for specific buckets.
Gary,
Thanks for providing your detailed comment.
I believe that my remark that you quoted is not as clear as it probably should be – in the OP’s system 200814 would be the 14th month of the year 2008, which would be an impossible date, but not an impossible number between 200812 and 200901, and without a histogram, the optimizer should assume that some of the 289 distinct values fall into that number range – that is the point I was trying to make. Again, without a histogram, would the optimizer assume that the same percentage of those 289 distinct values fall between 200811 and 200812 as it assumes fall between 200812 and 200901?
The optimizer’s estimated cardinality, without the histogram, would probably swing wildly if the date range was less than a couple years, for instance, sometimes including part of one year and part of the next.
For example (using the OP’s sample data), wrapping around the end of the year:
Compare to the last 2 months of the year:
Notice the difference in the cardinality estimates.
Now if we repeat the test using the date column, wrapping around the end of the year:
Compare again with the last 2 months of a year:
Now, the next question is does this cardinality estimate really matter?
The OP has 24 years of data in the sample table. Will the cardinality difference really change the execution plan? Is the decrease in the number of consistent gets with the NUMBER datatype an advantage? How does the performance change when the average row length is much longer for the table? How does the performance change if there is an index? Does the order in which the rows are inserted matter? What about the CPU cycles burned converting the NUMBER column to a DATE for some comparison, or converting a string to a DATE value for inserting into the table or for use in the WHERE clause? What if bind variables are used, rather than constants?
I would like to encourage people to continue sharing their thoughts on this topic.
I see where you are coming from but still disagree. I reproduced the situation in XE. With a the basic DBMS_STATS call, with just estimate_percent=>NULL, it creates 1 bucket for the date and 254 for the number. I think the DBMS_STATS gathering is recognizing the ‘lumpiness’ of numbers. It gets 8000 for each of 200101 to 200112 and doesn’t get any for 200113 to 200199, so it sees that as skew and creates the histograms. It doesn’t see any skew for dates, so doesn’t create histograms.
If you force the stats package to a certain number of histograms for both columns, the cardinality estimates are pretty much the same.
My contention is that the datatype is affecting the SKEWONLY aspect of stats gathering and doesn’t directly affect the optimizer.
Gary, thanks again for your comments.
For the 3 people who have viewed this article (OK, there are probably more people than that), Gary has also posted a blog article on his site:
http://igor-db.blogspot.com/2010/01/cardinality-estimates-are-not.html
It is good to see that this article has sparked some degree of discussion. Gary, while your copy of XE and the OP’s copy of Oracle 10.2.0.1 created a histogram on the NUMBER column and not on the DATE column when this command was executed:
No histograms were created during my tests with Oracle 10.2.0.2, 10.2.0.4, or 11.1.0.7 when using the above command with the sample data from the OP’s test case (this was after correcting the copy-paste error as noted above). If a histogram were automatically created on the NUMBER column and the OP switches from using constants to bind variables and bind peeking is left enabled, there could be a significant price to pay for having that histogram present when the bind variable values change. Unfortunately, I do not have the book “Troubleshooting Oracle Performance” with me at the moment, but I seem to remember that the default behavior for histogram collection (important when METHOD_OPT is not specified in the DBMS_STATS call) could be adjusted (edit: pages 147-150 of “Troubleshooting Oracle Performance” describe how to adjust the defaults on 10g and 11g).
I do, however, have a copy of “Cost-Based Oracle Fundamentals” sitting next to me. Page 130 of that book shows why the following example on your blog article predicts such a low cardinality value:
-5 is below the minimum value for the ID column, while 200814 (as used in my example) falls between the minimum value and the maximum value.
This is a good discussion.
Here is the explain plan output from 10.2.0.2 using the script at the top of this article:
Here is the explain plan output from 10.2.0.4 using the script at the top of this article:
Here is the explain plan output from 11.1.0.7 using the script at the top of this article:
Good discussion. Okay, I accept that it isn’t just the way the stats are gathered, and the optimizer is involved.
Now, with a low value of 200101 and a maximum of 202501, that is a (numeric) span of 2400. The range 200810 to 200903 is 94 (or about 4% of the total)
With a low of 2001-01-01 and high of 2025-01-01 we have a span of 8766 days. The range 2008-10-01 to 2009-03-01 is 151 days (or about 1.7% of the total)
We can measure the size of date ranges in weeks (1252 and 21.5 respectively) but the ratio of the ‘range of interest’ to ‘range of existing values’ remains the same at around 1.7%
With 2.3 million values and 8000 rows per distinct value, 4% would be about 93000 from the range, add in the 8000 for the top of range and 8000 for the bottom and it comes close to the 108K estimate.
Similarly 1.7% would be around 41,000 plus the 8000 for the top and 8000 for the bottom comes close to the 57K estimate.
If you use “id in (200110,200111,200112,200201)” and “dt in (date ‘2001-10-01’, date ‘2001-11-01’, date ‘2001-12-01’, date ‘2002-01-01′)”, then they both come out with consistent estimates.
So with either dates or numbers, the range estimates aren’t great. With dates it over-estimates by 8000 (number of records per distinct value). With numbers, the estimate is off by a variable amount depending on how many times you span a chunk representing the missing ’13-99’ months. It would start off by significantly under-estimating for small ranges not including a chunk, over-estimating where it does include a chunk and getting less incorrect the larger the range. At various points the incorrect estimates would balance out. Sort of like a stopped clock being right twice a day, I’d prefer not to rely on it.