February 15, 2010
Here is a fun test on Oracle Database 11.1.0.7 and 11.2.0.1 (all posted tests are from 11.2.0.1).
First, we will create a table and collect statistics for the table with indexes on a couple of the columns and histograms on a couple of the columns:
CREATE TABLE T1 ( C1 NUMBER NOT NULL, C2 NUMBER NOT NULL, C3 NUMBER NOT NULL, C4 NUMBER NOT NULL, C5 VARCHAR2(30) NOT NULL, C6 VARCHAR2(30) NOT NULL, FILLER VARCHAR2(200), PRIMARY KEY (C1)); INSERT INTO T1 SELECT ROWNUM, ROWNUM, TRUNC(ROWNUM/100+1), TRUNC(ROWNUM/100+1), CHR(65+TRUNC(ROWNUM/10000))||TRUNC(ROWNUM/100+1), CHR(65+TRUNC(ROWNUM/10000))||TRUNC(ROWNUM/100+1), LPAD('A',200,'A') FROM DUAL CONNECT BY LEVEL<=100000; COMMIT; CREATE INDEX IND_T1_C3 ON T1(C3); CREATE INDEX IND_T1_C4 ON T1(C4); CREATE INDEX IND_T1_C5 ON T1(C5); EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR COLUMNS SIZE 254 C2, C4, C6')
Just for a review of that GATHER_TABLE_STATS procedure, from the documentation:
When gathering statistics on a table, DBMS_STATS gathers information about the data distribution of the columns within the table. The most basic information about the data distribution is the maximum value and minimum value of the column. However, this level of statistics may be insufficient for the optimizer’s needs if the data within the column is skewed. For skewed data distributions, histograms can also be created as part of the column statistics to describe the data distribution of a given column. Histograms are described in more details in “Viewing Histograms“.
Histograms are specified using the METHOD_OPT argument of the DBMS_STATS gathering procedures. Oracle recommends setting the METHOD_OPT to FOR ALL COLUMNS SIZE AUTO. With this setting, Oracle Database automatically determines which columns require histograms and the number of buckets (size) of each histogram. You can also manually specify which columns should have histograms and the size of each histogram.
The table created by the above script will have 100,000 rows with indexes on columns C3, C4, and C5. Columns C2, C4, and C6 will have histograms with 254 buckets. Let’s see the maximum values for the table columns (column C2 values are identical to C1, column C4 values are identical to C3, column C6 values are identical to C5)):
SELECT MAX(C1) MAX_C1, MAX(C3) MAX_C3, MAX(C5) MAX_C5 FROM T1; MAX_C1 MAX_C3 MAX_C5 ------ ---------- ------ 100000 1001 K1001
The maximum value for column C3 is 1001 – most of the distinct values in that column have 100 matching rows, except for the value 1001 which has a single row. We will try a couple of tests with my VBS tool for automatically generating DBMS_XPLANs, with the DBMS_XPLAN Type set to “ALLSTATS LAST” and Stats Level set to “ALL“. First, we will try the maximum value for column C3 (and also C4):
SELECT * FROM T1 WHERE C3=1001;
The returned execution plan is displayed below (after this point the execution plan will be displayed directly below the SQL statement):
SQL_ID 0vcvak7bgbdzt, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3=1001 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 3 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 1000 | 1 |00:00:00.01 | 3 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 400 | 1 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3"=1001)
OK, it was more or less expected that the cardinality estimate would be incorrect here because there is no histogram on column C3. It is a bit odd that the optimizer predicts that the index will return 400 rows that then causes 1000 rows to be returned from the table. Now let’s try the same SQL statement referencing column C4:
SELECT * FROM T1 WHERE C4=1001; SQL_ID d3zfa447tsjrz, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4=1001 Plan hash value: 7035821 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 3 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 100 | 1 |00:00:00.01 | 3 | |* 2 | INDEX RANGE SCAN | IND_T1_C4 | 1 | 100 | 1 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C4"=1001)
100 is closer to the actual number of rows than were predicted for the same SQL statement using column C3, so the histogram probably helped. What happens when we specify values for C3 and C4 that exceed the maximum values in those columns?
SELECT * FROM T1 WHERE C3=1101; SQL_ID 7hy399svng33n, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3=1101 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 1000 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 400 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3"=1101) - SELECT * FROM T1 WHERE C4=1101; SQL_ID at676unwj7uk1, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4=1101 Plan hash value: 7035821 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 90 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C4 | 1 | 90 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C4"=1101)
In the above, the SQL statement that accesses column C3 continued to estimate the same number of rows would be returned when the value exceeded the maximum value for the column by roughly 10%. When the same restriction was applied to column C4, the optimizer predicted that 10% less rows would be returned. Interesting…
SELECT * FROM T1 WHERE C3=1201; SQL_ID f94b21zwvsn11, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3=1201 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 1000 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 400 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3"=1201) - SELECT * FROM T1 WHERE C4=1201; SQL_ID 4sf3hjx44u1sn, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4=1201 Plan hash value: 7035821 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 80 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C4 | 1 | 80 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C4"=1201)
Exceeding the maximum value by 20% returned the same cardinality estimate for the SQL statement using column C3 as seen earlier, while the cardinality estimate decreased by 20% for the SQL statement accessing column C4.
SELECT * FROM T1 WHERE C3=1901; SQL_ID 86z7nbb5u2p26, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3=1901 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 1000 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 400 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3"=1901) - SELECT * FROM T1 WHERE C4=1901; SQL_ID 08rg4uf562h4x, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4=1901 Plan hash value: 7035821 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 10 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C4 | 1 | 10 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C4"=1901)
When exceeding the maximum value by 90% we see the same pattern, the cardinality estimates for the first SQL statement were unaffected, while the cardinality estimate for the second SQL statement decreased by 90%. With a value of 2001 specified for column C4 the optimizer’s predicted cardinality decreased to 1 row.
What about value ranges?
SELECT * FROM T1 WHERE C3 BETWEEN 1101 AND 1201; SQL_ID 95zzq8vb523gf, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3 BETWEEN 1101 AND 1201 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 250 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 450 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3">=1101 AND "C3"<=1201)
Notice that the optimizer is estimating that 250 rows will be returned, which is less than the sum of the estimated 1,000 rows that would be returned for the value 1101 and the estimated 1,000 rows for value 1201. Let’s try again specifying column C4:
SELECT * FROM T1 WHERE C4 BETWEEN 1101 AND 1201; SQL_ID 9t2fv8dcz7jsv, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4 BETWEEN 1101 AND 1201 Plan hash value: 7035821 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 100 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C4 | 1 | 90 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C4">=1101 AND "C4"<=1201)
Interesting – the index is predicted to return 90 rows which then causes the table to return 100 rows (edit: Oracle 10.2.0.2 predicts 90 rows for both the index and the table). Let’s try again specifying ranges that exceed the maximum values by 90% to 100%:
SELECT * FROM T1 WHERE C3 BETWEEN 1901 AND 2001; SQL_ID 99btm350uvvbp, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3 BETWEEN 1901 AND 2001 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 250 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 450 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3">=1901 AND "C3"<=2001) - SELECT * FROM T1 WHERE C4 BETWEEN 1901 AND 2001; SQL_ID bnwyf98m74q26, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4 BETWEEN 1901 AND 2001 Plan hash value: 7035821 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 100 | 0 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | IND_T1_C4 | 1 | 10 | 0 |00:00:00.01 | 2 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C4">=1901 AND "C4"<=2001)
The first query is predicting the same number of rows will be returned as seen earlier. The second query is again predicting that 100 rows will be returned from the table as a result of the 10 predicted rows that will be returned from the index (edit: Oracle 10.2.0.2 predicts 10 rows for both the index and the table). When the range was changed to 1101 through 2001, the same cardinality estimates displayed for the range of 1101 to 1201 were again returned for both SQL statements.
If the above accurately depicts what happens when the maximum recorded value for a column is exceeded, what might happen when statistics are not gathered on a regular basis? Hold that thought.
Let’s try again using a value range for C3 and C4 that are not beyond the maximum values for the columns:
SELECT * FROM T1 WHERE C3 BETWEEN 101 AND 201; SQL_ID 8jd9h693mbkvc, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3 BETWEEN 101 AND 201 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10100 |00:00:00.02 | 547 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 250 | 10100 |00:00:00.02 | 547 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 450 | 10100 |00:00:00.01 | 124 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3">=101 AND "C3"<=201) - SELECT * FROM T1 WHERE C4 BETWEEN 101 AND 201; SQL_ID 2bk0njs0atfpw, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4 BETWEEN 101 AND 201 Plan hash value: 7035821 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10100 |00:00:00.02 | 547 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 10039 | 10100 |00:00:00.02 | 547 | |* 2 | INDEX RANGE SCAN | IND_T1_C4 | 1 | 10039 | 10100 |00:00:00.01 | 124 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C4">=101 AND "C4"<=201)
Once again, the first SQL statement is estimating that the index range scan will return 450 rows which will then cause the table to return 250 rows. In actuality, both the index and the table return 10,100 rows (do we possibly have a statistics problem here?). The second query returns cardinality estimates that are closer to the actual number of rows returned – the histogram helped here.
Let’s try again with a wider value range:
SELECT * FROM T1 WHERE C3 BETWEEN 101 AND 1401; SQL_ID 0bapwrbn3ch8j, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3 BETWEEN 101 AND 1401 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 90001 |00:00:00.13 | 4865 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 250 | 90001 |00:00:00.13 | 4865 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 450 | 90001 |00:00:00.04 | 1090 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3">=101 AND "C3"<=1401) - SELECT * FROM T1 WHERE C4 BETWEEN 101 AND 1401; SQL_ID 3ppn43z5ukur6, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4 BETWEEN 101 AND 1401 Plan hash value: 3617692013 ------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | 90001 |00:00:00.05 | 4147 | |* 1 | TABLE ACCESS FULL| T1 | 1 | 89961 | 90001 |00:00:00.05 | 4147 | ------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(("C4">=101 AND "C4"<=1401))
Again, the cardinality estimates and execution plan are the same as before for the query using column C3, even though we are selecting 90% of the rows in the table. The query using column C4 switched to a full table scan, and has a much more accurate cardinality estimate. This same pattern holds true regardless of the values specified for the low and high ends of the range.
Let’s switch back to SQL*Plus and try a couple of experiments with bind variables (note that I am actually submitting the SQL statements using my Toy Project so that I do not have to watch all of the rows scroll on the screen):
VARIABLE N1 NUMBER VARIABLE N2 NUMBER ALTER SESSION SET STATISTICS_LEVEL='ALL'; EXEC :N1:=101 EXEC :N2:=1401 SET ARRAYSIZE 100 SELECT * FROM T1 WHERE C3 BETWEEN :N1 AND :N2; ... SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SQL_ID 2ksn64btq6fx4, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3 BETWEEN :1 AND :2 Plan hash value: 108045900 ---------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 90001 |00:00:00.18 | 4865 | |* 1 | FILTER | | 1 | | 90001 |00:00:00.18 | 4865 | | 2 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 250 | 90001 |00:00:00.13 | 4865 | |* 3 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 450 | 90001 |00:00:00.04 | 1090 | ---------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(:1<=:2) 3 - access("C3">=:1 AND "C3"<=:2)
The same execution plan as returned when we used literals (constants), so bind peeking is probably working. Now let’s try the query that accesses column C4:
SELECT * FROM T1 WHERE C4 BETWEEN :N1 AND :N2; ... SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SQL_ID gwgk5h8u3k4tp, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4 BETWEEN :1 AND :2 Plan hash value: 3332582666 ------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 90001 |00:00:00.10 | 4147 | |* 1 | FILTER | | 1 | | 90001 |00:00:00.10 | 4147 | |* 2 | TABLE ACCESS FULL| T1 | 1 | 89961 | 90001 |00:00:00.05 | 4147 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(:1<=:2) 2 - filter(("C4">=:1 AND "C4"<=:2))
Bind variable peeking was used to obtain a close estimate for the cardinalities. So, what happens when we change the value of bind variable N2?
EXEC :N2:=101 SELECT * FROM T1 WHERE C3 BETWEEN :N1 AND :N2; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SQL_ID 2ksn64btq6fx4, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3 BETWEEN :1 AND :2 Plan hash value: 108045900 ---------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 100 |00:00:00.01 | 7 | |* 1 | FILTER | | 1 | | 100 |00:00:00.01 | 7 | | 2 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 250 | 100 |00:00:00.01 | 7 | |* 3 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 450 | 100 |00:00:00.01 | 3 | ---------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(:1<=:2) 3 - access("C3">=:1 AND "C3"<=:2) - SELECT * FROM T1 WHERE C3 BETWEEN :N1 AND :N2; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST')); SQL_ID gwgk5h8u3k4tp, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4 BETWEEN :1 AND :2 Plan hash value: 3332582666 ------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 100 |00:00:00.01 | 3263 | |* 1 | FILTER | | 1 | | 100 |00:00:00.01 | 3263 | |* 2 | TABLE ACCESS FULL| T1 | 1 | 89961 | 100 |00:00:00.01 | 3263 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(:1<=:2) 2 - filter(("C4">=:1 AND "C4"<=:2))
The bind variable in the presence of bind peeking with a histogram came back to bit us – ideally Oracle would have used an index range scan for this set of bind variables.
Let’s add another 10,000 rows to the table without gathering statistics and see what happens to the cardinality estimates in the plans:
INSERT INTO T1 SELECT (100000+ROWNUM), (100000+ROWNUM), TRUNC((100000+ROWNUM)/100+1), TRUNC((100000+ROWNUM)/100+1), CHR(65+TRUNC((100000+ROWNUM)/10000))||TRUNC((100000+ROWNUM)/100+1), CHR(65+TRUNC((100000+ROWNUM)/10000))||TRUNC((100000+ROWNUM)/100+1), LPAD('A',200,'A') FROM DUAL CONNECT BY LEVEL<=100000; COMMIT; ALTER SYSTEM FLUSH SHARED_POOL;
Here are the execution plans:
SELECT * FROM T1 WHERE C3=1901; SQL_ID 86z7nbb5u2p26, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3=1901 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 100 |00:00:00.01 | 8 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 1000 | 100 |00:00:00.01 | 8 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 400 | 100 |00:00:00.01 | 4 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3"=1901) - SELECT * FROM T1 WHERE C4=1901; SQL_ID 08rg4uf562h4x, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C4=1901 Plan hash value: 7035821 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 100 |00:00:00.01 | 8 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 10 | 100 |00:00:00.01 | 8 | |* 2 | INDEX RANGE SCAN | IND_T1_C4 | 1 | 10 | 100 |00:00:00.01 | 4 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C4"=1901)
The cardinality estimates are unchanged from what we saw earlier. The first query is estimating 10 times too many rows will be returned, and the second query is estimating 10 times too few rows will be returned. Let’s delete the rows that were just added and check a couple of statistics:
DELETE FROM T1 WHERE C1>100000; 100000 rows deleted. COMMIT; SELECT SUBSTR(COLUMN_NAME,1,3) COL, DENSITY, NUM_BUCKETS, LAST_ANALYZED FROM DBA_TAB_COLUMNS WHERE TABLE_NAME='T1' ORDER BY 1; COL DENSITY NUM_BUCKETS LAST_ANAL --- ---------- ----------- --------- C1 C2 .00001 254 14-FEB-10 C3 C4 .00099998 254 14-FEB-10 C5 C6 .00099998 254 14-FEB-10 FIL
No density values for columns C1, C3, or C5 and the LAST_ANALYZED column is NULL for those same entries (edit: same results on Oracle 10.2.0.2). Let’s try collecting statistics again without specifying the columns for which histograms should be created:
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,NO_INVALIDATE=>FALSE) SELECT SUBSTR(COLUMN_NAME,1,3) COL, DENSITY, NUM_BUCKETS, LAST_ANALYZED FROM DBA_TAB_COLUMNS WHERE TABLE_NAME='T1' ORDER BY 1; COL DENSITY NUM_BUCKETS LAST_ANAL --- ---------- ----------- --------- C1 .00001 1 14-FEB-10 C2 .00001 1 14-FEB-10 C3 .000999001 1 14-FEB-10 C4 .000999001 1 14-FEB-10 C5 .000999001 1 14-FEB-10 C6 .000999001 1 14-FEB-10 FIL 1 1 14-FEB-10
Now we have no histograms, but the density and LAST_ANALYZED columns are populated for all rows returned by the above query (edit: same results on Oracle 10.2.0.2). For fun let’s retry one of the queries that returned odd cardinality estimates earlier:
SELECT * FROM T1 WHERE C3 BETWEEN 101 AND 201; SQL_ID 8jd9h693mbkvc, child number 0 ------------------------------------- SELECT * FROM T1 WHERE C3 BETWEEN 101 AND 201 Plan hash value: 1220227203 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10100 |00:00:00.02 | 547 | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 10200 | 10100 |00:00:00.02 | 547 | |* 2 | INDEX RANGE SCAN | IND_T1_C3 | 1 | 10200 | 10100 |00:00:00.01 | 124 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C3">=101 AND "C3"<=201)
The estimated cardinality is just about as accurate as it was with the query that accessed column C4 which had a histogram.
What are your thoughts? Is the METHOD_OPT parameter of DBMS_STATS.GATHER_TABLE_STATS only used to specify how histograms will be collected for columns, or does that parameter also specify for which columns column-level statistics should be gathered? If you again collect statistics using the DBMS_STATS.GATHER_TABLE_STATS command found near the start of this article, the query of DBA_TAB_COLUMNS returns the following, which is what was originally expected (edit: same results on Oracle 10.2.0.2):
COL DENSITY NUM_BUCKETS LAST_ANAL --- ---------- ----------- --------- C1 .00001 1 14-FEB-10 C2 .00001 254 14-FEB-10 C3 .000999001 1 14-FEB-10 C4 .00099998 254 14-FEB-10 C5 .000999001 1 14-FEB-10 C6 .00099998 254 14-FEB-10 FIL 1 1 14-FEB-10
——-
Follow-up (8 hours after this article was scheduled to appear):
It is important at times to check more than one source in the Oracle documentation (bold/italic emphasis is mine):
“METHOD_OPT – The value controls column statistics collection and histogram creation.”
Here is an example where Christian Antognini tried to drive that point home:
“Since the syntax is FOR ALL INDEXED COLUMNS, you are gathering statistics for all indexed columns only. I.e. not for all columns. FOR ALL COLUMNS should be used for that…”
“When the option FOR ALL INDEXED COLUMNS is specified, columns statistics are gathered only for the indexed columns.”
“My point was that the parameter method_opt not only impacts histograms, but also column statistics.”
Greg Rahn also made the point in this blog article.
“The METHOD_OPT parameter of DBMS_STATS controls two things:
- on which columns statistics will be collected
- on which columns histograms will be collected (and how many buckets)”
The message of this blog article is to make certain that you know what the GATHER_TABLE_STATS procedure is actually doing when you use the METHOD_OPT parameter (or alter the default value for the parameter). The apparent benefit from having the histogram in place might actually be a false benefit – it might be that previously you were not updating the column statistics for that column, until you started creating a histogram on that column (assuming that you were previously collecting histograms on other columns in the table, and you assumed that column statistics were updated for the remaining columns).
Incidentally, if instead of using this to collect statistics for the table:
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR COLUMNS SIZE 254 C2, C4, C6')
You used this:
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR ALL INDEXED COLUMNS size skewonly')
You would end up with the following when executing the query against DBA_TAB_COLUMNS:
COL DENSITY NUM_BUCKETS LAST_ANAL --- ---------- ----------- --------- C1 .00001 1 15-FEB-10 C2 C3 .000999001 1 15-FEB-10 C4 .000999001 1 15-FEB-10 C5 .00099998 254 15-FEB-10 C6 FIL
The above output confirms Christian Antognini’s comment in the OTN thread – column-level statistics were not collected for columns C2, C6, and FILLER. There is effectively no histogram on columns C3 and C4 (a single bucket, therefore no histogram), but the cardinality estimates in the plan will be close because column-level statistics are present for columns C3 and C4.
What are your thoughts? Is the METHOD_OPT parameter of DBMS_STATS.GATHER_TABLE_STATS only used to specify how histograms will be collected for columns, or does that parameter also specify for which columns column-level statistics should be gathered?
Is the “intelligence” in statistics collection, that checks SYS.COL_USAGE$, the reason ?
Narendra,
I am not sure that I understand your question. Please see the follow-up to the blog article (at the end of the blog article) to see if your question is answered in that section.
I think Narenda tried to explain the issue Hemant covered here
http://hemantoracledba.blogspot.com/2008/08/more-tests-of-colusage.html
Basically when “for all columns size auto” used data distribution and the workload of the columns (column usage) are considered.mentioned in doc .
Charles,
Apologies. My followup point was completely misguided.
Coskan,
Thanks for filling in the missing details, Narendra’s comment makes a little more sense with the links that you provided.
Incidentally, if you do not want to delete from the data dictionary as Hemant did (with appropriate cautions mentioned), you can see whether or not SYS.COL_USAGE$ is used during a statistics collection with an approach like this (using one of the tables created for the March 18, 2010 blog article)”
If you pass the resulting trace file through a program, such as TKPROF, you will see a variety of SQL statements with the dep= set to 1 or greater. One of those SQL statements looks something like this: