Full Table Scan when Selecting Null Values

29 07 2010

July 29, 2010

Another interesting thread appeared on the OTN forums this week.  The original poster in the thread stated that Oracle Database was performing a full table scan for a SQL statement like this:

SELECT
  COUNT(*)
FROM
  T1
WHERE
  COL1 IS NULL;

While a SQL statement like this following did not perform a full table scan:

SELECT
  COUNT(*)
FROM
  T1;

The original poster stated that there is an index on the column COL1 (COL1 is actually ID_NUMBER in the SQL statement posted by the OP).  Full table scans are not always the worst execution path, however, the OP stated that only 0.1% (average of 1 in 1000) of the table’s rows contain a NULL value in the column.  For a standard B*Tree index, NULL values are not included in the index structure if the values for all columns in the index definition are NULL – thus Oracle Database could not be able to use a standard single column B*Tree index to locate all of the rows where that column contains a NULL value.

Several suggestions were offered in the thread:

  • Sybrand Bakker suggested creating a function based index similar to NVL(COL1, <somevalue>) and then modifying the WHERE clause to specify NVL(COL1, <somevalue>) = <somevalue>  – This is a good suggestion if the original query may be modified.
  • SomeoneElse suggested creating a bipmap index for COL1 since bitmap indexes do store NULL values for rows where all columns in the index structure contain NULL values.  – This is a good suggestion if the table is not subject to many changes.
  • I suggested creating a standard, composite B*Tree index on COL1, ‘ ‘, which is a technique I found in one of Richard Foote’s blog articles some time ago.  This approach requires roughly 2 additional bytes per index entry.
  • David Aldridge suggested creating a function based index that only includes the NULL values contained in COL1 using the following for the index definition CASE WHEN COL1 IS NULL THEN 1 END and then modifying the WHERE clause in the SQL statement to CASE WHEN COL1 IS NULL THEN 1 END = 1  – this is a good suggestion that will create a very small index structure, useful when it is possible to modify the original SQL statement.

A lot of great ideas in the thread (Richard Foote’s blog articles demonstrate several of these approaches).  Hemant K Chitale pointed out that there are potential problems with the suggestion that I offered, where a constant is used for the second column in the index so that NULL values in the first column are included in the index structure.  He referenced Metalink (My Oracle Support) Doc ID 551754.1 that described Bug 6737251 and one of his blog articles.  That Metalink article, as well as another mentioned by Amardeep Sidhu which caused problems for DBMS_STATS (Metalink Doc ID 559389.1, not mentioned in the thread, suggests a work around for the bug 6737251), essentially state to NOT use a constant for the second column in the composite index, but instead to use another column in the table.  I reviewed Metalink Doc ID 551754.1, and found it a bit lacking in detail (maybe it is just me?).  Below are my comments from the thread regarding that Metalink document:

I see a couple of problems with that Metalink (My Oracle Support) article… it lacks a little clarity (then again I could be seeing things that do not exist). First, the suggestion to use a pair of columns rather than a constant has at least four potential problems that I am able to identify:

  1. The second column must have a NOT NULL constraint (not mentioned in the article) – it cannot be just any secondary column.
  2. The secondary column will likely increase the index size a bit more than a single character used for the second column in the index.
  3. The secondary column will likely affect the clustering factor calculation for the index.
  4. The secondary column could affect the cardinality estimates for the index access path.

The second problem with the Metalink article is that, while it does demonstrate a bug, it does not show why the bug affects the results, nor does it explore other possibilities – like the one that Richard Foote used in his blog article. Here is a quick test case, loosely based on the Metalink test case, to demonstrate (note that I have not copied the statistics output from AUTOTRACE so that I may improve the clarity of the output):

First, the setup:

DROP TABLE T1 PURGE;

CREATE TABLE T1 (C1 NUMBER, C2 NUMBER);

INSERT INTO T1 VALUES (NULL, 1);

COMMIT;

CREATE INDEX IND_T1_1 ON T1(C1,1);
CREATE INDEX IND_T1_2 ON T1(C1,' ');

We now have a table containing a single row and two indexes – the first index uses a numeric constant for the second column in the index, while the second index uses a blank space for the second column in the index. Now continuing:

SET LINESIZE 140
SET PAGESIZE 1000
SET TRIMSPOOL ON
SET AUTOTRACE ON

SELECT /*+ INDEX(T1 IND_T1_1) */
  *
FROM
  T1
WHERE
  C1 IS NULL;

        C1         C2
---------- ----------
                    1

Execution Plan
----------------------------------------------------------
Plan hash value: 2805969644

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |     1 |    26 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1       |     1 |    26 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_1 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C1" IS NULL)

Note
-----
   - dynamic sampling used for this statement

The above worked as expected. Continuing:

SELECT /*+ INDEX(T1 IND_T1_1) */
  *
FROM
  T1
WHERE
  C1 IS NULL
  AND ROWNUM<=1;

no rows selected

Execution Plan
----------------------------------------------------------
Plan hash value: 404994253

-----------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     1 |    26 |     1   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY               |          |       |       |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1       |     1 |    26 |     1   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | IND_T1_1 |     1 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=1)
   3 - access("C1" IS NULL AND ROWNUM<=1)
       filter(ROWNUM<=1)

Note
-----
   - dynamic sampling used for this statement

Note that this time we encountered the bug – take a close look at the Predicate Information section of the execution plan to see why.

Now the test continues with the suggestion from Richard’s blog:

SELECT /*+ INDEX(T1 IND_T1_2) */
  *
FROM
  T1
WHERE
  C1 IS NULL;

        C1         C2
---------- ----------
                    1

Execution Plan
----------------------------------------------------------
Plan hash value: 348287884

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |     1 |    26 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1       |     1 |    26 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_2 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C1" IS NULL)

Note
-----
   - dynamic sampling used for this statement

We obtained the same result as before [the correct results], continuing:

SELECT /*+ INDEX(T1 IND_T1_2) */
  *
FROM
  T1
WHERE
  C1 IS NULL
  AND ROWNUM<=1;

        C1         C2
---------- ----------
                    1

Execution Plan
----------------------------------------------------------
Plan hash value: 2383334138

-----------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     1 |    26 |     2   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY               |          |       |       |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1       |     1 |    26 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | IND_T1_2 |     1 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=1)
   3 - access("C1" IS NULL)

Note
-----
   - dynamic sampling used for this statement

Note that this time we did not encounter the bug [we received the correct results], and a row was returned. Compare the Predicate Information section of the execution plan with the one that failed to produce the correct result.

Let’s remove the “dynamic sampling used for this statement” note by gathering statistics:

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,NO_INVALIDATE=>FALSE)
BEGIN DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,NO_INVALIDATE=>FALSE); END;

*
ERROR at line 1:
ORA-03001: unimplemented feature
ORA-06512: at "SYS.DBMS_STATS", line 13159
ORA-06512: at "SYS.DBMS_STATS", line 13179
ORA-06512: at line 1

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE)
BEGIN DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE); END;

*
ERROR at line 1:
ORA-03001: unimplemented feature
ORA-06512: at "SYS.DBMS_STATS", line 13159
ORA-06512: at "SYS.DBMS_STATS", line 13179
ORA-06512: at line 1

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1')
BEGIN DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1'); END;

*
ERROR at line 1:
ORA-03001: unimplemented feature
ORA-06512: at "SYS.DBMS_STATS", line 13159
ORA-06512: at "SYS.DBMS_STATS", line 13179
ORA-06512: at line 1

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2')

PL/SQL procedure successfully completed.

Well, it appears that we hit another bug [likely the one mentioned by Amardeep Sidhu]. Note that I successfully gathered statistics on another table just to demonstrate that there was not a problem with my statistics gathering syntax. Let’s fix that problem [removing the problem that triggered the bug in the Oracle Database software]:

SQL> DROP INDEX IND_T1_1;

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,NO_INVALIDATE=>FALSE)

PL/SQL procedure successfully completed.

That is better. Apparently, Oracle Database (at least 10.2.0.2) has problems when the second column in an index definition is a number constant, but not when the second column is a character constant.

Mohamed Houri asked a very good question about the impact on the calculated clustering factor for an index when the index definition is changed to have a blank space for the second column, which will allow the rows containing NULL values to be indexed.  A change in the clustering factor may affect the optimizer’s decision to select an index access path using that index.

Without performing a test, I would estimate that each NULL value to be included in the index structure will increase the clustering factor by a value of 1, unless all of those rows containing the NULL values are contained within a couple of table blocks, in which case the clustering factor should increase by less than 1 per NULL value – such a case would likely assume that the rows were inserted at roughly the same time.

Confusing? The NULL values in the index structure will be logically grouped together in logically adjacent index leaf blocks, when this is considered as well as the algorithm for calculating the clustering factor, the maximum increase in the clustering for this approach should be fairly easy to guess. See the clustering factor chapter in the book “Cost-Based Oracle Fundamentals”.

Let’s create a test case (note that this test case differs slightly from the test case in the thread) that hopefully explores some of the possibilities, and tries to determine the potential impact on the clustering factor.  The setup:

DROP TABLE T1 PURGE;

CREATE TABLE T1 (
  COL1 NUMBER NOT NULL,
  COL2 NUMBER,
  COL3 NUMBER,
  EMPLOYEE_ID VARCHAR2(20),
  SHIFT_DATE DATE NOT NULL,
  ROLLING_DATE DATE NOT NULL,
  INDIRECT_ID VARCHAR2(20));

INSERT INTO
  T1
SELECT
  ROWNUM COL1,
  DECODE(MOD(ROWNUM,1000),0,NULL,ROWNUM) COL2,
  DECODE(SIGN(999001-ROWNUM),1,ROWNUM,NULL) COL3,
  DECODE(TRUNC(DBMS_RANDOM.VALUE(0,5)),
          0,'MIKE',
          1,'ROB',
          2,'SAM',
          3,'JOE',
          4,'ERIC') EMPLOYEE_ID,
  TRUNC(SYSDATE)-ROUND(DBMS_RANDOM.VALUE(0,1000)) SHIFT_DATE,
  TRUNC(SYSDATE) + (1/1000)*ROWNUM ROLLING_DATE,
  DECODE(TRUNC(DBMS_RANDOM.VALUE(0,10)),
          0,'VAC',
          1,'HOL',
          2,'BEREAVE',
          3,'JURY',
          4,'ABS',
          5,'EXCUSE',
          6,'MIL',
          'OTHER') INDIRECT_ID
FROM
  DUAL
CONNECT BY
  LEVEL<=1000000;

COMMIT;

We now have 1,000,000 rows.  Columns COL2 and COL3 each contain 0.1% NULL values (COL2 has the NULL values evenly dispursed, while COL3 has the NULL values in the last 1,000 rows inserted).  Now, let’s demonstrate an example of over-indexing – do not do this in a production environment.

CREATE INDEX IND_T1_COL1 ON T1(COL1);
CREATE INDEX IND_T1_COL2 ON T1(COL2);
CREATE INDEX IND_T1_EMPL ON T1(EMPLOYEE_ID);
CREATE INDEX IND_T1_SHIFT_DATE ON T1(SHIFT_DATE);
CREATE INDEX IND_T1_ROLL_DATE ON T1(ROLLING_DATE);

CREATE INDEX IND_T1_COL2_NOT_NULL ON T1(COL2,' ');
CREATE INDEX IND_T1_COL3_NOT_NULL ON T1(COL3,' ');
CREATE INDEX IND_T1_EMPL_NOT_NULL ON T1(EMPLOYEE_ID,' ');

CREATE INDEX IND_T1_COL2_NOT_NULL2 ON T1(COL2,COL1);
CREATE INDEX IND_T1_COL2_NOT_NULL3 ON T1(COL2,SHIFT_DATE);
CREATE INDEX IND_T1_COL2_NOT_NULL4 ON T1(COL2,ROLLING_DATE);
CREATE INDEX IND_T1_COL2_NOT_NULL5 ON T1(COL2,EMPLOYEE_ID);
CREATE INDEX IND_T1_COL2_NOT_NULL6 ON T1(NVL(COL2,-1));
CREATE INDEX IND_T1_COL2_NOT_NULL7 ON T1(CASE WHEN COL2 IS NULL THEN 1 END);

CREATE INDEX IND_T1_COL3_NOT_NULL2 ON T1(COL3,COL1);
CREATE INDEX IND_T1_COL3_NOT_NULL3 ON T1(COL3,SHIFT_DATE);
CREATE INDEX IND_T1_COL3_NOT_NULL4 ON T1(COL3,ROLLING_DATE);
CREATE INDEX IND_T1_COL3_NOT_NULL5 ON T1(COL3,EMPLOYEE_ID);
CREATE INDEX IND_T1_COL3_NOT_NULL6 ON T1(NVL(COL3,-1));
CREATE INDEX IND_T1_COL3_NOT_NULL7 ON T1(CASE WHEN COL3 IS NULL THEN 1 END);

CREATE INDEX IND_T1_EMPL_NOT_NULL2 ON T1(EMPLOYEE_ID,COL1);
CREATE INDEX IND_T1_EMPL_NOT_NULL3 ON T1(NVL(EMPLOYEE_ID,'UNKNOWN'));
CREATE INDEX IND_T1_EMPL_NOT_NULL4 ON T1(CASE WHEN EMPLOYEE_ID IS NULL THEN 'UNKNOWN' END);

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>NULL)

The statistics collection script was specified to sample 100% of the table and indexes.  For my test run on Oracle Database 11.1.0.7, I received the following results (I added the index definition to the end of each line of output):

SELECT
  INDEX_NAME,
  DISTINCT_KEYS,
  BLEVEL,
  LEAF_BLOCKS,
  CLUSTERING_FACTOR,
  SAMPLE_SIZE
FROM
  USER_INDEXES
WHERE
  TABLE_NAME='T1'
ORDER BY
  INDEX_NAME;

INDEX_NAME            DISTINCT_KEYS BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR SAMPLE_SIZE
--------------------- ------------- ------ ----------- ----------------- -----------
IND_T1_COL1                 1000000      2        2226              6330     1000000  (COL1)
IND_T1_COL2                  999000      2        2224              6330      999000  (COL2)
IND_T1_COL2_NOT_NULL         999001      2        2505              7330     1000000  (COL2,' ')
IND_T1_COL2_NOT_NULL2       1000000      2        2921              7330     1000000  (COL2,COL1)
IND_T1_COL2_NOT_NULL3        999640      2        3343              7330     1000000  (COL2,SHIFT_DATE)
IND_T1_COL2_NOT_NULL4       1000000      2        3343              7330     1000000  (COL2,ROLLING_DATE)
IND_T1_COL2_NOT_NULL5        999005      2        2841              7330     1000000  (COL2,EMPLOYEE_ID)
IND_T1_COL2_NOT_NULL6        999001      2        2226              7330     1000000  (NVL(COL2,-1))
IND_T1_COL2_NOT_NULL7             1      1           2              1000        1000  (CASE WHEN COL2 IS NULL THEN 1 END)
IND_T1_COL3_NOT_NULL         999001      2        2505              6331     1000000  (COL3,' ')
IND_T1_COL3_NOT_NULL2       1000000      2        2921              6330     1000000  (COL3,COL1)
IND_T1_COL3_NOT_NULL3        999638      2        3343              7150     1000000  (COL3,SHIFT_DATE)
IND_T1_COL3_NOT_NULL4       1000000      2        3343              6330     1000000  (COL3,ROLLING_DATE)
IND_T1_COL3_NOT_NULL5        999005      2        2841              6359     1000000  (COL3,EMPLOYEE_ID)
IND_T1_COL3_NOT_NULL6        999001      2        2226              6331     1000000  (NVL(COL3,-1))
IND_T1_COL3_NOT_NULL7             1      1           2                 7        1000  (CASE WHEN COL3 IS NULL THEN 1 END)
IND_T1_EMPL                       5      2        2147             31585     1000000  (EMPLOYEE_ID)
IND_T1_EMPL_NOT_NULL              5      2        2425             31585     1000000  (EMPLOYEE_ID,' ')
IND_T1_EMPL_NOT_NULL2       1000000      2        2840             31598     1000000  (EMPLOYEE_ID,COL1)
IND_T1_EMPL_NOT_NULL3             5      2        2147             31585     1000000  (NVL(EMPLOYEE_ID,'UNKNOWN'))
IND_T1_EMPL_NOT_NULL4             0      0           0                 0           0  (CASE WHEN EMPLOYEE_ID IS NULL THEN 'UNKNOWN' END)
IND_T1_ROLL_DATE            1000000      2        2646              6330     1000000  (ROLLING_DATE)
IND_T1_SHIFT_DATE              1001      2        2646            925286     1000000  (SHIFT_DATE)

SELECT
  BLOCKS
FROM
  USER_TABLES
WHERE
  TABLE_NAME='T1';

BLOCKS
------
  6418

For the 1,000,000 row table, 1 of every 1,000 rows contains a NULL value in column COL2, and the clustering factor increased by exactly 1,000 (1,000,000/1,000) for the index on that column that used a single space for the second column in the index (IND_T1_COL2_NOT_NULL). The results were different when the majority of the NULL values were grouped together, as was the case for column COL3, where the clustering factor increased by a value of 1. In this particular test case, with the NULL values evenly dispursed, the clustering factor is about the same regardless of whether a blank space is used for the second column, or an actual table column.  The clustering factor calculation for the index definition NVL(COL2, -1) and NVL(COL3, -1) was identical to that for the index definition COL2, ‘ ‘ and COL3, ‘ ‘, respectively.  The index structure using the CASE syntax was extremely small when 0.1% of the rows in the table contained NULL values, and thus produced a very small clustering factor because there were few index entries.  The test case did not include a bitmap index, because the clustering factor statistic for such an index has a different meaning.  A clustering factor of 6,330 is the lowest possible value for the indexes not using the CASE syntax (I expected it to be closer to the number of blocks in the table). 

Your results in a production environment may be very different from those in my simple test case, so be certain to test your solution.  So, the question: What would you do if you were faced with the problem identified by the original poster in the OTN thread?