SQL – Experimenting with Case Insensitive Searches

4 06 2010

June 4, 2010

Have you ever read about something, or heard about something, and wanted to be able to reproduce it?  Have you ever been warned that doing something is not a good idea because a specific problem is certain to happen, yet you do it anyway just to see if the problem can be avoided?  I recently read (again) about performing case insensitive searches in Oracle… as is the default behavior on SQL Server.  So, let’s try a couple of experiments.

First, we need a test table with a primary key index:

CREATE TABLE T9 (
  C1 VARCHAR2(20),
  C2 VARCHAR2(200),
  PRIMARY KEY (C1));

INSERT INTO
  T9
SELECT
  CHR(65+MOD(ROWNUM-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(97+MOD(ROWNUM-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(65+MOD(ROWNUM-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(65+MOD(ROWNUM-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(97+MOD(ROWNUM-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(97+MOD(ROWNUM-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576;

COMMIT;

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T9',CASCADE=>TRUE)

The above test table is filled with one column having a three letter sequence (AAA, BAA, CAA,… ABA, BBA,… AZZ, … ) with a couple variations of upper and lowercase letters.  The other column is padded to 200 characters to intentionally discourage Oracle’s optimizer from using a full table scan when a suitable index is available.  Now the test (on Oracle Database 11.1.0.7).

To begin, we will explicitly list the upper and lowercase versions of the letters that are of interest:

SET AUTOTRACE TRACEONLY EXPLAIN

SELECT
  C1,
  C2
FROM
  T9
WHERE
  C1 IN ('ABC','abc','ABc','Abc','abC','aBC');

Plan hash value: 2861409042

---------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |     6 |  1230 |    13   (0)| 00:00:01 |
|   1 |  INLIST ITERATOR             |              |       |       |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T9           |     6 |  1230 |    13   (0)| 00:00:01 |
|*  3 |    INDEX UNIQUE SCAN         | SYS_C0016172 |     6 |       |     7   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("C1"='ABC' OR "C1"='ABc' OR "C1"='Abc' OR "C1"='aBC' OR "C1"='abC' OR
              "C1"='abc')

As can be seen, an index unique scan was performed for each of the values in the IN list.  The calculated cost is 13, and the optimizer is correctly predicting that 6 rows will be returned.

Next, we will try the brute force method, using the UPPER function on the C1 column:

SELECT
  C1,
  C2
FROM
  T9
WHERE
  UPPER(C1) = 'ABC';

Execution Plan
----------------------------------------------------------
Plan hash value: 3973213776

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1055 |   211K|   380   (2)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T9   |  1055 |   211K|   380   (2)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(UPPER("C1")='ABC')

The calculated cost is now 380, compared to the earlier value of 13, and Oracle is predicting that 1,055 rows will be returned – the increased cost likely explains why the optimizer did not use a full table scan for the IN list version of the SQL statement.

Next, let’s try something that is not supposed to work, using an index when a function is applied to that index’s column in the WHERE clause:

SELECT /*+ INDEX(T9) */
  C1,
  C2
FROM
  T9
WHERE
  UPPER(C1) = 'ABC';

Plan hash value: 1084614729

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |  1055 |   211K|  1332   (1)| 00:00:06 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T9           |  1055 |   211K|  1332   (1)| 00:00:06 |
|*  2 |   INDEX FULL SCAN           | SYS_C0016172 |  1055 |       |   277   (2)| 00:00:02 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(UPPER("C1")='ABC')

The primary key index was used, but note that this access is an index full scan where every block in the index is accessed (using single block reads), rather than an index unique scan as happened with the first SQL statement.  The calculated cost also increased again, this time to 1,332.

Now what?  Well, we can tell Oracle to perform case insensitive matches:

ALTER SESSION SET NLS_SORT=BINARY_CI;
ALTER SESSION SET NLS_COMP=LINGUISTIC;

SELECT
  C1,
  C2
FROM
  T9
WHERE
  C1 = 'ABC';

Plan hash value: 3973213776

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |   205 |   382   (3)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T9   |     1 |   205 |   382   (3)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NLSSORT("C1",'nls_sort=''BINARY_CI''')=HEXTORAW('61626300') )

But now we are back to a full table scan and the calculated cost of that full table scan is 382, rather than the 380 that we saw earlier.  However, Oracle is expecting to retrieve only a single row now, not the actual 6 rows nor the 1,055 rows when a full table scan appeared earlier in this article.

So let’s force an index access path with a hint just to see what happens:

SELECT /*+ INDEX(T9) */
  C1,
  C2
FROM
  T9
WHERE
  C1 = 'ABC';

Plan hash value: 1084614729

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |   205 |   279   (3)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T9           |     1 |   205 |   279   (3)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | SYS_C0016172 |     1 |       |   278   (3)| 00:00:02 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(NLSSORT("C1",'nls_sort=''BINARY_CI''')=HEXTORAW('61626300') )

There is the index access path, and notice that the calculated cost for this access path is less than that of the full table scan, yet Oracle did not automatically select that access path.  We saw this behavior in an earlier article too.  The predicted cardinality still shows only a single row is expected to be returned.

We still have not tried a function based index, so we will switch back to case sensitive matches and try again:

ALTER SESSION SET NLS_COMP=BINARY;

CREATE INDEX IND_T9_C1_UPPER ON T9(UPPER(C1));
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T9',CASCADE=>TRUE)

SELECT
  C1,
  C2
FROM
  T9
WHERE
  UPPER(C1) = 'ABC';

Execution Plan
----------------------------------------------------------
Plan hash value: 1260941705

-----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                 |     6 |  1254 |     7   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T9              |     6 |  1254 |     7   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IND_T9_C1_UPPER |     6 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(UPPER("C1")='ABC')

The function based index access path produced a lower cost plan than the IN list plan at the start of this article, and the cardinality estimate is correct.

Now let’s change back to a case insensitive search of column C1 to see what happens:

ALTER SESSION SET NLS_SORT=BINARY_CI;
ALTER SESSION SET NLS_COMP=LINGUISTIC;

SELECT
  C1,
  C2
FROM
  T9
WHERE
  UPPER(C1) = 'ABC';

Execution Plan
----------------------------------------------------------
Plan hash value: 3973213776

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     6 |  1254 |   383   (3)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T9   |     6 |  1254 |   383   (3)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NLSSORT(UPPER("C1"),'nls_sort=''BINARY_CI''')=HEXTORAW('61
              626300') )

The optimizer did not use our function based index, so a full table scan was performed.

Trying again:

SELECT
  C1,
  C2
FROM
  T9
WHERE
  C1 = 'ABC';

Plan hash value: 3973213776

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |   205 |   382   (3)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T9   |     1 |   205 |   382   (3)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NLSSORT("C1",'nls_sort=''BINARY_CI''')=HEXTORAW('61626300') )

Well, the WHERE clause did not match the function based index definition, so of course that index was not used.

One more time:

SELECT /*+ INDEX(T9) */
  C1,
  C2
FROM
  T9
WHERE
  C1 = 'ABC';

Plan hash value: 1084614729

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |   205 |   279   (3)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T9           |     1 |   205 |   279   (3)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | SYS_C0016172 |     1 |       |   278   (3)| 00:00:02 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(NLSSORT("C1",'nls_sort=''BINARY_CI''')=HEXTORAW('61626300') )

I think that we saw that plan earlier in this article.

Now who wanted Oracle to behave like SQL Server anyway?





Measuring Numbers – Is this a Valid Comparison?

4 06 2010

June 4, 2010

I encountered an interesting test case in the “Oracle SQL Recipes” book, but I fear that my degree in mathematics is causing me to fail to fully comprehend the test case.  I developed a parallel test case that possibly answers the questions that are left unanswered.  Here is my test case:

CREATE TABLE T8 (
  NUMBER_DIV NUMBER,
  BIN_DBL_DIV BINARY_DOUBLE,
  NUMBER_VALUE NUMBER,
  BIN_DBL_VALUE BINARY_DOUBLE,
  NUMBER_VALUE2 NUMBER(7,2));

INSERT INTO
  T8
SELECT
  ROWNUM,
  ROWNUM,
  1000/ROWNUM,
  1000/ROWNUM,
  1000/ROWNUM
FROM
  DUAL
CONNECT BY
  LEVEL<=10000;

COMMIT;

COLUMN NUMBER_DIV FORMAT 99999
COLUMN BIN_DBL_DIV FORMAT 99999
COLUMN NUMBER_VALUE FORMAT 99990.00000000000000000000
COLUMN BIN_DBL_VALUE FORMAT 99990.00000000000000000000
COLUMN NUMBER_VALUE2 FORMAT 99990.000
COLUMN VND FORMAT 999
COLUMN VBDD FORMAT 9999
COLUMN VNV FORMAT 999
COLUMN VBDV FORMAT 9999
COLUMN VNV2 FORMAT 9999

SET LINESIZE 140
SET TRIMSPOOL ON
SET PAGESIZE 1000
SPOOL NUMBERTEST.TXT

SELECT
  NUMBER_DIV,
  BIN_DBL_DIV,
  NUMBER_VALUE,
  BIN_DBL_VALUE,
  NUMBER_VALUE2,
  VSIZE(NUMBER_DIV) VND,
  VSIZE(BIN_DBL_DIV) VBDD,
  VSIZE(NUMBER_VALUE) VNV,
  VSIZE(BIN_DBL_VALUE) VBDV,
  VSIZE(NUMBER_VALUE2) VNV2
FROM
  T8
ORDER BY
  NUMBER_DIV;

SPOOL OFF

Quoting from page 190 of the book:

With the value one-third stored in each column, we can use the VSIZE function to show it was much more complicated to store this [the value of 1/3] with  decimal precision [using the NUMBER datatype], taking nearly three times the space [when compared to the BINARY_DOUBLE datatype].

Here is the output from my script for the row containing the value one-third:

NUMBER_DIV BIN_DBL_DIV            NUMBER_VALUE           BIN_DBL_VALUE NUMBER_VALUE2  VND  VBDD  VNV  VBDV  VNV2
---------- ----------- ----------------------- ----------------------- ------------- ---- ----- ---- ----- -----
      3000        3000      0.3333333333333333      0.3333333333333333         0.330    2     8   21     8     2

As the book states, the column with the NUMBER datatype requires 21 bytes, while the column with the BINARY_DOUBLE datatype requires just 8 bytes to store the value one-third in the table.  What, if anything,  is wrong with the comparison?

Hint: To conserve space, the column format for the NUMBER_VALUE and BIN_DBL_VALUE columns in the above output was changed from:

99990.00000000000000000000
to:
99990.0000000000000000

There is an interesting description of the NUMBER and BINARY_DOUBLE (or other similar datatypes) datatypes in the book “Troubleshooting Oracle Performance“.








Follow

Get every new post delivered to your Inbox.

Join 139 other followers