SQL – Experimenting with Case Insensitive Searches

4 06 2010

June 4, 2010

Have you ever read about something, or heard about something, and wanted to be able to reproduce it?  Have you ever been warned that doing something is not a good idea because a specific problem is certain to happen, yet you do it anyway just to see if the problem can be avoided?  I recently read (again) about performing case insensitive searches in Oracle… as is the default behavior on SQL Server.  So, let’s try a couple of experiments.

First, we need a test table with a primary key index:

CREATE TABLE T9 (
  C1 VARCHAR2(20),
  C2 VARCHAR2(200),
  PRIMARY KEY (C1));

INSERT INTO
  T9
SELECT
  CHR(65+MOD(ROWNUM-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(97+MOD(ROWNUM-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(65+MOD(ROWNUM-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(65+MOD(ROWNUM-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(97+MOD(ROWNUM-1,26))||
    CHR(97+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576
UNION ALL
SELECT
  CHR(97+MOD(ROWNUM-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/26)-1,26))||
    CHR(65+MOD(CEIL(ROWNUM/676)-1,26)),
  RPAD('A',200,'A')
FROM
  DUAL
CONNECT BY
  LEVEL<=17576;

COMMIT;

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T9',CASCADE=>TRUE)

The above test table is filled with one column having a three letter sequence (AAA, BAA, CAA,… ABA, BBA,… AZZ, … ) with a couple variations of upper and lowercase letters.  The other column is padded to 200 characters to intentionally discourage Oracle’s optimizer from using a full table scan when a suitable index is available.  Now the test (on Oracle Database 11.1.0.7).

To begin, we will explicitly list the upper and lowercase versions of the letters that are of interest:

SET AUTOTRACE TRACEONLY EXPLAIN

SELECT
  C1,
  C2
FROM
  T9
WHERE
  C1 IN ('ABC','abc','ABc','Abc','abC','aBC');

Plan hash value: 2861409042

---------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |     6 |  1230 |    13   (0)| 00:00:01 |
|   1 |  INLIST ITERATOR             |              |       |       |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T9           |     6 |  1230 |    13   (0)| 00:00:01 |
|*  3 |    INDEX UNIQUE SCAN         | SYS_C0016172 |     6 |       |     7   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("C1"='ABC' OR "C1"='ABc' OR "C1"='Abc' OR "C1"='aBC' OR "C1"='abC' OR
              "C1"='abc')

As can be seen, an index unique scan was performed for each of the values in the IN list.  The calculated cost is 13, and the optimizer is correctly predicting that 6 rows will be returned.

Next, we will try the brute force method, using the UPPER function on the C1 column:

SELECT
  C1,
  C2
FROM
  T9
WHERE
  UPPER(C1) = 'ABC';

Execution Plan
----------------------------------------------------------
Plan hash value: 3973213776

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  1055 |   211K|   380   (2)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T9   |  1055 |   211K|   380   (2)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(UPPER("C1")='ABC')

The calculated cost is now 380, compared to the earlier value of 13, and Oracle is predicting that 1,055 rows will be returned – the increased cost likely explains why the optimizer did not use a full table scan for the IN list version of the SQL statement.

Next, let’s try something that is not supposed to work, using an index when a function is applied to that index’s column in the WHERE clause:

SELECT /*+ INDEX(T9) */
  C1,
  C2
FROM
  T9
WHERE
  UPPER(C1) = 'ABC';

Plan hash value: 1084614729

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |  1055 |   211K|  1332   (1)| 00:00:06 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T9           |  1055 |   211K|  1332   (1)| 00:00:06 |
|*  2 |   INDEX FULL SCAN           | SYS_C0016172 |  1055 |       |   277   (2)| 00:00:02 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(UPPER("C1")='ABC')

The primary key index was used, but note that this access is an index full scan where every block in the index is accessed (using single block reads), rather than an index unique scan as happened with the first SQL statement.  The calculated cost also increased again, this time to 1,332.

Now what?  Well, we can tell Oracle to perform case insensitive matches:

ALTER SESSION SET NLS_SORT=BINARY_CI;
ALTER SESSION SET NLS_COMP=LINGUISTIC;

SELECT
  C1,
  C2
FROM
  T9
WHERE
  C1 = 'ABC';

Plan hash value: 3973213776

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |   205 |   382   (3)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T9   |     1 |   205 |   382   (3)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NLSSORT("C1",'nls_sort=''BINARY_CI''')=HEXTORAW('61626300') )

But now we are back to a full table scan and the calculated cost of that full table scan is 382, rather than the 380 that we saw earlier.  However, Oracle is expecting to retrieve only a single row now, not the actual 6 rows nor the 1,055 rows when a full table scan appeared earlier in this article.

So let’s force an index access path with a hint just to see what happens:

SELECT /*+ INDEX(T9) */
  C1,
  C2
FROM
  T9
WHERE
  C1 = 'ABC';

Plan hash value: 1084614729

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |   205 |   279   (3)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T9           |     1 |   205 |   279   (3)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | SYS_C0016172 |     1 |       |   278   (3)| 00:00:02 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(NLSSORT("C1",'nls_sort=''BINARY_CI''')=HEXTORAW('61626300') )

There is the index access path, and notice that the calculated cost for this access path is less than that of the full table scan, yet Oracle did not automatically select that access path.  We saw this behavior in an earlier article too.  The predicted cardinality still shows only a single row is expected to be returned.

We still have not tried a function based index, so we will switch back to case sensitive matches and try again:

ALTER SESSION SET NLS_COMP=BINARY;

CREATE INDEX IND_T9_C1_UPPER ON T9(UPPER(C1));
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T9',CASCADE=>TRUE)

SELECT
  C1,
  C2
FROM
  T9
WHERE
  UPPER(C1) = 'ABC';

Execution Plan
----------------------------------------------------------
Plan hash value: 1260941705

-----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                 |     6 |  1254 |     7   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T9              |     6 |  1254 |     7   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IND_T9_C1_UPPER |     6 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(UPPER("C1")='ABC')

The function based index access path produced a lower cost plan than the IN list plan at the start of this article, and the cardinality estimate is correct.

Now let’s change back to a case insensitive search of column C1 to see what happens:

ALTER SESSION SET NLS_SORT=BINARY_CI;
ALTER SESSION SET NLS_COMP=LINGUISTIC;

SELECT
  C1,
  C2
FROM
  T9
WHERE
  UPPER(C1) = 'ABC';

Execution Plan
----------------------------------------------------------
Plan hash value: 3973213776

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     6 |  1254 |   383   (3)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T9   |     6 |  1254 |   383   (3)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NLSSORT(UPPER("C1"),'nls_sort=''BINARY_CI''')=HEXTORAW('61
              626300') )

The optimizer did not use our function based index, so a full table scan was performed.

Trying again:

SELECT
  C1,
  C2
FROM
  T9
WHERE
  C1 = 'ABC';

Plan hash value: 3973213776

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |   205 |   382   (3)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T9   |     1 |   205 |   382   (3)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(NLSSORT("C1",'nls_sort=''BINARY_CI''')=HEXTORAW('61626300') )

Well, the WHERE clause did not match the function based index definition, so of course that index was not used.

One more time:

SELECT /*+ INDEX(T9) */
  C1,
  C2
FROM
  T9
WHERE
  C1 = 'ABC';

Plan hash value: 1084614729

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |   205 |   279   (3)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T9           |     1 |   205 |   279   (3)| 00:00:02 |
|*  2 |   INDEX FULL SCAN           | SYS_C0016172 |     1 |       |   278   (3)| 00:00:02 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(NLSSORT("C1",'nls_sort=''BINARY_CI''')=HEXTORAW('61626300') )

I think that we saw that plan earlier in this article.

Now who wanted Oracle to behave like SQL Server anyway?


Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

Join 144 other followers

%d bloggers like this: