Which is Most Efficient when Selecting Rows: EXISTS, IN, or a JOIN

29 12 2009

December 29, 2009

This post is a follow up to the previous post that questioned which approach is most efficient when deleting rows.

The Oracle documentation for 10g R2 states the following

“In certain circumstances, it is better to use IN rather than EXISTS. In general, if the selective predicate is in the subquery, then use IN. If the selective predicate is in the parent query, then use EXISTS.”

“Sometimes, Oracle can rewrite a subquery when used with an IN clause to take advantage of selectivity specified in the subquery. This is most beneficial when the most selective filter appears in the subquery and there are indexes on the join columns. Conversely, using EXISTS is beneficial when the most selective filter is in the parent query. This allows the selective predicates in the parent query to be applied before filtering the rows against the EXISTS criteria.”

The above quotes seems to be similar to the quotes provided in the earlier blog article.  Let’s set up the same test tables as were used in the previous blog article:

CREATE TABLE T1 (
  C1 NUMBER,
  FILLER VARCHAR2(300),
  PRIMARY KEY (C1));

CREATE TABLE T2 (
  C1 NUMBER,
  FILLER VARCHAR2(300),
  PRIMARY KEY (C1));

INSERT INTO
  T1
SELECT
  ROWNUM,
  LPAD('A',300,'A')
FROM
  (SELECT
    ROWNUM NR
  FROM
    DUAL
  CONNECT BY
    LEVEL<=1000) V1,
  (SELECT
    ROWNUM NR
  FROM
    DUAL
  CONNECT BY
    LEVEL<=1000) V2;

INSERT INTO
  T2
SELECT
  ROWNUM*3,
  LPAD('A',300,'A')
FROM
  (SELECT
    ROWNUM NR
  FROM
    DUAL
  CONNECT BY
    LEVEL<=333) V1,
  (SELECT
    ROWNUM NR
  FROM
    DUAL
  CONNECT BY
    LEVEL<=1000) V2;

COMMIT;

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER, TABNAME=>'T1')
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER, TABNAME=>'T2')

Next, we will set up to capture a 10053 trace for three SQL statements:

  1. Uncorrelated IN subquery
  2. Correlated IN subquery
  3. EXISTS subquery
ALTER SESSION SET TRACEFILE_IDENTIFIER = 'SELECT_METHODS';
ALTER SESSION SET EVENTS '10053 trace name context forever, level 1';

ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SET TIMING ON

SPOOL select_methods.txt

SELECT C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2);

ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SELECT C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2 WHERE T2.C1=T1.C1);

ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SELECT C1 FROM T1 WHERE EXISTS (SELECT 1 FROM T2 WHERE T2.C1=T1.C1);

SPOOL OFF

ALTER SESSION SET EVENTS '10053 trace name context off';

On Oracle Database 11.1.0.7, I received the following output in the select_methods.txt file (slightly cleaned up):

SQL> SELECT C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2);

Elapsed: 00:00:43.54

SQL> ALTER SYSTEM FLUSH BUFFER_CACHE;
SQL> ALTER SYSTEM FLUSH BUFFER_CACHE;

SQL> SELECT C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2 WHERE T2.C1=T1.C1);

Elapsed: 00:00:43.18

SQL> ALTER SYSTEM FLUSH BUFFER_CACHE;
SQL> ALTER SYSTEM FLUSH BUFFER_CACHE;

SQL> SELECT C1 FROM T1 WHERE EXISTS (SELECT 1 FROM T2 WHERE T2.C1=T1.C1);

Elapsed: 00:00:43.33

The above output seems to indicate that the correlated IN subquery completed the fastest, followed by the EXISTS subquery, and then the uncorrelated IN subquery.  Let’s check the 10053 trace file.

******************************************
----- Current SQL Statement for this session (sql_id=19gjmx8y5rcg9) -----
SELECT C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2)
*******************************************
...
Final query after transformations:******* UNPARSED QUERY IS *******
SELECT "T1"."C1" "C1" FROM "TESTUSER"."T2" "T2","TESTUSER"."T1" "T1" WHERE "T1"."C1"="T2"."C1"
...
============
Plan Table
============
---------------------------------------------+-----------------------------------+
| Id  | Operation              | Name        | Rows  | Bytes | Cost  | Time      |
---------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT       |             |       |       |   576 |           |
| 1   |  NESTED LOOPS          |             |  325K | 3252K |   576 |  00:00:07 |
| 2   |   INDEX FAST FULL SCAN | SYS_C0016167|  977K | 4883K |   514 |  00:00:07 |
| 3   |   INDEX UNIQUE SCAN    | SYS_C0016168|     1 |     5 |     0 |           |
---------------------------------------------+-----------------------------------+
Predicate Information:
----------------------
3 - access("C1"="C1")

...
...
...

******************************************
----- Current SQL Statement for this session (sql_id=5hkw5pa3stxvr) -----
SELECT C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2 WHERE T2.C1=T1.C1)
*******************************************
...
Final query after transformations:******* UNPARSED QUERY IS *******
SELECT "T1"."C1" "C1" FROM "TESTUSER"."T2" "T2","TESTUSER"."T1" "T1" WHERE "T1"."C1"="T2"."C1" AND "T2"."C1"="T1"."C1"
...
============
Plan Table
============
---------------------------------------------+-----------------------------------+
| Id  | Operation              | Name        | Rows  | Bytes | Cost  | Time      |
---------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT       |             |       |       |   576 |           |
| 1   |  NESTED LOOPS          |             |  325K | 3252K |   576 |  00:00:07 |
| 2   |   INDEX FAST FULL SCAN | SYS_C0016167|  977K | 4883K |   514 |  00:00:07 |
| 3   |   INDEX UNIQUE SCAN    | SYS_C0016168|     1 |     5 |     0 |           |
---------------------------------------------+-----------------------------------+
Predicate Information:
----------------------
3 - access("C1"="C1")

...
...
...

******************************************
----- Current SQL Statement for this session (sql_id=4x73yz9t7dazj) -----
SELECT C1 FROM T1 WHERE EXISTS (SELECT 1 FROM T2 WHERE T2.C1=T1.C1)
*******************************************
...
Final query after transformations:******* UNPARSED QUERY IS *******
SELECT "T1"."C1" "C1" FROM "TESTUSER"."T2" "T2","TESTUSER"."T1" "T1" WHERE "T2"."C1"="T1"."C1"
...
============
Plan Table
============
---------------------------------------------+-----------------------------------+
| Id  | Operation              | Name        | Rows  | Bytes | Cost  | Time      |
---------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT       |             |       |       |   576 |           |
| 1   |  NESTED LOOPS SEMI     |             |  325K | 3252K |   576 |  00:00:07 |
| 2   |   INDEX FAST FULL SCAN | SYS_C0016167|  977K | 4883K |   514 |  00:00:07 |
| 3   |   INDEX UNIQUE SCAN    | SYS_C0016168|  108K |  541K |     0 |           |
---------------------------------------------+-----------------------------------+
Predicate Information:
----------------------
3 - access("T2"."C1"="T1"."C1")

Notice anything strange about the “Final query after transformations”?  The transformed uncorrelated IN subquery and the transformed EXISTS subquery are identical (if you ignore that T1.C1= T2.C1 is actually listed as T2.C1=T1.C1.  Notice also that the transformed queries are now standard joins rather than the IN query being transformed into an EXISTS query, or vice-versa.  The transformed version of the correlated IN subquery is a bit odd – it seems that the query optimizer in 11.1.0.7 did not automatically eliminate the duplicate T1.C1= T2.C1 (if I recall correctly 10.2.0.4 does remove the duplicate predicate when writing the “Final query”).  OK, the transformed versions of the SQL statements are for all purposes identical.  The plan for the EXISTS query shows that the join is a NESTED LOOPS SEMI, while the other plans show a join of NESTED LOOPS.

Let’s look at the DBMS_XPLAN output (note that STATISTICS_LEVEL was set to the default of TYPICAL):

SET TIMING OFF
SET AUTOTRACE OFF

SET PAGESIZE 2000
SET LINESIZE 140

ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SPOOL select_methods2.txt

SELECT /*+ GATHER_PLAN_STATISTICS */ C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2);

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SELECT /*+ GATHER_PLAN_STATISTICS */ C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2 WHERE T2.C1=T1.C1);

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

ALTER SYSTEM FLUSH BUFFER_CACHE;
ALTER SYSTEM FLUSH BUFFER_CACHE;

SELECT /*+ GATHER_PLAN_STATISTICS */ C1 FROM T1 WHERE EXISTS (SELECT 1 FROM T2 WHERE T2.C1=T1.C1);

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

SPOOL OFF 

The output of the above is as follows (note that the output has been cleaned up slightly):

SQL_ID  gwd4d6sk75hhk, child number 0                                
-------------------------------------                                
SELECT /*+ GATHER_PLAN_STATISTICS */ C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2)                          
Plan hash value: 319633161        

---------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |              |      1 |        |    333K|00:00:00.05 |   64418 |   2547 |
|   1 |  NESTED LOOPS         |              |      1 |    333K|    333K|00:00:00.05 |   64418 |   2547 |
|   2 |   INDEX FAST FULL SCAN| SYS_C0016167 |      1 |   1000K|   1000K|00:00:00.02 |   24052 |   1883 |
|*  3 |   INDEX UNIQUE SCAN   | SYS_C0016168 |   1000K|      1 |    333K|00:00:00.84 |   40366 |    664 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):                  
---------------------------------------------------                  
   3 - access("C1"="C1")          

-

SQL_ID  fcm7ptb886bzt, child number 0                                
-------------------------------------                                
SELECT /*+ GATHER_PLAN_STATISTICS */ C1 FROM T1 WHERE C1 IN (SELECT C1 FROM T2 WHERE T2.C1=T1.C1)        

Plan hash value: 319633161        

---------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |              |      1 |        |    333K|00:00:00.09 |   64418 |   2547 |
|   1 |  NESTED LOOPS         |              |      1 |    333K|    333K|00:00:00.09 |   64418 |   2547 |
|   2 |   INDEX FAST FULL SCAN| SYS_C0016167 |      1 |   1000K|   1000K|00:00:00.03 |   24052 |   1883 |
|*  3 |   INDEX UNIQUE SCAN   | SYS_C0016168 |   1000K|      1 |    333K|00:00:01.06 |   40366 |    664 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):                  
---------------------------------------------------                  
   3 - access("C1"="C1")          

-

SQL_ID  4ym7p8815nquy, child number 0                                
-------------------------------------                                
SELECT /*+ GATHER_PLAN_STATISTICS */ C1 FROM T1 WHERE EXISTS (SELECT 1 FROM T2 WHERE T2.C1=T1.C1)        

Plan hash value: 2371405353       

---------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |              |      1 |        |    333K|00:00:00.06 |   64418 |   2547 |
|   1 |  NESTED LOOPS SEMI    |              |      1 |    333K|    333K|00:00:00.06 |   64418 |   2547 |
|   2 |   INDEX FAST FULL SCAN| SYS_C0016167 |      1 |   1000K|   1000K|00:00:00.03 |   24052 |   1883 |
|*  3 |   INDEX UNIQUE SCAN   | SYS_C0016168 |   1000K|    110K|    333K|00:00:00.94 |   40366 |    664 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):                  
---------------------------------------------------                  
   3 - access("T2"."C1"="T1"."C1")

The output captured during the 10053 trace implied: “The above output seems to indicate that the correlated IN subquery completed the fastest, followed by the EXISTS subquery, and then the uncorrelated IN subquery.”  The output captured during the DBMS_XPLAN script seems to show that the uncorrelated subquery completed the fastest, followed by the EXISTS subquery, and then the correlated subquery.

OK, so the question remains: Which is most efficient when selecting rows: EXISTS, IN, or a JOIN?  It is pretty hard to tell when the optimizer automatically re-writes the queries that we submit.

It might be interesting to recheck the output with a new test case that permits:

  1. NULL values in one or both tables.
  2. Duplicate values in one or both tables.
  3. Larger data sets.

Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

Join 139 other followers

%d bloggers like this: