Nested Loops Join – the Smaller Table is the Driving Table, the Larger Table is the Driving Table

21 03 2011

March 21, 2011

I occasionally see discussions about Oracle Database behavior that make me wonder… is it true, can I generate a test case that validates the statement, and just as important, can I generate a test case that refutes the statement.  An interesting question was posted the the OTN forums regarding apparently conflicting advice that the original posted had received regarding which table the Oracle Database’s optimizer would select as the driving table in a nested loops join; if there is a large table and a small table to be joined by nested loops, which will be the driving (outer) table?  Jonathan Lewis provided a good deal of insight in the thread, indicating that the optimizer’s decision is not as clear cut as suggested by the OP’s sources.

Not as clear cut as the optimizer always selects the smaller table as the driving (outer) table?  Maybe the documentation is able to provide a little more insight into the topic:

“The optimizer uses nested loop joins when joining small number of rows, with a good driving condition between the two tables. You drive from the outer loop to the inner loop, so the order of tables in the execution plan is important.

The outer loop is the driving row source. It produces a set of rows for driving the join condition. The row source can be a table accessed using an index scan or a full table scan. Also, the rows can be produced from any other operation. For example, the output from a nested loop join can be used as a row source for another nested loop join.

The inner loop is iterated for every row returned from the outer loop, ideally by an index scan. If the access path for the inner loop is not dependent on the outer loop, then you can end up with a Cartesian product; for every iteration of the outer loop, the inner loop produces the same set of rows. Therefore, you should use other join methods when two independent row sources are joined together.”

I guess that helps a little – the driving table (or row source) is the outer table, and the inner table (or row source) is checked for every row returned by the driving table (after the specified filter predicates are applied to the driving table).  However, the documentation does not explicitly state that the outer table must be the smaller table.  Let’s check a couple of books:

The books were mostly in agreement, except for the last two.  Since it appears that there might be some disagreement whether the smaller table is selected by the optimizer as the driving table or the larger table is selected by the optimizer as the driving table, we need a test case (as repeatedly requested as evidence by Sean in the OTN thread). 

The following noworkload system statistics will be used:

SELECT
  PNAME,
  PVAL1
FROM
  SYS.AUX_STATS$
ORDER BY
  PNAME;

PNAME                PVAL1
--------------- ----------
CPUSPEED
CPUSPEEDNW      2116.57559
DSTART
DSTOP
FLAGS                    1
IOSEEKTIM               10
IOTFRSPEED            4096
MAXTHR
MBRC
MREADTIM
SLAVETHR
SREADTIM
STATUS 

First the tables.  We will start simple, with one table (T3) having 100 rows and another table (T4) having 10 rows:

CREATE TABLE T3 (
  C1 NUMBER,
  C2 NUMBER,
  C3 NUMBER,
  C4 VARCHAR2(20),
  PADDING VARCHAR2(200));

CREATE TABLE T4 (
  C1 NUMBER,
  C2 NUMBER,
  C3 NUMBER,
  C4 VARCHAR2(20),
  PADDING VARCHAR2(200));

INSERT INTO
  T3
SELECT
  ROWNUM C1,
  1000000-ROWNUM C2,
  MOD(ROWNUM-1,1000) C3,
  TO_CHAR(SYSDATE+MOD(ROWNUM-1,10000),'DAY') C4,
  LPAD(' ',200,'A') PADDING
FROM
  DUAL
CONNECT BY
  LEVEL<=100;

INSERT INTO
  T4
SELECT
  ROWNUM C1,
  1000000-ROWNUM C2,
  MOD(ROWNUM-1,1000) C3,
  TO_CHAR(SYSDATE+MOD(ROWNUM-1,10000),'DAY') C4,
  LPAD(' ',200,'A') PADDING
FROM
  DUAL
CONNECT BY
  LEVEL<=10;

COMMIT;

CREATE INDEX IND_T3_C1 ON T3(C1);
CREATE INDEX IND_T3_C2 ON T3(C2);
CREATE INDEX IND_T3_C3 ON T3(C3);

CREATE INDEX IND_T4_C1 ON T4(C1);
CREATE INDEX IND_T4_C2 ON T4(C2);
CREATE INDEX IND_T4_C3 ON T4(C3);

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T3',CASCADE=>TRUE,ESTIMATE_PERCENT=>NULL)
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T4',CASCADE=>TRUE,ESTIMATE_PERCENT=>NULL) 

We are able to easily produce an example where the “smallest” table is selected as the driving table (note that I had to add a hint to specify a nested loop join in several of these examples):

SET AUTOTRACE TRACEONLY EXPLAIN

SELECT /*+ USE_NL(T3 T4) */
  T3.C1,
  T3.C2,
  T3.C3,
  T3.C4,
  T4.C1,
  T4.C2,
  T4.C3,
  T4.C4
FROM
  T3,
  T4
WHERE
  T3.C1=T4.C1;

Execution Plan
----------------------------------------------------------
Plan hash value: 567778651

------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |    10 |   420 |    13   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                |           |       |       |            |          |
|   2 |   NESTED LOOPS               |           |    10 |   420 |    13   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL         | T4        |    10 |   210 |     3   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN          | IND_T3_C1 |     1 |       |     0   (0)| 00:00:01 |
|   5 |   TABLE ACCESS BY INDEX ROWID| T3        |     1 |    21 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."C1"="T4"."C1") 

If we stop at that point, we could declare quite simply that the optimizer selects the smaller table as the driving table.  But wait a minute, take a look at this example where the optimizer selected the largest table as the driving table:

SELECT /*+ USE_NL(T3 T4) */
  T3.C1,
  T3.C2,
  T3.C3,
  T3.C4,
  T4.C1,
  T4.C2,
  T4.C3,
  T4.C4
FROM
  T3,
  T4
WHERE
  T3.C1=T4.C1
  AND T3.C2=1;

Execution Plan
----------------------------------------------------------
Plan hash value: 4214127300

-------------------------------------------------------------------------------------------
| Id  | Operation                     | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |           |     1 |    42 |     3   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                 |           |       |       |            |          |
|   2 |   NESTED LOOPS                |           |     1 |    42 |     3   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T3        |     1 |    21 |     2   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | IND_T3_C2 |     1 |       |     1   (0)| 00:00:01 |
|*  5 |    INDEX RANGE SCAN           | IND_T4_C1 |     1 |       |     0   (0)| 00:00:01 |
|   6 |   TABLE ACCESS BY INDEX ROWID | T4        |     1 |    21 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."C2"=1)
   5 - access("T3"."C1"="T4"."C1") 

The above execution plans were generated on 11.2.0.2, which sometimes differs a bit from older Oracle Database release versions when nested loops joins are used (note the two nested loops joins), however we are able to hint the optimizer to generate the older style nested loops join:

SELECT /*+ USE_NL(T3 T4) OPTIMIZER_FEATURES_ENABLE('10.2.0.4') */
  T3.C1,
  T3.C2,
  T3.C3,
  T3.C4,
  T4.C1,
  T4.C2,
  T4.C3,
  T4.C4
FROM
  T3,
  T4
WHERE
  T3.C1=T4.C1;

Execution Plan
----------------------------------------------------------
Plan hash value: 2465588182

-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |    10 |   420 |    13   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T3        |     1 |    21 |     1   (0)| 00:00:01 |
|   2 |   NESTED LOOPS              |           |    10 |   420 |    13   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL        | T4        |    10 |   210 |     3   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN         | IND_T3_C1 |     1 |       |     0   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."C1"="T4"."C1") 

 

SELECT /*+ USE_NL(T3 T4) OPTIMIZER_FEATURES_ENABLE('10.2.0.4') */
  T3.C1,
  T3.C2,
  T3.C3,
  T3.C4,
  T4.C1,
  T4.C2,
  T4.C3,
  T4.C4
FROM
  T3,
  T4
WHERE
  T3.C1=T4.C1
  AND T3.C2=1;

Execution Plan
----------------------------------------------------------
Plan hash value: 3446668716

-------------------------------------------------------------------------------------------
| Id  | Operation                     | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |           |     1 |    42 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID  | T4        |     1 |    21 |     1   (0)| 00:00:01 |
|   2 |   NESTED LOOPS                |           |     1 |    42 |     3   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T3        |     1 |    21 |     2   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | IND_T3_C2 |     1 |       |     1   (0)| 00:00:01 |
|*  5 |    INDEX RANGE SCAN           | IND_T4_C1 |     1 |       |     0   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."C2"=1)
   5 - access("T3"."C1"="T4"."C1") 

We found one case where the larger table was selected as the driving table, so the books and articles that simply state absolutely that the smallest table will be the driving table are not completely correct.  Maybe the larger table is only selected as the driving table when both tables are small?  Let’s test that theory by creating a couple of more tables:

SET AUTOTRACE OFF

CREATE TABLE T1 (
  C1 NUMBER,
  C2 NUMBER,
  C3 NUMBER,
  C4 VARCHAR2(20),
  PADDING VARCHAR2(200));

CREATE TABLE T2 (
  C1 NUMBER,
  C2 NUMBER,
  C3 NUMBER,
  C4 VARCHAR2(20),
  PADDING VARCHAR2(200));

INSERT INTO
  T1
SELECT
  ROWNUM C1,
  1000000-ROWNUM C2,
  MOD(ROWNUM-1,1000) C3,
  TO_CHAR(SYSDATE+MOD(ROWNUM-1,10000),'DAY') C4,
  LPAD(' ',200,'A') PADDING
FROM
  DUAL
CONNECT BY
  LEVEL<=1000000;

INSERT INTO
  T2
SELECT
  ROWNUM C1,
  1000000-ROWNUM C2,
  MOD(ROWNUM-1,1000) C3,
  TO_CHAR(SYSDATE+MOD(ROWNUM-1,10000),'DAY') C4,
  LPAD(' ',200,'A') PADDING
FROM
  DUAL
CONNECT BY
  LEVEL<=100000;

COMMIT;

CREATE INDEX IND_T1_C1 ON T1(C1);
CREATE INDEX IND_T1_C2 ON T1(C2);
CREATE INDEX IND_T1_C3 ON T1(C3);

CREATE INDEX IND_T2_C1 ON T2(C1);
CREATE INDEX IND_T2_C2 ON T2(C2);
CREATE INDEX IND_T2_C3 ON T2(C3);

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>NULL)
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',CASCADE=>TRUE,ESTIMATE_PERCENT=>NULL) 

The above script created table T1 with 1,000,000 rows and table T2 with 100,000 rows.  We will now use queries that are similar to those that were used with the 100 and 10 row tables.

The smaller table (T2) as the driving table:

SET AUTOTRACE TRACEONLY EXPLAIN

SELECT /*+ USE_NL(T1 T2) */
  T1.C1,
  T1.C2,
  T1.C3,
  T1.C4,
  T2.C1,
  T2.C2,
  T2.C3,
  T2.C4
FROM
  T1,
  T2
WHERE
  T1.C1=T2.C1;

Execution Plan
----------------------------------------------------------
Plan hash value: 2610346857

------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |   100K|  4687K|   300K  (1)| 01:00:12 |
|   1 |  NESTED LOOPS                |           |       |       |            |          |
|   2 |   NESTED LOOPS               |           |   100K|  4687K|   300K  (1)| 01:00:12 |
|   3 |    TABLE ACCESS FULL         | T2        |   100K|  2343K|   889   (1)| 00:00:11 |
|*  4 |    INDEX RANGE SCAN          | IND_T1_C1 |     1 |       |     2   (0)| 00:00:01 |
|   5 |   TABLE ACCESS BY INDEX ROWID| T1        |     1 |    24 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T1"."C1"="T2"."C1") 

The larger table as the driving table:

SELECT /*+ USE_NL(T1 T2) */
  T1.C1,
  T1.C2,
  T1.C3,
  T1.C4,
  T2.C1,
  T2.C2,
  T2.C3,
  T2.C4
FROM
  T1,
  T2
WHERE
  T1.C1=T2.C1
  AND T1.C2 BETWEEN 1 AND 10000;

Execution Plan
----------------------------------------------------------
Plan hash value: 2331401024

-------------------------------------------------------------------------------------------
| Id  | Operation                     | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |           | 10001 |   468K| 11353   (1)| 00:02:17 |
|   1 |  NESTED LOOPS                 |           |       |       |            |          |
|   2 |   NESTED LOOPS                |           | 10001 |   468K| 11353   (1)| 00:02:17 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1        | 10001 |   234K|   348   (0)| 00:00:05 |
|*  4 |     INDEX RANGE SCAN          | IND_T1_C2 | 10001 |       |    25   (0)| 00:00:01 |
|*  5 |    INDEX RANGE SCAN           | IND_T2_C1 |     1 |       |     1   (0)| 00:00:01 |
|   6 |   TABLE ACCESS BY INDEX ROWID | T2        |     1 |    24 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T1"."C2">=1 AND "T1"."C2"<=10000)
   5 - access("T1"."C1"="T2"."C1") 

So, what is happening?  Is it simply the case that it is the expected number of rows that will be returned from each table that determines which table will be the driving table?  Let’s test:

SET AUTOTRACE OFF

SELECT
  COUNT(*)
FROM
  T1
WHERE
  T1.C1 BETWEEN 890000 AND 1000000;

  COUNT(*)
----------
    110001

SELECT
  COUNT(*)
FROM
  T2
WHERE
 T2.C2 BETWEEN 900000 AND 1000000;

  COUNT(*)
----------
    100000 

The above shows that if we specify T1.C1 BETWEEN 890000 AND 1000000 in the WHERE clause there will be 110,001 rows from the larger table that match the criteria. If we specify T2.C2 BETWEEN 900000 AND 1000000 in the WHERE clause there will be 100,000 rows from the smaller table that match the criteria. If we execute the following query, which table will be the driving table, the 10 times larger T1 table where we are retrieving 110,001 rows or the smaller T2 table where we are retrieving 100,000 rows?

SET AUTOTRACE TRACEONLY EXPLAIN

SELECT /*+ USE_NL(T1 T2) */
  T1.C1,
  T1.C2,
  T1.C3,
  T1.C4,
  T2.C1,
  T2.C2,
  T2.C3,
  T2.C4
FROM
  T1,
  T2
WHERE
  T1.C3=T2.C3
  AND T1.C1 BETWEEN 890000 AND 1000000
  AND T2.C2 BETWEEN 900000 AND 1000000; 

This is the result that I received, which seems to demonstrate that it is not just the size of the tables, nor is it the number of expected rows to be returned from the tables, that determines which table will be the driving table:

-------------------------------------------------------------------------------------------
| Id  | Operation                     | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |           |    11M|   503M|    11M  (1)| 37:03:27 |
|   1 |  NESTED LOOPS                 |           |       |       |            |          |
|   2 |   NESTED LOOPS                |           |    11M|   503M|    11M  (1)| 37:03:27 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1        |   110K|  2578K|  3799   (1)| 00:00:46 |
|*  4 |     INDEX RANGE SCAN          | IND_T1_C1 |   110K|       |   248   (1)| 00:00:03 |
|*  5 |    INDEX RANGE SCAN           | IND_T2_C3 |   100 |       |     1   (0)| 00:00:01 |
|*  6 |   TABLE ACCESS BY INDEX ROWID | T2        |   100 |  2400 |   101   (0)| 00:00:02 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T1"."C1">=890000 AND "T1"."C1"<=1000000)
   5 - access("T1"."C3"="T2"."C3")
   6 - filter("T2"."C2">=900000 AND "T2"."C2"<=1000000) 

110,001 rows from T1 is still somewhat close in number to the 100,000 rows from T2, so let’s try an experiment selecting 992,701 rows from T1:

SELECT /*+ USE_NL(T1 T2) */
  T1.C1,
  T1.C2,
  T1.C3,
  T1.C4,
  T2.C1,
  T2.C2,
  T2.C3,
  T2.C4
FROM
  T1,
  T2
WHERE
  T1.C3=T2.C3
  AND T1.C1 BETWEEN 7300 AND 1000000
  AND T2.C2 BETWEEN 900000 AND 1000000;

Execution Plan
----------------------------------------------------------
Plan hash value: 3718770616

------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |    99M|  4544M|   100M  (1)|334:20:23 |
|   1 |  NESTED LOOPS                |           |       |       |            |          |
|   2 |   NESTED LOOPS               |           |    99M|  4544M|   100M  (1)|334:20:23 |
|*  3 |    TABLE ACCESS FULL         | T1        |   992K|    22M|  8835   (1)| 00:01:47 |
|*  4 |    INDEX RANGE SCAN          | IND_T2_C3 |   100 |       |     1   (0)| 00:00:01 |
|*  5 |   TABLE ACCESS BY INDEX ROWID| T2        |   100 |  2400 |   101   (0)| 00:00:02 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("T1"."C1">=7300 AND "T1"."C1"<=1000000)
   4 - access("T1"."C3"="T2"."C3")
   5 - filter("T2"."C2">=900000 AND "T2"."C2"<=1000000) 

As shown above, table T1 is still the driving table in the nested loops join.  Let’s test retrieving 993,001 rows from T1:

SELECT /*+ USE_NL(T1 T2) */
  T1.C1,
  T1.C2,
  T1.C3,
  T1.C4,
  T2.C1,
  T2.C2,
  T2.C3,
  T2.C4
FROM
  T1,
  T2
WHERE
  T1.C3=T2.C3
  AND T1.C1 BETWEEN 7000 AND 1000000
  AND T2.C2 BETWEEN 900000 AND 1000000;

------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |    99M|  4545M|   100M  (1)|334:26:13 |
|   1 |  NESTED LOOPS                |           |       |       |            |          |
|   2 |   NESTED LOOPS               |           |    99M|  4545M|   100M  (1)|334:26:13 |
|*  3 |    TABLE ACCESS FULL         | T2        |   100K|  2343K|   889   (1)| 00:00:11 |
|*  4 |    INDEX RANGE SCAN          | IND_T1_C3 |  1000 |       |     3   (0)| 00:00:01 |
|*  5 |   TABLE ACCESS BY INDEX ROWID| T1        |   993 | 23832 |  1003   (0)| 00:00:13 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("T2"."C2">=900000 AND "T2"."C2"<=1000000)
   4 - access("T1"."C3"="T2"."C3")
   5 - filter("T1"."C1">=7000 AND "T1"."C1"<=1000000) 

As shown above, table T2 is now the driving table for the nested loops join.  So, there must be other factors beyond table (or better worded row source) size and the number of rows that will be retrieved from the tables.  You might be wondering if the CLUSTERING_FACTOR of the indexes also plays a role in determining which table is the driving table:

SET AUTOTRACE OFF
 
SELECT
  TABLE_NAME,
  INDEX_NAME,
  CLUSTERING_FACTOR,
  NUM_ROWS
FROM
  USER_INDEXES
WHERE
  TABLE_NAME IN ('T1','T2')
ORDER BY
  TABLE_NAME,
  INDEX_NAME;

TABLE_NAME INDEX_NAME CLUSTERING_FACTOR   NUM_ROWS
---------- ---------- ----------------- ----------
T1         IND_T1_C1              32259    1000000
T1         IND_T1_C2              32259    1000000
T1         IND_T1_C3            1000000    1000000
T2         IND_T2_C1               3226     100000
T2         IND_T2_C2               3226     100000
T2         IND_T2_C3             100000     100000 

I suggested (without checking) in the OTN thread that the CLUSTERING_FACTOR of the index on columns C2 would be higher than the CLUSTERING_FACTOR of the index on columns C1 because of the reverse (descending) order in which the C2 column values were inserted into the tables.  Surprisingly (at least to me), the optimizer set the CLUSTERING_FACTOR of the C1 and C2 columns to be the same values, and set the CLUSTERING_FACTOR of column C3 to be the same as the number of rows in the table.  Maybe one of the readers of this blog article can explain what happened to the CLUSTERING_FACTOR.

So, the answer to the OP’s question is not as simple as “the Smaller Table is the Driving Table” or “the Larger Table is the Driving Table”, but that there are other factors involved.  I think that it might be time for a third read through of the book “Cost-Based Oracle Fundamentals”.  In the mean time, anyone care to share more insight (yes we could look inside a 10053 trace file, but there must be at least one other tip that can be provided without referencing such a trace file).