True or False – Hash Joins

12 05 2010

May 12, 2010

(Back to the Previous Post in the Series) (Forward to the Next Post in the Series)

Continuing the true or false blog article series with more questions that seem to have conflicting answers.  Today’s quiz is on the topic of the hash joins, inspired by a recent thread that I found in a discussion forum.

When answering the questions, answer True or False, followed by a justification for the answer that you provided.

The Articles:

The Quiz:

1.  A hash join is ideal when joining a small table to a large table, and in such cases a hash join is typically faster than a nested loops join or a sort-merge join.

2.  When presented with very large tables in a production database, the Oracle query optimizer will attempt to avoid using hash joins unless the PGA_AGGREGATE_TARGET or HASH_AREA_SIZE (if PGA_AGGREGATE_TARGET is not used) is increased to a large value.

3.  When a PGA_AGGREGATE_TARGET is in use, one session cannot use all of the memory specified by PGA_AGGREGATE_TARGET nor all of the available memory in the server, whichever is the lower value.

4.  A hash join may use only 200MB of memory, or 5% of the PGA_AGGREGATE_TARGET, whichever is the lower value.

5.  The query optimizer will not select to use a hash join unless the hash join is able to complete in memory, as controlled by the PGA_AGGREGATE_TARGET and/or HASH_AREA_SIZE parameter values.

6.  A hash join is only faster than a nested loops join when the memory available for the hash join is at least 1.6 times the sum of bytes in the rows of the table to be hashed.

7.  A hash join’s performance is affected when the hash join overflows the memory established by the HASH_AREA_SIZE parameter, and the hash join spills to the TEMP tablespace.

8.  The degree of parallelism for a table (or individual SQL statement) must be adjusted when a full table scan is used to retrieve the rows during a hash join.  Hash joins typically require the parallel degree to be set to a value greater than 1 in order to improve performance over other join methods.

9.  The _PGA_MAX_SIZE parameter should be adjusted to improve performance when joining large tables with a hash join.

10. The Oracle query optimizer will commonly ignore USE_HASH hints.

11. The OPTIMIZER_INDEX_COST_ADJ parameter and OPTIMIZER_MAX_PERMUTATIONS parameters control the optimizer’s decision to use hash joins.

12. When the PGA_AGGREGATE_TARGET parameter is not used, the HASH_AREA_SIZE parameter should be set to 50% of the square root of the bytes in the smaller of the two row sources being joined, with a minimum value of 1MB.

13. The row source listed below the words HASH JOIN in an execution plan is the source for the hash table – the optimizer typically tries to select the smaller estimated row source of the two row sources as the source for the hash table. The second row source listed below the words HASH JOIN in an execution plan is fully scanned, probing the generated hash table in search of a match.

When attempting to answer the true or false questions, think about under what conditions the answer to the question will be true and under what conditions the answer to the question must be false.


Actions

Information

5 responses

13 05 2010
Narendra

1. A hash join is ideal when joining a small table to a large table, and in such cases a hash join is typically faster than a nested loops join or a sort-merge join.
FALSE. A NESTED LOOP join is better choice when joining small table to a large table, using equality condition. A SORT-MERGE join is better choice when joining small table to a large table, using non-equality conditions. I would also rather use the term “small result set” instead of “small table” and “large result set” instead of “large table”
2. When presented with very large tables in a production database, the Oracle query optimizer will attempt to avoid using hash joins unless the PGA_AGGREGATE_TARGET or HASH_AREA_SIZE (if PGA_AGGREGATE_TARGET is not used) is increased to a large value.
TRUE. If PGA_AGGREGATE_TARGET (or HASH_AREA_SIZE) is insufficient, the optimizer will expect hashing process to spill to disk and that will make hash join more costly as compared to NESTED LOOP join.
3. When a PGA_AGGREGATE_TARGET is in use, one session cannot use all of the memory specified by PGA_AGGREGATE_TARGET nor all of the available memory in the server, whichever is the lower value.
TRUE. But I must admit I don’t know how to prove the same.
4. A hash join may use only 200MB of memory, or 5% of the PGA_AGGREGATE_TARGET, whichever is the lower value.
TRUE. Because PGA_AGGREGATE_TARGET is also used for sorting
5. The query optimizer will not select to use a hash join unless the hash join is able to complete in memory, as controlled by the PGA_AGGREGATE_TARGET and/or HASH_AREA_SIZE parameter values.
FALSE. Optimizer will select hash join, even if it is expected to spill over to disk because it may still prove to be cheaper access path than a NESTED LOOP or SORT MERGE join.
6. A hash join is only faster than a nested loops join when the memory available for the hash join is at least 1.6 times the sum of bytes in the rows of the table to be hashed.
Not Sure
7. A hash join’s performance is affected when the hash join overflows the memory established by the HASH_AREA_SIZE parameter, and the hash join spills to the TEMP tablespace.
TRUE. When hash join spills to TEMP tablespace, additional physical I/O is required (first writing to TEMP followed by reading from TEMP), which will affect performance.
8. The degree of parallelism for a table (or individual SQL statement) must be adjusted when a full table scan is used to retrieve the rows during a hash join. Hash joins typically require the parallel degree to be set to a value greater than 1 in order to improve performance over other join methods.
FALSE. Hash joins can perform better than other join methods even without parallel degree.
9. The _PGA_MAX_SIZE parameter should be adjusted to improve performance when joining large tables with a hash join.
FALSE. As a general statement, this is not right as it is a hidden parameter and should not be touched without recommendations from support.
10. The Oracle query optimizer will commonly ignore USE_HASH hints.
FALSE. Hints are directives and unless they are invalid, optimizer will not ignore them
11. The OPTIMIZER_INDEX_COST_ADJ parameter and OPTIMIZER_MAX_PERMUTATIONS parameters control the optimizer’s decision to use hash joins.
FALSE/TRUE. OPTIMIZER_INDEX_COST_ADJ parameter will directly affect how an individual tale is accessed whereas OPTIMIZER_MAX_PERMUTATIONS will affect how many access paths are evaluated, which may influence optimizer’s decision to use hash join.
12. When the PGA_AGGREGATE_TARGET parameter is not used, the HASH_AREA_SIZE parameter should be set to 50% of the square root of the bytes in the smaller of the two row sources being joined, with a minimum value of 1MB.
FALSE. The HASH_AREA_SIZE parameter is not specific to a query (although it can be made such using OPT_ESTIMATE hint) and hence the calculation may yield different results for each query. The minimum value of 1 MB is correct.
13. The row source listed below the words HASH JOIN in an execution plan is the source for the hash table – the optimizer typically tries to select the smaller estimated row source of the two row sources as the source for the hash table. The second row source listed below the words HASH JOIN in an execution plan is fully scanned, probing the generated hash table in search of a match.
TRUE. That is the way oracle has implemented hash join method.

13 05 2010
Charles Hooper

Narendra,

Thanks for participating.

I thought that I would need to start answering the easy questions to encourage people to start answering the more difficult questions. You made some very good points with some of your answers, and have left enough room for others to supply their own answers, some of which may disagree with your answers. For question #9 – I think that you stated as much as needs to be stated about that parameter, although it might be interesting to look at the current value of the parameter occasionally.

17 05 2010
Charles Hooper

Time to answer another one of the questions – while quite a few of my answers agree with Narendra, not all of my answers agree.
#2 “When presented with very large tables in a production database, the Oracle query optimizer will attempt to avoid using hash joins unless the PGA_AGGREGATE_TARGET or HASH_AREA_SIZE (if PGA_AGGREGATE_TARGET is not used) is increased to a large value.”

I guess one of the things that needs to be defined is what is meant by very large tables. Since it is not real clear, let’s assume that a large table is 10,000,000 or more rows. Here are two test tables with 2 indexes each:

CREATE TABLE T1 (
  C1 NUMBER,
  C2 NUMBER,
  C3 VARCHAR2(100),
  C4 VARCHAR2(500),
  PRIMARY KEY (C1));
 
INSERT INTO T1
SELECT
  ROWNUM,
  DECODE(MOD(ROWNUM,100),0,NULL,ROWNUM),
  RPAD('A',100,'A'),
  RPAD('A',500,'A')
FROM
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=10000),
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=1000);
 
CREATE TABLE T2 (
  C1 NUMBER,
  C2 NUMBER,
  C3 VARCHAR2(100),
  C4 VARCHAR2(500),
  PRIMARY KEY (C1));
 
INSERT INTO T2
SELECT
  ROWNUM,
  DECODE(MOD(ROWNUM,100),0,NULL,ROWNUM),
  RPAD('A',100,'A'),
  RPAD('A',500,'A')
FROM
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=10000),
  (SELECT
    ROWNUM RN
  FROM
    DUAL
  CONNECT BY
    LEVEL<=1000);
 
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',CASCADE=>TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',CASCADE=>TRUE)

Next, let’s see the version of the optimizer that is in use:

SHOW PARAMETER OPTIMIZER_FEATURES_ENABLE
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- --------
optimizer_features_enable            string      11.1.0.7

It may take an hour or so for the above test tables to be created.

Now, the test script:

SET AUTOTRACE TRACEONLY EXPLAIN
 
ALTER SESSION SET TRACEFILE_IDENTIFIER = 'HASH_JOIN_TEST';
ALTER SESSION SET EVENTS '10053 TRACE NAME CONTEXT FOREVER, LEVEL 1';
 
SELECT
  T1.C1,
  T2.C1,
  SUBSTR(T1.C3,1,10) T1_C3,
  SUBSTR(T2.C3,1,10) T2_C3
FROM
  T1,
  T2
WHERE
  T1.C2=T2.C2;
 
ALTER SESSION SET EVENTS '10053 TRACE NAME CONTEXT OFF';
 
SET AUTOTRACE OFF
 
SELECT SID FROM V$MYSTAT WHERE ROWNUM=1;

SHOW PARAMETER PGA

SHOW PARAMETER MEMORY_TARGET

The output of the above:

Execution Plan
----------------------------------------------------------
Plan hash value: 1838229974
 
-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  9966K|  2148M|       |   267K  (1)| 00:18:05 |
|*  1 |  HASH JOIN         |      |  9966K|  2148M|  1192M|   267K  (1)| 00:18:05 |
|   2 |   TABLE ACCESS FULL| T1   |    10M|  1077M|       |   108K  (1)| 00:07:21 |
|   3 |   TABLE ACCESS FULL| T2   |    10M|  1077M|       |   108K  (1)| 00:07:21 |
-----------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."C2"="T2"."C2")
 
SID
--- 
297
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- -----
pga_aggregate_target                 big integer 0
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- -----
memory_target                        big integer 408M
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- -----
pga_aggregate_target                 big integer 0

Well, that is no fun, the MEMORY_TARGET parameter was specified. Let’s take a look at a couple of the hidden parameter values since a hash join was selected.

__PGA_AGGREGATE_TARGET = 146,800,640 (140MB)
__PGA_MAX_SIZE = 209,715,200 (200MB)
SGA_TARGET = 0
__SGA_TARGET = 281,018,368 (268MB)

Let’s fix the MEMORY_TARGET problem using a session connected as the SYS user:

ALTER SYSTEM SET PGA_AGGREGATE_TARGET = 40M;
ALTER SYSTEM SET SGA_TARGET = 400M SCOPE=SPFILE;
ALTER SYSTEM RESET MEMORY_TARGET SCOPE=SPFILE SID='*';
 
SHUTDOWN IMMEDIATE;
 
STARTUP

The script:

SET AUTOTRACE TRACEONLY EXPLAIN

ALTER SESSION SET TRACEFILE_IDENTIFIER = 'HASH_JOIN_TEST2';
ALTER SESSION SET EVENTS '10053 TRACE NAME CONTEXT FOREVER, LEVEL 1';
 
SELECT
  T1.C1,
  T2.C1,
  SUBSTR(T1.C3,1,10) T1_C3,
  SUBSTR(T2.C3,1,10) T2_C3
FROM
  T1,
  T2
WHERE
  T1.C2=T2.C2;
 
ALTER SESSION SET EVENTS '10053 TRACE NAME CONTEXT OFF';
 
SET AUTOTRACE OFF
 
SELECT SID FROM V$MYSTAT WHERE ROWNUM=1;
 
SHOW PARAMETER PGA
 
SHOW PARAMETER MEMORY_TARGET

The output:

Execution Plan
----------------------------------------------------------
Plan hash value: 1838229974
 
-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  9966K|  2148M|       |   331K  (1)| 00:22:24 |
|*  1 |  HASH JOIN         |      |  9966K|  2148M|  1192M|   331K  (1)| 00:22:24 |
|   2 |   TABLE ACCESS FULL| T1   |    10M|  1077M|       |   108K  (1)| 00:07:21 |
|   3 |   TABLE ACCESS FULL| T2   |    10M|  1077M|       |   108K  (1)| 00:07:21 |
-----------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."C2"="T2"."C2")
 
SID
---
320
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- -----
pga_aggregate_target                 big integer 40M
 
NAME                                 TYPE        VALUE
------------------------------------ ----------- -----
memory_target                        big integer 0

The hidden parameter values:

__PGA_AGGREGATE_TARGET = 41,943,040 (40MB)
__PGA_MAX_SIZE = 209,715,200 (200MB)
SGA_TARGET = 419,430,400 (400MB)
__SGA_TARGET = 419,430,400 (400MB)

Let’s try again when connected as SYS:

ALTER SYSTEM SET PGA_AGGREGATE_TARGET = 10M;

The script:

SET AUTOTRACE TRACEONLY EXPLAIN
 
ALTER SESSION SET TRACEFILE_IDENTIFIER = 'HASH_JOIN_TEST3';
ALTER SESSION SET EVENTS '10053 TRACE NAME CONTEXT FOREVER, LEVEL 1';
 
SELECT
  T1.C1,
  T2.C1,
  SUBSTR(T1.C3,1,10) T1_C3,
  SUBSTR(T2.C3,1,10) T2_C3
FROM
  T1,
  T2
WHERE
  T1.C2=T2.C2;
 
ALTER SESSION SET EVENTS '10053 TRACE NAME CONTEXT OFF';

SELECT /*+ INDEX(T1) */
  T1.C1,
  T2.C1,
  SUBSTR(T1.C3,1,10) T1_C3,
  SUBSTR(T2.C3,1,10) T2_C3
FROM
  T1,
  T2
WHERE
  T1.C2=T2.C2;
 
SELECT /*+ INDEX(T1) USE_NL(T1 T2) */
  T1.C1,
  T2.C1,
  SUBSTR(T1.C3,1,10) T1_C3,
  SUBSTR(T2.C3,1,10) T2_C3
FROM
  T1,
  T2
WHERE
  T1.C2=T2.C2; 
 
SELECT /*+ USE_NL(T1 T2) */
  T1.C1,
  T2.C1,
  SUBSTR(T1.C3,1,10) T1_C3,
  SUBSTR(T2.C3,1,10) T2_C3
FROM
  T1,
  T2
WHERE
  T1.C2=T2.C2; 

The output:

----------------------------------------------------------
Plan hash value: 1838229974
 
-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  9966K|  2148M|       |   436K  (1)| 00:29:27 |
|*  1 |  HASH JOIN         |      |  9966K|  2148M|  1192M|   436K  (1)| 00:29:27 |
|   2 |   TABLE ACCESS FULL| T1   |    10M|  1077M|       |   108K  (1)| 00:07:21 |
|   3 |   TABLE ACCESS FULL| T2   |    10M|  1077M|       |   108K  (1)| 00:07:21 |
-----------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."C2"="T2"."C2")
 
---
  
Execution Plan
----------------------------------------------------------
Plan hash value: 3851049713
 
--------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |  9966K|  2148M|       |  1259K  (1)| 01:24:58 |
|*  1 |  HASH JOIN                   |           |  9966K|  2148M|  1192M|  1259K  (1)| 01:24:58 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1        |    10M|  1077M|       |   931K  (1)| 01:02:51 |
|   3 |    INDEX FULL SCAN           | IND_T1_C2 |  9669K|       |       | 23026   (2)| 00:01:34 |
|   4 |   TABLE ACCESS FULL          | T2        |    10M|  1077M|       |   108K  (1)| 00:07:21 |
--------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."C2"="T2"."C2")
 
---
 
Execution Plan
----------------------------------------------------------
Plan hash value: 819113148
 
-------------------------------------------------------------------------------------------
| Id  | Operation                     | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |           |  9966K|  2148M|    20M  (1)| 23:35:19 |
|   1 |  NESTED LOOPS                 |           |       |       |            |          |
|   2 |   NESTED LOOPS                |           |  9966K|  2148M|    20M  (1)| 23:35:19 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1        |    10M|  1077M|   931K  (1)| 01:02:51 |
|   4 |     INDEX FULL SCAN           | IND_T1_C2 |  9669K|       | 23026   (2)| 00:01:34 |
|*  5 |    INDEX RANGE SCAN           | IND_T2_C2 |     1 |       |     2   (0)| 00:00:01 |
|   6 |   TABLE ACCESS BY INDEX ROWID | T2        |     1 |   113 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1"."C2"="T2"."C2")
 
---
 
Execution Plan
----------------------------------------------------------
Plan hash value: 1384810690
 
------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |  9966K|  2148M|    30M  (1)| 33:52:40 |
|   1 |  NESTED LOOPS                |           |       |       |            |          |
|   2 |   NESTED LOOPS               |           |  9966K|  2148M|    30M  (1)| 33:52:40 |
|   3 |    TABLE ACCESS FULL         | T1        |    10M|  1077M|   108K  (1)| 00:07:21 |
|*  4 |    INDEX RANGE SCAN          | IND_T2_C2 |     1 |       |     2   (0)| 00:00:01 |
|   5 |   TABLE ACCESS BY INDEX ROWID| T2        |     1 |   113 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T1"."C2"="T2"."C2")

Maybe PGA_AGGREGATE_TARGET is still too large?

ALTER SYSTEM SET PGA_AGGREGATE_TARGET = 5M
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-00093: pga_aggregate_target must be between 10M and 4096G-1

I think that it is safe to state that this question is False.

20 07 2011
PD

Chales, How about the rest of the statements please – are they true or false?

Thanks.

20 07 2011
Charles Hooper

PD,

14 months have elapsed since I wrote this quiz – I forgot all of my well thought out, creative answers to the questions. Maybe your comment will jump start the discussion again.

To get the ball rolling again:
3. When a PGA_AGGREGATE_TARGET is in use, one session cannot use all of the memory specified by PGA_AGGREGATE_TARGET nor all of the available memory in the server, whichever is the lower value.

A fun 2 part blog article series starts here:

http://hoopercharles.wordpress.com/2010/01/18/pga-memory-the-developers-secret-weapon-for-stealing-all-of-the-memory-in-the-server/

Quoting a portion of the above blog article:

When just the sort operation is returned by the above query, session 2 executes this SQL statement:

COLUMN VALUE FORMAT 999,999,999,990

SELECT
  SN.NAME,
  SS.VALUE
FROM
  V$STATNAME SN,
  V$SESSTAT SS
WHERE
  SS.SID=335
  AND SS.STATISTIC#=SN.STATISTIC#
  AND SN.NAME LIKE '%pga%';
 
NAME                             VALUE
------------------------ -------------
session pga memory       3,391,500,272
session pga memory max   3,391,500,272

Based on the above, Session 1 is not consuming about 90MB of PGA memory, but instead roughly 3234.39MB of PGA memory (the 2 DBAs still standing and clapping should sit down now). Let’s hope that the DBA responsible for this database did not consider the 1800MB value for the PGA_AGGREGATE_TARGET parameter as a hard upper limit, and set the other parameters to take full advantage of the 12GB of memory in the server

Question #3 is false – as shown above, a single session was using 179.67% of the amount of memory specified by the PGA_AGGREGATE_TARGET parameter, and it could have used much more memory.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

Join 147 other followers

%d bloggers like this: