Parallel Execution Challenge – It’s About Time

8 07 2012

July 8, 2012

As I was reading a book, I saw a comment essentially stating that with multiple CPUs/cores, parallel execution will be faster than serial execution of the same query.  The book provided a test case – I decided to change around the test case a bit so that the results were a bit more fair, and then I performed a bit of testing using the sample database data that Oracle Corporation provides. 

My testing was performed in Oracle Database 11.2.0.1 on 64 bit Windows with a quad core CPU that supports hyperthreading, and two SSD drives in a RAID 0 array (with all scripts executed on the server).  So, is parallel query always faster than serial execution when the server has multiple CPUs/cores?  But wait, that is not the interesting question… not the point of this parallel execution challenge, but still an interesting question.  Here is the test script, which will be executed three times, with the results of the first execution thrown out so that we may eliminate the side effects of the hard parse:

ALTER SESSION SET CURRENT_SCHEMA=SH;
SET AUTOTRACE TRACEONLY STATISTICS EXPLAIN
SET TIMING ON
SET LINESIZE 145
SET PAGESIZE 1000

ALTER SYSTEM FLUSH BUFFER_CACHE;
SELECT /*+ PARALLEL(2) */ S.PROD_ID, S.CUST_ID, S.TIME_ID FROM SH.SALES S ORDER BY S.AMOUNT_SOLD DESC;

ALTER SYSTEM FLUSH BUFFER_CACHE;
SELECT S.PROD_ID, S.CUST_ID, S.TIME_ID FROM SH.SALES S ORDER BY S.AMOUNT_SOLD DESC; 

Notice in the above that the test case performs the parallel execution first, with a degree of parallelism of 2, and then executes the same query without using parallel execution.

Let’s look at the output for the parallel execution:

918843 rows selected.

Elapsed: 00:00:11.97

Execution Plan
----------------------------------------------------------
Plan hash value: 2055439529

-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |   918K|    19M|       |   308  (14)| 00:00:04 |       |       |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (ORDER)    | :TQ10001 |   918K|    19M|       |   308  (14)| 00:00:04 |       |       |  Q1,01 | P->S | QC (ORDER) |
|   3 |    SORT ORDER BY        |          |   918K|    19M|    31M|   308  (14)| 00:00:04 |       |       |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |   918K|    19M|       |   277   (4)| 00:00:04 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX SEND RANGE      | :TQ10000 |   918K|    19M|       |   277   (4)| 00:00:04 |       |       |  Q1,00 | P->P | RANGE      |
|   6 |       PX BLOCK ITERATOR |          |   918K|    19M|       |   277   (4)| 00:00:04 |     1 |    28 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| SALES    |   918K|    19M|       |   277   (4)| 00:00:04 |     1 |    28 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 2 because of hint

Statistics
----------------------------------------------------------
         12  recursive calls
          0  db block gets
       1802  consistent gets
       1647  physical reads
          0  redo size
   22116448  bytes sent via SQL*Net to client
    1102982  bytes received via SQL*Net from client
      61258  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
     918843  rows processed 

918,843 rows selected (not displayed in SQL*Plus due to the AUTOTRACE setting), 1,647 blocks read from disk, 1,802 consistent gets, an estimated cost of 308, and the execution completed in 11.97 seconds.

Let’s take a look at the serial execution output:

918843 rows selected.

Elapsed: 00:00:08.84

Execution Plan
----------------------------------------------------------
Plan hash value: 3803407550

------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |   918K|    19M|       |  6637   (2)| 00:01:20 |       |       |
|   1 |  SORT ORDER BY       |       |   918K|    19M|    31M|  6637   (2)| 00:01:20 |       |       |
|   2 |   PARTITION RANGE ALL|       |   918K|    19M|       |   499   (4)| 00:00:06 |     1 |    28 |
|   3 |    TABLE ACCESS FULL | SALES |   918K|    19M|       |   499   (4)| 00:00:06 |     1 |    28 |
------------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1718  consistent gets
       1647  physical reads
          0  redo size
   22116448  bytes sent via SQL*Net to client
    1102982  bytes received via SQL*Net from client
      61258  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
     918843  rows processed 

918,843 rows selected (not displayed in SQL*Plus due to the AUTOTRACE setting), 1,647 blocks read from disk, 1,718 consistent gets, an estimated cost of 6,637, and the execution completed in 8.84 seconds.

* The first part of the challenge is to describe how the cost is derived in the above execution plan for the parallel execution.  Here is the easy part of the cost calculation:

cost of serial full table scan / degree of parallel / 0.90
499 / 2 / 0.90 = 277.22

The above calculation is easy to understand, and is explained in this article.  The SORT ORDER BY operation in the parallel query execution plan added 31 to the calculated cost (308 – 277), while the same operation in the serial execution added 6,138 to the calculated cost (6637 – 499).

The first challenge is to describe why the calculated cost of the parallel execution plan is so small compared to the serial execution.

The timing output indicated that the serial execution required 3.13 fewer seconds than the parallel execution.  Let’s look at the results of the third execution of this test case script:

918843 rows selected.

Elapsed: 00:00:10.66

Execution Plan
----------------------------------------------------------
Plan hash value: 2055439529

-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |   918K|    19M|       |   308  (14)| 00:00:04 |       |       |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (ORDER)    | :TQ10001 |   918K|    19M|       |   308  (14)| 00:00:04 |       |       |  Q1,01 | P->S | QC (ORDER) |
|   3 |    SORT ORDER BY        |          |   918K|    19M|    31M|   308  (14)| 00:00:04 |       |       |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |   918K|    19M|       |   277   (4)| 00:00:04 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX SEND RANGE      | :TQ10000 |   918K|    19M|       |   277   (4)| 00:00:04 |       |       |  Q1,00 | P->P |      RANGE |
|   6 |       PX BLOCK ITERATOR |          |   918K|    19M|       |   277   (4)| 00:00:04 |     1 |    28 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| SALES    |   918K|    19M|       |   277   (4)| 00:00:04 |     1 |    28 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 2 because of hint

Statistics
----------------------------------------------------------
         12  recursive calls
          0  db block gets
       1802  consistent gets
       1647  physical reads
          0  redo size
   22116448  bytes sent via SQL*Net to client
    1102982  bytes received via SQL*Net from client
      61258  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
     918843  rows processed 

The above shows the results of the parallel execution.  918,843 rows selected (not displayed in SQL*Plus due to the AUTOTRACE setting), 1,647 blocks read from disk, 1,802 consistent gets, an estimated cost of 308, and completed in 10.66 seconds.  Notice that with the exception of the elapsed time, the statistics are identical to what we saw during the previous execution of the test script.

918843 rows selected.

Elapsed: 00:00:08.48

Execution Plan
----------------------------------------------------------
Plan hash value: 3803407550

------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |   918K|    19M|       |  6637   (2)| 00:01:20 |       |       |
|   1 |  SORT ORDER BY       |       |   918K|    19M|    31M|  6637   (2)| 00:01:20 |       |       |
|   2 |   PARTITION RANGE ALL|       |   918K|    19M|       |   499   (4)| 00:00:06 |     1 |    28 |
|   3 |    TABLE ACCESS FULL | SALES |   918K|    19M|       |   499   (4)| 00:00:06 |     1 |    28 |
------------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1718  consistent gets
       1647  physical reads
          0  redo size
   22116448  bytes sent via SQL*Net to client
    1102982  bytes received via SQL*Net from client
      61258  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
     918843  rows processed 

The above shows the results of the serial execution. 918,843 rows selected (not displayed in SQL*Plus due to the AUTOTRACE setting), 1,647 blocks read from disk, 1,718 consistent gets, an estimated cost of 6,637, and completed in 8.48 seconds. Notice that with the exception of the elapsed time, the statistics are identical to what we saw during the previous execution of the test script.

The second challenge is to describe why the serial execution completed at least 25% faster than the parallel execution of the same query – this result is counter to what the calculated cost of the execution plans suggest, and counter to what the book claimed.

Maybe we need to look at the result from a different angle.  The book did not use SET TIMING ON as my test case above did, so let’s change the test case again.  This time, we will use the GATHER_PLAN_STATISTICS hint so that the runtime engine will collect the timing and various other statistics, and then use DBMS_XPLAN to display the actual execution plan.  To eliminate the effects of hard parsing, the script will be executed three times with the first execution thrown out.  The new script:

ALTER SESSION SET CURRENT_SCHEMA=SH;
SET AUTOTRACE TRACEONLY STATISTICS
SET TIMING OFF
SET LINESIZE 145
SET PAGESIZE 1000

ALTER SYSTEM FLUSH BUFFER_CACHE;
SELECT /*+ Q1 PARALLEL(2) GATHER_PLAN_STATISTICS */ S.PROD_ID, S.CUST_ID, S.TIME_ID FROM SH.SALES S ORDER BY S.AMOUNT_SOLD DESC;

ALTER SYSTEM FLUSH BUFFER_CACHE;
SELECT /*+ Q2 GATHER_PLAN_STATISTICS */ S.PROD_ID, S.CUST_ID, S.TIME_ID FROM SH.SALES S ORDER BY S.AMOUNT_SOLD DESC;

SET AUTOTRACE OFF 

SELECT
  SQL_ID,
  CHILD_NUMBER,
  SUBSTR(SQL_TEXT,1,30) SQL_TEXT
FROM
  V$SQL
WHERE
  SQL_TEXT LIKE 'SELECT /*+ Q%'
ORDER BY
  3,
  1,
  2; 

The last of the above SQL statements simply retrieves the SQL_ID and CHILD_NUMBER for the two previous queries.  For my results, I received the following output from the last of the above SQL statements:

SQL_ID        CHILD_NUMBER SQL_TEXT
------------- ------------ ------------------------------
5p47nbddfn0k7            0 SELECT /*+ Q1 PARALLEL(2) GATH
8k6bdqr09xc8u            0 SELECT /*+ Q2 GATHER_PLAN_STAT 

To display the execution plans with the runtime statistics, we are able to use the following two statements:

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('5p47nbddfn0k7',0,'ALLSTATS LAST +COST'));

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('8k6bdqr09xc8u',0,'ALLSTATS LAST +COST')); 

Here are the autotrace statistics for the parallel execution:

Statistics
----------------------------------------------------------
         12  recursive calls
          0  db block gets
       1802  consistent gets
       1647  physical reads
          0  redo size
   22116448  bytes sent via SQL*Net to client
    1102982  bytes received via SQL*Net from client
      61258  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
     918843  rows processed 

The DBMS_XPLAN generated execution plan for the parallel execution follows:

SQL_ID  5p47nbddfn0k7, child number 0
-------------------------------------
SELECT /*+ Q1 PARALLEL(2) GATHER_PLAN_STATISTICS */ S.PROD_ID,
S.CUST_ID, S.TIME_ID FROM SH.SALES S ORDER BY S.AMOUNT_SOLD DESC

Plan hash value: 2055439529

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |        |   308 (100)|    918K|00:00:00.70 |     116 |     28 |       |       |          |
|   1 |  PX COORDINATOR         |          |      1 |        |            |    918K|00:00:00.70 |     116 |     28 |       |       |          |
|   2 |   PX SEND QC (ORDER)    | :TQ10001 |      0 |    918K|   308  (14)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   3 |    SORT ORDER BY        |          |      0 |    918K|   308  (14)|      0 |00:00:00.01 |       0 |      0 |    83M|  3142K|   37M (0)|
|   4 |     PX RECEIVE          |          |      0 |    918K|   277   (4)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      PX SEND RANGE      | :TQ10000 |      0 |    918K|   277   (4)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       PX BLOCK ITERATOR |          |      0 |    918K|   277   (4)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  7 |        TABLE ACCESS FULL| SALES    |      0 |    918K|   277   (4)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - access(:Z>=:Z AND :Z<=:Z)

Note
-----
   - Degree of Parallelism is 2 because of hint 

Notice that the execution plan claims that the query completed in 0.70 seconds, that the calculated cost of the execution plan is (still) 308, that 28 blocks were read from disk, and 116 consistent gets were performed.

Here are the autotrace statistics for the serial execution:

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1718  consistent gets
       1647  physical reads
          0  redo size
   22116448  bytes sent via SQL*Net to client
    1102982  bytes received via SQL*Net from client
      61258  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
     918843  rows processed 

The DBMS_XPLAN generated execution plan for the serial execution plan follows:

SQL_ID  8k6bdqr09xc8u, child number 0
-------------------------------------
SELECT /*+ Q2 GATHER_PLAN_STATISTICS */ S.PROD_ID, S.CUST_ID, S.TIME_ID
FROM SH.SALES S ORDER BY S.AMOUNT_SOLD DESC

Plan hash value: 3803407550

-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |  6637 (100)|    918K|00:00:00.93 |    1718 |   1647 |       |       |          |
|   1 |  SORT ORDER BY       |       |      1 |    918K|  6637   (2)|    918K|00:00:00.93 |    1718 |   1647 |    42M|  2308K|   37M (0)|
|   2 |   PARTITION RANGE ALL|       |      1 |    918K|   499   (4)|    918K|00:00:00.31 |    1718 |   1647 |       |       |          |
|   3 |    TABLE ACCESS FULL | SALES |     28 |    918K|   499   (4)|    918K|00:00:00.22 |    1718 |   1647 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------- 

Notice that the execution plan claims that the query completed in 0.93 seconds, that the calculated cost of the execution plan is (still) 6,637, that 1,647 blocks were read from disk, and 1,718 consistent gets were performed.

The third challenge is to describe why DBMS_XPLAN claims that the parallel execution completed roughly 32.9% faster than the serial execution, when the timing output in SQL*Plus indicated that the reverse was true.  Bonus: Why do the Buffers and Reads statistics in the top row of the execution plan for the serial execution plan exactly match the autotrace generated consistent gets and physical reads statistics, yet the same is not true for the execution plan for the parallel execution?  Double Bonus: the execution plan claims that the query completed in 0.93 seconds, yet the timing information in SQL*Plus indicated that the query required an additional 7.55 seconds, why is there so much difference between the reported times (keep in mind that all testing was performed on the server, and the output was discarded without formatting and without displaying on screen)?

If cost is time, and time is money, what is the exchange rate for parallel execution?  ;-)

I have not fully worked the answers to the above challenge questions, and I will wait a couple of days before providing any feedback regarding proposed answers supplied by readers.  I strongly suspect that some readers know without a doubt the answers to all of the above questions.








Follow

Get every new post delivered to your Inbox.

Join 142 other followers