Query Performance Problem, or is Something Else Happening?

3 03 2010

March 3, 2010

Let’s say that you encouter a query that is taking an unexpectedly long time to run, possibly 30 minutes.  Maybe the query looks something like this:

SELECT   mds.messagepartdata, ms.status, mi.ID, mi.messageguid, mi.channel,
         ROWNUM AS messagecount
    FROM pfmq_messagedata md,
         pfmq_messagedatastorage mds,
         pfmq_messageinfo mi,
         pfmq_messagestatus ms
   WHERE (    mi.queuename = 'CL312911032'
          AND mi.ID = ms.ID
          AND mi.ID = md.ID
          AND mi.ID = mds.ID
          AND md.ID = mds.ID
          AND md.messageparttype = mds.messageparttype
          AND md.messageparttype = 1
          AND (ms.statusrevisionnumber = (SELECT MAX (statusrevisionnumber)
                                            FROM pfmq_messagestatus ms2
                                           WHERE ms2.ID = ms.ID)
     AND ((ms.status = 64) AND (mi.direction = 1) AND mi.messagetype = 0)
ORDER BY mi.sequenceordinalnumber, mi.senttime

In the above, the table PFMQ_MESSAGEDATASTORAGE contains a long raw column – the MESSAGEPARTDATA column.  You enable a 10046 trace, but only at level 1 – so there are no wait events written to the trace file.  The resulting trace file is then processed using TKPROF, and you obtain the following output:

call     count       cpu    elapsed       disk      query    current  rows
------- ------  -------- ---------- ---------- ---------- ---------- -----
Parse        1      0.00       0.00          0          0          0     0
Execute      1      0.00       0.00          0          0          0     0
Fetch     4321     14.56     580.31     231750     746064          0 64806
------- ------  -------- ---------- ---------- ---------- ---------- -----
total     4323     14.56     580.31     231750     746064          0 64806

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 29

Rows     Row Source Operation
-------  ---------------------------------------------------
  64806  SORT ORDER BY (cr=681258 r=184767 w=0 time=403515790 us)
  64806   COUNT  (cr=681258 r=184767 w=0 time=1729762996 us)
  64806    NESTED LOOPS  (cr=681258 r=184767 w=0 time=1729717540 us)
  64806     NESTED LOOPS  (cr=486600 r=92648 w=0 time=901417748 us)
  64806      NESTED LOOPS  (cr=356748 r=46572 w=0 time=268980743 us)
  64820       TABLE ACCESS BY INDEX ROWID PFMQ_MESSAGEINFO (cr=31514 r=23422 w=0 time=44437657 us)
 120784        INDEX RANGE SCAN AK1_PFMQ_SEQUENCENUMBER (cr=3117 r=3062 w=0 time=10896605 us)(object id 6511)
  64806       TABLE ACCESS BY INDEX ROWID PFMQ_MESSAGESTATUS (cr=325234 r=23150 w=0 time=224278563 us)
  64820        INDEX RANGE SCAN XPKPF_MESSAGESTATUS (cr=260414 r=15792 w=0 time=208616639 us)(object id 6515)
  64820         SORT AGGREGATE (cr=129644 r=116 w=0 time=1973822 us)
  64820          FIRST ROW  (cr=129644 r=116 w=0 time=1810738 us)
  64820           INDEX RANGE SCAN (MIN/MAX) XPKPF_MESSAGESTATUS (cr=129644 r=116 w=0 time=1756030 us)(object id 6515)
  64806      INDEX UNIQUE SCAN XPKPF_MESSAGEDATA (cr=129852 r=46076 w=0 time=632244506 us)(object id 6505)
  64806     TABLE ACCESS BY INDEX ROWID PFMQ_MESSAGEDATASTORAGE (cr=194658 r=92119 w=0 time=828055493 us)
  64806      INDEX UNIQUE SCAN XPKPF_MESSAGEDATASTORAGE (cr=129852 r=46036 w=0 time=613528422 us)(object id 6507)

How would you troubleshoot this performance problem?  What are the problems?  What looks good about the above output?  What about the above makes you wonder if some detail is missing?  You will find the above SQL statement in this comp.databases.oracle.server Usenet thread.

What can be said about the above output?

  • The client application is retrieving roughly 15 rows in each fetch request: 64,806/4,321 = 14.998 rows per fetch.  Maybe setting the fetch array size to a larger value would help?
  • The fetch calls required 14.56 seconds of the server’s CPU time, while the elapsed time for the fetch was 508.31 580.31 seconds.  565.75 seconds were probably spent doing something other than actively burning CPU cycles – such as waiting for the completion of a disk read.
  • Most of the indexes accessed were high precision indexes, meaning that few of the rowids returned by the index where eliminated at the table level, with the exception of the AK1_PFMQ_SEQUENCENUMBER index, where 46% of the rows identified by the index were discarded.  Those rows were discarded very early in the plan, so the performance impact was likely minimal.
  • There was heavy use of nested loop joins – that might be OK, but might not be as efficient as a hash join if a large percentage of the tables were being accessed or if the clustering factor were high.
  • There were 231,750 blocks read from disk, and considering the heavy use of indexes and nested loop joins, those blocks were likely read one at a time from disk.  If that was the case, the average time to read a block from disk was 0.0024412 seconds (2.4412ms), which is considered to be very fast access times for physical reads from disk.
  • Considering the WHERE clause, the execution plan starts with the two tables with the greatest number of predicates – so that is probably a smart starting point.
  • The elapsed time reported on the last line of the execution plan is greater than the elapsed time reported on the first line of the execution plan – just an odd inconsistency that the time reported on the last line was 613.5 seconds, yet the elapsed time reported by TKPROF for the entire execution was only 580.31 seconds?  The second line of the execution plan shows 1,729.7 seconds (28.8 minutes) on the time= entry, which again is inconsistent with TKPROF’s elapsed time statistic.
  • 580.31 seconds is roughly 9.7 minutes – what happened to the other 20.3 minutes in the 30 minute total execution time that was reported by the original poster?

What might be the next couple of steps for troubleshooting this performance problem?

  • Generate the 10046 trace at level 8, rather than level 1.  That will cause the wait events to be written to the trace file.  If the trace file contains a large number of long duration waits on SQL*Net type wait events, check the network with a packet sniffer (Wireshark, for example) and check the client-side activity to make certain that it is not the client application that is the source of the slow performance.  If you see large unexplained gaps in the tim= values in the trace file without a corresponding long wait event in between, investigate whether the server’s CPUs are over-burdened.
  • Check the statistics on the tables and indexes to make certain that those statistics are reasonably accurate and up to date.
  • Review the current optimizer related initialization parameters to make certain that silly parameters are not specified.
  • Most of all, determine where the missing 20.3 minutes have gone.