Brain Teaser: 10046 Extended SQL Trace Shows a FETCH Call with c=306350000, e=299174653, p=0, cr=22298 – How is that Possible?

26 04 2012

April 26, 2012

Last week I posted a popular article that questioned the effectiveness of showing a job candidate a raw 10046 trace file that was captured at level 12, asking the job candidate to evaluate whether or not the SQL statement in the 10046 trace is efficient.  Many of the commenters suggested that the approach is potentially very effective if the interviewer is knowledgeable about the subject matter, and he recognizes that the question deserves more than a simple yes or no answer.

A year ago I posted a related article, a brain teaser in fact, that showed 0.015600 seconds of CPU consumption in 0.000510 seconds of elapsed time without using parallel execution.  The comments attached to last year’s article correctly identified the source of the seemingly impossible statistics.

A recent OTN thread (don’t click the link until you have thought about the problem for a while) shows an interesting section of a 10046 trace file, a portion of which appears below:

PARSE #2:c=0,e=4764,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=1182070262719
BINDS #2:
EXEC #2:c=0,e=255,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=1182070263378
WAIT #2: nam='SQL*Net message to client' ela= 5 driver id=1650815232 #bytes=1 p3=0 obj#=258 tim=1182070263514
FETCH #2:c=306350000,e=299174653,p=0,cr=22298,cu=0,mis=0,r=1,dep=0,og=1,tim=1182369438322
WAIT #2: nam='SQL*Net message from client' ela= 1454 driver id=1650815232 #bytes=1 p3=0 obj#=258 tim=1182369440799
WAIT #2: nam='SQL*Net message to client' ela= 4 driver id=1650815232 #bytes=1 p3=0 obj#=258 tim=1182369441012
FETCH #2:c=0,e=293,p=0,cr=0,cu=0,mis=0,r=15,dep=0,og=1,tim=1182369441265
WAIT #2: nam='SQL*Net message from client' ela= 7437 driver id=1650815232 #bytes=1 p3=0 obj#=258 tim=1182369448858
WAIT #2: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=258 tim=1182369449007
FETCH #2:c=0,e=283,p=0,cr=0,cu=0,mis=0,r=15,dep=0,og=1,tim=1182369449257
WAIT #2: nam='SQL*Net message from client' ela= 6701 driver id=1650815232 #bytes=1 p3=0 obj#=258 tim=1182369456082
WAIT #2: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=258 tim=1182369456222 

The interesting portion of the above trace file is the first FETCH line where a single row is returned to the client.  The c=306350000 entry indicates that 306.35 seconds of server CPU time was consumed during the fetch call.  The e=299174653 entry indicates that the elapsed time (as if captured by a stop watch) of the single row fetch was 299.174653 seconds.  The p=0 entry indicates that 0 blocks were physically read from disk when preparing to retrieve the one row.  The cr=22298 entry indicates that 22,298 consistent gets were performed when preparing to retrieve the first row.  Typically, performing a large number of consistent gets will result in the consumption of a considerable number of CPU seconds, yet 22,298 consistent gets could not explain the consumption of more than 5 minutes of CPU time, even if the OP was running Oracle Database on an Intel 8088 CPU.

The brain teaser posed (directly or indirectly) by the OP:

  1. Why did the first fetch of a single row require roughly 299 seconds of elapsed time, when the second fetch of 15 rows apparently required 0.000293 seconds?
  2. Why did the first fetch of a single row require roughly 306 seconds of CPU time, when the second fetch of 15 rows apparently required no CPU time?
  3. How is it possible that the CPU consumption exceeds the elapsed time by 7.175347 seconds?
  4. How is it possible that 306.35 seconds of CPU were burned while performing only 22,298 consistent gets?
  5. The trace file indicates that 306.35 seconds of CPU were burned during the first fetch call, does the number 306.35 represent 100% of the CPU consumption during the fetch call, or is the actual CPU consumption higher or lower than that number?
  6. What release version of Oracle Database is the OP running?
  7. What operating system is the OP using for Oracle Database?
  8. What other questions might you ask of a job candidate if the 10046 trace file from the OTN thread were handed to a job candidate during an interview?

Think about the questions for a couple of minutes before visiting the OTN thread.  Several of the above questions are answered in that thread.


Actions

Information

5 responses

27 04 2012
Mladen Gogala

Well, the first thought was that there was some kind of buffer busy wait problem, which would burn CPU like crazy, but there is no indication of any such thing in the trace file. A COU intensive operations like sort/hash would be done before the fetch phase and your trace shows that this is happening during the fetch phase. Being an ex-system administrator, I would also suspect paging. As a matter of fact, paging fits the description perfectly. On a memory-strapped systems, the parts of SGA containing the requested blocks might have been swapped out. The VM susbsystem has to make room to bring them in. So, it will make a survey of the allocated memory and chose what to swap out to make room for the needed part of SGA. That would account for the time in minutes.
You have even tried to give us a hint, by mentioning the OS. Good job, Charles!

28 04 2012
Charles Hooper

Mladen,

Nice analysis, and the explanation of why buffer busy waits should be considered, and then discounted due to the lack of wait events reporting that type of problem. The lack of wait events makes this problem a bit difficult to investigate. A couple of years ago Tanel Poder and James Morle put together a blog article series that describes approaches to troubleshooting problems like this one, but the series may not have covered the memory-strapped system situation that you mentioned. Below is a link to one of the articles in the series:
http://blog.tanelpoder.com/2010/01/15/beyond-oracle-wait-interface-part-2/

I think that the time/CPU required for a sort/hash actually would appear on the FETCH call line in the trace file, but I agree with you that the sort/hash would need to complete before the fetched row could be sent to the client. With so few consistent gets reported, it is important to think about what else could happen – maybe the runtime engine is pinning the blocks and then re-visiting the same blocks frequently:
http://books.google.com/books?id=G9AJA91PL54C&pg=PA114

At this point, it appears that the OP has abandoned the OTN thread. I guess that I should get back to finishing the review of the performance tuning book – the recipe for the souffle is faulty, it seems that the souffle deflates every time I open the oven door to check on it.

2 05 2012
Charles Hooper

There have been fewer attempts at answering the eight brain teaser questions that were posed directly or indirectly by the OP, despite the seemingly high number of views of the OTN thread. I thought that I might take an attempt at answering questions number 3 and 5.

#3 In the OTN thread, Jonathan Lewis noticed that with c=306350000 and e=299174653, the c (CPU) value is roughly 1.0239838 times the value of the e (Elapsed) time, which is roughly the same value as 1024/1000 (2^10 / 10^3). That might imply that the OP’s combination of Oracle Database version and operating system version very likely assumes that there are 1024 milliseconds of CPU time (2^10 ticks) per second, rather than 1000. But why?

I took a look through the book “Optimizing Oracle Performance” for clues, in part because I thought that I remembered that the reason for the bit shift optimization (using 2^10 rather than 10^3) was mentioned in that book. While I did not find the quote that I remembered., I found a couple of interesting quotes that lead down the path to a couple other lines of thought:
Page 148:

“On POSIX-compliant operating systems, the Oracle kernel obtains CPU usage information from a function called getrusage on Linux and many other operating systems, or a similar function called times on HP-UX and a few other systems.”

Page 156:

“The Linux gettimeofday call, for example, converts a value from nanoseconds (10^-9 seconds) to microseconds (10^-6) by performing an integer division of the nanosecond value by 1,000, effectively discarding the final three digits of information.”

Page 163:

“With a 1-cs clock resolution, getrusage may return microsecond data, but those microsecond values will never contain valid detail smaller than 1/100th (0.01) of a second.”

Page 164:

“At every interrupt, the operating system’s process scheduler tallies one full centisecond (10,000 microseconds) of CPU consumption to whatever process is running at that time.”

Interesting… the quote from page 164 might be helpful when answering question 5. The quote from page 156 makes me question, but what if instead of dividing the result of the Linux gettimeofday function by 1000 (10^3), Oracle were to optimize this routine by performing a bit shift operation (effectively dividing by 2^10)?

A trip through Metalink (MOS) found another error on that site. Doc ID 39817.1: “Interpreting Raw SQL_TRACE output” (https://supporthtml.oracle.com/epmos/faces/ui/km/DocumentDisplay.jspx?id=39817.1 ) essentially states that the c= statistic represents the microsecond CPU consumption rounded to 1/100th of a second in Oracle Database 9.0.1 and above (this statement is incorrect for some operating system platforms, including the Windows operating system).

But why divide or multiply by 1024, rather than 1000? Efficiency, especially with older CPUs (division performance is impacted more than multiplication). In one article that I found (http://stackoverflow.hewgill.com/questions/545/844.html ), the following statement was made:

“On the 1995-era Sparc I was running this code on integer multiplication took 33 clock cycles; one cycle per bit in the word. The bit shift took 1 cycle.”

So, if what this person stated was actually true, a bit shift operation is roughly 33 times more efficient than a multiplication (and even more efficient than a division) on some CPUs from the 1995 time frame.

Another article offered several comparisons between bit shift operations and multiplication/division (http://stackoverflow.com/questions/6357038/is-multiplication-and-division-using-shift-operators-in-c-actually-faster ). Among other things, the number of clock cycles required for such operations on Intel 80486 and Core i7 was discussed, as well as some of the automatic optimizations that are built into modern C code compilers:

“Say we stick to “486” (the newest one listed), 32 bit registers and immediates, IMUL takes 13-42 cycles and IDIV 44. Each SAL takes 2, and adding 1, so even with a few of those together shifting superficially looks like a winner.”

So, for the Intel 80486 processor, a bit shift operation is roughly 21 or 22 times more efficient than a multiply or divide operation.

The getrusage function, mentioned in the “Optimizing Oracle Performance” book, is not available on some platforms, including Windows and Solaris. Some of the websites that I found indicated that the C programming language’s clock() function might be a suitable substitute (on some operating system platforms, a call to the clock() function actually calls getrusage behind the scenes). The C clock() function returns the number of ticks counted, and that number must be divided by the CLOCKS_PER_SEC (macro function) constant that varies by operating system (1,000,000 on some OS platforms, and 128 or 1 on others)
http://web.archiveorange.com/archive/v/uSKRm99PbuVNLTzGkJSN

Oddly, it seems that the C clock() function is not applied consistently across all platforms. Apparently, that function returns the wall clock time on Windows and Solaris, and the CPU usage time on other platforms.
http://stackoverflow.com/questions/588307/c-obtaining-milliseconds-time-on-linux-clock-doesnt-seem-to-work-properl
http://bytes.com/topic/c/answers/517364-clock_t-clock-ambiguity
http://rabbit.eng.miami.edu/info/functions/time.html

The “Optimizing Oracle Performance” book states that HP-UX and a few other systems use the times() function to determine CPU usage. Some OS platforms require the value returned by times() to be divided by 128 (might this also be a source of calculation problem, if the number was actually divided by 100 thinking that doing so would generate a centisecond value).:
http://freebsd70.unixmanpages.net/man7/clocks.html

So, the answer to question number 3 (How is it possible that the CPU consumption exceeds the elapsed time by 7.175347 seconds?), in this case, the difference is probably caused by either Oracle Database or the operating system performing a bit shifting 2^10 rather than a multiplication of 1000 when returning the number of CPU microseconds consumed (of course, it could be incorrectly bit shifting the elapsed time instead).

Other possibilities mentioned in the OTN thread that might cause the CPU time to exceed the elapsed time:
* The session was spinning on a latch that could not be obtained. The fact that the session did not post a latch related wait event would mean that either the _SPIN_COUNT parameter was modified to an extremely high value, or that the session encountered a bug in the code that resulted in a nearly infinite loop.
* Auditing is enabled, and something went wrong in the audit code.
* The session’s process was suspended at the operating system level, although the CPU time probably would not closely equal the elapsed time if that happened.
* The session’s process was in the CPU run queue for 299 seconds – some operating systems will continue to accumulate CPU time for processes even when those processes are sitting in the run queue. It seems very unlikely that this is what happened, unless someone had used something such as the nice OS utility and there was a severe shortage of CPU availability. The tim= time stamps closely agree with the reported wait event elapsed times, so that fact might decrease the chances that the process spent a lot of time in the run queue due to excessive competition for the CPU.
* The server’s clock was resynchronized with an external time source. I could see this affecting the reported elapsed time, but not the CPU time.
* Memory is over-allocated on the server, which resulted in at least a portion of the buffer cache being swapped to the swap file. It seems unreasonable that it would take nearly 5 minutes to perform a memory swap operation.
* Parallel query is enabled for the SQL statement (it very likely was not in this case).

For question number 5 (The trace file indicates that 306.35 seconds of CPU were burned during the first fetch call, does the number 306.35 represent 100% of the CPU consumption during the fetch call, or is the actual CPU consumption higher or lower than that number?), I think that the safest bet is to state that the answer is an unknown. Does the reported CPU time include the CPU time required to write the previous trace file line to the 10046 trace file? The OP’s operating system is outputting CPU usage statistics that appear to be rounded to the 1/100th of a second – was the actual utilization higher or lower than that value (even if we divide by 1.024), keeping in mind the final quote from the book “Optimizing Oracle Performance”? Maybe “during the fetch call” must be better defined – should that figure also include or exclude the CPU time to perform the context switch to perform the process state transition? What about CPU consumption in un-instrumented code that is falsely associated with the FETCH call?

13 02 2013
Log Buffer #270, A Carnival of the Vanities for DBAs

[…] Charles Hooper dishes out another brain teaser which is related to 10046 Extended SQL Trace. […]

27 02 2014
jkstill

I’ve recently seen an interesting bit of information included in a fetch call.
Oracle thick clients on Windows use out of band breaks. The polling time for the out of band breaks is reported in a fetch call.
I’ve tested this on Oracle 9.2 (the client database) as well as a 11.2.0.3 on Linux 5.8 – same results.
The issue came about due to a setting on HPUX that caused the poll calls to take 15 milliseconds rather than the normal 5 microseconds, leading to some very long query times.
FETCH seems to be a catch all at times.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s




%d bloggers like this: