September 27, 2010
It is far too easy to invest many hours into building a test case to demonstrate or explore a simple concept, or model a particular problem that is being encountered. I recently had the opportunity to attend a session by Jonathan Lewis that was titled “The Beginner’s Guide to Becoming an Expert”, a session which emphasized the use of test scripts to investigate various types of unexpected behavior, when that unexpected behavior is found. Using examples from Jonathan’s session, that unexpected behavior might include noticing a reduction in the redo generated when an update statement is driven through an index access path, for instance. Another example from Jonathan’s session included the unexpected behavior where a significant amount of redo is generated when a bitmap index is updated, compared to the amount of redo generated when a b*tree index on the same column is updated, and then finding that the amount of redo generated for the bitmap index drops significantly when the Oracle Database is upgraded from 9.2 to 11.1. Creating a test script that models a particular situation allows that problem to be investigated in isolation, and then repeatedly analyzed as one variable is changed at a time – the change might include modifying an initialization parameter, submission of different initial bind variables, execution of the script against a different Oracle Database release version, or any number of other small scale changes.
One of the key concepts in the creation of test cases is to compare the intended end result of the test case with the starting point, and then identify logical transformation stages between the two end points. At each transformation stage confirm that the preceding step logically leads to the next step, and then determine if there is any chance that a different result might be obtained given the same input. Work backward from the last transformation stage to the first, and ask yourself if there is any other possible input that might have caused the outcome. It takes practice, a fair amount of research, and sometimes trial and error to discover the multiple potential cause and effect relationships.
I recently used this technique to build a test case for an OTN thread to discover why an index was not used to retrieve three rows from an 8 million row table. The “given” information was that the index was used if an always false comparison of two literal VARCHAR values was removed from an OR condition in the WHERE clause. The first step in building a test case, assuming that the production environment where the problem was found is not available for use, involves building a model of the source data. If you are familiar with the real data that causes the unexpected behavior, build a model that simulates the real data, otherwise assume that the data to be returned is scattered among the other rows in the table data. In my test case, to simplify the calculations a bit, the number of rows to be inserted into the test table was increased to 10 million. Before building the data model, determine what might cause a full table scan to be performed rather than using an index access path: high clustering factor for the index, poor cardinality estimates (might be caused by out of date object statistics), incorrectly gathered system statistics, poorly specified optimizer parameters, sub-optimal index definition (might require an index skip scan, or return a large number of rows that are later eliminated at the table level), optimizer bugs, maybe the full table scan really is faster than an index access path, etc.
Now that we have identified some of the causes for a full table scan, our test case may be used to experiment “what if” and “how would I know if that was the cause?” To determine if the clustering factor was the problem, I might first need to know how the optimizer uses the clustering factor, and am I able to artificially adjust it to see if a change in the value makes a difference in performance. To determine if the cardinality estimates are a problem I might use DBMS_XPLAN.DISPLAY_CURSOR to retrieve the estimated cardinality and actual number of rows returned by every operation in the execution plan. I might remove the comparison of the two literal VARCHAR values to see how that change affects the cardinality estimates, or see if changing one of the literal values causes a change – one of those changes just might also allow an index access path to be used. I might try to use a hint to force the use the index (hints are, after all, directives and not suggestions), and if that does not work I might analyze why the hint was invalid. If the hint worked, after verifying that a performance improvement was achieved, I would likely consult a 10053 trace file to determine why the index access path was not automatically used – was the cost too high, or was the index access path cost not even calculated?
Once the cause of the full table scan is determined, additional what-if statements should be considered. What if the index hint worked, am I able to permanently add that hint to the SQL statement – are there any cases were adding that hint might lead to other performance problems later? If the cardinality estimates are incorrect, is there anything that I am able to do to fix the cardinality estimates? If the optimizer parameters are poorly specified, am I able to adjust those parameters – and if I do what might be the impact on the other SQL statements that depend on the current initialization parameter values for optimal performance and costing calculations? Are we able to modify the application that submitted the SQL statement to remove the unnecessary literal comparison?
Once the desired outcome is reached, work backward through the steps and determine where this approach might go wrong. Is it reasonable to conclude that the previous step always results in the next step? In short, test cases are not terribly difficult to construct once you learn what might cause a transformation step to simply lead to the wrong next transformation step.
Oracle Database behavior changes from one release version to another, and sometimes from one patch level to the next. Save the test case scripts once generated, because the behavior that is demonstrated by the test script might only be true for a small number of Oracle Database releases. My test case script for generating a deadlock on Oracle Database 11.1 and 11.2, while the same test case script simply generates an expected enqueue on Oracle Database 10.2, is just one example of the usefulness of test cases to demonstrate Oracle behavior.
Do we need a test case to determine why the OTN forums, if working at all, are so slow this morning? I believe that it required more than two minutes just to display the forum home – that was roughly an hour after I saw the message stating that the OTN forums were down (someone must have tripped over the power cord again).
Any other ideas for easily creating useful test cases?
Charles,
Nice write-up. I must admit it is, kind of, becoming more difficult with each release of oracle (or CBO, in particular) to write a reproducible test cases for oracle.
p.s. BTW, if you don’t mind and have interest and time, can you have a look at this thread ?
It is a long thread and I am not sure if OP has managed to resolve the issue but I would be curious to know how you would approach to the problem.
That is an interesting thread. Just a couple of comments:
* It does not appear that the execution plan actually changed when the user changed the OPTIMIZER_MODE to ALL_ROWS. The server just appeared to be more busy when the query was executed with the OPTIMIZER_MODE set to ALL_ROWS because all of the wait events increased in time – there were probably fewer of the needed blocks still in the buffer cache, and that likely explains the jump in the number of waits on the “db file sequential read” wait event.
* The execution plan that was displayed by the EXPLAIN PLAN FOR command does not exactly match the runtime plan that was output by TKPROF. If you copy both versions of the plan into a spreadsheet package (I used Microsoft Excel) you will see that lines 41 through 44 of the EXPLAIN PLAN FOR version of the plan do not appear in the runtime version of the plan. Between lines 34 and 35 of the EXPLAIN PLAN FOR version of the plan there is an index unique scan of the FND_TERRITORIES_TL_U1 index that is used to retrieve rows from the FND_TERRITORIES_TL table.
* The cardinality estimates appear to be inaccurate, and this might have to do with the ROWNUM=1 specification in the scalar subqueries – there is a bug that affects all release versions prior to 11.2.0.1 when ROWNUM is specified in the WHERE clause which incorrectly pushes the cardinality estimate too low for much of the rest of the plan (I have a test case somewhere in my blog that demonstrates this problem, but I am not sure if it applies in this case and may in fact be forcing a FIRST_ROWS optimizer mode). The cardinality estimates problem could also be caused by the use of ANALYZE rather than DBMS_STATS – Jonathan Lewis has several articles that describe the differences (http://jonathanlewis.wordpress.com/2009/04/30/analyze-this/ http://jonathanlewis.wordpress.com/2010/03/31/analyze-this-2/ http://jonathanlewis.wordpress.com/2010/04/01/analyze-this-2-2/).
* The excessive use of scalar subqueries is probably causing an excessive use of nested loop joins.
* The view POC_ASN_PICKUP_LOCATIONS_V might not be too terribly efficient. Inside the view the FND_LOOKUP_VALUES_U2 index returns 3778 rows, while only 1 row makes it through the restrictions on the FND_LOOKUP_VALUES table (PV”.”SEGMENT1″=”B”.”DESCRIPTION” AND “B”.”DESCRIPTION” IS NOT NULL AND “B”.”ATTRIBUTE9″ IS NOT NULL AND “B”.”ENABLED_FLAG”=’Y’ AND “B”.”TAG”=:B1 AND TO_NUMBER(“B”.”ATTRIBUTE11″)=:B2))
* The number of consistent gets is fairly high at 2,259,039 and that might explain the 19.89 seconds of CPU time.
* There are several calls to USERENV(‘LANG’) – I wonder if those calls result in a context switch.
* If there was a lot of competition for CPU time on the server, it might be the case that the process for this session was having difficulty getting back onto the server’s CPUs (edit: note that it is not the session’s process that must get back onto the CPU to send the next two packets) the every time a fetch call was received from the client and every time two network packets were sent to the client. The max wait time of 7.48 seconds on the “SQL*Net more data to client” wait event in the one run and the 232.41 seconds on the “SQL*Net more data to client” wait event during the second run (when OPTIMIZER_MODE was set to ALL_ROWS) makes me wonder if the server’s CPU is a bottleneck or if there is a severe network problem.
* If the SDU size is set to 8,000, that implies the client was fetching roughly (228+11289)*8000 92,136,000 bytes (87.87MB) of data, which would put the average row length at 4,064 bytes, which is enough to fill 3 network packets. If every two packets received by the client must be ACKed, that probably means there would be around 30,712 (92,136,000 / 1500 /2) ACKs, and if the ping time between the client and server was 20ms, it just might take 614 seconds to transfer the 92,136,000 bytes to the client – that is close to the reported 421.60 elapsed seconds reported for the SQL statement when the OPTIMIZER_MODE was set to CHOOSE. So, it might be a good idea to look at the average ping time between the server and client.
* The client computer may be choking on the data that it is receiving – it may be overloaded.
A Wireshark trace on both the client and server might help, as would monitoring the server’s CPU usage.
Charles,
Thanks a lot for your in-depth analysis accompanied with all calculations.
I must admit I would have to re-read the point about SDU setting many times to understand it 😉
Apart from that, you have managed to point out some amazing points (which I had completely missed).
At present, I am struggling to understand your following statements. Can you help?
The excessive use of scalar subqueries is probably causing an excessive use of nested loop joins
Can use of scalar subqueries really affect the way other tables in the query are joined ?
If there was a lot of competition for CPU time on the server, it might be the case that the process for this session was having difficulty getting back onto the server’s CPUs (edit: note that it is not the session’s process that must get back onto the CPU to send the next two packets) the every time a fetch call was received from the client and every time two network packets were sent to the client. The max wait time of 7.48 seconds on the “SQL*Net more data to client” wait event in the one run and the 232.41 seconds on the “SQL*Net more data to client” wait event during the second run (when OPTIMIZER_MODE was set to ALL_ROWS) makes me wonder if the server’s CPU is a bottleneck or if there is a severe network problem
Is it really possible that a bottleneck on server’s CPU results in large waits on “SQL*Net more data to client”? Again, I am struggling to build a test case for this.
BTW, the OP had managed to eliminate few scalar subqueries (as mentioned in subsequent follow-ups), which brought down the disk reads and consistent gets to some extent. Also, in OP’s version of oracle, the default SDU size is 2KB although OP has not mentioned explicitly if he was using a default or non-default SDU size.
Besides the unusually large “SQL*Net more data to client” waits, I am struggling to understand if the NESTED LOOP join used by CBO is the most appropriate join method and whether the wait event (or the cause for it) is somehow affecting optimizer’s decisions.
p.s. if the OP does respond, I would point him to your reply as it is definitely much more exhaustive and convincing than what I tried to provide on that thread.
I saw that the OP claimed to have removed some of the excessive scalar subqueries and transformed those into inline views, but I did not see those inline views listed in the FROM. So, I did not spend a lot of time investigating the changes. I see that the Predicate Information section of the explain plan no longer shows ROWNUM, for example “20 – filter(ROWNUM=1)”. The scalar subqueries could limit the number of available execution plans, and that is why I made the comment about nested loop joins.
Regarding the test case for CPU competition causing excessive waits for the “SQL*Net more data to client” wait event, you may see different results on different operating system platforms, and even with different network cards. If you want to try testing, build several tables with a couple of columns, such that when the tables are joined and you select from those tables the average row length returned to the client is roughly 4,000 bytes. From a client computer, connect to the database using SQL*Plus, set the array fetch size to 100, disable output to the screen (SET AUTOTRACE TRACEONLY STATISTICS), set timing on, enable a 10046 trace at level 8, then query the tables. Now, apply a load to the database server such that there are 8 processes fighting for every CPU core (or core thread) – you will find examples of simple CPU load generators in the “Expert Oracle Practices” book: http://books.google.com/books?id=xzjw4hoeDegC&pg=PA197 Change the 10046 trace file name, and again from the client computer issue the same query as before. Some servers (in some cases depending on the network card’s offloading capability) will need CPU cycles for generating checksum values for network packets, and also verifying that all packets in the conversation were successfully received. The additional CPU competition should cause an effect on the network transfer speed.
How do the elapsed times displayed in SQL*Plus compare? If you process the 10046 trace files using TKPROF, how do the statistics change? Do you see a large increase in the “SQL*Net more data to client” wait event?
You are right regarding the default SDU size being 2,000 – I was in a bit of a rush when I wrote the above comment and did not have a chance to re-read the quote for typing or logic errors (in this case, stating 8,000 was a logic error not a typing error).
We have to keep in mind that according to the wait events and the CPU usage, for the first TKPROF summary, we see the following:
13.0195 minutes waiting on the client/network
0.3315 minutes on the CPU
0.4963 minutes on server-side wait events
Even if we were able to completely eliminate the server CPU component and the server-side wait events, we still would have to account for the 13 minutes of client/network time. So, if we spend too much time trying to optimize the SQL statement, it might not make much of a difference in the overall execution time for the SQL statement.
Charles,
Many thanks for the pointers on building the test case. I will see what I can do as I doubt I will be able to produce this using only my laptop.
One last question which is nagging me a lot about this thread.
1) Is the time spent on “SQL*Net more data to client” accounted for in the figure displayed under “elapsed time” for the query ?
2) Is the time (and statistics) for scalar subqueries accounted for in the “CPU Time” and “Elapsed Time” figures for the query?
p.s. Ok. those are two questions, not one. Hope you don’t mind 😉
I will claim to not know the answer (actually I cannot remember at the moment), so let’s set up a test case to find out. The test table is created with the following commands:
The above created a table such that if we select a single row it will be slightly less than 1,000 bytes in length (note that I had to make certain that the column values changed from one row to the next so that SQL*Net compression would not alter the test results) which will allow us to try a couple of experiments:
* Using a small array fetch size of 2 we should be able to pack 2 of these rows into a single SDU unit, so we should not have any “SQL*Net more data to client” wait events.
* Using a larger array fetch size of 5,000 we should see plenty of “SQL*Net more data to client” wait events.
* Perfoming a four-way self-join should cause each row returned to be roughly 4,000 bytes.
* Adding a significant CPU load to the server for roughly 10 minutes to see the impact on the “SQL*Net more data to client” wait events as well as the other wait events – the 10 minute CPU load ends before the SQL statement completes.
* Forcing the server-side of the network connection to 10Mb/s (probably half-duplex because the switch side was not also forced to 10Mb/s full duplex), which should increase the duration of the “SQL*Net more data to client” wait events.
The test script (as written requires Oracle Database 11.1.0.6 or higher) – note that the results with the array fetch size set to 15 will be discarded:
Now the test script when the significant CPU load is applied:
The test script when the server’s network card was forced to 10Mb/s full duplex:
The results… will follow.
‘10046TraceTest1’
‘10046TraceTest2’
‘10046TraceTest3’
‘10046TraceTest4’
(still executing)
The tkprof summaries:
‘10046TraceTest1’ (array fetch size set to 2)
‘10046TraceTest2’ (array fetch size set to 5,000)
‘10046TraceTest3’ (array fetch size set to 5,000 with significant server CPU load)
‘10046TraceTest4’ (array fetch size set to 5,000 with significant server CPU load) – still not done, so I aborted the query.
The above tkprof summary from the aborted execution should answer your question regarding whether the “SQL*Net more data to client” wait event is factored into the elapsed time.
This particular server has a network card that supports CPU offloading. From the raw ‘10046TraceTest3’ (array fetch size set to 5,000 with significant server CPU load), near the start:
Near the end (including the final fetch call
In this case it does not appear that the duration of the “SQL*Net more data to client” wait events was affected by the significant CPU load at the start of the query execution, but what about the difference in the tim= values – is there a greater amount of “missing” time when the server was very busy? I should mention that the server was running on Windows, and that probably explains the increased CPU seconds displayed in the tkprof summary when the server’s CPUs were heavily loaded (on Windows, time in the run queue accumulates as if the process were still on the CPU).
Charles,
Awesome!!!
Thanks a zillion for taking time to answer my question, along with a complete test case.