December 27, 2009
What follows is part of a presentation that I gave in 2008.
A sudden unexpected spike in server CPU usage was noticed while running a batch process in an ERP package that normally has most of its time captured in the SQL*Net message from client wait event. My Toy Project for Performance Tuning was actively logging database performance when the problem was noticed. Below is a marked up view of the program as it captured the spike in CPU activity:In this 83 second time period, the database instance performed 7.8 million consistent gets, and not a single physical read. The database had a 100% buffer cache hit ratio during this time period, and that is good, right? The database instance read 505.6 million rows by full table scan in the 83 seconds with relatively few rows read by index lookup, which might be a sign of a problem.
Clicking on the Rows Read by Table Scan statistic, we see the running history of the number of rows read by a full table scan, with the current statistic value in blue:If we dig a bit more, knowing that the number of rows retrieved by table scan seems high, we see that we had a quiet period, and then the number of rows read by full table scan jumped considerably. At the same time, the CPU usage also jumped. We should dig into the system a bit more deeply to understand why this is a problem.
Switching over the the SQL Monitor, we start to investigate:If we look at the SQL statements as they are executed, we see one SQL statement that seems to be performing a large number of buffer gets, or logical reads – about 5.1 million per minute when fetching 123 times. The execution plans for the SQL statement are shown in the bottom pane. Note that there are two execution plans for this SQL statement, with the second created due to a bind mismatch – the initial parse call did not initialize the bind variables which caused Oracle to treat the bind variable as a VARCHAR2(2000), the second parse initialized the bind variable as a CHAR.
Zooming in on the grid at the top:If we look at the SQL statements as they are executed, we see one SQL statement that seems to be performing a large number of buffer gets, or logical reads – about 5.1 million per minute when fetching 123 times with 120 executions. Something seems a bit odd.
Zooming in on the execution plan at the bottom:
Taking a look at the execution plan, we see that the query optimizer decided to perform a full table scan when executing this SQL statement, rather than using the selective index on the PART_ID column. Oracle is applying a filter against the table rows to find only those with a specific PART_ID having a QTY greater than the COSTED_QTY, and with a TYPE of O.
If we submit the SQL statement with the STATISTICS_LEVEL parameter set to ALL at the session level, and retrieve the actual execution timing by passing in ‘ALLSTATS LAST’ as the format parameter for DBMS_XPLAN.DISPLAY_CURSOR, we see the following:The above slide shows that a single execution of the query required roughly 0.5 seconds (10% of the estimated time), required 42,751 consistent gets, and returned 1 row when the optimizer predicted that 943 rows would be returned. The good news is that the BCHR is 100%, or maybe that is not good news, depending on who you ask.
The final slide shows the expected execution plan output by DBMS_XPLAN.DISPLAY_CURSOR:The above output shows that the query should be executing in 0.01 seconds or less with the same value specified for the PART_ID, with just 16 consistent gets. While not checked, the clustering factor for the X_INV_TRANS_1 index on the PART_ID column will certainly be high, perhaps close to the number of rows in the table.
What caused the above problem? Assume that the OPTIMIZER_INDEX_COST_ADJ parameter is left at its default value of 100, and system statistics were collected with a command like the following:
The above command captures system (CPU) statistics with a 30 minute time interval. The statistics describe the performance of the server, including CPU speed (possibly a measure of the number of consistent gets per second), single block physical read time, multi-block physical read time, average number of blocks read by multi-block reads, as well as several other statistics - these statistics are visible in the SYS.AUX_STATS$ view. Assume that the data is stored in ASSM tablespaces with AUTO segment space management, which gradually ramps up the extent size from 64KB (for the first 1MB of data) to 1MB and beyond. If DB_FILE_MULTIBLOCK_READ_COUNT is set to allow 1MB multi-block reads (or auto-tuned to allow 1MB multi-block reads), it is quite possible that the MBRC statistic in SYS.AUX_STATS$ could be set to a very high value, possibly close to 100, if system statistics are gathered shortly after bouncing the database. It is also possible that a clever DBA might have executed something like the following to force the MBRC statistic to a value that is close to what would be expected if full table scans involving physical reads might be required (DB_FILE_MULTIBLOCK_READ_COUNT was auto-tuned to 128):
Since the optimizer assumes that every block needed for a query will result in a physical read, odd side-effects, such as high CPU usage with long execution times, may happen when every block needed is satisfied by blocks already in the buffer cache, and a popular value (possibly identifying 0.98% of the rows in the table) is specified on the initial hard parse when bind peeking is enabled (the default starting with Oracle 9.0.1).