Book Review: Oracle Performance Firefighting (Fourth Printing)

25 06 2010

June 25, 2010

Many Helpful Topics, However the Errors and Omissions are Unnecessary Distractions
http://resources.orapub.com/Oracle_Performance_Firefighting_Book_p/ff_book.htm

This review covers the June 2010 fourth printing of the “Oracle Performance Firefighting” book.  I bought a PDF copy of the book direct from the OraPub site.  The author’s excellent “Forecasting Oracle Performance” book was a factor in my decision to purchase this book, as was my desire to learn something new while trying to maintain previously acquired knowledge.  As stated in this book, the author has roughly 20 years of experience working with Oracle Database, worked with Cary Millsap at Oracle Corporation for a period of time, and conducts a successful consulting and training business.  I recognize the names of three of the six technical reviewers listed in the book as well known experts in the Oracle community.  The author’s website states that there are no known issues/errata starting with the third printing (November 2009) of the book (reference).  All of the components for the book’s success are in place, so how did the book perform?

There are definitely positive moments in the book, where the book successfully communicates complicated topics using very clear statements.  The book publisher selected a font size and margin size that packs a significant amount of detail into each of the book’s 387 pages.  There are several well-placed helpful diagrams with no space wasting cartoon drawings.  The author introduced his three circle analysis which involves identifying the intersection of the Oracle Database, the Application, and the Operating System to help pinpoint the source of performance problems.  Several examples were provided where the author used the Linux strace command to troubleshoot Oracle Database performance problems.  The book recommends against modifying the _SPIN_COUNT hidden initialization parameter, which is a wise advisement. Clearly states that when there is an operating system bottleneck, there are only 4 subsystems to investigate: CPU, memory, IO, and network.  Provides a good demonstration of using DBMS_MONITOR to enable 10046 traces.  Provides a nice description of in-memory undo and queuing theory in easily understood language.  The author’s website offers the OSM toolkit, which is a collection of scripts that can be freely downloaded – several of the scripts were used throughout the book.  It is obvious that the author invested a significant amount of time into the writing of this book.

There are also definitely negative moments in the book, sections filled with typing errors, logic errors, and omissions.  Signs of problems include a paragraph that includes four hidden initialization parameters that the author suggests to adjust, without the author indicating by how much the initialization parameters should be adjusted, and without the author explaining why three of the four hidden initialization parameters simply do not exist in Oracle Database 11.1.0.7.  Some sections of the book are written in a condescending tone, as if no one other than a DBA is intelligent enough to understand the challenges that a DBA or performance tuning consultant faces, even if the concepts are provided in simple terms to others.  Very few reproducible test cases are provided in the book, which makes it a little difficult for readers to practice or duplicate concepts introduced in the book.

This book is not a replacement for the “Troubleshooting Oracle Performance” book – the two books share very little in common.  This book seems to spend little time explaining how to understand the root cause of the performance problems through various trace file analyses (10046, 10053, 10032, etc.) or even analysis of execution plans, while the author does spend a lot of effort suggesting changes to initialization parameters.  The book’s author is in a somewhat unique position in that he has either a 3 out of 5 star book, or a 5 out of 5 star book depending on the corrections that make it into the fifth printing of the book.  The PDF version of the book cannot be printed, does not permit copying text, and has the purchaser’s name, address, and phone number at the top of each page.

Specific errors, omissions, and distractions in order:

  • The book uses a lot of footnotes, some of which do not add value to the discussion of Oracle Database.
  • Page 16 suggests to keep working to improve performance until time and budget are exhausted, or performance is acceptable.  I think that this is a pathway to compulsive tuning disorder.
  • Page 22 suggests that when the CPU run queue is usually longer than the number of CPU cores, that means that processes are needing to wait for CPU resources.  This suggestion begs the question – isn’t it possible to reach the point of processes needing to wait for CPU resources without the CPU run queue reaching the length of the number of CPU cores?
  • Chapters one and two seem to spend a lot of time trying to relate Oracle firefighting concepts to non-computer related activities: telephone operator, school report card, doctor checking a young child, clients requiring the author to keep a timecard to track time, etc.  Additionally, the early chapters excessively repeat the terms firefighting and firefight.
  • Page 41: Missing letter “Oracle has never guaranteed it has completely instrumented it code.”
  • Page 42: Missing letter “By tracing an Oracle process through the operating system, we can easily observe how Oracle has instrumented it code.”
  • Page 43: Incorrectly defines the meaning of a db file scattered read wait event when attempting to correlate a Linux readv OS call to an Oracle wait event: “Regardless of the operating system or the actual system call, Oracle gives a multiblock IO call the special name db file scattered read, because these multiple blocks can be scattered over the IO subsystem.”  One of the problems with the quote is that a multiblock read from the operating system could correspond to several different Oracle wait events.  A second problem is that the readv call reads logically adjacent sections of a file (think multiple Oracle blocks) on disk into non-adjacent memory locations (reference).
  • Page 47: While Table 2-1 states that only selected columns of V$SYSTEM_EVENT are displayed, the table probably should have included the TOTAL_TIMEOUTS column, which helps describe the true length of wait events that have, for instance, three second timeouts – this column has existed since Oracle Database 8.1.5.  The table also states that the WAIT_CLASS, WAIT_CLASS_ID, AND WAIT_CLASS# columns (the book incorrectly includes an underscore between CLASS and #) were introduced in 11g R1, while the Oracle documentation shows that those columns were introduced in 10g R2 (reference).
  • Page 49: While Table 2-2 states that only selected columns of V$SESSION_WAIT are displayed, the table seems to discount the value of the SECONDS_IN_WAIT column by excluding that column.  The table also states that the WAIT_CLASS, WAIT_CLASS_ID, AND WAIT_CLASS# columns (the book incorrectly includes an underscore between CLASS and #) were introduced in 11g R1, while the Oracle documentation shows that those columns were introduced in 10g R2 (reference).  The table should provide a better description for the STATE column’s value of “WAITED KNOWN TIME: Previous wait completed; the session is currently not waiting.” – the book should describe what does this STATE value means: that the session is in an idle wait event, or that the session is currently burning CPU time (this is what it means).
  • Page 55: States “Systems based on Oracle Database 10g and later can take advantage of the v$sys_time_model and v$ses_time_model views…” – the session level view is actually named V$SESS_TIME_MODEL.
  • Page 61: The book strongly suggests that a user’s “Submit” or “Save” operation equates to a COMMIT, and that a COMMIT equates to a LOG FILE SYNC wait event, and therefore if the average LOG FILE SYNC wait time is identical to a previous day when performance was acceptable, the performance problem is “most likely not related to the database.”  A lot may be said about where this paragraph fails the reality check – just think of one other operation that may happen when the user selects the “Save” option, or one other cause for the LOG FILE SYNC wait, or how averages may distort the true slowness that is experienced by the user, or how the author is simply measuring the wrong thing when trying to determine if the problem is present in the database (reference).
  • Page 68: Example of condescending tone: “When working on Oracle systems, you will eventually hear some poor whiner say something like, ‘Oracle is a resource hog!  It consumes every bit of resources it can get its despicable hands on!’”  More examples of condescending/confrontational tone with the server admin, storage admin, storage vendor, etc. are found on pages 128, 130, 132, 148, 156, 201, 264, 324.
  • Page 74: Example of an odd footnote that does not add value to the book: “I am in no way implying Oracle sincerely cares how much memory its database system requires.”
  • Page 94: The author indicated that one of his tests that modified the _KKS_USE_MUTEX_PIN parameter showed that the mutex implementation decreased the execution time by 11%, yet the author did not provide a SQL statement to create the test table, nor provide the PL/SQL script used in the test so that readers could reproduce the test results.
  • Page 104: The book states: “Virtual memory management, process scheduling, power management, or essentially any activity not directly related to a user task is classified as system time.  From an Oracle-centric perspective, system time is pure overhead…”  The book fails to mention that activities in Oracle Database itself directly cause kernel mode CPU utilization.  Network I/O,  disk I/O (especially with buffered file systems), timer calls (typically gettimeofday calls),  memory access requests, memory management, and process creation and termination (think session creation and termination) all directly cause kernel mode CPU utilization.  The paragraph seems to suggest that the DBAs has no control over kernel mode CPU consumption (reference reference2).
  • Page 116: States: “While it may seem strange, the run queue reported by the operating system includes processes waiting to be serviced by a CPU as well as processes currently being serviced by a CPU.”  This is an operating system specific observation.  Linux appears to behave this way, Windows does not, and others have indicated that various Unix releases do not behave the same as Linux (reference).
  • Page 136: States: “For example, if you ran a response time-focused report over a duration of 60 seconds, and if the single-CPU core subsystem were heavily overworked, Oracle could report that Oracle processes consumed 65 seconds of CPU time. Yet, we clearly know the CPU subsystem is limited to supplying up to 60 seconds of CPU time and no more.”  This appears to be an operating system specific CPU usage measurement problem.  Windows will exhibit this problem, where time spent in the run queue is accumulated with time spent actually running on the CPU, but I could not reproduce that behavior on Linux (reference).
  • The book seems to use Oracle 10g R1 frequently for tests, rather than something more recent, such as 10g R2, in various tests.  Oracle 10g R1 was desupported quite a long time ago, while the 10.2.0.5 patch was only recently released.  In the span of a couple of sentences, the book frequently uses the term “Oracle” to mean both “Oracle Database” and “Oracle Corporation” – this might cause some degree of confusion.
  • Page 175: The book seems to suggest that bind variable values are typically included in TKPROF summaries, except when multiple trace files are combined.  Bind variable values are never included in TKPROF summaries.
  • Page 177: The book seems to recommend changing hidden initialization parameters without warning that those parameters should only be modified after consulting Oracle support.
  • Page 182: The book suggests modifying the _ASH_SAMPLING_FREQUENCY hidden initialization parameter.  Unfortunately, this parameter does not exist in Oracle Database 11.1.0.7 (the only release that I checked).
  • Page 183: The book suggests that changing the _ASH_SIZE hidden initialization parameter will modify the amount of memory used by Active Session History, yet the book mentions that Oracle does not always respect the setting.  This might be a sign that the parameter controls something other than what the author described.  The author also mentioned changing the  _ASH_ENABLE hidden initialization parameter.
  • Page 200: The author suggests that the _DB_BOCK_MAX_CR_DBA hidden initialization parameter controls the number of clones per buffer, however that parameter does not exist in Oracle Database 11.1.0.7.  I believe that the author intended to write _DB_BLOCK_MAX_CR_DBA which attempts to control the maximum number of consistent read copies of a block.
  • Page 203: The author suggests that changing the _DB_BLOCK_HASH_LATCHES hidden initialization parameter will provide some relief to cache buffer chains latch contention.  This is one of the few times that the author cautions, “Before you change this, be aware there may be support issues from both Oracle and your application vendor.”
  • Page 208: The book states that the _SMALL_TABLE_THRESHOLD hidden initialization parameter was deprecated because of the modified LRU (Least Recently Used) algorithm that the book stated was introduced in Oracle Database 8.1.5, and the hidden initialization parameter assumed new life with the introduction of Oracle Database 11g where the “parameter is the threshold for a server process to start issuing direct reads.”  Metalink (My Oracle Support) Doc ID 787373.1 indicates that the parameter was never deprecated.  The actual threshold for selecting a direct path read for a serial operation in Oracle Database 11.1.0.6 and above is roughly five times the value specified for _SMALL_TABLE_THRESHOLD.  Various other articles describe the purpose of the _SMALL_TABLE_THRESHOLD parameter prior to Oracle Database 11.1.0.6. (reference reference2 reference3)
  • Page 213: The book states: “Without physical blocks being read from disk, there will be no LRU chain latch contention, because there will be no need to find a free buffer or insert a buffer header into an LRU chain.”  This statement begs the question, what about consistent read copies of blocks, where Oracle must essentially roll back a block to a specific SCN to provide a consistent read for an in-process query?
  • Page 219: The book states: “Investigate the _db_block_write_batch and the _db_writer_max_writes parameters. Also considering increasing the instance parameter _db_writer_max_scan_pct (the default may be 40, for 40%) or _db_writer_max_scan_cnt…”  One problem with these two sentences is that the _DB_BLOCK_WRITE_BATCH, _DB_WRITER_MAX_SCAN_PCT, and _DB_WRITER_MAX_SCAN_CNT hidden initialization parameters do not exist in Oracle Database 11.1.0.7 (the only release checked) (reference).
  • Page 221: The book states: “Now suppose you have scanned more than _db_writer_max_scan_pct buffer headers. If so, you would be very frustrated.”  Unfortunately, the _DB_WRITER_MAX_SCAN_PCT hidden initialization parameter does not exist in Oracle Database 11.1.0.7 (the only release checked).
  • Page 222: The book states: “Increase _db_writer_max_scan_pct. This will give the database writer more time to flush its write list.”  Unfortunately, the _DB_WRITER_MAX_SCAN_PCT hidden initialization parameter does not exist in Oracle Database 11.1.0.7 (the only release checked).
  • Page 223: The book states, “Strangely, in Oracle Database 11g, Oracle removed the reason code from the v$session_wait and v$session view’s p3 column!”  Column P3 was changed from the reason code to block class starting with Oracle Database 10g R1.
  • Page 226: The author used block dumps to investigate locks, while it seems that it would be easier to query a couple of views.
  • Page 230: The book states, “… Simply keep adding free lists until the contention subsides.” This begs the question of what side-effects may be encountered?
  • Page 231: The book describes how active transactions are assigned to undo segments.  Limited testing indicates that the book is correct unless the objects are created in the SYSTEM tablespace (creating objects in that tablespace is not recommended).
  • Page 253: The book used library_cache dumps to investigate child cursors – it seems that there are easier, less resource intensive ways to obtain this information.
  • Page 256: The book states: “Oracle keeps track of these reloads in the reloads column from v$library_cache…” – the correct view name is V$LIBRARYCACHE, not V$LIBRARY_CACHE.
  • Page 265: The book mentions the use of CURSOR_SHARING = SIMILAR and FORCE, but does not discuss the performance impact of the change other than stating that Mark Gury stated that CURSOR_SHARING = FORCE can result in incorrect results.  No other discussion of performance problems were provided.
  • Page 284: The book describes the author’s under-powered Linux server used in a test.  The server was configured with 512MB of memory, 256MB shared pool, 4MB for the buffer cache and was running Oracle Database 11g Release 1.  The Oracle documentation (reference) states that 1GB of memory is the minimum for Oracle Database 11g Release 1, and that puts into question the validity of any tests performed with that server.  The details of the test case, which obtained a 21% decrease in CPU usage through the use of in-memory undo over the configuration with in-memory undo disabled, was largely omitted, making it impossible for readers to duplicate the test.
  • Page 286: The book lists a parameter named COMPATIBILITY, however that parameter does not exist.  The actual parameter name is COMPATIBLE.
  • Page 289: The book states: “Increase the number of IMU pools by increasing the _im_pools parameter…”  The _im_pools hidden initialization parameter does not exist in Oracle Database 11.1.0.7 (the only release checked) – the author probably intended to write _IMU_POOLS.
  • Page 294: The test script shows that redo statistics for a session are recorded, the session sleeps for 5 seconds, the session queries a table, and then the redo statistics are recorded again.  The author attempted to explain that redo could be generated by a session merely by querying a table due to delayed block cleanout.  The test case is slightly unclear, making it difficult for readers to reproduce the results – there is no point in the session sleeping for 5 seconds, it would have been helpful if the author explained what changes were made to the CUSTOMERS that eventually resulted in the delayed block cleanouts, and because the statistic number for the redo size statistic is prone to change from one Oracle release to the next it would have made more sense to specify SN.NAME=’redo size’ rather than essentially specifying SN.STATISTIC#=133.
  • Page 297 states: “… There is also the instance parameter _log_small_entry_max_size, which is used to shift allocation latch activity onto one of the redo copy latches…”  The _LOG_SMALL_ENTRY_MAX_SIZE hidden initialization parameter does not exist in Oracle Database 11.1.0.7 (the only release checked).
  • Page 298 states: “The DBA can specifically set the number of redo strands via the hidden parameter _log_parallelism.”  The _LOG_PARALLELISM hidden initialization parameter does not exist in Oracle Database 11.1.0.7 (the only release checked).
  • Page 300 states: “Global temporary tables can be explicitly dropped. When this occurs, the associated indexes and views are also dropped.”  The associated indexes will be dropped assuming that the temporary table is also purged, however affected views will not be dropped and will have a status of INVALID.
  • Page 301 states: “Object growth can be observed by repeatedly by querying v$sort_usage_view.  Global temporary tables create 50% less redo than standard tables.” This quote has a minor grammar error, and the view name is V$SORT_USAGE (should not include _VIEW in the name).  The question remains how the author determined that working with global temporary tables generates half as much redo as regular heap tables.
  • Page 302: Another instance where the author correlates an application save with a LOG FILE SYNC wait and concludes that the problem is not in the database instance.
  • Page 305: While it is mentioned on the next page in another section, it probably would have made sense to mention that the redo log buffer is also flushed every three seconds and repeated that the redo log buffer is also by default flushed after every COMMIT.
  • Page 314: The footnote states: “When a process is serviced by one CPU and then is switched to another CPU, it undergoes a context switch… Exclusively binding a CPU and process together virtually eliminates context switching, because the CPU will always be there for that process and the process will never need to wait while the operating system switches its processing to another CPU.”  The author seems to have either redefined the meaning of context switch, or the footnote needs additional clarification (reference reference2).
  • Page 315 states: “The redo log buffer is structured in relatively small operating system-size blocks (typically 512 bytes), not Oracle-size blocks, so it is normal for log writer activity to be performed with multiple blocks.”  The Oracle documentation (reference) states that the redo log file block size defaults to the physical sector size of the disk, which is typically 512 bytes.
  • Page 322 states: “The heavy redo-generating SQL will obviously be some type of DML, like inserts, updates, deletes, or perhaps create statements. If you have ASH licensed, as Figure 8-9 demonstrated, you can easily find the SQL related to the log file parallel write event.” – The heavy redo generating SQL statement could also be a SELECT statement due to delayed block cleanout (as mentioned earlier in the book), a CREATE statement would likely be classified as DDL.  Since the LOG FILE PARALLEL WRITE wait event is posted by LGWR and not a user session, there probably will not be a SQL statement that is definitely associated with the LOG FILE PARALLEL WRITE wait event.
  • Page 324: The book makes a claim that is difficult to verify: “Using global temporary tables can cut redo generation by 50%, substantially reducing Oracle’s overall IO requirements. This alone could solve the log file parallel write problem.”
  • Page 326 states: “If the log writer encounters a nonarchived redo log, it will stop writing and post the wait event log file switch (archive incomplete).”  The actual wait event that will be posted is log file switch (archiving needed).
  • Pages 327 – 387 seem to be related to the author’s “Forecasting Oracle Performance” book, with concepts written in an easier to understand, more fluid writing style.  This is not a problem, just worth mentioning.
  • Page 330: The word “difference” should be replaced with “different”.
  • Page 330 states: “For example, the Statspack facility typically collects data in 60-minute intervals and stores the data in tables starting with stats$ (the key table is stats$snap).”  There is no typical collection period for Statspack, while AWR snapshots default to 60 minutes in duration.
  • Page 333 states: “In previous chapters, I have presented how to gather CPU requirements from the v$sesstat, v$sysstat, v$ses_time_mode, and v$sys_time_model views.”   V$SES_TIME_MODEL should be written as V$SESS_TIME_MODEL.
  • Page 343 states: “Also, Oracle CPU consumption includes Oracle processes waiting for CPU and also queuing for CPU. As a result, in a CPU-saturated system, Oracle may report CPU consumption higher than actually occurred.”  The behavior mentioned by the author is operating system dependent.
  • Page 344 states: “In other words, Oracle does not have the ability to split CPU time into service time and queue time. When we gather CPU consumption from either the instance statistics views or the system time model views, what we collect as CPU time and typically classify as CPU service time actually contains both CPU service time and CPU queue time. Oracle has no way of knowing the difference and reports the total, which is, in the truest sense, CPU response time.” This behavior is operating system dependent.
  • Page 346 states: “The arrival rate in Figure 9-9, which is the horizontal axis, is simply the number of logical IOs (v$sysstat: buffer gets plus consistent gets) processed per millisecond.”  There is no statistic in V$SYSSTAT that is named buffer gets, the author probably should have stated db block gets instead. (reference).
  • Page 347 states: “The instance statistic physical reads signifies the number of Oracle blocks that Oracle server processes had to request from the operating system because the blocks did not reside in Oracle’s buffer cache.”  The physical reads statistic includes block reads that will never hit the buffer cache, including blocks read into the PGA.  It appears that the author is describing the physical reads cache statistic instead.
  • Page 349 states: “…when focusing on a particular session or group of sessions, use v$sesstat, v$ses_time_model, and v$session_event.” The correct view name is V$SESS_TIME_MODEL.

The “Oracle Performance Firefighting” book has a lot of potential for being helpful, however the errors and omissions found in the fourth printing of the book give the book a decidedly dangerous edge.  If flipping initialization parameter values fits your style, this book will be a good fit.

Pages 352 though 387 are currently not included in this book review.

(Section added December 13, 2010): Related blog articles that discuss the contents of this book: