Faulty Quotes 6 – CPU Utilization

5 02 2010

February 5, 2010

(Back to the Previous Post in the Series) (Forward to the Next Post in the Series)

The ideal level of CPU utilization in a server is an interesting topic.  Google (and other search engines) find a number of different web pages that advocate that 100% CPU utilization is ideal, CPU utilization at 95% is likely catastrophic, significant queuing for CPU time begins when the CPUs are 75% to 80% busy, as well as a number of other interesting nuggets of information.  It is important to keep in mind that at any one instant, a CPU (or core or CPU instruction thread) is either 100% busy or 0% busy – at any one instant a CPU cannot be 75% busy.  The 75% or 95% utilization figures found on various web sites, in books, and in presentations are actually an average utilization between two points in time – whether those two points in time represent 0.000001 seconds, 24 hours, or somewhere in between could be very important when trying to determine if there is an excessive CPU utilization issue that causes service level agreement problem (or “slowness” problems reported by end-users).

Assume that in a one minute time period, the CPU utilization in a server is 75% – is that suitable, or is that undesirable, or not enough information is available to make an educated guess?  Good !?  Maybe good?  Bad?  Keep in mind that the CPU utilization is an average utilization between a starting time point and an ending time point – much like with a Statspack/AWR report, if you look at too large of a time period, significant problems may be masked (hidden from view) when the statistics from the time intervals containing problems are averaged over a long time period.  The 75% CPU utilization could indicate that for every three of four points in time the CPU had work that needed to be performed.  The 75% CPU utilization might also indicate that there was intense competition for the CPU time by many tasks for the first 45 seconds, followed by a complete absence of the need for CPU time in the last 15 seconds of the one minute time period.  For the many tasks competing for CPU time in the first 45 seconds, what might normally complete in one second might have actually required close to 45 seconds due to the operating system attempting to allocate portions of the server’s CPU time to each of the tasks that needed to use the CPU.  The tasks queue up while waiting for their turn for processing, in what is known as the CPU run queue.  As more processes enter the run queue, it takes longer for each process to perform each unit of their normal processing.  This is where the topic of queuing theory becomes very important.  Two very helpful books that discuss queuing theory as it applies to Oracle Database functionality are “Optimizing Oracle Performance” (by Cary Millsap with Jeff Holt) and “Forecasting Oracle Performance” (by Craig Shallahamer).  (Note: This example used one minute as the time interval for measuring CPU utilization in order to rationalize the competition for CPU resources into terms that are easily understood – assuming that a given 3GHz processor is only able to perform one operation at a time, that processor is capable of performing 3,000,000,000 operations per second – 180,000,000,000 operations in that one minute.)

There are a couple of different formulas used in queuing theory, including the Erlang C function, Little’s Law, and Kendall’s Notation.  I will not go into significant detail here on the different queuing theory models, but I will provide a simple example.  Assume that you enter a grocery store that has 10 checkout lanes (think of this like 10 CPUs in a database server).  When it is time to pay for the items in your cart, a person working for the store directs you into one of the 10 checkout lanes.  If anyone else is directed into the same checkout lane as you, you will need to alternate with that person at the checkout counter every 10 seconds – when your 10 second time period is up, you will need to load up everything placed on the conveyor belt and allow the other person to unload their items on the belt to use the checkout lane for 10 seconds  (this loading and unloading of items could be time consuming).  If anyone else is directed into your checkout lane, that person will also be able to use the checkout counter for 10 second intervals.  In short order, what would have required 5 minutes to complete is now requiring 30 minutes.  If the line grows too long in one checkout lane, there might be a chance to jump into a different checkout lane used by fewer people, possibly once a minute (some Linux operating systems will potentially move a process from one CPU to a less busy CPU every 200ms).  Jumping into a different checkout lane not only allows you to check out faster, but also allows the people who remain in the original line to check out faster.  The above is a very rough outline of queuing theory.  If the customer expects to check out in no more than 10 minutes, how many lanes are necessary, given that the customers arrive at a random rate, and we must meet the target 99% of the time.

CPU queuing is not a linear problem – 100% CPU utilization is not twice as bad as 50% CPU utilization, it is much worse than that.  Some of the articles below explain this concept very well – a Google search found a couple of interesting articles/presentations that computer science professors assembled for various classes – you might find it interesting to read some of those documents that are found in the .edu domain (it appears that none of those links made it into this blog article).  Some operating systems use a single run queue (for instance, Windows, and Linux prior to the 2.6 kernel release), with the end result of effectively evenly distributing the CPU load between CPUs, causing the processes to constantly jump from one CPU to another (this likely reduces the effectiveness of the CPU caches – pulling everything off the conveyor belt in the analogy).  Other operating systems have a separate run queue for each CPU, which keeps the process running on the same CPU.  Quick quiz: If our 10 CPU server in this example has a run queue of 10 – does that mean that one process is in each of the 10 CPU run queues, or is it possible that all 10 processes will be in just one of the 10 run queues, or possibly something in between those two extremes?  Are all three scenarios equally good or equally bad?

Keep in mind that while sessions are in “wait events” that does not mean that the sessions are not consuming server CPU time.  A session in an Oracle wait event might motivate a significant amount of system (kernel) mode CPU time on behalf of the session.  Sending/receiving data through the network, disk accesses, inspection of the current date/time, and even reading eight bytes (a 64 bit word) from memory motivates the use of the server’s CPU time.  CPU saturation may lead to latch contention (note that latch contention may also lead to CPU saturation due to sessions spinning while attempting to acquire a latch), long-duration log file waits (log file sync, log file parallel write), cluster-related waits, increased duration of single-block and multiblock reads, and significant increases in server response time.

So, with the above in mind, just what did my Google search find?  In the following quotes, I have attempted to quote the bare minimum of each article so that the quote is not taken too far out of context (I am attempting to avoid changing the meaning of what is being quoted).

Oracle Performance Tuning 101” (Copyright 2001, directly from the book) by Gaja Vaidyanatha states:

“One of the classic myths about CPU utilization is that a system with 0 percent idle is categorized as a system undergoing CPU bottlenecks… It is perfectly okay to have a system with 0 percent idle, so long as the average runnable queue for the CPU is less than (2 x number of CPUs).”

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6108562636780

“Are you at 100% utilization?  If not, you haven’t accomplished your job yet.  You cannot put CPU in the bank and save it for later.  So, if you are running with idle cycles you should be looking for more ways to use it.”

dba-oracle.com/t_high_cpu.htm

“Remember, all virtual memory servers are designed to drive CPU to 100%, and 100% CPU utilization is optimal, that’s how the server SMP architecture is designed.   You only have CPU enqueues when there are more tasks waiting for CPU, than you have CPU’s (your cpu_count)… Remember, it is not a cause for concern when the user + system CPU values approach 100 percent. This just means that the CPUs are working to their full potential.”

dba-oracle.com/t_tuning_cpu_usage_vmstat.htm

“Remember, it is not a cause for concern when the user + system CPU values approach 100 percent. This just means that the CPUs are working to their full potential. The only metric that identifies a CPU bottleneck is when the run queue (r value) exceeds the number of CPUs on the server.”

dba-oracle.com/t_monitor_cpu_unix.htm

“Within UNIX, the OS is geared to drive CPU consumption to 100%, so the best way to monitor CPU usage is by tracking the ‘r’ column in vmstat”

dba-oracle.com/t_detecting_windows_cpu_processor_bottlenecks.htm

“100% utilization DOES NOT always indicate any bottleneck. It just means that the CPU is busy!  You ONLY have a CPU bottleneck when the runqueue exceeds cpu_count.”

fast-track.cc/op_unix_79_identifying_high_cpu.htm

“Please note that it is not uncommon to see the CPU approach 100 percent even when the server is not overwhelmed with work. This is because the UNIX internal dispatchers will always attempt to keep the CPUs as busy as possible. This maximizes task throughput, but it can be misleading for a neophyte.”

dbaforums.org/oracle/index.php?showtopic=5552

“It’s normal for virtual memory systems to drive the CPU to 100%.

What you need to look for are CPU runqueues, not 100% values”

dbaforums.org/oracle/index.php?showtopic=9986

“No problem! Processors are designed to drive themselves up to 100%.

You are only CPU-bound when the runqueue exceeds the number of processors”

http://forums.oracle.com/forums/thread.jspa?messageID=2518290

“100% utilization is the optimal state. If you want to look for CPU bottlenecks, use vmstat and check the “r” (runqueue) column…  It’s not a claim, it’s a fact, according to the folks who built their servers!  The vendors who build the servers say that 100% CPU utilization is optimal, and they wrote both the OS and the underlying hardware… Every 1st year DBA noob panics at some point when they go into top and see that the CPU is at pegged at 100%.”

http://forums.oracle.com/forums/message.jspa?messageID=2501989

“All SMP architectures are designed to throttle-up the CPU quickly, and a 100% utilization DOES NOT mean an overload. It’s straight from Algorithms 101…  Just to make sure that you are not operating under “assumptions” here, I’m talking about server-side CPU consumption, on an SMP server running lots of concurrent tasks. The references to 100% CPU are as they display in standard OS monitors like lparstat, watch, sar and vmstat.  Also, don’t assume that all OS tasks have the same dispatching priority. In a server-side 100% CPU situation, some tasks may have enqueues, while other do not. That’s what ‘nice’ is for.”

http://books.google.com/books?id=cHHMDgKDXtIC&pg=PA112

“Remember, it is not a cause for concern when the user + system CPU values approach 100 percent.  This just means that the CPUs are working to their full potential. The only metric that identifies a CPU bottleneck is when the run queue (r value) exceeds the number of CPUs on the sever.”

——————–

Before deciding that a 100% CPU utilization is not only normal, but something we should all try to achieve, visit the following links and spend a little time reading the text near the quoted section of the document.

——————–

“Optimizing Oracle Performance” page 264, by Cary Millsap:

“On batch-only application systems, CPU utilization of less than 100% is bad if there is work waiting in the job queue. The goal of a batch-only system user is maximized throughput. If there is work waiting, then every second of CPU capacity left idle is a second of CPU capacity gone that can never be reclaimed. But be careful: pegging CPU utilization at 100% over long periods often causes OS scheduler thrashing, which can reduce throughput. On interactive-only systems, CPU utilization that stays to the right of the knee over long periods is bad. The goal of an interactive-only system user is minimized response time. When CPU utilization exceeds the knee in the response time curve, response time fluctuations become unbearable.”

“Forecasting Oracle Performance” page 71 by Craig Shallahamer:

 “With the CPU subsystem shown in Figure 3-7, queuing does not set in (that is response time does not significantly change) until utilization is around 80% (150% workload increase). The CPU queue time is virtually zero and then skyrockets because there are 32 CPUs. If the system had fewer CPUs, the slope, while still steep, would have been more gradual.”

“Forecasting Oracle Performance” page 195 by Craig Shallahamer:

“The high-risk solution would need to contain at least 22 CPUs. Because the reference ratios came from a 20 CPU machine, scalability is not significant. However, recommending a solution at 75% utilization is significant and probably reckless. At 75% utilization, the arrival rate is already well into the elbow of the curve. It would be extremely rare to recommend a solution at 75% utilization.”

http://forums.oracle.com/forums/thread.jspa?messageID=2518290

“First: check the following simple example of how wrong you can be in saying {‘using’ all of your CPU is a good thing} especially in a multi-user, shared memory environment such as an active Oracle instance. You see, although ‘using’ all of your CPU may be desirable if you don’t waste any of it, in a multi-user system you can waste a lot of CPU very easily – even when nobody goes off the run queue.”

http://forums.oracle.com/forums/message.jspa?messageID=2501989

“But it’s not ‘normal’ to drive CPUs to 100%. Except for extremely exotic circumstances (and that excludes database processing) it means you’ve overloading the system and wasting resources…  Consider the simple case of 8 queries running on 8 CPUs. They will be competing for the same cache buffers chains latches – which means that seven processes could be spinning on the same latch while the eighth is holding it. None of the processes ever need wait, but most of them could be wasting CPU most of the time.”

http://www.dell.com/downloads/global/solutions/public/White_Papers/hied_blackboard_whitepaper.pdf Page 19

“One of the sizing concepts that is independent of the server model is resource utilization. It is never a good idea to attempt to achieve 100% resource utilization. In the Blackboard benchmark tests, the optimum Application Server CPU Utilization was 75% to 90%. In general, clients should size all Application Servers to achieve no more than 75% CPU utilization. For database servers, the optimum CPU utilization is 80% in non-RAC mode. In RAC mode, clients should consider CPU utilization rates around 65% at peak usage periods to allow reserve capacity in case of cluster node failover.”

http://www-03.ibm.com/support/techdocs/atsmastr.nsf/5cb5ed706d254a8186256c71006d2e0a/546c74feec117c118625718400173a3e/$FILE/RDB-DesignAndTuning.doc

“The CPU utilization goal should be about 70 to 80% of the total CPU time. Lower utilization means that the CPU can cope better with peak workloads.  Workloads between 85% to 90% result in queuing delays for CPU resources, which affect response times. CPU utilization above 90% usually results in unacceptable response times.  While running batch jobs, backups, or loading large amounts of data, the CPU may be driven to high percentages, such as to 80 to 100%, to maximize throughput.”

11g R2 Performance Tuning Guide :

“Workload is an important factor when evaluating your system’s level of CPU utilization. During peak workload hours, 90% CPU utilization with 10% idle and waiting time can be acceptable. Even 30% utilization at a time of low workload can be understandable. However, if your system shows high utilization at normal workload, then there is no room for a peak workload.”

Oracle9i Application Server Oracle HTTP Server powered by Apache Performance Guide

“In addition to the minimum installation recommendations, your hardware resources need to be adequate for the requirements of your specific applications. To avoid hardware-related performance bottlenecks, each hardware component should operate at no more than 80% of capacity.”

Relational Database Design and Performance Tuning for DB2 Database Servers

Page 26: “When CPU utilization rises above 80%, the system overhead increases significantly to handle other tasks. The lifespan of each child process is longer and, as a result, the memory usage supporting those active concurrent processes increases significantly. At stable load, 10% login, and CPU utilization below 80%, the memory usage formula is as follows…”

Page 27: “When system load generates a high CPU utilization (>90%) some of the constituent processes do not have enough CPU resource to complete within a certain time and remain ‘active’.”

Oracle Database High Availability Best Practices 10g Release 2

“If you are experiencing high load (excessive CPU utilization of over 90%, paging and swapping), then you need to tune the system before proceeding with Data Guard. Use the V$OSSTAT or V$SYSMETRIC_HISTORY view to monitor system usage statistics from the operating system.”

“Optimizing Oracle Performance” page 317

“Oracle’s log file sync wait event is one of the first events to show increased latencies due to the time a process spends waiting in a CPU run queue.”

Metalink Doc ID 148176.1 “Diagnosing hardware configuration induced performance problems”

“In general your utilization on anything should never be over 75-80%…”

http://www.tomfarwellconsulting.com/Queuing%20Presentation.pdf

“Linear thinking is a common human process.  This notion implies that if an input increases by 20 percent the output of the system changes by 20 percent.

Computer response does not follow a linear curve. It is entirely possible that a change in computer input by 20 percent results in a change in output of hundreds of percent.”

http://www.mcguireconsulting.com/newsl_queuing.html

“This utilization-based multiplier increases exponentially and does so rapidly after the 50% utilization point, as shown in the graph below. The translation is: if a resource’s utilization is much beyond 50%, there is a higher probability that congestion will occur. Keep in mind that at 100% utilization, the delay goes to infinity, which is the direction of these curves.”

http://www.db2-dba.net/articles/Article-Usage%20Factor.html

“Request Service time =  ( Ideal Request Service time x  Usage factor ) / (1 – Usage factor )    where   0>=  Usage factor >= 1
The  Request Service time is proportional to U/(1-U).

As you can see from the simple plot above as U reaches 0.95  you are fast approaching the meltdown point.
As U gets closer to .95  the Service time of the system reacts violently and starts approaching infinity. 
The ‘system’  might be your  CPU , DISK, Network, employee or your motor car. 

It is just a bad idea to push the average resource utilization factor beyond 0.9, and the peak resource utilization factor beyond 0.95.”

http://databaseperformance.blogspot.com/2009/01/queuing-theory-resource-utilisation.html

“So an Oracle database server does conform to the general model in Queuing Theory of having lots of separate clients (the shadow servers) making requests on the resources in the system. And as a result, it does conform to the golden rule of high resource utilisation equals queues of pending requests.

As a result, 100% CPU utilization is very bad, and is symptomatic of very large queues of waiting processes. Queuing Theory also shows that above 50% utilization of a resource, there is always a request in the queue more often than not…  A general rule of thumb is to get worried at 80% utilization, as the number of concurrent requests will average something around four, and rises exponentially above this.”

http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/

“Once LGWR loses his CPU it may be quite some time until he gets it back. For instance, if LGWR is preempted in the middle of trying to perform a redo buffer flush, there may be several time slices of execution for other processes before LGWR gets back on CPU…” Fix the CPU problem, and the other significant waits may decrease.

Newsflash – 100% CPU is Worse than You Think!

“Amazing. During my entire discussion of CPU load and process priorities I completely ignored the fact that I’m using 2 dual core cpus on that system, and that all Oracle processes use shared memory, which means shared resource, which means locks, which means resource wasting by waiting for locks. And this complicated the discussion, because 6 processes on 8 CPUs will also waste time waiting for locks. You don’t need 100% CPU to suffer from this.”

Opinions are bound to change over time.  The first two quotes are from OakTable Network members, and those quotes were originally written eight or nine years ago.  If you were to ask those two people today (or even shortly after the release of the “Optimizing Oracle Performance” book in 2003), they might state something a bit different about driving and then holding CPUs at 100% utilization.  Interestingly, holding a CPU at 100% utilization will cause its core temperature to gradually increase.  Logic in some of the CPUs will sense this increase in temperature and actually throttle back the speed of the CPU until the core temperature drops to acceptable levels.  Of course, when the CPU speed is throttled back, that potentially causes the CPU run queue to increase in length because the amount of work required for by each process has not decreased, while the number of CPU cycles per second available for performing work has decreased.  

More could be written on this subject, but I will leave it at this point for now (for instance, I could have mentioned process/thread priority gradually decreasing on some operating systems for those processes consuming a lot of CPU time).  Opinions, other than this article being too short (or too long)?  As you can probably tell from the quotes, CPU utilization issues are not just a special situation on a single operating system, or a special situation with a certain program (Oracle Database).








Follow

Get every new post delivered to your Inbox.

Join 139 other followers