Faulty Quotes 6 – CPU Utilization

5 02 2010

February 5, 2010

(Back to the Previous Post in the Series) (Forward to the Next Post in the Series)

The ideal level of CPU utilization in a server is an interesting topic.  Google (and other search engines) find a number of different web pages that advocate that 100% CPU utilization is ideal, CPU utilization at 95% is likely catastrophic, significant queuing for CPU time begins when the CPUs are 75% to 80% busy, as well as a number of other interesting nuggets of information.  It is important to keep in mind that at any one instant, a CPU (or core or CPU instruction thread) is either 100% busy or 0% busy – at any one instant a CPU cannot be 75% busy.  The 75% or 95% utilization figures found on various web sites, in books, and in presentations are actually an average utilization between two points in time – whether those two points in time represent 0.000001 seconds, 24 hours, or somewhere in between could be very important when trying to determine if there is an excessive CPU utilization issue that causes service level agreement problem (or “slowness” problems reported by end-users).

Assume that in a one minute time period, the CPU utilization in a server is 75% – is that suitable, or is that undesirable, or not enough information is available to make an educated guess?  Good !?  Maybe good?  Bad?  Keep in mind that the CPU utilization is an average utilization between a starting time point and an ending time point – much like with a Statspack/AWR report, if you look at too large of a time period, significant problems may be masked (hidden from view) when the statistics from the time intervals containing problems are averaged over a long time period.  The 75% CPU utilization could indicate that for every three of four points in time the CPU had work that needed to be performed.  The 75% CPU utilization might also indicate that there was intense competition for the CPU time by many tasks for the first 45 seconds, followed by a complete absence of the need for CPU time in the last 15 seconds of the one minute time period.  For the many tasks competing for CPU time in the first 45 seconds, what might normally complete in one second might have actually required close to 45 seconds due to the operating system attempting to allocate portions of the server’s CPU time to each of the tasks that needed to use the CPU.  The tasks queue up while waiting for their turn for processing, in what is known as the CPU run queue.  As more processes enter the run queue, it takes longer for each process to perform each unit of their normal processing.  This is where the topic of queuing theory becomes very important.  Two very helpful books that discuss queuing theory as it applies to Oracle Database functionality are “Optimizing Oracle Performance” (by Cary Millsap with Jeff Holt) and “Forecasting Oracle Performance” (by Craig Shallahamer).  (Note: This example used one minute as the time interval for measuring CPU utilization in order to rationalize the competition for CPU resources into terms that are easily understood – assuming that a given 3GHz processor is only able to perform one operation at a time, that processor is capable of performing 3,000,000,000 operations per second – 180,000,000,000 operations in that one minute.)

There are a couple of different formulas used in queuing theory, including the Erlang C function, Little’s Law, and Kendall’s Notation.  I will not go into significant detail here on the different queuing theory models, but I will provide a simple example.  Assume that you enter a grocery store that has 10 checkout lanes (think of this like 10 CPUs in a database server).  When it is time to pay for the items in your cart, a person working for the store directs you into one of the 10 checkout lanes.  If anyone else is directed into the same checkout lane as you, you will need to alternate with that person at the checkout counter every 10 seconds – when your 10 second time period is up, you will need to load up everything placed on the conveyor belt and allow the other person to unload their items on the belt to use the checkout lane for 10 seconds  (this loading and unloading of items could be time consuming).  If anyone else is directed into your checkout lane, that person will also be able to use the checkout counter for 10 second intervals.  In short order, what would have required 5 minutes to complete is now requiring 30 minutes.  If the line grows too long in one checkout lane, there might be a chance to jump into a different checkout lane used by fewer people, possibly once a minute (some Linux operating systems will potentially move a process from one CPU to a less busy CPU every 200ms).  Jumping into a different checkout lane not only allows you to check out faster, but also allows the people who remain in the original line to check out faster.  The above is a very rough outline of queuing theory.  If the customer expects to check out in no more than 10 minutes, how many lanes are necessary, given that the customers arrive at a random rate, and we must meet the target 99% of the time.

CPU queuing is not a linear problem – 100% CPU utilization is not twice as bad as 50% CPU utilization, it is much worse than that.  Some of the articles below explain this concept very well – a Google search found a couple of interesting articles/presentations that computer science professors assembled for various classes – you might find it interesting to read some of those documents that are found in the .edu domain (it appears that none of those links made it into this blog article).  Some operating systems use a single run queue (for instance, Windows, and Linux prior to the 2.6 kernel release), with the end result of effectively evenly distributing the CPU load between CPUs, causing the processes to constantly jump from one CPU to another (this likely reduces the effectiveness of the CPU caches – pulling everything off the conveyor belt in the analogy).  Other operating systems have a separate run queue for each CPU, which keeps the process running on the same CPU.  Quick quiz: If our 10 CPU server in this example has a run queue of 10 – does that mean that one process is in each of the 10 CPU run queues, or is it possible that all 10 processes will be in just one of the 10 run queues, or possibly something in between those two extremes?  Are all three scenarios equally good or equally bad?

Keep in mind that while sessions are in “wait events” that does not mean that the sessions are not consuming server CPU time.  A session in an Oracle wait event might motivate a significant amount of system (kernel) mode CPU time on behalf of the session.  Sending/receiving data through the network, disk accesses, inspection of the current date/time, and even reading eight bytes (a 64 bit word) from memory motivates the use of the server’s CPU time.  CPU saturation may lead to latch contention (note that latch contention may also lead to CPU saturation due to sessions spinning while attempting to acquire a latch), long-duration log file waits (log file sync, log file parallel write), cluster-related waits, increased duration of single-block and multiblock reads, and significant increases in server response time.

So, with the above in mind, just what did my Google search find?  In the following quotes, I have attempted to quote the bare minimum of each article so that the quote is not taken too far out of context (I am attempting to avoid changing the meaning of what is being quoted).

Oracle Performance Tuning 101” (Copyright 2001, directly from the book) by Gaja Vaidyanatha states:

“One of the classic myths about CPU utilization is that a system with 0 percent idle is categorized as a system undergoing CPU bottlenecks… It is perfectly okay to have a system with 0 percent idle, so long as the average runnable queue for the CPU is less than (2 x number of CPUs).”

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6108562636780

“Are you at 100% utilization?  If not, you haven’t accomplished your job yet.  You cannot put CPU in the bank and save it for later.  So, if you are running with idle cycles you should be looking for more ways to use it.”

dba-oracle.com/t_high_cpu.htm

“Remember, all virtual memory servers are designed to drive CPU to 100%, and 100% CPU utilization is optimal, that’s how the server SMP architecture is designed.   You only have CPU enqueues when there are more tasks waiting for CPU, than you have CPU’s (your cpu_count)… Remember, it is not a cause for concern when the user + system CPU values approach 100 percent. This just means that the CPUs are working to their full potential.”

dba-oracle.com/t_tuning_cpu_usage_vmstat.htm

“Remember, it is not a cause for concern when the user + system CPU values approach 100 percent. This just means that the CPUs are working to their full potential. The only metric that identifies a CPU bottleneck is when the run queue (r value) exceeds the number of CPUs on the server.”

dba-oracle.com/t_monitor_cpu_unix.htm

“Within UNIX, the OS is geared to drive CPU consumption to 100%, so the best way to monitor CPU usage is by tracking the ‘r’ column in vmstat”

dba-oracle.com/t_detecting_windows_cpu_processor_bottlenecks.htm

“100% utilization DOES NOT always indicate any bottleneck. It just means that the CPU is busy!  You ONLY have a CPU bottleneck when the runqueue exceeds cpu_count.”

fast-track.cc/op_unix_79_identifying_high_cpu.htm

“Please note that it is not uncommon to see the CPU approach 100 percent even when the server is not overwhelmed with work. This is because the UNIX internal dispatchers will always attempt to keep the CPUs as busy as possible. This maximizes task throughput, but it can be misleading for a neophyte.”

dbaforums.org/oracle/index.php?showtopic=5552

“It’s normal for virtual memory systems to drive the CPU to 100%.

What you need to look for are CPU runqueues, not 100% values”

dbaforums.org/oracle/index.php?showtopic=9986

“No problem! Processors are designed to drive themselves up to 100%.

You are only CPU-bound when the runqueue exceeds the number of processors”

http://forums.oracle.com/forums/thread.jspa?messageID=2518290

“100% utilization is the optimal state. If you want to look for CPU bottlenecks, use vmstat and check the “r” (runqueue) column…  It’s not a claim, it’s a fact, according to the folks who built their servers!  The vendors who build the servers say that 100% CPU utilization is optimal, and they wrote both the OS and the underlying hardware… Every 1st year DBA noob panics at some point when they go into top and see that the CPU is at pegged at 100%.”

http://forums.oracle.com/forums/message.jspa?messageID=2501989

“All SMP architectures are designed to throttle-up the CPU quickly, and a 100% utilization DOES NOT mean an overload. It’s straight from Algorithms 101…  Just to make sure that you are not operating under “assumptions” here, I’m talking about server-side CPU consumption, on an SMP server running lots of concurrent tasks. The references to 100% CPU are as they display in standard OS monitors like lparstat, watch, sar and vmstat.  Also, don’t assume that all OS tasks have the same dispatching priority. In a server-side 100% CPU situation, some tasks may have enqueues, while other do not. That’s what ‘nice’ is for.”

http://books.google.com/books?id=cHHMDgKDXtIC&pg=PA112

“Remember, it is not a cause for concern when the user + system CPU values approach 100 percent.  This just means that the CPUs are working to their full potential. The only metric that identifies a CPU bottleneck is when the run queue (r value) exceeds the number of CPUs on the sever.”

——————–

Before deciding that a 100% CPU utilization is not only normal, but something we should all try to achieve, visit the following links and spend a little time reading the text near the quoted section of the document.

——————–

“Optimizing Oracle Performance” page 264, by Cary Millsap:

“On batch-only application systems, CPU utilization of less than 100% is bad if there is work waiting in the job queue. The goal of a batch-only system user is maximized throughput. If there is work waiting, then every second of CPU capacity left idle is a second of CPU capacity gone that can never be reclaimed. But be careful: pegging CPU utilization at 100% over long periods often causes OS scheduler thrashing, which can reduce throughput. On interactive-only systems, CPU utilization that stays to the right of the knee over long periods is bad. The goal of an interactive-only system user is minimized response time. When CPU utilization exceeds the knee in the response time curve, response time fluctuations become unbearable.”

“Forecasting Oracle Performance” page 71 by Craig Shallahamer:

 “With the CPU subsystem shown in Figure 3-7, queuing does not set in (that is response time does not significantly change) until utilization is around 80% (150% workload increase). The CPU queue time is virtually zero and then skyrockets because there are 32 CPUs. If the system had fewer CPUs, the slope, while still steep, would have been more gradual.”

“Forecasting Oracle Performance” page 195 by Craig Shallahamer:

“The high-risk solution would need to contain at least 22 CPUs. Because the reference ratios came from a 20 CPU machine, scalability is not significant. However, recommending a solution at 75% utilization is significant and probably reckless. At 75% utilization, the arrival rate is already well into the elbow of the curve. It would be extremely rare to recommend a solution at 75% utilization.”

http://forums.oracle.com/forums/thread.jspa?messageID=2518290

“First: check the following simple example of how wrong you can be in saying {‘using’ all of your CPU is a good thing} especially in a multi-user, shared memory environment such as an active Oracle instance. You see, although ‘using’ all of your CPU may be desirable if you don’t waste any of it, in a multi-user system you can waste a lot of CPU very easily – even when nobody goes off the run queue.”

http://forums.oracle.com/forums/message.jspa?messageID=2501989

“But it’s not ‘normal’ to drive CPUs to 100%. Except for extremely exotic circumstances (and that excludes database processing) it means you’ve overloading the system and wasting resources…  Consider the simple case of 8 queries running on 8 CPUs. They will be competing for the same cache buffers chains latches – which means that seven processes could be spinning on the same latch while the eighth is holding it. None of the processes ever need wait, but most of them could be wasting CPU most of the time.”

http://www.dell.com/downloads/global/solutions/public/White_Papers/hied_blackboard_whitepaper.pdf Page 19

“One of the sizing concepts that is independent of the server model is resource utilization. It is never a good idea to attempt to achieve 100% resource utilization. In the Blackboard benchmark tests, the optimum Application Server CPU Utilization was 75% to 90%. In general, clients should size all Application Servers to achieve no more than 75% CPU utilization. For database servers, the optimum CPU utilization is 80% in non-RAC mode. In RAC mode, clients should consider CPU utilization rates around 65% at peak usage periods to allow reserve capacity in case of cluster node failover.”

http://www-03.ibm.com/support/techdocs/atsmastr.nsf/5cb5ed706d254a8186256c71006d2e0a/546c74feec117c118625718400173a3e/$FILE/RDB-DesignAndTuning.doc

“The CPU utilization goal should be about 70 to 80% of the total CPU time. Lower utilization means that the CPU can cope better with peak workloads.  Workloads between 85% to 90% result in queuing delays for CPU resources, which affect response times. CPU utilization above 90% usually results in unacceptable response times.  While running batch jobs, backups, or loading large amounts of data, the CPU may be driven to high percentages, such as to 80 to 100%, to maximize throughput.”

11g R2 Performance Tuning Guide :

“Workload is an important factor when evaluating your system’s level of CPU utilization. During peak workload hours, 90% CPU utilization with 10% idle and waiting time can be acceptable. Even 30% utilization at a time of low workload can be understandable. However, if your system shows high utilization at normal workload, then there is no room for a peak workload.”

Oracle9i Application Server Oracle HTTP Server powered by Apache Performance Guide

“In addition to the minimum installation recommendations, your hardware resources need to be adequate for the requirements of your specific applications. To avoid hardware-related performance bottlenecks, each hardware component should operate at no more than 80% of capacity.”

Relational Database Design and Performance Tuning for DB2 Database Servers

Page 26: “When CPU utilization rises above 80%, the system overhead increases significantly to handle other tasks. The lifespan of each child process is longer and, as a result, the memory usage supporting those active concurrent processes increases significantly. At stable load, 10% login, and CPU utilization below 80%, the memory usage formula is as follows…”

Page 27: “When system load generates a high CPU utilization (>90%) some of the constituent processes do not have enough CPU resource to complete within a certain time and remain ‘active’.”

Oracle Database High Availability Best Practices 10g Release 2

“If you are experiencing high load (excessive CPU utilization of over 90%, paging and swapping), then you need to tune the system before proceeding with Data Guard. Use the V$OSSTAT or V$SYSMETRIC_HISTORY view to monitor system usage statistics from the operating system.”

“Optimizing Oracle Performance” page 317

“Oracle’s log file sync wait event is one of the first events to show increased latencies due to the time a process spends waiting in a CPU run queue.”

Metalink Doc ID 148176.1 “Diagnosing hardware configuration induced performance problems”

“In general your utilization on anything should never be over 75-80%…”

http://www.tomfarwellconsulting.com/Queuing%20Presentation.pdf

“Linear thinking is a common human process.  This notion implies that if an input increases by 20 percent the output of the system changes by 20 percent.

Computer response does not follow a linear curve. It is entirely possible that a change in computer input by 20 percent results in a change in output of hundreds of percent.”

http://www.mcguireconsulting.com/newsl_queuing.html

“This utilization-based multiplier increases exponentially and does so rapidly after the 50% utilization point, as shown in the graph below. The translation is: if a resource’s utilization is much beyond 50%, there is a higher probability that congestion will occur. Keep in mind that at 100% utilization, the delay goes to infinity, which is the direction of these curves.”

http://www.db2-dba.net/articles/Article-Usage%20Factor.html

“Request Service time =  ( Ideal Request Service time x  Usage factor ) / (1 – Usage factor )    where   0>=  Usage factor >= 1
The  Request Service time is proportional to U/(1-U).

As you can see from the simple plot above as U reaches 0.95  you are fast approaching the meltdown point.
As U gets closer to .95  the Service time of the system reacts violently and starts approaching infinity. 
The ‘system’  might be your  CPU , DISK, Network, employee or your motor car. 

It is just a bad idea to push the average resource utilization factor beyond 0.9, and the peak resource utilization factor beyond 0.95.”

http://databaseperformance.blogspot.com/2009/01/queuing-theory-resource-utilisation.html

“So an Oracle database server does conform to the general model in Queuing Theory of having lots of separate clients (the shadow servers) making requests on the resources in the system. And as a result, it does conform to the golden rule of high resource utilisation equals queues of pending requests.

As a result, 100% CPU utilization is very bad, and is symptomatic of very large queues of waiting processes. Queuing Theory also shows that above 50% utilization of a resource, there is always a request in the queue more often than not…  A general rule of thumb is to get worried at 80% utilization, as the number of concurrent requests will average something around four, and rises exponentially above this.”

http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/

“Once LGWR loses his CPU it may be quite some time until he gets it back. For instance, if LGWR is preempted in the middle of trying to perform a redo buffer flush, there may be several time slices of execution for other processes before LGWR gets back on CPU…” Fix the CPU problem, and the other significant waits may decrease.

Newsflash – 100% CPU is Worse than You Think!

“Amazing. During my entire discussion of CPU load and process priorities I completely ignored the fact that I’m using 2 dual core cpus on that system, and that all Oracle processes use shared memory, which means shared resource, which means locks, which means resource wasting by waiting for locks. And this complicated the discussion, because 6 processes on 8 CPUs will also waste time waiting for locks. You don’t need 100% CPU to suffer from this.”

Opinions are bound to change over time.  The first two quotes are from OakTable Network members, and those quotes were originally written eight or nine years ago.  If you were to ask those two people today (or even shortly after the release of the “Optimizing Oracle Performance” book in 2003), they might state something a bit different about driving and then holding CPUs at 100% utilization.  Interestingly, holding a CPU at 100% utilization will cause its core temperature to gradually increase.  Logic in some of the CPUs will sense this increase in temperature and actually throttle back the speed of the CPU until the core temperature drops to acceptable levels.  Of course, when the CPU speed is throttled back, that potentially causes the CPU run queue to increase in length because the amount of work required for by each process has not decreased, while the number of CPU cycles per second available for performing work has decreased.  

More could be written on this subject, but I will leave it at this point for now (for instance, I could have mentioned process/thread priority gradually decreasing on some operating systems for those processes consuming a lot of CPU time).  Opinions, other than this article being too short (or too long)?  As you can probably tell from the quotes, CPU utilization issues are not just a special situation on a single operating system, or a special situation with a certain program (Oracle Database).


Actions

Information

15 responses

5 02 2010
Chen Shapira

I still think that often administrators treat 100% CPU as the problem instead of a potential cause or a symptom.

The problem is, as usual, wait time. 100% CPU can cause waits for CPU (when run queue > 2) or waits for other resources.

If you define the problem as 100% CPU, you may be tempted to solve it by adding more CPU, which will not necessarily solve the real problem – the waits.

If you define the problem in terms of waits, you can arrive at the more difficult but usually better solution – find a way to do less work so you’ll use less CPU (and usually less other resources as well).

I was once asked to solve 100% CPU problem using “nice”. The only way to do that is to prevent a process from running while the CPU is idle for part of the time. This will solve the 100% CPU problem while increasing wait times. Wrong problem definition led to a solution that actually makes the real issue worse.

6 02 2010
Charles Hooper

Chen, thanks for the comments – the points that you make are very important.

Just for clarification, in your comment you stated:

“The problem is, as usual, wait time. 100% CPU can cause waits for CPU”

I think that it might be important to point out that this “wait time” is from the end-user’s perspective (my SQL statement is taking longer than normal to execute), and not from the perspective of the database instance’s session (I am apparently active on the CPU, and therefore not in a “wait event”). This clarification, I believe, also applies to your comment about using “nice” (after the fix, my SQL statement is taking twice as long to execute, but at least the server’s CPUs are only 98% utilized).

The book “Optimizing Oracle Performance” (written by Cary Millsap) describes how adding more CPUs to a CPU constrained server could actually make a performance problem worse. Chapter 3 of “Expert Oracle Practices” (chapter written by Connie Green, Graham Wood, and Uri Shaft) also mentions how removing the CPU as the bottleneck could cause an even worse bottleneck to appear somewhere else in the system that then causes the end-user’s critical SQL statement take even longer to execute. I believe that your comments touched on this point.

9 02 2010
Cary Millsap

Charles,

Your parenthetical “(I am apparently active on the CPU, and therefore not in a ‘wait event’)” is not quite right. When an Oracle process is Preempted, Ready to Run In Memory, Asleep in Memory, Asleep and Swapped, or Ready to Run but Swapped, it is not “apparently active on the CPU.” That is, the process won’t be accumulating CPU time (in either user or sys mode), and it won’t be accumulating time in any so-called Oracle WAIT.

In the extended SQL trace data for such a process, you can see clearly what goes on in such a circumstance. The clock (tim value) will move without any corresponding buildup in CPU time (c value on a PARSE, EXEC, or FETCH line) or syscall response time (ela value on a WAIT line). In the book and in our software tools, we call this unaccounted-for time, because it is time that is measured to have elapsed but without an explanation for how it was consumed. You can see examples of this phenomenon on the web page for our software tool called mrnl.

Lots of unaccounted-for time is the signature of a process that needed a CPU resource that it wasn’t able to obtain as early as it wanted. One common cure is to eliminate the wasteful overuse of the CPU resource by some other process on the system.

Your post reminds me of two ideas that I think are worth highlighting. (1) The term “wait event” so badly confuses so many conversations. In Oracle, the term WAIT (uppercase) means specifically “response time for operating system function call (usually exactly one, but sometimes two).” Of course, it’s possible to “wait” for a CPU (as a process does when it’s in the Preempted or Ready to Run In Memory state), but that’s a natural language “wait.” It’s not what an Oracle WAIT means.

And (2), people still miss the point that the maximum utilization you should sustain for a resource on a system with random arrivals depends upon the number of service channels (in this conversation, a service channel is a CPU) in your system. The utilization “wall” for a 2-CPU system is less than or equal to 57%, whereas the “wall” for a 16-CPU system is less than or equal to 81%, for example. I’m hopeful that the new “Thinking Clearly” paper that I wrote for RMOUG will help clear up that issue for more people.

9 02 2010
Charles Hooper

Cary,

Thanks for stopping by my blog and leaving such a fantastic follow-up comment with very helpful links. For those reading this blog article, I recommend taking a look at Cary’s “Thinking Clearly”/“Thinking Clearly About Performance” article that he linked to in his comment. I have only read a couple of pages of that article so far, but looking ahead a couple of pages I see a discussion of queuing theory.

My quoted comment was intended to draw a distinction between what the end-user views as a “wait” and what the Oracle session accumulates as a “wait”. (Of course there is also a third potential meaning for wait when talking about a CPU – waiting in the run queue.) In short, my wording in that comment was sloppy. The words “apparently active on the CPU” were intended to include all of the process states that you included in your comment (I think that I would have had trouble listing off all of the different states you supplied) and therefore would not be accumulating time in an Oracle wait event – and I did not make a distinction of which of those would actually increment the CPU time statistic.

In light of your detailed comment, I agree that my statement is not quite right. Unfortunately, my quoted comment could be read in one of several ways.

By the way, thanks for the detailed explanation of how the unaccounted-for time might be used as an indicator of competition for the CPU.

9 02 2010
Doug's Oracle Blog

Thinking about CPU…

Charles Hooper has written a number of impressive blog posts in a fairly short space of time (the man is a blogging machine!) but I really wanted to draw attention to one in particular.Fault Quotes 6 – CPU UtilizationWhy? Well I seem to be spending mor…

10 02 2010
Rodger Lepinsky

Hello Oracle experts,

This and the other blog about the nice command reminded me of some experiences I had a few years ago. My comments started to be long, so I made a post here:

http://rodgersnotes.blogspot.com/2010/02/oracle-database-tuning-and-being-nice.html

Hope it’s useful.

15 02 2010
Thinking about CPU | Oracle

[…] Charles Hooper has written a number of impressive blog posts in a fairly short space of time (the man is a blogging machine!) but I really wanted to draw attention to one in particular.Fault Quotes 6 – CPU Utilization […]

5 03 2010
Blogroll Report 29/01/2009 – 05/02/2010 « Coskan’s Approach to Oracle

[…] and Where Do I Find It 3? 20-Faulty quotes about %100 CPU utilization ? (comments) Charles Hooper-Faulty Quotes 6 – CPU Utilization 21-SQL Tuning Advisor generated SQL Profiles and manual sql profile (comments) Kerry Osborne-Single […]

14 06 2010
CPU Utilization « Steve Harville's Blog

[…] By steveharville Here’s a good blog post about cpu utilization. 0.000000 […]

17 09 2010
Technical Review of MOTS « Charles Hooper's Oracle Notes

[…] Waits, and Commit Performance by Tanel Poder.  You might be curious about what happens when your CPUs are pushed toward 100% utilization.  You might be curious why placing your redo logs on SSD drives may not help.  You might be […]

28 11 2010
Book Review: Oracle Tuning: The Definitive Reference Second Edition « Charles Hooper's Oracle Notes

[…] Page 325 states, “The reason that CPU drives to 10% utilization is because the UNIX internal dispatchers will always attempt to keep the CPU’s as busy as possible. This maximizes task throughput, but it can be misleading for a neophyte. Remember, it is not a cause for concern when the user + system CPU values approach 100 percent.”  Why 10%?  Page 325 of the book makes the same errors as are found on page 25 of the book (reference). […]

15 05 2012
Yasir

Hi,
Interpreting CPU usage from AWR is tricky I find. Can you tell me for a 2 node RAC with 23 CPus dual core which makes cpu_count=46, what total CPU should we consider: 23 or 46 for each node?
If i want to calculate cpu usage in a single node, what metric should i take for total cpu; 23 or 46?
cpu_count=46
parallel threads per execution=2.
Thanks

16 05 2012
Charles Hooper

Yasir,

I think that this might be your OTN thread: https://forums.oracle.com/forums/thread.jspa?threadID=2389356

I am not sure how much help I will be able to provide to you. In general, assuming that Oracle is picking up the CPU_COUNT correctly, the CPU_COUNT should take into account any multi-threading/hyperthreading and multiple cores per CPU. Bugs do occasionally happen:
https://supporthtml.oracle.com/epmos/faces/ui/km/DocumentDisplay.jspx?_afrLoop=2820473851313000&id=738114.1 (requires a MOS account)

If the server does in fact have 23 occupied CPU sockets, with each socket containing a dual core CPU that does not support multi-threading/hyperthreading, then the server effectively has 46 seconds of CPU capacity per elapsed second – 165,600 CPU seconds per hour. Keep in mind that just because there are 165,600 CPU seconds of potentially available CPU time per hour, that does not mean that a single session has at its disposal 165,600 CPU seconds per hour. The processes for one or more sessions could still be CPU bound even if the server is not reporting anywhere near 100% CPU utilization.

The PARALLEL_THREADS_PER_CPU parameter is not exactly related to available CPU capacity, but is used in various formulas related to parallel processing:
http://docs.oracle.com/cd/E18283_01/server.112/e17110/initparams187.htm
http://docs.oracle.com/cd/E18283_01/server.112/e16541/parallel004.htm
http://books.google.com/books?id=b3DIkYO2gBQC&pg=PA497

There are AWR scripts that are specific to RAC environments, see the following articles:
http://jonathanlewis.wordpress.com/2011/02/23/awr-reports/
http://orainternals.wordpress.com/2009/12/23/rac-performance-tuning-understanding-global-cache-performance/

I am certain that other readers would be happy to provide additional assistance to you.

28 10 2013
Is CPU usage 100% really okay? | Oracle Diagnostician

[…]  After I already finished the first draft, I stumbled upon an excellent post by Charles Hooper on the same subject, with far more quotes and references, and a more technical […]

2 02 2015
Is CPU usage 100% really okay? | Oracle Diagnostician's scripts and stuff

[…]  After I already finished the first draft, I stumbled upon an excellent post by Charles Hooper on the same subject, with far more quotes and references, and a more technical […]

Leave a reply to CPU Utilization « Steve Harville's Blog Cancel reply