CPU Run Queue – What is Wrong with this Quote?

14 06 2010

June 14, 2010

I found another interesting couple of lines in the June 2010 printing of the “Oracle Performance Firefighting” book.  This quote is from page 116:

“While it may seem strange, the run queue reported by the operating system includes processes waiting to be serviced by a CPU as well as processes currently being serviced by a CPU. This is why you may have heard it’s OK to have the run queue up to the number of CPUs or CPU cores.”

Also, this quote from page 123:

“Have you ever heard someone say, ‘Our CPUs are not balanced. We’ve got to get that fixed.’? You might have heard this if you’re an operating system kernel developer or work for Intel, but not as an Oracle DBA. This is because there is a single CPU run queue, and any available core can service the next transaction.”

Also, this quote from page 136:

“The other area of misrepresentation has to do with time in the CPU run queue. When Oracle reports that a process has consumed 10 ms of CPU time, Oracle does not know if the process actually consumed 10 ms of CPU time or if the process first waited in the CPU run queue for 5 ms and then received 5 ms of CPU time.”

Interesting… regarding the first quote – most of what I have read about the CPU run queue seemed to indicate that the process was removed from the run queue when the process is running on the CPU, and then re-inserted into the run queue when the process stops executing on the CPU (assuming that the process has not terminated and is not suspended).  The “Oracle Performance Firefighting” book lacks a test case to demonstrate that the above is true, so I put together a test case using the CPU load generators on page 197 of the “Expert Oracle Practices” book, the Linux sar command, and a slightly modified version (set to refresh every 30 seconds rather than every second) of the WMI script on pages 198-200 of the “Expert Oracle Practices” book.

For the test, I will use the following command on Linux:

sar -q 30 10

Immediately after the above command is started, a copy of the Linux version of the CPU load generator will be run (the load generator runs for 10 minutes and then exits):

#!/bin/bash
i=0
STime=`date +%s`

while [ `date +%s` -lt $(($STime+$((600)))) ]; do
  i=i+0.000001
done

Every time a new line is written by the sar utility another copy of the CPU load generator is started.  For the first test run I manually launched a new command line from the GUI and then started the script.  For the second test run I first opened as many command line windows as necessary, and prepared each to execute the script.  Here is the output (the runq-sz column shows the run queue):

[root@airforce-5 /]# sar -q 30 10
Linux 2.6.18-128.el5 (airforce-5.test.com)      06/13/2010

05:39:16 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
05:39:46 PM         1       228      0.76      0.37      0.16
05:40:16 PM         2       230      1.04      0.48      0.21
05:40:46 PM         3       232      1.31      0.59      0.25
05:41:16 PM         4       233      1.86      0.79      0.33
05:41:46 PM         6       237      2.81      1.11      0.45
05:42:16 PM         7       241      3.71      1.48      0.59
05:42:46 PM         9       244      5.56      2.14      0.84
05:43:16 PM        12       247      7.86      3.00      1.16
05:43:46 PM        16       250     10.29      4.04      1.56
05:44:16 PM        14       250     12.03      5.06      1.98
Average:            7       239      4.72      1.91      0.75

[root@airforce-5 /]# sar -q 30 10
Linux 2.6.18-128.el5 (airforce-5.test.com)      06/13/2010

05:50:53 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
05:51:23 PM         1       237      0.54      3.41      2.76
05:51:53 PM         3       239      1.44      3.35      2.76
05:52:23 PM         3       241      2.20      3.35      2.78
05:52:53 PM         5       242      3.31      3.51      2.85
05:53:23 PM         7       245      4.53      3.78      2.96
05:53:53 PM        10       247      6.40      4.31      3.16
05:54:23 PM        13       249      8.60      5.03      3.43
05:54:53 PM        13       249     10.71      5.87      3.76
05:55:23 PM        16       253     12.16      6.68      4.10
05:55:53 PM        14       251     13.02      7.41      4.42
Average:            8       245      6.29      4.67      3.30

For the sake of comparison, here is a CPU load generator script that executes on Windows that performs the same operation as the script which was executed on Linux:

Dim i
Dim dteStartTime

dteStartTime = Now

Do While DateDiff("n", dteStartTime, Now) < 10
  i = i + 0.000001
Loop

Let’s use the WMI script in place of the Linux sar command and repeat the test.  The WMI script will be started, the CPU load generator script will be started, and every time the WMI script outputs a line another copy of the CPU load generator script will be started.  Here is the output from the WMI script (the Q. Length column shows the run queue):

6/13/2010 6:29:27 PM Processes: 53 Threads: 825 C. Switches: 1757840 Q. Length: 0
6/13/2010 6:29:57 PM Processes: 54 Threads: 826 C. Switches: 32912 Q. Length: 0
6/13/2010 6:30:27 PM Processes: 56 Threads: 831 C. Switches: 71766 Q. Length: 0
6/13/2010 6:30:57 PM Processes: 58 Threads: 836 C. Switches: 39857 Q. Length: 0
6/13/2010 6:31:27 PM Processes: 59 Threads: 830 C. Switches: 33946 Q. Length: 0
6/13/2010 6:31:57 PM Processes: 59 Threads: 821 C. Switches: 27955 Q. Length: 1
6/13/2010 6:32:27 PM Processes: 61 Threads: 830 C. Switches: 32088 Q. Length: 0
6/13/2010 6:32:57 PM Processes: 63 Threads: 826 C. Switches: 27027 Q. Length: 0
6/13/2010 6:33:29 PM Processes: 64 Threads: 827 C. Switches: 22910 Q. Length: 3
6/13/2010 6:34:01 PM Processes: 66 Threads: 836 C. Switches: 22936 Q. Length: 4
6/13/2010 6:34:34 PM Processes: 68 Threads: 839 C. Switches: 34076 Q. Length: 5
6/13/2010 6:35:07 PM Processes: 70 Threads: 840 C. Switches: 25564 Q. Length: 8

What, if anything, is wrong with the above quotes from the book?  The comments in this article might be helpful.

The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true.  It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past).  If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.


Actions

Information

17 responses

14 06 2010
Noons

One of the possible causes of differences is that each *n*x system handles run queue counter differently. The way it behaves is not under any standard definition, so it’s up to each manufacturer to give it their own “spin”. The result is exactly what you point out: the need to test these assertions on each system.
Some OS’s do indeed have more than one CPU queue. And the load generated by a pure CPU loop like the one above is quite different from one generated by a process that generates I/O and naturally goes into a wait queue, then gets sent back to CPU wait queue after I/O completion post.
Thought provoking post, Charles: thanks for that.

14 06 2010
Charles Hooper

Noons,

Thank you for the comments. My Linux test indicates that I was using the 2.6.18-128.el5 kernel, and as I understand it, the 2.6 series of Linux kernel has a separate run queue for each CPU.

Here is a related quote from page 136 of the “Oracle Performance Firefighting” book:

“For example, if you ran a response time-focused report over a duration of 60 seconds, and if the single-CPU core subsystem were heavily overworked, Oracle could report that Oracle processes consumed 65 seconds of CPU time. Yet, we clearly know the CPU subsystem is limited to supplying up to 60 seconds of CPU time and no more.”

Would anyone want to set up a test case to see if it is possible for a single CPU system (without multiple cores/threads) to post 65 seconds of CPU time in a 60 second time period? I think that one of my blog articles showed that on the Windows platform the statistics for CPU time will continue to accumulate when a process is forced off the CPU and back into the run queue – at the moment I cannot recall which blog article.

15 06 2010
Charles Hooper

I think that this is the blog article with the test script which investigates whether time spent in the run queue accumulates time in the CPU used statistic (see the summary at the bottom of the article):

http://hoopercharles.wordpress.com/2010/03/17/what-is-the-impact-on-the-cpu-statistic-in-a-10046-trace-file-in-a-cpu-constrained-server/

This one might also apply:

http://hoopercharles.wordpress.com/2010/01/09/drilling-into-session-detail-from-the-operating-system-on-the-windows-platform-2/

16 06 2010
karlarao

Hi Charles,

Yes it is possible.. I have a test environment with just one CPU.. snap interval is 10mins, so that would be Total CPU time of 600seconds.. then as I was mining the AWR, I saw two instances where Oracle CPU had 845 and 706 seconds.. Also as what I have observed, OS CPU could also exceed 600seconds..
But definitely when you reach this point, the server is already loaded..

You can check the report here.. http://goo.gl/3wPY

– Karl Arao

16 06 2010
Charles Hooper

Karl,

Thank you for posting an AWR report. Very interesting.

Out of curiosity, what kind (manufacturer and model) of CPU is in that server?

This line in the AWR report is interesting:

Parse CPU to Parse Elapsd %:  122.22

I think that it is normal for that percentage to be less than 100.

Also interesting is this from the OS Statistics:

Statistic					Total
-------------------------------- --------------------
BUSY_TIME				       46,115
IDLE_TIME				       13,989

That suggests that the server’s CPU was about 76.9% busy. The OS statistics also show that the system (kernel) mode time exceeds the user mode time.

It does not appear to be a rounding problem, but I wonder if there is a double-counting problem caused by PL/SQL calling SQL?:

    CPU      Elapsed		      CPU per  % Total
  Time (s)   Time (s)  Executions     Exec (s) DB Time	  SQL Id
---------- ---------- ------------ ----------- ------- -------------
       416	  438		 4	104.08	  49.9 6gvch1xu9ca3g
DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN :
= FALSE; BEGIN EMD_MAINTENANCE.EXECUTE_EM_DBMS_JOB_PROCS(); :mydate := next_date
; IF broken THEN :b := 1; ELSE :b := 0; END IF; END;

       415	  437		 1	415.14	  49.8 2zwjrv2186835
Module: Oracle Enterprise Manager.rollup
DELETE FROM MGMT_METRICS_RAW WHERE ROWID = :B1

The 416 seconds of consumed CPU seems to be close to what is reported by the operating system statistics.

16 06 2010
karlarao

Hi Charles,

It’s Intel Core2 Duo CPU T8100.. but Oracle is on a virtualized environment..

I’ve also updated the posts to contain the AWR and ASH reports of the following scenarios exceeding Total CPU time of 600seconds:
1) Total Oracle CPUsec > CPUsec http://goo.gl/3wPY
– on this one I always see the DECLARE and DELETE operations consuming 320+CPU seconds each on a SNAP period..

2) Total OS CPUsec > CPUsec http://goo.gl/DJ57
– on this one I noticed there’s a high value on %nice on SAR.. also check the significantly low value on DB Time, but the OS CPU on the three SNAP_ID is 100% and slightly exceeds the 600seconds..

– Karl Arao

17 06 2010
Charles Hooper

Karl,

Thank you for the additional information. I located your processor here:

http://ark.intel.com/Product.aspx?id=33916

It is a dual core CPU without hyperthreading support, and also without automatic over-clocking capability as you would find on a Core i7/Core i5, or Xeon 5500/5600/6500/7500 series (I thought there was a small chance that a 60% automatic overclock, as found on the Core i7 920XM might be a potential source of a problem). What that means is that technically that single CPU box is capable of producing 120 CPU seconds of work for every 60 seconds of wall-clock time. There is still the odd discrepancy between the OS Statistics section and the DB CPU statistic in the original report.

I have not yet had an opportunity to look at the new reports. I will try to find time tomorrow.

15 06 2010
kevinclosson

SMPs do not have a single run queue. There are run queues for each CPU whereby processor-affinity (think cache) can be maintained (best effort). If SMPs had a single run queue there would be hell to pay with dequeue thrashing because run queue elements are manipulated with extreme frequency and protected by locks. People tend to think that run queues have only to do with scheduling state changes but even process birth and process death require manipulation of run queue elements.

15 06 2010
Charles Hooper

Kevin,

Thank you for providing a detailed explanation why a single run queue is not ideal – very helpful description.

15 06 2010
Charles Hooper

If anyone is interested, I bumped into these articles before putting together this blog article – some of you might find the articles to be interesting:
“Inside the Linux Scheduler”

http://www.ibm.com/developerworks/linux/library/l-scheduler/

“The pre-2.6 scheduler also used a single runqueue for all processors in a symmetric multiprocessing system (SMP). This meant a task could be scheduled on any processor — which can be good for load balancing but bad for memory caches. For example, suppose a task executed on CPU-1, and its data was in that processor’s cache. If the task got rescheduled to CPU-2, its data would need to be invalidated in CPU-1 and brought into CPU-2.

The prior scheduler also used a single runqueue lock; so, in an SMP system, the act of choosing a task to execute locked out any other processors from manipulating the runqueues. The result was idle processors awaiting release of the runqueue lock and decreased efficiency.”

“CPU Monitoring and Tuning”

http://www.ibm.com/developerworks/aix/library/au-aix5_cpu/index.html

“For local run queues, the dispatcher picks the best priority thread in the run queue when a CPU is available. When a thread has been running on a CPU, it tends to stay on that CPU’s run queue. If that CPU is busy, then the thread can be dispatched to another idle CPU and assigned to that CPU’s run queue.”

15 06 2010
kevinclosson

Yes, and therein lies the reason 2.4 kernels didn’t scale beyond 4 CPUs for process-heavy loads. I remember the first time I observed a scheduling bug with my own eyes. In the early 90s Sequent SMPs (huge) had LEDs to show CPu activity. I was toiling with a benchmark (can’t remember if it was Oracle or Informix) and happened to notice that my LEDs were all twinkling at *very* low levels of utilization and at the end of the run my performance was miserable. The workload performed better when I established hard affinity to 1 CPU. The workload fit in 1 CPU worth of bandwidth but running without my specified affinity the OS scheduler was spreading the workload across all CPUs and thus I had no CPU affinity. I had 30 CPUs. The processor caches were getting trashed. This was a freshly patched system and I picked up pre-production bits with a scheduling bug. The build was done without the affinity code enabled so all CPUs were “happily” running anything they could get their hands on. If your CPUs have cache, you need cache affinity…and, of course, a threshold where the scheduler is willing to run something cache-cold. This extends to NUMA as well. Just another level of affinity.

>The pre-2.6 scheduler also used a single runqueue for all processors in a symmetric multiprocessing system (SMP). This meant a task could be scheduled on any processor

This quote always sounded stupid to me. No matter whether you have a single queue or per-engine queue if it is a shared memory SMP any CPU can execute the code. It seems to me like a politically-correct explanation for an implementation trade-off. Think about it, When your core engineering team consists of lots of people with single CPU desk-side servers you don’t get much focus on SMP.

16 06 2010
Charles Hooper

Kevin,

Thank you for the additional detailed information, and the related history lesson.

One might wonder how the 30 CPU server behavior that you mention above might impact the log file sync quote that was provided in today’s blog article?

16 06 2010
Frits Hoogland

Couldn’t it be ‘load’ was meant instead of ‘runqueue’?

This too is subject to the same confusion, because load is calculated different on different operating systems.

10 01 2011
Neeraj Bhatia

Hi Charles,
Very interesting thread :-) and especially for me as a Capacity Planner.

So read-to-run statistics (r values in the vmstat output) are threads waiting for CPU and are averaged across number of physical CPUs configured on the system. Under virtualized environment, does it has any relation with the #virtual CPUs?

Secondly, What should be a good threshold value for “r” for monitoring perspective? Ideally for a stable system, it should approach to zero. Some texts are suggesting 1.5 to 2 times of #CPUs however other are recommending for 5X. Here is a link for later:

http://www.ibmsystemsmag.com/aix/februarymarch04/features/6670p1.aspx

Thanks,
Neeraj

10 01 2011
Charles Hooper

Hi Neeraj,

Based on the comment left by Noons (and I think that I observed a similar result under Linux), the “r” value means different things on different Unix platforms. In some cases that value includes the count of processes (or threads) that are currently running on the CPU as well as the count of processing waiting in the run queue. Someone else reading this blog might be able to provide more information for you regarding how it applies in your specific version of Unix and how a virtualized environment behaves. If you can find an appropriate article on Kevin Closson’s blog (http://kevinclosson.wordpress.com/ ), you could post your question there.

Keep in mind that in an Oracle database that is used for interactive processing could experience CPU related problems before the “r” value exceeds the number of CPUs in the server (due to latch aquisition requirements for various operations, and in theory all of the processes waiting to run could be in the run queue for a single CPU). I believe that the article stated that in situations where the server is NOT performing interactive processing with end users, but instead where batch processes were executed, the greatest amount of throughput was actieved when the run queue was 5 times longer than the number of CPUs in the server. The article that you referenced is roughly 7 years old now, and that ideal run queue value may or may not hold true now with more modern operating systems, if it was true 7 years ago.

Sorry, I probably did not answer your question. Anyone else able to provide a better answer?

11 01 2011
Neeraj Bhatia

No problems Charles, still your answer is useful for me.

I have AIX 5.3 and as per the IBM red books, it has multiple run queues (one per physical processor I guess and not virtual) and “r” statistics in the vmstat output denotes number of threads waiting for the CPU or in other words read-to-run state.

Yes, it’s true that from throughput perspective we can have 3 to 5 times longer queue, but for performance threshold should be lesser.

Regarding Oracle processes waiting for CPU in a single CPU run queue, isn’t run queues are load-balanced? I’ve learnt that after fixed amount of time (~200 ms), operating system do load balancing across multiple queues, so we usually don’t have contention issue on a single CPU.

Thanks for your reply!

Neeraj

11 01 2011
Charles Hooper

Neeraj,

What you mentioned about a process potentially moving to a different CPU (I believe that the Linux 2.6 and above kernel will do a rebalance every 200ms seconds, I am not sure about AIX) is true. However, if someone has set the process affinity to essentially prevent a process from moving to another CPU, that could be a valid reason why all of the processes in the run queue may remain in the run queue for a single CPU.

You can see how to set the process affinity on AIX in this article:

http://www.ibm.com/developerworks/aix/library/au-processinfinity.html

And you can see some of the reason for setting the process affinity in this article (written for Linux, but should apply to other operating systems also):

http://www.ibm.com/developerworks/linux/library/l-affinity.html

Kevin Closson has a possibly related blog article here:

http://kevinclosson.wordpress.com/2007/08/10/learn-how-to-obliterate-processor-caches-configure-lots-and-lots-of-dbwr-processes/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

Join 140 other followers

%d bloggers like this: