Unexpected Timer Resolution, Unexpected Parked CPUs, Unexpected Power Consumption

19 04 2013

April 19, 2013 (Modified May 11, 2013, June 5, 2014)

This blog article is not purely Oracle Database specific, yet it may have some relevance to companies that run Oracle Database on the Windows Server platform (for those DBAs lucky/unlucky enough to run Oracle Database on the Windows Server platform, you may find this article interesting).

I am in the process of setting up a couple of new Windows servers to perform various non-Oracle Database tasks.  I noticed that one of the servers had an odd issue – the server would occasionally become very slow at responding to mouse movements and keyboard input, for instance taking 30 seconds to move the mouse pointer a short distance across the screen.  These servers are running Windows Server 2012, which shares the same kernel and includes much the same features as Windows 8 – with the exception that the server operating system opens to the desktop rather than Windows 8’s new start screen.

Two years ago I wrote a brain teaser article that asked how it was possible that a 10046 extended SQL trace could output c=15600,e=510 on a line of the trace file when executing a SQL statement without using parallel query – essentially asking how it was possible to consume 0.015600 seconds of CPU time in 0.000510 seconds of elapsed time when the SQL statement was restricted to running on no more than one CPU.  In the comments section of the article I mentioned the ClockRes utility, but did not provide a link for the download of the program.  So, I thought that I would run the ClockRes utility on one of the new servers, make a change to the server, and then run the ClockRes utility again:

UnexpectedClockResOutput

As can be seen above, on the first execution of ClockRes the Current timer interval was 1.001 ms, while on the second execution of the ClockRes program the Current timer interval was 15.626 ms.  There is an odd similarity between that 15.626ms time (which oddly exceeds the reported Maximum timer interval of 15.625ms) and the c=15600 reported in the Oracle 10046 extended SQL trace file.  So, what change did I make to the server between the first execution of ClockRes utility and the second execution?  For now I will just say that I stopped one of the background services on the server (more later).

I recall performing an experiment a couple of years ago with Oracle Database.  I downloaded a utility that offered to change the Windows default timer resolution from 15.625ms to 1.0ms.  That utility did in fact change the Windows timer resolution, resulting in Oracle Database outputting c= values in increments of 1000, rather than in increments of 15600.  If I am remembering correctly, a second outcome of the experiment was a decrease in performance of the test Oracle database on the computer due to the higher resolution of the Windows timer.

Could the change in the resolution of the Windows timer from the Windows default of 15.625ms to 1.001ms be responsible for the occasionally sluggish performance of the server?  One article that I found (and unfortunately did not save the link to) claimed that adjusting the Windows timer from the default of 15.625ms to a lower value, 1ms for example, could cause a significant negative impact in multitasking system performance (roughly 30% decrease, if I recall correctly).  I located an article on Microsoft’s website that offered some level of clarification, below is a short quote from the article:

“Applications can call timeBeginPeriod to increase the timer resolution. The maximum resolution of 1 ms is used to support graphical animations, audio playback, or video playback. This not only increases the timer resolution for the application to 1 ms, but also affects the global system timer resolution, because Windows uses at least the highest resolution (that is, the lowest interval) that any application requests. Therefore, if only one application requests a timer resolution of 1 ms, the system timer sets the interval (also called the “system timer tick”) to at least 1 ms. For more information, see “timeBeginPeriod Function” on the MSDN® website.

Modern processors and chipsets, particularly in portable platforms, use the idle time between system timer intervals to reduce system power consumption. Various processor and chipset components are placed into low-power idle states between timer intervals. However, these low-power idle states are often ineffective at lowering system power consumption when the system timer interval is less than the default.

If the system timer interval is decreased to less than the default, including when an application calls timeBeginPeriod with a resolution of 1 ms, the low-power idle states are ineffective at reducing system power consumption and system battery life suffers.”

The above mentioned Microsoft article also suggested running the following command from the Windows command line:

powercfg /energy

I had actually executed the above command before running the ClockRes program for the first time, and again after running the ClockRes program for the second time.  A very small portion of the powercfg generated HTML file follows, generated prior to the first execution of ClockRes:

Platform Timer Resolution:Platform Timer Resolution
The default platform timer resolution is 15.6ms (15625000ns) and should be used whenever the system is idle. If the timer resolution is increased, processor power management technologies may not be effective. The timer resolution may be increased due to multimedia playback or graphical animations.
Current Timer Resolution (100ns units) 10009
Maximum Timer Period (100ns units) 156250

Platform Timer Resolution:Outstanding Timer Request
A program or service has requested a timer resolution smaller than the platform maximum timer resolution.
Requested Period 10000
Requesting Process ID 536
Requesting Process Path \Device\HarddiskVolume4\PROGRA~2\APC\POWERC~1\agent\pbeagent.exe

This is the same section of the generated HTML file, generated after the second execution of ClockRes:

Platform Timer Resolution:Platform Timer Resolution
The default platform timer resolution is 15.6ms (15625000ns) and should be used whenever the system is idle. If the timer resolution is increased, processor power management technologies may not be effective. The timer resolution may be increased due to multimedia playback or graphical animations.
Current Timer Resolution (100ns units) 156261

That is potentially interesting.  The output of powercfg stated that PROGRA~2\APC\POWERC~1\agent\pbeagent.exe requested a timer of 1.000 ms, which then changed the Windows server system-wide timer to 1.0009ms.  Interesting?  PROGRA~2\APC\POWERC~1\agent\pbeagent.exe resolves to the “APC PBE Agent” service in Windows, which is a component of the American Power Conversion (APC) PowerChute Business Edition software.  That software interfaces with an attached UPS to provide a gentle shutdown of the server in the event of an extended power outage.  The “APC PBE Agent” service happens to be the service that I shut down between the first and second execution of the ClockRes utility.

Interesting?  Does that suggest that installing the APC PowerChute Business Edition software on a server potentially has a significant impact on the performance of that server due to the program’s insistance on changing the Windows system-wide timer resolution to 1ms?  A quick observation indicates that the change made by the APC software to the Windows system-wide timer resolution does NOT apparently affect the reporting of the c=15600 entries in an Oracle Database 10046 extended SQL trace when the APC software is installed on the server.  The question remains whether or not this APC software could significantly decrease the performance of that Oracle Database software (potentially by 30%, as suggested in the one unnamed article).

——

The Windows Server that is experiencing occasionally jittery mouse and keyboard input is reasonally high-end for a Windows server: Intel Xeon E5-2690 8 core CPU at 2.9GHz (with hyperthreading enabled, giving the appearance of 16 CPUs in Windows), 64GB of memory, RAID controller with 1GB of battery backed cache, 16 internal 10,000 RPM hard drives, two gigabit network adapters in a teamed configuration, etc.  It should require a substantial load on the server to cause the jittery mouse and keyboard input behavior.

The power option plan in Windows was set to High Performance, while the default plan in Windows Server is Balanced.  Various articles on Microsoft’s website state that the Balanced plan allows the server/operating system to use CPU speed throttling (reducing the CPU speed from the stated speed rating, 2.9GHz in the case of this server), and core parking (essentially putting one or more CPU cores to sleep) in order to reduce energy consumption.  Some articles on Microsoft’s site indicate that, at least with Windows Server 2008, that CPU parking may increase IO latencies – that, of course, would be bad if Oracle Database were installed on the server.  Other articles on Microsoft’s site indicate that there are bugs, at least with Windows Server 2008, related to core parking which causes the parked cores not to wake up when the CPU load increases.  I wonder if this particular bug is playing a part in the performance issue faced in this very recent Usenet thread that describes poor performance of Oracle Database running in Hyper-V on Windows?

Here is a screen capture of the Power Options window and Task Manager on the Windows Server 2012 machine that is experiencing occasionally jittery mouse and keyboard input (screen capture taken when the server was mostly idle):

UnexpectedPowerOptionsTaskManager

Notice the inconsistency?  The server’s CPU is throttled down from 2.9GHz to just 1.16GHz while the power option plan is set to High Performance.  The Microsoft published “Performance Tuning Guidelines for Windows Server 2012” document on pages 16-17 states the following (I highlighted some of the words in red):

Balanced (recommended): Default setting. Targets good energy efficiency with minimal performance impact.  Matches capacity to demand. Energy-saving features balance power and performance.

High Performance: Increases performance at the cost of high energy consumption. Power and thermal limitations, operating expenses, and reliability considerations apply.  Processors are always locked at the highest performance state (including “turbo” frequencies). All cores are unparked.

Power Saver: Limits performance to save energy and reduce operating cost.  Caps processor frequency at a percentage of maximum (if supported), and enables other energy-saving features.”

Well, that is interesting, and is inconsistent with the above screen capture.  Incidentally, when the server was experiencing the worst of the occasionally jittery mouse and keyboard input, the CPU utilization was hovering around 6% and the CPU speed was still coasting at 1.16GHz to 1.18GHz, the network performance hovered between 600Mbps and 1100Mbps, and the server’s internal hard drives barely noticed the traffic passing to/from the disks through the network interface (lower than 75MB/s and 137MB/s, respectively).  6% CPU utilization causes the mouse and keyboard input to become jittery?  With hyperthreading enabled, there is essentially 16 available CPU seconds per each second of elapsed time.  A quick check: 1/16 = 0.0625, so 1 (hyperthreaded) CPU at 100% utilization would be reported as a system-wide utilization of 6.25%.  Interesting, but is that statistic relevant?

I happened to have the Windows Resource Monitor open during one of the jittery episodes.  The Resource Monitor showed, shockingly, that 14 (possibly 15) of the hyperthreaded “CPUs” were parked!  That result is also in conflict with the Microsoft document mentioned above regarding “all cores are unparked” when the High Performance power plan is selected.  So, at 6% CPU utilization the server was CPU constrained.  Modifying the setting in the server’s BIOS that controls whether or not cores may be parked, so that the cores could not be parked, fixed the issue in Windows Server 2012 that resulted in the 30 second delay that accompanied moving the mouse pointer a short distance across the screen.

The server still exhibits a bit of jittery behavior with mouse and keyboard input when the server’s teamed network cards are heavily used for file transfers to the server, but at least the CPU activity is no longer confined to a single hyperthreaded “CPU”:

UnexpectedResourceMonitor

Considering that this server was ordered from the manufacturer as “performance optimized”, I am a bit surprised at the power consumption of the server.  The server was ordered with dual (redundant) 1100 watt power supplies.  With the CPU’s 135 watt maximum TDP (per Intel: “Thermal Design Power (TDP) represents the near maximum power a product can draw for a thermally significant period while running commercially available software.”), 16 hard drives, and 64GB of memory, I fully expected the server to consume between 700 and 900 watts of electrical power.

Here is the server’s power consumption when the server is lightly loaded with roughly 68 running processes (note that the server is connected to a 120 volt power outlet):

UnexpectedPowerConsumptionLittleLoad

Here is the server’s power consumption when the server is moderately loaded with between 600Mbps and 1100Mbps of network traffic (the mouse pointer was slightly jittery at this point):

UnexpectedPowerConsumptionNetworkLoad

So, the server consumes 1.2 amps (126 watts) when lightly loaded and 1.4 amps (154 watts) when moderately loaded.  Keeping in mind that many of the popular incandescent light bulbs require 100 watts of power (note that some governments have now restricted the manufacturing of high wattage incandescent light bulbs), this server is consuming just a little more electrical power than a light bulb that might have been hung overhead just a decade or two ago.

One of the common arguments for server virtualization is energy savings – the above screen captures may suggest that energy savings may not be a significant cost-savings factor for virtualization with modern server hardware.  One might question how much energy is really being saved when the network interface is maxed out by a single virtualized server, just 6% CPU utilization results in a jittering mouse pointer, and there are eight to ten virtualized servers stacked on the physical hardware (all competing for the scarce CPU and network resources).

Added May 11, 2013:

Dell BIOS setting to enable or disable CPU parking in Windows Server 2012:

UnexpectedPowerConsumptionBIOSProcIdle

With the BIOS option set to enabled, disk activity caused by network traffic results in occasionally jittery mouse movements on the server.  Based on a bit of research, installing the Hyper-V role on either Windows Server 2012 or Windows 8 may disable CPU throttling and/or disable CPU parking.

Added June 5, 2014:

I finally had sufficient time to fully analyze this problem, where a 2.9GHz CPU in a Dell PowerEdge T620 server crawled along at a leasurely pace of about 1.16GHz, actually throttling back performance further as demand for the server’s resources increased.  A second Dell PowerEdge T620 server with a 2.6GHz CPU that was purchased at the same time also coasted along at roughly 1.16GHz, but that server did not seem to throttle back performance further as demand for the server’s resources increased.

As a review, the screen capture shown below at the left shows the Windows Server 2012 Power Options settings and the Performance tab of the Task Manager.  The screen capture below at the right shows the Windows Server 2012 Power Options settings and the Performance tab of the Task Manager after fixing this particular problem – note that the 2.9GHz CPU is now essentially overclocked at 3.28GHz (it has operated at roughly that speed since the fix).

UnexpectedT620PowerOptionsTaskManager UnexpectedT620PowerOptionsTaskManager2

The 2.9GHz PowerEdge T620 and the 2.6GHz PowerEdge T620 are both Active Directory domain controllers and internal DNS servers (along with supporting other tasks), so the occasionally slow (or extremely slow) performance of the servers negatively impacted the performance of other servers as well as client workstations.

There was a BIOS firmware update released in the third quarter of 2013, which was supposed to address some CPU throttling issues – that BIOS update did not seem to help the problem that I experienced.

I thought that the low power consumption of the big server with the 2.9 GHz E5-2690 8 core CPU was a significant clue when I tried troubleshooting the server a year ago since that CPU is rated to consume up to 135 watts and the server liked to hover between 120 watts and 140 watts regardless of the server’s workload.  The Dell T620 (and other recent Dell servers) has some pretty sophisticated power management capabilities.  Dell sells a utility that is able to alter the electrical power profile of a server, and I thought that Dell might have imposed a 140 watt limit on the server for some reason, but I could not find where that limit was specified.  The 2.9 GHz E5-2690 8 core CPU apparently has some additional electrical power limiting capabilities.  A year ago I even tried downloading a demo of Dell’s power management utility – that did not help resolve the issue (I think that the installation might have caused some other issues that I had to fix).  Last week Tuesday I read the following articles:
http://www.dell.com/learn/us/en/19/financial-services-markets-solutions-processor-acceleration-technology
http://www.dell.com/us/business/p/dell-openmanage-power-center/pd
http://www.intel.com/content/www/us/en/data-center/data-center-management/how-to-configure-node-manager-video.html
ftp://ftp.dell.com/Manuals/Common/poweredge-r720_Concept%20Guide_en-us.pdf
http://en.community.dell.com/techcenter/power-cooling/w/wiki/3536.openmanage-power-center-faq.aspx

I rebooted the server, pressed F2, and dug around in the settings a bit.  I found that the System Profile Setting was set to “Performance per Watt” (I believe that this was how it was set when it left the Dell factory).  I changed that setting to “Performance”, saved the changes, and rebooted the server again.  The server is now consuming 200+ watts, and the CPU is freely exceeding its rated speed.  Once in the System BIOS settings, the pictures below show the configuration changes to remove the electric power cap, thus allowing the server to behave as it should have from the factory:

UnexpectedT620SystemProfile1 UnexpectedT620SystemProfile2

UnexpectedT620SystemProfile3 UnexpectedT620SystemProfile4

I suppose that if a Dell PowerEdge T620 (or similar recent model Dell server) seems to be running a bit slower than expected (note that the particular problem mentioned above is NOT Windows specific – a Dell PowerEdge T620 running Linux should be affected in the same way), you might take a quick peek at the System Profile Setting in the System BIOS to make certain that the System Profile is set to Performance.  As shipped from the factory, two Dell PowerEdge T620 servers purchased this year were NOT affected by the problems mentioned in this blog article.








Follow

Get every new post delivered to your Inbox.

Join 142 other followers