SpeedStep and VDI? Is it a good thing? Not for me.
SpeedStep problems are nothing new when it comes to Virtual Machines and performance issues. Just google around a bit and you will see the forum questions and articles on VMs not getting enough CPU due to SpeedStep. But often these problems are seen in workstation products and not in servers/hosted virtual desktops. Well that has changed.
This week I was introduced to an interesting problem one of our customers was having. They happened to notice it as we were sitting down to do some Unidesk testing. The “problem” was that on their test servers (almost no load) the VMs that were executing anything with CPU load seemed to be “slow”.
Login times were a little slower on these new servers, the Windows Mini-setup seemed to take longer, basically anything that was heavily CPU dependent was not as fast as it should be. On a lark I had them run Ron’s tried and true CPU consumer...
Jump to a command promptand run, drop to the root of C: and run Dir /s
While this was running we jumped over to vCenter and checked the real-time performance tab of this VM. Here we could see that the CPU utilization table-toped at just over 50% and about 1080MHz.
We then jumped back to the VM and in task manager noticed the machine was “thinking” it was using 100% of available CPU. WTH?
Further digging was needed. Why could this VM NOT get 100% of the pCPU (physical CPU to any non-VM types)? I mean this host had only a few VMs? We checked for VMW resource limits, we looked for anything in the VM config, pools or vCenter that may be setting a limit.
Next step was the hardware itself. Our first theory was that this may be Hyper-Threading related. So we shut off HT in BIOS and rebooted the host and powered back up the test VM. Sure enough we could now get full CPU (little over 2 GHz) and max out a physical CPU. This led us to believe HT was to blame for a bit. But we were too quick to make the assumption and got burned.
Another interesting item that we noticed was that ESXTOP reported that CPU USED and UTIL counters were VERY different. (Image below) Interestingly the USED and UTIL counters were almost always UTIL = 2* USED plus or minus a couple of percentage points.
Fast forward 15 minutes or so, we begin to look at other items that needed work in the environment, and we happen to notice that things have gotten “slow” again. Jumping back to the performance metrics for the test VM we see AGAIN that running a task that consumes 100% of the CPU in the VM only really takes 50% or so of the pCPU. WTH? AGAIN?
Long story short we have figured out that it is a SpeedStep ‘issue’. This ‘issue’ is also expected behavior. See, SpeedStep slows down (or steps down) the processor's clock speed (or frequency to you processor geeks) when there is not a lot of demand for the CPU. The functionality can be seen in notebooks all the time; you are sitting there idle, reading an email or something, and SpeedStep may take your 3 GHz proc and drop it down to 1 or 1.5. Then if you start doing something CPU intensive (think AV scan) the CPU is stepped back UP to full clock speed. The idea is that since you don’t need all of the available CPU cycles, reduce the available number of cycles at the CPU, which will reduce overall power consumption and heat coming from the processor. A 'good thing' correct? For sure, in a laptop, to keep the leg burns to a minimum. But is it good for VDI host hardware?
Kind of. On the server side it seems that this technology is going to be looking at overall CPU demand/usage to determine when to step up and down the clock speed. At lower demand (think a bunch of idle VMs or when few people are logged in) SpeedStep will step all the processors down. As load increases on the server, there will be a tipping point, where SpeedStep will see that overall CPU demand is high enough to step all the CPUs back up to full speed. Sounds great right?
Well, maybe for power. But for perceived performance (perfromance as viewed from the end user perspective) this can cause variations in application launch time and app responsiveness. We noted that just after rebooting the server (prior to SpeedStep stepping down the CPUs) our login to a Win7 desktop was about 2x as fast as when SpeedStep is kicked-in. We also noticed that those application starts (anti-virus, and other load at login type of things) seemed to move along much quicker and get the user to a desktop faster. Once SpeedStep kicked in, the login was noticeably slower and application launches were somewhat sluggish… Why? Well the VM only had access to one core and that core had been stepped down to 1080Mhz… welcome to 2001!
These are my issues: Variable performance and the unknown of how much load it will take to keep SpeedStep from dropping the CPUs to a lower frequency. If I were building and managing a VDI environment I would not risk users' getting different experiences from session to session. One of the primary complaints with VDI is performance. People want to move to VDI only if they can supply the same or better performance for the desktop. Well, variable performance is just as bad as LESS performance in the VDI world. This will generate calls from the users. So for my environments SpeedStep is now a no-no.
As a side note here I want to point out that we cannot (at this time) disable SpeedStep on certain UCS blades.
The customer attempted to disable SpeedStep on two types of blades in the environment, UCS 200 and UCS 230’s. For the UCS 200s this was not an issue. Those VMs behaved and acted as you would think (disable SpeedStep, get full access to the processor). But on the UCS 230 blades SpeedStep could not be disabled and continued to experience the performance issues…. Then my contact found this link in the Cisco forums:
“I'm assuming that is what VMWare noticed while looking at another issue as throttling my CPU performance by 50%.”
Yup… same problem… The customer also asks about why he cannot disable it. He has set a policy and applied it and still has the problem. He also notices that you cannot disable the Option in the BIOS to get it to stop doing it. Cisco Response?
"discussed this with our Dev team. This is expected behavior."
Basically, you can’t shutoff SpeedStep. Great. Thanks. Pound sand.
The support folks also go on to state that you should see NO performance change at high or low utilization. But that is complete BS. The UCS 230 blades are not allowing VMs to spike up to 100% of true CPU when the CPUs are stepped down. The whole SYSTEM has to have enough utilization at the CPU level to step the CPUs back up. So while the host CPU has overall low utilization, an individual VM or process in a VM that is CPU constrained will take about twice as long as normal. Great. Is this the “expected behavior” you were aiming for?
Not good Cisco. I am going to keep an eye on this issue, and my faith in SpeedStep and its viability on the server side has been eroded. I think that the folks dealing with this technology think too much about single server workloads and not enough about VM environments.
Ron’s Politically Incorrect VDI Blog
Too many organizations are out there trying to implement VDI and failing. Whether you like what Ron has to say or not, he is here to say what others won’t about VDI and help you get it right in your environment. Get Ron’s advice… raw…unfiltered… without the sugar coating.
Popular Blogs by Ron