Saturday, May 3, 2014

vSphere performance issue: Make your future self proud as admin

Baselining is, like cable management, a skill often downplayed, overlooked, or otherwise ignored—but is absolutely essential to any virtual environment. Consider this scenario:

On Monday, you arrive to a slew of emails: " I think my blaa-blaa app is not working correctly. It is super-slow." You think about it, do a quick check, and I mean quick, and go get your coffee. Then, you return to your desk and your boss walks over, and says, "I've been hearing a lot of people talking about their problems with blaa-blaa app. What's going on?"

Now, if you haven't been practicing baselining, you won't be able to give anything other than a subjective answer: "Well, they think it is running slowly, but I don't think it is running slowly." There is no empirical data to back up your statement; it is your word against theirs.

Baselining is the answer to the problem of subjectivity in the computing experience. It provides you, as a systems administrator, the ability to 
1) know what good performance looks like,
 2) determine, by delta comparison, when things are not functioning properly and performing below designed expectations, and, 
3) help to correct any erroneous experiences—whether you are wrong, and the app is performing poorly, or perhaps something else is going wrong, on the user's end.

So how does baselining work, and what is the proper way to perform it, to achieve this desired outcome? I'm glad you asked!

In my experience, many people baseline a system before Go Live—that is, without any users on it. Now, that is important. However, a true production baseline must be performed under normal user workload and resource consumption, otherwise how will you know if anything is not operating as expected?
Picture
A few things to remember; call them the "gotchas" of performance baselining.
  • Utilization is a measure of resource consumption, not user workload.
  • Latency is a measure of resource transmission, not user experience.

Too often our designs are based on static calculations: I know each VDI session will consume 500MHz of processor and produce .25Mbps of LAN traffic. Scale out. We create a modular "pod" if you will to figure out what we need for hardware, etc. And this is a great practice, and is used by virtually everyone when it comes to sizing.

The problem enters, most often, when VMware admins and other Sysadmins take what is meant to be a sizing guideline and turn it into a baseline—a function for which it was never intended. The sizing guideline is for purchasing and initial configuration; it is a place to start, whereas a baseline is really supposed to be representative of the end-state at go-live under normal operating conditions.

So how do you baseline? Well, you use the same tools, you just use them properly. You can use simulated user loads, but in my experience users are less predictable than we think. It is best, if possible, to perform your final baseline under normal user workload conditions, if possible over several days. And you should be able to summarize it as such—keep it simple for your users and boss's boss:

VDI baseline: On an average Monday, we have
  • 127 users logged in, with a peak login hours at 7:45–8:30am.
  • 58% CPU utilization per host, 78% during peak login
  • 62% RAM utilization per host, 68% during peak login
  • 12% network utilization per host, 35% during peak login
  • 2.3ms average latency per datastore, 15.8ms during peak login

Make sure that you have a baseline not just for servers in general, but for specific workloads, too. In this case above, I used VDI as an example.

Make sense? So avoid the trap of laziness, and do the hard work of optimizing and recording a baseline, and your future self will thank you greatly for it. And by the way, a tool like VMware vSphere Operations Manager or NetApp OnCommand Performance Manager can assist you greatly along the way.