I’ve been thinking a lot lately about what has made my virtualization journey successful so far. There are eight good reasons for my success, four of which tend to be more technical than the rest: a test environment, not breaking vMotion, N+1 capacity, and maintenance windows & good patching practices.
1. A respectable test environment.
I have four physical hosts (two older hosts, two newer hosts) configured in two clusters where I can try new things, test patches and upgrades and functionality, run a couple test VMs for each OS we support, develop procedures, train staff, do demos, and generally muck around without affecting production. I run the test vCenter instance for these hosts as a VM in my production cluster, and that’s great because I can snapshot it, copy it, and do everything you’d expect to do with a VM. The main thing about this test environment is that it belongs to my team. We don’t have app admins using it, we don’t need to schedule downtime in it, and we don’t have to answer to anybody but ourselves. It’s very liberating to just be able to do work without needing to do a lot of corresponding paperwork, too.
2. A policy wherein we do not implement anything that breaks vMotion.
Yes, this means that applications that use the Microsoft Cluster Service don’t get virtualized (because they require physical SCSI bus sharing). Neither do systems that need access to specific hardware (weird ISDN cards, etc.). These don’t represent much of our aggregate workload (1.5% of all of our hosts) so we don’t worry too much about them right now. In the future we will need to worry about this, but by then it’s also possible that vendors like VMware and Microsoft will also come up with better solutions for us, too. It’s probably worth mentioning that we have virtualized some of the test instances of these clusters. When we want to move those VMs my team just shuts one down at a time. Works fine for test systems.
What this policy does get us is the ability to vMotion at any time, in order to do work on the physical hosts, to troubleshoot problems, and to patch. In fact, we’ve evolved to the point where my team rarely does ESX patching outside of normal work hours anymore. That has some nice side effects: A) patching actually gets done, as people who have to patch outside of business hours usually opt not to, B) we don’t make ridiculous mistakes because we’re half-asleep at 4 AM on Sunday, and C) if there is an unanticipated problem we have all sorts of IT staff around to assist with resolution. Beyond all that, in keeping with the promise of virtualization, there are no hardware emergencies anymore and no emergency downtime requests. If a host is suffering some sort of failure we clear it off without fanfare, keeping calm, carrying on. vMotion, and subsequently DRS, is just a fact of life.
3. N+1 cluster capacity.
We always have one extra host worth of CPU & RAM capacity in our cluster, so if a machine is having problems or being worked on we aren’t compromising performance by leaving it in maintenance mode for a couple of days. This also works nicely with our physical machine maintenance practices, in two ways.
First, two members of my team are Dell-certified technicians, and able to order our own parts via Dell’s web site. This is great because we don’t have to call Dell, wait on hold for 30 minutes, convince the support guy that we’re not idiots and we really do have a dead hard disk, drop everything and trudge up to the front door to meet some repair guy at 5:30 PM, escort him into the data center, and babysit him as he takes 2 minutes to put a new disk in a drive carrier. I own a screwdriver, I can do that myself and save many hours of time and interruption. Besides, 100% of my hardware failures over the last year were covered by some form of redundancy (RAID, ECC, dual power supplies, multiple fans), and we were able to clear the affected host off using vMotion and wait for next business day parts to be shipped to us. My logistics guys get the part from FedEx at 10 AM the next day, drop it on my desk by 10:30 AM, and someone from my team installs it by noon.
Second, if we have N+1 capacity, and we are happier with next business day service anyhow, why pay for 7×24 warranties? In fact, the money we’ve saved by going with next business day warranties has paid for the extra capacity multiple times, and actually helps fund my test environment, too.
4. Defined maintenance windows for every OS instance, and regular patching.
My organization’s policy is that every machine we stand up, whether physical or virtual, must have a weekly maintenance window defined before we turn it over to the app admins. As a result, a query of our CMDB lets us see when we can take an outage for any guest VM, and then we schedule an outage during that window a week to a month in advance, depending on the importance of the system.
This simple policy has led to another remarkable phenomenon: regular OS patching. Since it isn’t a hassle to coordinate or negotiate outages, every six months our Linux admins apply updates and reboot all the hosts (methodically, over four weeks, moving through development, test, QA, and production environments). Likewise, every month, on Patch Tuesday, our Windows admins apply updates and reboot all their hosts, too. This lets us virtualization guys piggyback on their updates to apply new tuning techniques, update VMware Tools, and generally evolve our environment as we learn better ways to run virtual OSes. For example, when Red Hat released kernel updates with divider= support we were able to take advantage of it within six months, cutting our Linux VM CPU utilization by 90% in many cases. Or, when a security hole is found in VMware Tools for Windows, we remediate it within a month… sometimes before anybody asks what we’re doing to fix it.
How cool is that?
(Tomorrow’s post will cover what I consider the other keys to my success: relationships, speed, simple chargeback, and evangelism.)