Over the last ten years my organization has come a long way with its IT policies and processes. We’ve gone from the wild, wild west of IT where personal heroism ruled the day, to a place where there’s just enough process to make sure that communication happens correctly and things like our Configuration Management Database (CMDB) stay up to date. It’s been a lot of work, but I am actually really proud of where we’re at.
There are three fundamental decisions we made a long time ago that, had they not been made, would have drastically changed how virtualization has proceeded here.
1. Clearly defined maintenance windows.
Knowing exactly when someone can do maintenance on server has been crucial to getting things done in our virtualization environment. There are many adjustments you can & should make in virtual environments, but if you can’t ever take the VMs down to make the changes you’re stuck. We’ve been able to do physical to virtual migrations, performance tuning, VMware Tools upgrades, vSphere upgrades, and a whole slew of other things in relatively short timeframes because we have this all worked out already. This also lets us “right-size” our VMs — rather than deploying huge VMs just in case they need the CPU or RAM, we deploy smaller ones and then can take an outage to add CPUs and RAM if we need to. The maintenance windows for a server are negotiated between the application/service admins and the system administrators when a machine is put into production, we track it in our CMDB, and any member of the whole team supporting the service can take the maintenance window, as long as they follow some rules about notifications for the change (timeframes, etc.).
2. Use of load-balancing technologies.
We use application load balancers (layer 4 of the OSI model) to decouple services from individual servers. Not only does this allow us to take a host down without affecting a service, but it also lets us spread the load out more among the physical hosts we have in our virtual infrastructure. In a lot of cases having more, smaller VMs results in better workload scheduling by ESX and DRS, especially on smaller ESX hosts.
Of course, this also plays nicely into the other points, because it’s very liberating to be able to do what we call “rolling maintenance” on a service, just taking one machine down at a time so that customers are not impacted. It also means that system administrator quality of life goes up, for now we can do maintenance tasks during the day instead of on weekends and off-hours. Doing maintenance during business hours has a couple of benefits. First, it means that the maintenance will actually get done. If you try to use someone’s personal time to do work they tend to opt out of that work. Servers go unpatched, tuning doesn’t happen, lots of things that should get done don’t because people will choose their personal time over work. Second, it means that if something goes wrong there are others around to help out. Doing work at 5 AM on a Sunday is fun, but if things go sideways you have to wake someone up or try fixing it yourself. Doing work during the day means you have the rest of the team around to lend a hand.
Third, it gives you a way to make incremental changes and then watch the effects. This has been particularly awesome for performance tuning of applications and our virtual environments themselves. Testing tuning changes is often hard, because test suites and test load generators are synthetic and often don’t compare to real load. But because the load is spread out we can make a change to one VM, or one ESX host servicing one VM, and keep an eye on it. I’m not advocating being a complete cowboy — you still have to do testing — but the risks to your production environment are a lot lower if you can catch problems on one VM first.
There are usually some other benefits to load balancers, too, that make them virtualization-friendly. Many will offload SSL processing, so your VMs have less work to do. Others have features, like iRules in F5’s products, that let you rewrite network traffic on the fly, which has some really neat implications for security, monitoring, and service delivery. And if you don’t want to buy a piece of hardware you can often get a virtual appliance from these vendors, though the physical appliances are usually a lot faster.
3. Commitment to operating system and application patching.
It is a fundamental belief of mine that one of the best ways to stay secure is to keep up on your patching. My organization agrees, and by using load balancers and defining maintenance windows we’ve made it easy for ourselves to keep our hosts up to date with regular patching cycles. Because we can take servers down without taking services down, and because sysadmins know exactly when a server can come down, we can schedule maintenance cycles easily, whether it’s six months out or two weeks. We can also respond very rapidly to emergency situations, like recent remote execution vulnerabilities in Microsoft Windows, by rolling patches out to development & test hosts, then QA & production, over the course of just two days if needed.
Keeping up to date with patches not only keeps you secure, it also lets you take advantage of new features that are added to operating systems. For example, Red Hat keeps adding new virtualization-friendly features, like kernel interrupt clock dividers. Being a kernel parameter you can’t just change it on the fly. And if you have to reboot, but can’t get a time to do it, you won’t do it. For us, we just rolled the change into one of our patching cycles and reduced the load on our infrastructure dramatically. Meaning more VMs per physical host, and a quantifiable amount of savings from just a small change on each machine.
Furthermore, our commitment to patching also extends to the virtual infrastructure itself, and we have a rule that we will not implement anything that breaks vMotion or Storage vMotion. Why? Because then it becomes very difficult to cope with ESX updates, or hardware failures, or any situation where vMotion could be used to prevent an outage. Sure, this means that we still need physical hardware for some applications, but it’s still just a fraction of the hardware we were buying years ago. This also makes virtual infrastructure easy to upgrade when the time comes, for new versions of vSphere, new storage arrays, and new physical hosts. Instead of planning outages on hundreds of VMs we just vMotion them, and nobody is the wiser.
Disclosure: F5 is a sponsor of Gestalt IT Tech Field Day, of which I have been a participant. I am not a customer of F5 at this time, though.
Hear hear for redundancy in IT. In most situations it makes everyone’s life better.
Yes, but there is such a thing as too much redundancy! The trick is to use just the right amount, spending only as much money as you need to in order to get things done.