RSS Feed for System AdministrationCategory: System Administration

Three Organizational Decisions That Help Me Virtualize »

Over the last ten years my organization has come a long way with its IT policies and processes. We’ve gone from the wild, wild west of IT where personal heroism ruled the day, to a place where there’s just enough process to make sure that communication happens correctly and things like our Configuration Management Database (CMDB) stay up to date. It’s been a lot of work, but I am actually really proud of where we’re at.

There are three fundamental decisions we made a long time ago that, had they not been made, would have drastically changed how virtualization has proceeded here.

1. Clearly defined maintenance windows.

Knowing exactly when someone can do maintenance on server has been crucial to getting things done in our virtualization environment. There are many adjustments you can & should make in virtual environments, but if you can’t ever take the VMs down to make the changes you’re stuck. We’ve been able to do physical to virtual migrations, performance tuning, VMware Tools upgrades, vSphere upgrades, and a whole slew of other things in relatively short timeframes because we have this all worked out already. This also lets us “right-size” our VMs — rather than deploying huge VMs just in case they need the CPU or RAM, we deploy smaller ones and then can take an outage to add CPUs and RAM if we need to. The maintenance windows for a server are negotiated between the application/service admins and the system administrators when a machine is put into production, we track it in our CMDB, and any member of the whole team supporting the service can take the maintenance window, as long as they follow some rules about notifications for the change (timeframes, etc.).

2. Use of load-balancing technologies.

We use application load balancers (layer 4 of the OSI model) to decouple services from individual servers. Not only does this allow us to take a host down without affecting a service, but it also lets us spread the load out more among the physical hosts we have in our virtual infrastructure. In a lot of cases having more, smaller VMs results in better workload scheduling by ESX and DRS, especially on smaller ESX hosts.

Of course, this also plays nicely into the other points, because it’s very liberating to be able to do what we call “rolling maintenance” on a service, just taking one machine down at a time so that customers are not impacted. It also means that system administrator quality of life goes up, for now we can do maintenance tasks during the day instead of on weekends and off-hours. Doing maintenance during business hours has a couple of benefits. First, it means that the maintenance will actually get done. If you try to use someone’s personal time to do work they tend to opt out of that work. Servers go unpatched, tuning doesn’t happen, lots of things that should get done don’t because people will choose their personal time over work. Second, it means that if something goes wrong there are others around to help out. Doing work at 5 AM on a Sunday is fun, but if things go sideways you have to wake someone up or try fixing it yourself. Doing work during the day means you have the rest of the team around to lend a hand.

Third, it gives you a way to make incremental changes and then watch the effects. This has been particularly awesome for performance tuning of applications and our virtual environments themselves. Testing tuning changes is often hard, because test suites and test load generators are synthetic and often don’t compare to real load. But because the load is spread out we can make a change to one VM, or one ESX host servicing one VM, and keep an eye on it. I’m not advocating being a complete cowboy — you still have to do testing — but the risks to your production environment are a lot lower if you can catch problems on one VM first.

There are usually some other benefits to load balancers, too, that make them virtualization-friendly. Many will offload SSL processing, so your VMs have less work to do. Others have features, like iRules in F5′s products, that let you rewrite network traffic on the fly, which has some really neat implications for security, monitoring, and service delivery. And if you don’t want to buy a piece of hardware you can often get a virtual appliance from these vendors, though the physical appliances are usually a lot faster.

3. Commitment to operating system and application patching.

It is a fundamental belief of mine that one of the best ways to stay secure is to keep up on your patching. My organization agrees, and by using load balancers and defining maintenance windows we’ve made it easy for ourselves to keep our hosts up to date with regular patching cycles. Because we can take servers down without taking services down, and because sysadmins know exactly when a server can come down, we can schedule maintenance cycles easily, whether it’s six months out or two weeks. We can also respond very rapidly to emergency situations, like recent remote execution vulnerabilities in Microsoft Windows, by rolling patches out to development & test hosts, then QA & production, over the course of just two days if needed.

Keeping up to date with patches not only keeps you secure, it also lets you take advantage of new features that are added to operating systems. For example, Red Hat keeps adding new virtualization-friendly features, like kernel interrupt clock dividers. Being a kernel parameter you can’t just change it on the fly. And if you have to reboot, but can’t get a time to do it, you won’t do it. For us, we just rolled the change into one of our patching cycles and reduced the load on our infrastructure dramatically. Meaning more VMs per physical host, and a quantifiable amount of savings from just a small change on each machine.

Furthermore, our commitment to patching also extends to the virtual infrastructure itself, and we have a rule that we will not implement anything that breaks vMotion or Storage vMotion. Why? Because then it becomes very difficult to cope with ESX updates, or hardware failures, or any situation where vMotion could be used to prevent an outage. Sure, this means that we still need physical hardware for some applications, but it’s still just a fraction of the hardware we were buying years ago. This also makes virtual infrastructure easy to upgrade when the time comes, for new versions of vSphere, new storage arrays, and new physical hosts. Instead of planning outages on hundreds of VMs we just vMotion them, and nobody is the wiser.

Disclosure: F5 is a sponsor of Gestalt IT Tech Field Day, of which I have been a participant. I am not a customer of F5 at this time, though.

Happy System Administrator Appreciation Day! »

The Wisconsin DMV sent me my gift a day early:

And it was a present — I needed replacement plates but hadn’t ordered them yet. I’m glad I didn’t!

I often joke that I haven’t come up with an original solution to anything in years, thanks to all the other sysadmins out there who share their solutions, knowledge, and time in order to make the world better. Thank you all for everything you do!

Rate-Limiting Steps »

In the last month I’ve added quite a few blogs to my reading list. One new one is “Movin’ Meat,” written by an ER doctor out of the Pacific Northwest. Besides just being interesting, some of his blog posts support my theory that IT folks can often learn things from people in other fields. The post from June 25, 2010, part four of his “Advice for Interns,” is one of these cases. When you read it (link is at the end because I want to get to my actual point before you leave to read it), I think substituting “customer/system” for “patient” in his list works nicely.

My real point is this: one thing in his list really stood out for me. It’s something that seems really obvious when it’s said, but also done wrong a lot:

“Determine the rate-limiting step and make it priority #1 in the work-up”

Figuring out what the slowest step in a project is going to be and getting to work on it right away is often key to getting a project done in a timely fashion. Especially if a large chunk of that time will be waiting for something. When you know it’s going to take six weeks for a request to make it through your purchasing department you should start that right away, especially since all you’ll be doing is waiting.

As kids we were told by our teachers to read all the instructions first, then start working on whatever we were doing. Determining the rate-limiting steps is the same sort of thing. By taking a few minutes at the beginning to look at the whole project first, rather than just starting on step #1 and going one by one until you’re done, you can often optimize things so that the longest parts of the project are done in parallel with the rest.

Links:

- Movin’ Meat: Friday Flashback – Advice for Interns Part Four

Why No-Reply Email Is A Bad Idea »

I absolutely hate no-reply email. I understand why it exists (autoresponders and bounces), but to send an email with no way to respond at all using the same communications medium is ridiculous.

A good example of this is the customer satisfaction survey Red Hat just sent me. It is from a no-reply email address and there is no other email address listed. There is just some text and a URL, and clicking on the URL gets me:

rhapps.redhat.com not found

$ dig rhapps.redhat.com ns1.redhat.com
[...snip...]
;; QUESTION SECTION:
;rhapps.redhat.com.             IN      A

A records are overrated.

I generally am a nice guy and let vendors know something is messed up, but there are limits, especially when I’m already on the fence about a negative experience. I’m not going to open a support case with them, because it’ll never get escalated correctly[0]. And there’s no email address to send a quick note to. So it goes unfixed, Red Hat gets added to my mental list of vendors that don’t get it, and I blog about it, which may be worse than a negative survey response.

And for heaven’s sake, if you send out a customer satisfaction survey make sure it works.[1]

——————–

[0] Here’s a test for your organization: can a customer open a support problem against your web site? Will it go to the right place, i.e. the webmasters or someone intelligent who can get things fixed? If not, why not?

[1] This also may mean you should add an external survey service to what you monitor.

More Equipment Means More To Go Wrong »

“Everything should be made as simple as possible, but not simpler.” – Albert Einstein

Over the last couple years my organization has been building out an alternate site, for disaster recovery and business continuity purposes. I’ve noticed a disturbing trend, both among my own coworkers and others who are starting to think about DR & BC: that by having multiple data centers, multiple servers, or using multiple cloud vendors they’ll reduce the number of problems they’ll have.

From a system administrator’s point of view that idea is absolutely false.

Every piece of equipment you have can fail, and the more equipment you possess the more likely a failure will be. The more servers, switches, chillers, PDUs, and power grids you have the more likely it will be that one dies. Heck, that’s even true of components within the servers. As I add more servers to my own environment, each with mirrored disks and tens of DIMMs, I spend more time replacing failed drives, failed DIMMs, updating firmware, etc.

Alternate data centers also add other problems than just mechanical failure. Your second data center is probably attached to a different power grid, so now you have two power utilities to worry about, and are subject to twice as many outages. You also have to maintain the equipment in that other data center, change the air filters on the air handlers, etc. And it isn’t twice as much work as having one data center, it’s three or four times when you count the overhead of traveling to a remote site, not having tools available, etc.

Having alternate sites, and multiple servers, can certainly improve application reliability by keeping services available to customers in case something happens. It’s a good bet that a power problem in San Jose won’t affect your site in Omaha, for example. But it absolutely increases your odds of having equipment failure, and while your customers might be happy your operations folks won’t be.

More equipment means more that can go wrong. So what do you do? Keep it simple. Have fewer, bigger machines. Have machines with fewer moving parts. Use a cloud provider with multiple sites. Use VMware DPM and keep your alternate site machines in standby most of the time, and the HVAC off. There are lots of ideas, but remember that the more you have, the more that will fail, and those failures mean time spent not moving forward.

Levels of Indirection »

“All problems in computer science can be solved by another level of indirection…
Except for the problem of too many layers of indirection.”

- David Wheeler, though often attributed to Butler Lampson, who has some great quotes, too:

“When in doubt, use brute force.”
“In handling resources, strive to avoid disaster rather than to attain an optimum.”

Lots of good stuff if you read his “Hints for Computer System Design.”

Midnight is Always Tomorrow »

“So, are you ready for the big power outage on Sunday?” a colleague asks on Thursday.

“You mean Saturday.”

“No… Sunday morning.”

“Um, I was told two months ago, and countless times between, that the outage is on Saturday, midnight to 8 AM, and they were starting to shut things down at 10 PM.”

“It’s Sunday, midnight to 8 AM. They’re going to start shutting things down on Saturday at 10 PM.”

“Did they move the outage?”

“No, I bet they were just telling you when things were going to start. On Saturday.”

Midnight is 00:00, meaning the start of a new day. Always.

If you’re in doubt, use 00:01. Assume everybody is clueless about time, because they are. For example, a lot of people think in terms of when they go to sleep, not what actual time it is, so if they’re still up at 0200 on Sunday they consider it to be Saturday. While that’s wrong, and makes visions of their painful, torturous death flash in your mind, it’s a fact of life. Deal with it.

Be precise. Use 24-hour time, because there is no AM/PM question. 24 hour time runs between 0000 and 2359 on any given day. There is no 2400[0].

Last, all times should be accompanied by days, and vice-versa. It’s like units in science classes. You didn’t just write “1.67,” you wrote “1.67 meters.” It isn’t “0800,” it is always “0800 on 4/18/2010.” Times are useless without dates. And if your team or customers are not all in the same time zone, and they rarely are[1], you need that information, too.

“The system shutdowns will commence at 2200 on 4/17/2010, the power will be disconnected at 0000 on 4/18/2010, and power-ups will occur again at 0800 on 4/18/2010. All times are in CDT (-0500).”

———————

[0] Yes, I am aware there are sometimes leap seconds, which get added to the end of a day, thus causing a 23:59:60. 99.99%+ of all outage planning does not need to take this into account.

[1] And even if they are, it doesn’t hurt to add that information.

What are P-states and how do I use them in vSphere? »

VMware vSphere 4 added the ability to take advantage of Intel SpeedStep and AMD PowerNow! CPU power management features. These features are commonly known as “Dynamic Voltage and Frequency Scaling” or DVFS, and let an OS cooperate with the CPU to reduce power consumption by reducing the frequency of the CPU and the voltage at which it is operating. It reduces these things in preset tiers, and these tiers are known as P-states. On Intel CPUs they are trademarked as “SpeedStep” and on AMD they are either “Cool’n'Quiet” or “PowerNow!”

The Wikipedia article on Intel SpeedStep points out that “power consumed by a CPU with a capacitance of C, running at voltage V, and frequency f is approximately P = CV2f.”  This means if you can reduce the voltage to the CPU the power needs drop in a non-linear fashion. Furthermore, many electronic components run more efficiently at lower temperatures, and since consuming less power means less heat generated you end up seeing efficiency gains within the host as well as reduced load on data center cooling. This results in an overall reduced power bill, and potential savings in related systems like a UPS, generators, etc.

Frequency and voltage in a CPU are correlated. So are instructions per second and frequency. Basically, if you want your CPU to get more work done per second you need to increase the frequency it runs at, and to do that you need to increase the voltage. So why would you want to turn the CPU’s performance down in the first place? The thing is, CPUs are much faster than everything else in a computer system. If the CPU needs data for an operation it’ll look in cache. L1 cache operates at the CPU speed — fast but small. L2 cache operates at a fraction of the CPU speed, but still many times faster than RAM[1]. The problem is when the CPU needs data that isn’t found in cache and has to go to RAM or disk. Going to RAM means it’ll wait for thousands of clock cycles before the data is returned, because RAM is much slower than the CPU. Going to disk or network means waiting for millions of clock cycles, which is an eternity to a CPU. So while the system may be busy, the CPU might actually be idle, and that’s a great time to stop using power and generating heat.

When one process is doing I/O like that it’s also a good time for the hypervisor in vSphere (or scheduler in a regular OS) to run something else. That “something else” might not need the full performance of the CPU, either, and the frequency & voltage of the CPU can be decreased to save power in that case, too.

Given that all this trouble has been taken to add this feature to hardware and software, how do you turn it on?

1. Make sure your CPUs have this feature. According to VMware vCenter, under Configuration->Processors, my sample Dell PowerEdge R610 has Intel E5530 CPUs. I can check that by looking at Intel’s product web site, ark.intel.com, under “Xeon” processors.

2. If, in vCenter, under Configuration->Processors it has something like “Enhanced Intel SpeedStep” listed by “Power Management Technology” then you can proceed to step 3. If it says “Not Available” or something else you may need to set your BIOS to allow operating system control of the power management. On my Dell PowerEdge R610 the option is under Power Management. Set it to “OS Control” as:

Dell R610 Bios 1.3.6 - Power Management

On some older models, like the PowerEdge R900, it’s in the CPU options and called “Demand-Based Power Management.”

3. Go back in vCenter. By now the Power Management Technology should be populated with something other than “Not Available” (if that isn’t the case then check with your hardware vendor). If that’s set, go to Configuration->Advanced Settings, then Power, and change Power.CpuPolicy to “dynamic.”

vSphere Advanced Settings - Power

4. Say OK and you’re set.

I’ve added this to my checklist for bringing a new ESX host online now, and now that I’ve got it enabled I’m watching the power consumption a lot more closely. Can I tell a difference? Hard to say right now, as I don’t have enough new data for my small clusters. It still doesn’t replace Dynamic Power Management (DPM), because if you genuinely don’t need the capacity of a host shutting it completely off makes the most sense. But in the effort to be greener, every little bit helps, and it’s easy to enable.

As always, if I’ve made a mistake or you’d like to add relevant information just make a comment below. I read all my comments!

——————–

[1] This is why larger L1 & L2 caches are better, why prefetchers exist (to try prepopulating the caches with data the CPU might need), why architectures like Intel’s Nehalem add L3 caches that are shared among the cores, and why hypervisors try to schedule the same process on the same CPUs when they can (CPU affinity increases the chance that useful data is still in the caches). It’s all a big effort to keep the CPUs from waiting.