More Equipment Means More To Go Wrong

“Everything should be made as simple as possible, but not simpler.” – Albert Einstein

Over the last couple years my organization has been building out an alternate site, for disaster recovery and business continuity purposes. I’ve noticed a disturbing trend, both among my own coworkers and others who are starting to think about DR & BC: that by having multiple data centers, multiple servers, or using multiple cloud vendors they’ll reduce the number of problems they’ll have.

From a system administrator’s point of view that idea is absolutely false.

Every piece of equipment you have can fail, and the more equipment you possess the more likely a failure will be. The more servers, switches, chillers, PDUs, and power grids you have the more likely it will be that one dies. Heck, that’s even true of components within the servers. As I add more servers to my own environment, each with mirrored disks and tens of DIMMs, I spend more time replacing failed drives, failed DIMMs, updating firmware, etc.

Alternate data centers also add other problems than just mechanical failure. Your second data center is probably attached to a different power grid, so now you have two power utilities to worry about, and are subject to twice as many outages. You also have to maintain the equipment in that other data center, change the air filters on the air handlers, etc. And it isn’t twice as much work as having one data center, it’s three or four times when you count the overhead of traveling to a remote site, not having tools available, etc.

Having alternate sites, and multiple servers, can certainly improve application reliability by keeping services available to customers in case something happens. It’s a good bet that a power problem in San Jose won’t affect your site in Omaha, for example. But it absolutely increases your odds of having equipment failure, and while your customers might be happy your operations folks won’t be.

More equipment means more that can go wrong. So what do you do? Keep it simple. Have fewer, bigger machines. Have machines with fewer moving parts. Use a cloud provider with multiple sites. Use VMware DPM and keep your alternate site machines in standby most of the time, and the HVAC off. There are lots of ideas, but remember that the more you have, the more that will fail, and those failures mean time spent not moving forward.

3 thoughts on “More Equipment Means More To Go Wrong”

Alex

July 1, 2010 at 11:06 AM

WTF man, it can be expensive and a PITA to deal with multiple sites that are redundant, but it sure beats an outage that can cost hundreds to thousands to millions of dollars. The bean counters won’t like it, but it’s smart. There are lots of things that can mitigate travel to additional location like using a co-location datacenter to host your equipment, and using remote administration tools, like VPNs, SSH, RDP, VNC, etc… Redundancy in IT is just smart, period. If management doesn’t want to pay that’s their perogative, you can only advise. But as a sysadmin I can’t believe you would look at this negatively.
Bob Plankers

July 1, 2010 at 11:33 AM

I’m pretty sure you don’t understand what I’m saying. Redundancy is smart, but there is such a thing as too much redundancy. If you have 100 HDDs you have 100x more chance that one will fail over a certain time period, compared with 1 HDD. This isn’t negotiable, it’s just math. All the time you spend resolving those failures is 100% wasted time, i.e. time not spent moving forward.

Every site you add, every machine you add, every technology you add increases the likelihood that something will go wrong and that you’ll waste a lot of time dealing with it. So keep your redundancy as simple as possible, with as few machines and technologies as possible, to be able to deliver the reliability you need.
Alex

July 2, 2010 at 9:25 AM

I would agree too much of a good thing is too much. There is a point at which replicating systems is not worth it. But that should be part of your job to determine the appropriate amount of built in redundancy. If an internal website goes down it may be an acceptable to have a downtime of a day let’s say whereas having your email server go down could results in thousands lost. We may not make decisions but we should be recommending do the decision makers.

Comments are closed.