A few days ago I wrote an article on downtime. In the end it was an article on how big, complex, highly available systems get really expensive and hard to maintain.
The interesting thing is that highly available systems end up having more failures because there are so many more components to them than simpler systems. Every time you add a component you increase the risk of failure.
Obviously the goal of a highly available, highly redundant system is to survive outages. It’s just that the likelihood of actually having a problem is much greater. Buy a second server with mirrored disks for load balancing and now you have four disks to worry about instead of two. Ditto for power supplies, fans, and anything with a mean time between failure (MTBF) rating. Data centers aren’t immune, either. Spread your systems between two data centers for disaster recovery reasons and you now have two power grids to worry about.
Like everything else in IT it’s all about tradeoffs. A good trade is almost always one that takes complexity out of a system. In many cases it starts with reducing the number of components. Instead of buying 10 servers you might buy 5 servers that are twice as large. You get half the moving parts, and half the software bugs (at least in the OS). Sometimes centralization is a way to make things simpler, too. Instead of having each server handle its own replication to a remote site use a disk array with that feature so you have one device to monitor instead of 10.
Sometimes tradeoffs aren’t what they seem, and sometimes you just get caught by vendor shenanigans. For instance, my example of array-based replication is a good one, except EMC’s MirrorView replication software on our arrays seems to have caused more problems than it’s ever helped with. Likewise, the current trend of virtualization may save money and/or add features to your environment, but it also adds complexity.
Knowing where to make tradeoffs like these is the product of experience and thoughtfulness, not just about system setup but also about the long-term maintainability of a system. Thinking about places to reduce complexity in system designs is something every system administrator should be doing. Less is more in system administration, too.
Bob –
I spun around some similar thoughts a few months ago –
Here.
“When person-resources are constrained, highest availability is achieved when the system is designed with the minimum complexity necessary to meet availability requirements.”
–Mike