Reliability Isn't As Straightforward As It Seems

The concept of reliability isn’t nearly as straightforward as it seems. It also depends heavily on what you are protecting yourself against.

A good example of this is hard disks. You can protect yourself against a single drive failure by adding another disk and mirroring them. However, in doing this you add a controller that is now a point of failure. You also add another disk that may fail in a way that causes disruptions to both disks (freak the controller out, freak the SCSI bus out, etc.).

Is it worth the additional risk? Sure, as long as the controller is way more reliable than the drives.

On the other hand, sometimes the best way to make a service reliable is to keep it as simple as possible.

An example of this is a service that wanted off-site replication of their data. They used storage array-based replication software to mirror the data to another array. They bought expensive equipment to extend the SAN to the remote location. They were a medium-sized environment and didn’t have people or the money to dedicate to being experts in these technologies. As a result the environment went from being a simple collection of servers to being a somewhat not-understood complex collection of servers and storage and networking equipment. As a result they ended up having a lot of downtime, which looks bad when you bought all this gear to increase the availability of a service.

People say that availability depends heavily on how much money you invest. In general, I disagree. Without a clear idea of what you are protecting against and without good training and system design to support the implementation, adding components to a simple system design usually serves to make it less reliable.