Downtime

(Matt over at Standalone Sysadmin posted a thought the other day about downtime, which coincided nicely with an explanation I ended up writing for a customer about their downtime requirements. Since I had it written up anyhow I figured I’d post it here.)

Downtime is often referred to as the number of 9’s in the percentage of availability. So four 9’s would be 99.99% available, which translates to 52.56 minutes of downtime a year (ignoring leap years). It breaks down like this:

98% is 10512 minutes of downtime.
99% is 5256 minutes of downtime.
99.9% is 525.6 minutes of downtime.
99.99% is 52.56 minutes of downtime.
99.999% is 5.26 minutes of downtime.

By default all customers want five 9’s or better. “We can never be down,” they’ll say. What they don’t understand is how expensive it gets to actually deliver that sort of uptime. It goes something like this:

Everything up to three 9’s (99.9%) is very possible with a single server, on a single network switch, etc. In fact, my home computers are more than 99.9% available. With a good hardware warranty three 9’s is no big deal. Or with virtual machines, where you could take a snapshot, do your work, and have time to revert everything if there are problems.

Once you get to four 9’s you start thinking about redundancy. You might build two servers and use some simple failover, often manual, to keep service outages to a minimum. Maybe you’ll just move a service IP around between two machines, or get a single load-balancing switch. Life still isn’t too bad because you can get a lot done in 53 minutes a year if you use your head.

Five nines gets much harder, because, frankly, there isn’t a lot you can accomplish in 5 minutes. It isn’t enough to have simple, manual failover for planned maintenance. You need to detect failures and route around them quickly, and have lots of redundancy so that failures you can’t actively route around don’t take your service out. So you need a load-balancing network switch, a proxy of some sort, or much more complex clustering software (for databases, as an example). If you get a load-balancing switch you’ll need two so it doesn’t become the single point of failure.

You’ll need N+1 servers behind those switches to cover the load, and each will need multiple NICs attached to multiple network switches. Ditto for storage if it’s external. You’ll need a generator in case the power goes out, and multiple ISP uplinks in case one ISP has trouble. Maybe even two physical locations. You’ll need someone monitoring the system, who is on call to fix things right away. Which also means you’ll need a good monitoring system. You’ll need training to operate all of this new software or hardware, or expensive consultants to set it up. On top of it all you’ll probably want to order everything twice, so you can build an identical test environment to vet changes, software updates, etc.

See how it gets very complex, and simultaneously very expensive, very quickly?

Being realistic about one’s downtime ends up saving vast quantities of cash, time, and effort.

9 thoughts on “Downtime”

  1. Bob,

    Thanks a lot for the link, and for covering this topic. I don’t have any experience at all when it comes to actually calculating the amount of uptime a system is capable of.

    Do you perform these sorts of calculations for your infrastructure?

  2. When we build a system we do some work to build it out for the uptime we need, and we know what the maintenance needs of our network and SAN are so we can plan for them. In fact, we pre-schedule those outages a year in advance so everybody can see them.

    For example, we know that if we attach a server to certain enterprise storage arrays of ours they will have to be able to endure a four hour outage on two Sunday mornings a year.

    However, our infrastructure does have a few single points of failure in it, including non-redundant load balancers, for example. My network engineers are working on it, but again, it’s a big tradeoff of price vs. return on investment.

  3. One of my rules of thumb – If you add a nine to the availability requirements, you add a zero on to the end of the price tag (i.e. price = price x 10). That ends up not quite being true, you probably can go from 3 nine’s to 4 nines for about 3x the cost of 3 nines, not 10x, but it is easy to remember and discourages business units from overstating their requirements.

    As far as planned down times & maintenance windows, most of the SLA’s that I’ve seen don’t count those against availability.

    Also – The whole topic gets real messy when you start considering degraded systems (up or down?), time of day, number of users affected, etc. We built a formula that considers partial availability and considers the % of users who were affected, but I’m not sure that’s the right thing to do.

    –Mike

  4. 98, 99, and 99.9% downtime in hours:

    98: 175.2 hours
    99: 89.3 hours
    99.9: 8.76 hours

    It seems reasonable that most SMBs will want something between 99 and 99.9% downtime, after it is explained that each 9 costs more money.

    Thank you, Tom

  5. Isn’t it true that scheduled downtime is not usually factored in when calculating historic availability trends? If you had a scheduled maintenance window for 6 hours every Saturday morning, that wouldn’t count at all towards your downtime calculations. That could also affect the number of minutes per 9 calculation above.

    Of course, I could be totally wrong about this.

Comments are closed.