“Service X will be unavailable between 00:00 CST January 4th and 12:00 CST January 5th. We apologize for the inconvenience.”
I hate outage notifications that don’t actually say anything. Sure, I’d rather know ahead of time that something is going down, but there are several other qualities that I think any communication like this should have. The key is to put yourself in the customer’s shoes. What would you like to know if you were them?
Why is this happening? Tell people about the problem you’re having or fixing, and what you’re doing to fix it. Chances are people have noticed the problem already so saying nothing about it seems pretty pointless. Acknowledge the problem and take ownership of it.
If you’re doing preventative maintenance say so. Users understand and appreciate the idea of “an ounce of prevention is worth a pound of cure.” Except, of course, if you’ve made a very poor choice of when to schedule the preventative maintenance…
Exactly how does it affect the users? Have you ever sent out a note warning of an outage and then had someone call you to report the problem? Chances are you didn’t speak in a language the user understood. Lots of people don’t care if there’s a storage problem on a file server, but they do care if they lose what they’re working on. Tell people what they will notice during the outage. Say things like “Your workstation’s drive S: will be unavailable” or “The inventory lookup screens will be quite slow.”
Who is doing the work? You can’t take ownership of a problem without saying who you are. That way when the CEO wants to comment to your boss about how you rule then they’ll know who you are. If you plan on doing a sub-par job then don’t sign your work. Not signing your work means you are not proud of it.
Be positive. Yes, it might be 4 AM and your database server simultaneously lost two drives in it’s RAID 5 array, and now you have to drive 20 miles into work and restore it from backup. There is no good to be had in griping about it publicly. If you can’t say anything nice don’t say anything at all, but if you can muster a humorous comment in the outage note do it. I like joking about how peaceful the commute is at 4 AM.
Provide updates. Don’t let the outage notification dangle without a follow up notice to say it was resolved. Even if it was planned maintenance you should send a follow-up saying the work is done and it was uneventful. If you’re posting the notice to a blog or a web page make periodic updates about what’s happening for the people that are checking.
Be honest. Lying, whether explicit or through omission, is not a good way to win friends and influence people. Tell people what is up (or down) and set their expectations accordingly. Will it be eight hours before the new part for the server shows up? Say so. It helps your customers plan their reactions to the problem. For instance, instead of sitting there waiting for the database to come back up they could go get some lunch, or go home for the day and finish their work later.
No blame. Get the problem fixed and get the service back online. Blame assignments are for internal discussions, and even then it should be about not having the problem happen again rather than who dropped the ball.
Keep it concise but no jargon. A little information goes a long way. Keep it generic. Be informative but not technical. Use terms like servers, network connection, and storage rather than boxes, Ethernet, or drives. And it’s a data center, not a computer room. Computer rooms are what schools have.
Don’t apologize. Thank. So many outage notifications have heartless “We apologize for any inconvenience this may cause.” Don’t even say it, because frankly it’s just as inconvenient to you. Say something people aren’t desensitized to, like “thank you.” Thank your users for their patience or for allowing you to schedule this proactive maintenance to reduce trouble later.
“Underpromise and overdeliver” is crap. That whole Star Trek/Scotty thing where you say it’ll take six hours to fix and then do it in two is a great way to tell people you’re slow at your job and you like billing extra hours. It’s also lying. Practice making good time estimates and then use them. If your estimate is under or over by more than 10% you need more practice.
If you want examples of good outage notices look at dreamhoststatus.com or Flickr’s outage entries on their blog. Dreamhost’s are a little more edgy and raw but so is their customer base. It’s all about knowing what your customers would like to know and speaking to them informationally, respectfully, and honestly.
Our monitoring systems show some memory errors on HOMER, the server that handles our mailing lists. David S. from the sysadmin team will be taking the server down for emergency repairs today (January 4th, 2007) at 6:00 PM CST. The outage should last approximately 15 minutes. Any mail sent to a list during the outage will be stored temporarily on our backup mail server and sent when the list server is operating again. There might be a delay between when you send and when you see it mailed out, so don’t be alarmed if you don’t see your message right away.
Update (6:45 PM): We discovered that the memory was not faulty, but was a symptom of another problem. We are currently waiting for a replacement part (a new system board) to arrive. Because all the mail is safe we have chosen to leave the server down until the problem is fully resolved.
Update (8:00 PM): HOMER is fixed and all mailing list email is being processed. Thank you for your patience and have a good night.