Lightning, Cold Water, and Me

This morning my building was hit by lightning. The power sorta browned out for about 10 seconds, and then everything was fine again.

Except some control system freaked out, and the chillers in the data center stopped receiving cold water.

Oops.

Thus began an hour of frantic shutting down of development, test, staging, and otherwise not production machines. All in an attempt to keep the room temperature down while the facilities guys fixed the problem. It worked.

A couple of things occurred to me during all of this:

1. We discovered that the Jabber server we use to coordinate outage handling is open to the world. We had customers joining our chat room. Not that we were hiding anything, but it’s just not right. Never had that happen before, so we never thought of it.

2. We also discovered that the Jabber server we use has a fairly low number of people allowed into a chat room. Combined with #1 this made life a little difficult, since lots of them didn’t understand that they were actually impeding our ability to fix things. I’m not even sure we’d have caught this with a practice outage, either.

3. During a crisis we always have a lead tech and a lead manager. The manager is in charge of political operations and decision making. The tech is in charge of coordinating the technical operations. I accidentally invented a whole new role, though, which I dub “scribe” or at the very least, chat room moderator. I stayed out of the data center, at my desk, and kept track of everything going on with the servers so the others could make decisions and deal with one-off issues. When the outage went from “shut everything down” to “turn everything on” I already had a spreadsheet of everything that went down, published to the web. I was using Excel, but I want to check out Google Apps to see if it’d be easier to collaborate on a list.

4. As part of being scribe I was in a position to know what servers were down but not off. Some OS & hardware combinations don’t shut themselves off, and for a chiller problem that doesn’t help. Two of my coworkers stepped in. Once I noted a server was downed remotely by an admin they’d go to manually power it off. Super freaking cool, especially since it didn’t tax the guys making decisions about the outage. The best part about it was it just happened. It’s definitely a symptom of having good people on a team together, that things just happen and get done.

5. Low tech helped a lot. We’ve had individual chiller problems before, and because of those outages we bought some big barn fans and extension cords. Lifesavers, they are. Likewise, the $5 desktop thermometer/hygrometers scattered around the data center were great. Sure, the routers and servers could tell us what they thought the temperatures were, but that’s so much harder than just looking at a gauge.

So that was my day. I didn’t get anything done, really, but it was good. And for the first time ever, I’m looking forward to the post-incident review.