I was called last night by my NOC staff to resolve a problem. I don’t remember what time it was, just that my phone rang and I answered it. It was like a dream, with VPNs and shell prompts.
On my way back to sleep I was thinking about problem resolution, and that there really are two phases to it. First, you get the immediate problem fixed. Then you make sure it won’t happen again. The trouble with this is that very few organizations get around to the “making sure it won’t happen again” part. It’s hard and time-consuming to track down the root cause of a problem. It’s hard to make changes to help prevent it. Heck, maybe you could have predicted the failure if your monitoring system had been watching for something new. Lots of problems may not be easily preventable (anything with humans causing the problem), but may be easily detectable if you think of what to look for.
Once the immediate outage is over everything goes back to normal, and the underlying problem gets stuck in the list of low priority things that need to get done. It’s a broken window nobody ever gets around to fixing. And two months later you get called in the middle of the night to fix the same problem, one you’ve seen ten times before.