I was called last night by my NOC staff to resolve a problem. I don’t remember what time it was, just that my phone rang and I answered it. It was like a dream, with VPNs and shell prompts.
On my way back to sleep I was thinking about problem resolution, and that there really are two phases to it. First, you get the immediate problem fixed. Then you make sure it won’t happen again. The trouble with this is that very few organizations get around to the “making sure it won’t happen again” part. It’s hard and time-consuming to track down the root cause of a problem. It’s hard to make changes to help prevent it. Heck, maybe you could have predicted the failure if your monitoring system had been watching for something new. Lots of problems may not be easily preventable (anything with humans causing the problem), but may be easily detectable if you think of what to look for.
Once the immediate outage is over everything goes back to normal, and the underlying problem gets stuck in the list of low priority things that need to get done. It’s a broken window nobody ever gets around to fixing. And two months later you get called in the middle of the night to fix the same problem, one you’ve seen ten times before.
I agree thisis a common problem with many companies. Your pointing out the difference between Incident Management and Problem Management. Most companies have a well defined IM process but lack any formal PM process, or the tools to track them. While it may be a dry subject for many people in IT, this is where implementing a solid set of ITIL practices can make all the difference.