If a picture is worth a thousand words then getting to see something for yourself is worth an entire library. Especially when it comes to troubleshooting.
We’ve been having problems with our list server lately. Our monitoring system sends email through it at regular intervals to check the service, and if the round trip time is too long it sends an alarm. Standard stuff, built to handle the case where the list manager software spontaneously dies once every quarter or so. The problem lately has been that we get the alarm much more frequently: once a day or more.
Since this thing is a production service our operations staff have procedures on how to restart it. They get the alarm, they restart the service, everybody is happy. Except me. The alarms have either been happening late at night or they are restart ninjas, not affording us time to examine what is actually happening before they “fix” the problem. On top of that the logs this software keeps are nearly useless, and the alarm times seem random.
When human beings can’t see what’s happening we speculate. Which, in business, first manifests as a witch hunt. Who changed something? What changed? Are test and production the same? Why not? Second, it’s a blame game. You patched the production server’s OS right around when the problem started, so it’s your fault! You updated the database, it’s your fault! The app is unstable. No, the database is unstable. Shampoo is better. No, conditioner is better. Kindergarten disagreements with bigger words.
It makes my head hurt, because it’s entirely fact-free. Nobody has actually seen the problem occurring. In technical terms it’s known as talking out of one’s ass.
I just want to figure out what is wrong and get on with my life. So I threw caution to the wind and booby-trapped the restart procedure. Try to restart and it’d throw an error, telling the operators to call me. Today it paid off. I actually got to see the problem as it was happening. I even got to call our DBA and verify what I was seeing from a different point of view: absolutely nothing wrong.
In 30 minutes of watching the “problem” it became obvious that the problem was only the monitoring system. This time of year the list server gets busy. This year it apparently is busier than normal, and the monitoring probe’s round trip naturally takes longer when it’s busy. Sure enough, after 20 minutes of slow mail processing the problem cleared itself, and so did the alarm.
In three weeks of guessing the users were never suggested as the problem. Which is strange in itself, since they usually always contribute to the problem in some way. Nor was the monitoring system. Sometimes that hammer you’re holding does really make everything look like a nail.
Next time I’m not waiting so long for the chance to see things firsthand.
Quote: “Kindergarten disagreements with bigger words.”
That’s exactly what it is and you went on to describe the role a professional takes: “Let’s work together and see what the problem is.”
On the other hand, if you’re out for a joke then it’s great to pompously state that “it can’t be my software that’s causing the problem” 😀
just found my way to your blog today, oh man the entire paragraph (that concludes with the billy madison reference) had me rolling. How true it is. It only gets worse when no one has the time to do what you did to find the root cause.
🙂 Thanks guys.