Rube Goldberg, Part 2

“Hey Bob, big important web server A isn’t working right after you patched it the other day. We need to roll the patches back.”

“Can you tell me what’s going wrong first?”

“PHP can’t talk to the Oracle databases.”

“Does the test environment work?”

“Yeah.”

“It’s identical to production.”

“Um… I don’t get it.”

“I put the same patches on the test box that are on the production machine.” I always do. That’s the point.

“Well, maybe it’s on a different network.”

“What’s on a different network?”

“Did big important web server A switch networks?”

At this point an alarm is going off in my head. Switch networks? Yeah, I just randomly change the IPs of my servers because it keeps people on their toes. While we’re guessing, maybe its an evil spirit in our data center. Maybe we didn’t sacrifice enough goats last week. WTF.

“It didn’t switch networks. It sounds like a problem with your PHP installation.”

“It’s really important that we fix this.”

Yes, that’s why it took you 36 hours to notice the problem…

“Because you aren’t using the vendor-supplied web server, or a web server we maintain, it’s your responsibility. My position is that the machines are identical. Let me know if I can do anything.”

So an hour goes past, and they did some good, linear troubleshooting. The next thing I hear is:

“The Oracle environment variables aren’t set for the production web server.”

“Okay…”

“…”

“…so why don’t you do it?”

“Well, why would they not be set on the production machine?”

Very good question. Then it dawns on me:

1) When we rebooted the test machine the web server didn’t come up, because they changed the location of apachectl and didn’t tell us. So we didn’t make the change in rc.local.

2) I bet someone put stuff they weren’t supposed to in /etc/profile… checking… checking… bingo. Sometime in the past someone changed /etc/profile so that everybody gets the Oracle environment variables. That isn’t something we like, because it lets people make assumptions in their scripts. We hate assumptions. If you rely on something make sure it’s in your script. I’ve seen people do this before on my machines and it usually results in my “/etc is a directory that belongs to ME, this is an abuse of my trust” lecture.

After updating the test server I logged in to check to see if the web server was running. Because of #2 I got all the right environment variables. When I found the web server wasn’t running, I started it, and it used my environment.

Then, because I like being helpful, I corrected the location of apachectl in the production host’s rc.local. Because that runs from rc.local it doesn’t get an environment like a user would. Therefore it starts, but doesn’t do the right thing.

Luckily, the guy who was running the web server kept the old apachectl, and as it turns out, the old one had all the right environment variables at the top of it, as a hack from the last guy who worked on this.

Moral of the story? There are several:

1) I need to manage /etc/profile in cfengine or get alerts on changes from Osiris so bozos can’t sneak changes in on me. I hadn’t really thought about managing the stuff that we don’t customize, but it’s now apparent that we have to.

2) Don’t make system-wide changes, because they lead to system-wide assumptions.

3) You need to turn responsibilities over to people slowly, so that lapses in documentation can be corrected. In this case it was dumped on the new guy, and he didn’t have any reason to think there was something custom in apachectl until stuff broke. Then it’s a crisis.

4) If you offer a service, like PHP with Oracle, you need some way to monitor that it works. Had that been the case we would have caught this right away. A simple PHP script we can call from our monitoring system would work.

5) When you make changes it’s imperative you keep the old stuff for a while, as a reference. And not just in your backup systems, because that’ll take a while to restore.

*sigh*