Archive for April, 2006

links for 2006-04-29 »

links for 2006-04-26 »

Vacation »

Hey, sorry I’ve been quiet these last couple of weeks. I’ve been building up to a vacation, which I am on right now. I’m actually about to leave for San Diego, heading to Coachella this weekend. I so needed a vacation, especially after preparing for vacation. You ever notice that if you’re stressed out it just gets 500% worse right before you leave? Who’s your backup at work? What do they need to know? Finish that script. Charge your iPod. Find your camera. Do laundry. Pack. *sigh*

I did finally do something about comment spam, having just discovered 221 comments in moderation. The “Did You Pass Math?” plugin is the winner. Can you add 2 and 3? 7 and 5? If you can’t, can you use a calculator? Or are you a freakin’ spammer?

Okay, I have to go pack my Jeep. L8r.

links for 2006-04-19 »

links for 2006-04-18 »

links for 2006-04-13 »

Rube Goldberg, Part 2 »

“Hey Bob, big important web server A isn’t working right after you patched it the other day. We need to roll the patches back.”

“Can you tell me what’s going wrong first?”

“PHP can’t talk to the Oracle databases.”

“Does the test environment work?”

“Yeah.”

“It’s identical to production.”

“Um… I don’t get it.”

“I put the same patches on the test box that are on the production machine.” I always do. That’s the point.

“Well, maybe it’s on a different network.”

“What’s on a different network?”

“Did big important web server A switch networks?”

At this point an alarm is going off in my head. Switch networks? Yeah, I just randomly change the IPs of my servers because it keeps people on their toes. While we’re guessing, maybe its an evil spirit in our data center. Maybe we didn’t sacrifice enough goats last week. WTF.

“It didn’t switch networks. It sounds like a problem with your PHP installation.”

“It’s really important that we fix this.”

Yes, that’s why it took you 36 hours to notice the problem…

“Because you aren’t using the vendor-supplied web server, or a web server we maintain, it’s your responsibility. My position is that the machines are identical. Let me know if I can do anything.”

So an hour goes past, and they did some good, linear troubleshooting. The next thing I hear is:

“The Oracle environment variables aren’t set for the production web server.”

“Okay…”

“…”

“…so why don’t you do it?”

“Well, why would they not be set on the production machine?”

Very good question. Then it dawns on me:

1) When we rebooted the test machine the web server didn’t come up, because they changed the location of apachectl and didn’t tell us. So we didn’t make the change in rc.local.

2) I bet someone put stuff they weren’t supposed to in /etc/profile… checking… checking… bingo. Sometime in the past someone changed /etc/profile so that everybody gets the Oracle environment variables. That isn’t something we like, because it lets people make assumptions in their scripts. We hate assumptions. If you rely on something make sure it’s in your script. I’ve seen people do this before on my machines and it usually results in my “/etc is a directory that belongs to ME, this is an abuse of my trust” lecture.

After updating the test server I logged in to check to see if the web server was running. Because of #2 I got all the right environment variables. When I found the web server wasn’t running, I started it, and it used my environment.

Then, because I like being helpful, I corrected the location of apachectl in the production host’s rc.local. Because that runs from rc.local it doesn’t get an environment like a user would. Therefore it starts, but doesn’t do the right thing.

Luckily, the guy who was running the web server kept the old apachectl, and as it turns out, the old one had all the right environment variables at the top of it, as a hack from the last guy who worked on this.

Moral of the story? There are several:

1) I need to manage /etc/profile in cfengine or get alerts on changes from Osiris so bozos can’t sneak changes in on me. I hadn’t really thought about managing the stuff that we don’t customize, but it’s now apparent that we have to.

2) Don’t make system-wide changes, because they lead to system-wide assumptions.

3) You need to turn responsibilities over to people slowly, so that lapses in documentation can be corrected. In this case it was dumped on the new guy, and he didn’t have any reason to think there was something custom in apachectl until stuff broke. Then it’s a crisis.

4) If you offer a service, like PHP with Oracle, you need some way to monitor that it works. Had that been the case we would have caught this right away. A simple PHP script we can call from our monitoring system would work.

5) When you make changes it’s imperative you keep the old stuff for a while, as a reference. And not just in your backup systems, because that’ll take a while to restore.

*sigh*

Hold, Fold, Walk Away, Run »

There was a post today over at 37signals about sunk costs that I’ve been thinking about. It’s been rolling around in this skull of mine all day.

If you aren’t familiar with the idea of sunk costs, Wikipedia states that they are costs that have already been incurred and which cannot be recovered to any significant degree. The Economics Web Institute has a good definition, too (search on that page). I was first introduced to this concept by an old boss who was studying for his MBA. “You have to be able to walk away from sunk costs,” he said. So many people consider walking away a waste of time, money, and effort, though. In order to not waste their “investments” they cling to them, instead of expending effort on projects, software, and systems that do a better job from the ground up.

For system administrators, much of our time and effort ends up being a sunk cost, spent towards the overall investment in whatever we’re building. Software licensing is a sunk cost. Server hardware is a sunk cost. By themselves, all of the components of a system are sunk costs. The assets gained just aren’t usually worth anything in resale value.

Things are a bit different for programmers. Code can be considered an investment if your business derives revenue from it. It certainly isn’t a sunk cost. Joel Spolsky has talked about this in his posts about rewriting code from scratch. This may be true of tools sysadmins write, too. Maybe. Is it ever a good idea to throw a tool out and start over?

When should you scrap a project or a system and start over? When isn’t it a waste to keep going? Can we do something to limit the “waste?” How ready should we be to switch to something new? Can we do anything to turn sysadmin sunk costs into saleable assets?

I have some ideas, but they need some more time to soak. Hang tight.

Close
Powered by ShareThis