Have you read “The Pragmatic Programmer” by Andrew Hunt and David Thomas?
No? For what it’s worth I think you should. System administrators and software developers have so much in common, but we just don’t realize it. “Two sides of the same coin,” or something like that. A bunch of software development books would make great system administration books if you just replaced the word “software” with “operating system.” 🙂
As an example, check out this short excerpt on software entropy and broken windows, straight from the book.
Sound familiar?
What do you have that could be considered a broken window? I’ve got a few. One is my Linux server build system. When it was first designed we only supported 32-bit versions of Linux. The 64-bit support was hacked in one day in a rush, and now when we change the build process we have to change it in seven different places. Prone to error, forgetfulness, and laziness. Icky.
Another one of my broken windows is a script I have that runs at boot to detect Linux virtual machines that have the VMware tools installed, but not properly configured (like after a kernel upgrade). The script fixes the configuration and then reboots the VM. In a couple of situations my script gets into a loop, rebooting the VM continuously, and the sysadmin has to intervene. Usually the intervention is to comment out the script, which just adds to the problem.
We know exactly how to fix these problems. We just haven’t done it yet. Now I even I find myself justifying more bad design and carelessness, saying “we’ll fix it later” and “it’s a mess, just do whatever.” Would I say these things if the problems were cleaned up? Heck no! So tomorrow I’m going to invest some time in my team’s future, fixing broken windows and raising the bar back to where it should be.
If you’re curious, the original “Broken Window Theory” is up on Wikipedia.
My backup scheme is a major broken window for me. I’m actually embarrassed by it, but I’m slow to fix it because I can’t get the time to properly fix it in one day which I need to do otherwise we have no backup system. It’s horribly inefficient and has been outgrown from when I initially built it. It’s a leap of faith I just need to take and get it fixed, it’s too important.
Yeah, backup systems are often like that, with people unwilling to take a multi-day outage to fix them, asking “what if something happens that night?” But what if the problems don’t get fixed? That’s why I like to rebuild the system on other hardware in parallel if I can (like if I get a new tape library). Not always an option, when it is it is nice.