Once upon a time there was a man who was a sysadmin. He took care of twenty IBM AIX boxes. These machines were built by his predecessors, and they were built over time, each different, each with a personality. This man’s customers didn’t like the different personalities. They wanted each server to be the same so they could write scripts that had the same results on all the servers. They wanted to compile and not have to worry about differences between compilers on the machines.
This man asked around and discovered that IBM’s Network Installation Manager, or NIM, was a tool he could use to manage the machines. He took an older AIX box and made it a NIM master. He started using the tool to patch and upgrade older machines and install new machines. The process he used to manage his twenty servers was involved and inflexible. On several occassions he inadvertently destroyed machines using NIM, instead of upgrading them. New procedures had to be used to take image backups of the machines prior to his work in case something went wrong.
Many of his days were spent maintaining the tool, rather than maintaining the servers. The days he didn’t maintain the tool itself were spent on corollary tools to fix problems caused or exacerbated by the primary tool. The time spent on these tools far outstripped the time needed to admin the twenty servers by hand.
Once upon a time a manager wanted a better way for his group to keep track of work requests. Things were getting lost or delayed sometimes because his staff was getting distracted with new walk-up requests. He was familiar with the enterprise case tracking system his organization’s help desk used (Clarify ClearSupport), and decided that all work his group did should be entered into that system and assigned to an individual.
After a few weeks he examined the records and discovered that according to the case tracking system nobody in his group was doing any work. The few cases that were entered by people were woefully inaccurate and included very few details of the work done. He demanded that everybody enter everything they were working on, and do so accurately.
A week later he was checking the time reports from his staff and discovered that nearly all of them had spent many hours on “miscellaneous tasks.” When he asked his staff why they were spending so many hours on miscellany they replied that it was the recording of their work in the case tracking system that was taking so long. The act of doing the paperwork for their work was taking longer than the work itself.
Both of these cases suffer from the same problem: powerful tools. IBM’s NIM is a very powerful tool for managing machines. ClearSupport is a very powerful case tracking system. These tools are used very successfully by many large companies to manage their enterprises. In both cases, though, they are too big for most endeavors. NIM is great for hundreds of machines, but it is overkill for twenty. ClearSupport is wonderful for recording and escalating cases to different parts of a big organization, but is overkill as a shared to-do list.
When I am evaluating tools I have a series of questions I ask myself about that tool. I think of tools in terms of return on investment. If my team invests 10 hours of time in this tool will it save us more than 10 hours of time? Will it do so rapidly?
Does the tool help achieve other goals? A break-even ROI might be worth it if you can accomplish multiple goals or significantly improve the quality of your work. Task management systems often can help with customer billing and time reporting. Patch management tools might help with audits and compliance checking. Like using a pocketknife as a screwdriver, sometimes one tool should not be forced to take the place of one more suited to the task.
How much will the tool cost? Could I just hire someone to do the same job, or do it in a different way with a low-tech solution? ClearSupport is tens of thousands of dollars. Tasks Pro is $500. A white board is $100. Could I write the tool myself?
How much time does it take to maintain the tool? How does that compare with just doing the work manually? An example of this would be using NIM for patching. How much work goes into using NIM versus downloading the AIX patches and using the native package management tools to update the machine?
Does the tool require me to change my behaviour, or can I continue using my previous habits with this tool? Teams should be flexible enough to change their habits, but you should also choose a tool that can work with your setup. NIM is great when you have a consistent maintenance window and lots of machines that can be updated in lockstep. It isn’t as shiny when your servers all have different maintenance windows, and are all a little different here and there. If your situation doesn’t match the design of the tool then move on.
What will I need to learn to operate this tool? Does it use its own language or configuration syntax? Will I need another server to run it? What happens when the tool breaks? If I cannot fix it right away is there a workaround? Can I start using this tool a little bit here and there and work into it, or do I need to dive completely in to make it work? Did I really need the feature that is missing from this tool? Does this tool’s integration with other systems matter? Will we actually integrate it, or is that just an excuse? Would it be better to have two tools that each do their job very well, or one that is merely okay at both?
Sometimes these are tough questions. Sometimes you have to fight your ego and the urge to pick a big tool when you really just need a small one. Sometimes you have to fight your management and your coworkers when they cause scope creep, evaluating tools with criteria that make no sense. “The patching tool we choose should have account management features.” Um, what? Hell no. Sometimes you just need to search freshmeat.net and call it a day.
Your tools should amplify your productivity, not rob it. Whatever you choose think ROI.
Sounds very familiar.
We were tasked with a project of automating monitoring of ~60 servers of different makes and generations for failures. The goal was to instantly know if a part failed, then wake someone up to replace the redundant, but failed, part.
We never found a tool that actually worked. The solution was to walk the datacenter twice a day and look for hardware fault lights. Software isn’t at the level of reliablity as a human eye.