I’ve been watching people make changes to servers, OSes, networks, storage, applications, and the like for a while now. I’ve even been one of the people making the changes. There are four properties that every successful change I’ve witnessed has had.
1. The change is atomic.
If you are making changes to machines or systems each change should start and finish before you make the next. This keeps the system’s state consistent for other changes, and makes it easy to find what went wrong.
If you are making changes to machines or systems each change should start and finish before system maintenance processes, like backups, occur. This way the system’s state is consistent for those processes.
2. The change is completed in the least time possible.
If you are making changes to machines or systems you should strive to do them atomically and in the least time possible.
Think of it this way: nothing else should happen on the machine while a change is in progress. Generally that’s inconvenient to the users of the machine, so get your stuff done, make it quick, go drink a beer.
3. There is a clear way back to the pre-change state.
Change teams, committees, forms, books, managers, etc. are all useless in the change process if the system administrator does not have answers to:
“If the change I am making goes completely wrong, how can I get back to where I am right now?”
and
“Do I have everything I need to get back to where I am right now?”
A system administrator is encouraged to leave themselves lots of options so that they might handle the situation differently based on what is wrong.
4. The change has a clear way of being tested.
For operating system changes this is usually an implicit “everything still works.” It helps immensely if you can define “everything.” 🙂
Like above, this can be posed simply as a question:
“How will I know after each separate change if I have broken something?”
An example from my life of how things went completely wrong:
Last week a junior admin was patching machines to bring them up to the latest levels of code. He didn’t make the changes atomic, as he patched the machines Friday night and left them that way until the reboot on Sunday morning. Because of that, the nightly backups backed the machines up in that untested “meta” state. As it turns out the patches broke an application on one of the machines, and he had to revert them. Then it was discovered that he didn’t have everything he needed to go back, because he made some assumptions that were not true. Oops. Then he panicked and inadvertently destroyed all the data attached to the server while trying to fix things. We had, we still have some very unhappy people around the organization.
An example from my life of how things well:
Two weeks ago an admin on my team was upgrading a reporting tool that runs in a virtual machine. He wanted to do some other work, too, while he had the machine down, but he did it all sequentially. When he had a problem he knew exactly what caused it, and therefore could fix it without too much troubleshooting. He left himself an out in that he put the VM’s disks in “redo” mode so that the changes could be discarded if they didn’t work. Before the changes he ran a backup, and then stopped the agent so that another backup would not run during his work. The upgrade wasn’t seamless but he knew that his backout plan was solid so he didn’t worry or panic, just slogged through the problems.
There is a property of successful changes that I am omitting here: communication. I’ll save that for another time. 🙂