I’ve been writing lately about Blame and my attempt at Understanding Blame. I guess I didn’t mean to turn this into a miniseries but there’s been a lot of interest. Including from Xangati – it turns out they’ve been talking about this same topic, in the context of their products, of course, but they’ve got some really cool stuff going on with their Management Dashboards and free tools. I’m glad others think so, too. Though I don’t know if Sean Clark’s calling them “Skynet” constitutes a compliment, though. :) At least the end of civilization will be well monitored.
Anyhow, they’ve asked me to help host a webinar on the topic, a conversation & forum on blame, why it happens, and how to deal with it. Heck yeah! I’m excited about it. If you want to learn more, the webinar is on February 24th, and there’s a signup link…
So I’ve been thinking about all this blame and what we can do to prevent and minimize it. Some things came to mind, some of them are easier said than done but pretty effective if you can pull them off.
A Good Team & Good Relationships
People don’t like being treated like insurance policies. They do like involvement up front, before there’s a problem. So find the people you will lean on, your go-to people when things get weird. Get a good network guy, a good storage guy, and a good data center guy on your team. Ask them what’s wrong with your design. Listen. Ask questions, especially if you don’t understand something. “So explain it to me.” Be honest. If you ditch the ego they will, too, and if they don’t, ditch them.
If they don’t want to, or can’t, be an active part of the team make them advisors. Get them all in the same room once in a while. Be Switzerland.
If you don’t have an environment with separate network, storage, and data center guys get a couple extra people involved to use as a sounding board. Just the sheer act of explaining things to people makes you realize things about your environment. Namely, if it’s hard to explain it’s usually too complex! At the least you’ll make some coworkers more knowledgeable. All the good IT guys want to be part of virtualization projects now, anyhow. Build an army of friendlies.
Hard Facts, Good Attitudes
When bad things are happening nothing cuts through all the crap faster than hard data and a good attitude. “Hey, I’m seeing some weird behavior, can you take a look at this graph?” is a great starting point. As is the neutral phrasing. Don’t start the war with an offhand remark. You’re having a problem with your stuff and would like their help, and here’s some hard data to show you what’s going on.
A reproducible problem is priceless, too. I’ve asked a lot of app admins to show me how to break their app, or make it really slow.
Show your coworkers how to gather hard facts from the environment. Give them logins to the tools you use to gather data. Give impromptu mini-training sessions in a group meeting. Encourage them to gather what facts they can because it makes finding a problem much faster. Even just the exact time is nice.
This is doubly true for application admins. Things are slow, you say? How slow? What part is slow? When? How fast does it need to be? No, I mean in seconds, or transactions per minute. Oh, you don’t know how many transactions per minute? No problem (even if it is) – how can we set up some testing to measure the speed? That way we can compare a change to see if we’ve helped.
If you keep your attitude friendly, genuine, and accusation-free you encourage honesty, and honesty saves tons of time. If it was a dumb mistake don’t rub it in, just tell them everybody gets one here and there. Blame the newness of virtualization. We’re all learning, after all.
IT staffers are like bears: they hate surprises. IT staff usually have a number of “bear bells” at their disposal, too. Use them. With your team of advisors, decide what hard facts constitute “getting bad,” “really bad,” and “we generally need more of something,” like certain disk or network latency, CPU & RAM utilization. Set alarms at those points, and work hard to eliminate false positives. Instrument everything you depend on. There’s no blame when you catch and fix a problem early.
If it’s a group effort everybody should see the data and the alarms. You’ll get notes like “oh, ignore that – that disk latency alarm was me adjusting the mirroring on the arrays.” Honest notes like that save hundreds of hours of staff time a year, at least where I work. And no witch hunts, either. You know who the witch was. Fix it, change things so it never happens again or accept that bad stuff happens from time to time, move on.
Advance warning works other ways, too. Figure out a way to get projects to tell you in advance when they’re going to ask for 20 new VMs, each with 4 vCPUs and 16 GB of RAM. Get your purchasing people to cc: you. Have projects slow down their deployments a little so there’s time to react if their sizing is off. Help your app admins do testing. Insist on real numbers. Be honest with them. Help them be honest with you.
I’m not saying any of this is easy. In fact, sometimes it really stinks, especially the parts where you ‘fess up after you’ve done something dumb. But transparency is amazing, honesty is liberating, and participation is empowering for everybody. When done right you spend your days writing email about the cool things you’ve done, the new features and speed, and the money you saved… not how something isn’t your fault.