There’s been a fair amount of commentary & impatience from IT staff as we wait for vendors to patch their products for the OpenSSL Heartbleed vulnerability. Why don’t they hurry up? They’ve had 10 days now, what’s taking so long? How big of a deal is it to change a few libraries?
Perhaps, to understand this, we need to consider how software development works.
The Software Development Life Cycle
To understand why vendors take a while to do their thing we need to understand how they work. In short, there are a few different phases they work through when designing a new system or responding to bug reports.
Requirement Analysis is where someone figures out precisely what the customer wants and what the constraints are, like budget. It’s a lot of back & forth between stakeholders, end users, and the project staff. In the case of a bug report, like “OMFG OPENSSL LEAKING DATA INTERNET HOLY CRAP” the requirements are often fairly clear. Bugs aren’t always clear, though, which is why you sometimes get a lot of questions from support guys.
Design is where the technical details of implementation show up. The project team takes the customer requirements and turns them into a technical design. In the case of a bug the team figures out how to fix the problem without breaking other stuff. That’s sometimes a real art. Read bugs filed against the kernel in Red Hat’s Bugzilla if you want to see guys try very hard to fix problems without breaking other things.
Implementation is where someone sits down and codes whatever was designed, or implements the agreed-upon fix.
The testing phase can be a variety of things. For new code it’s often it’s full system testing, integration testing, and end-user acceptance testing. But if this is a bug, the testing is often Quality Assurance. Basically a QA team is trying to make sure that whoever coded a fix didn’t introduce more problems along the way. If they find a problem, called a regression, they work with the Engineering team to get it resolved before it ships.
Evolution is basically just deploying what was built. For software vendors there’s a release cycle, and then the process starts again.
So what? Why can’t they just fix the OpenSSL problem?
The problem is that in an organization with a lot of coders, a sudden need for an unplanned release really messes with a lot of things, short-circuiting the requirements, design, and implementation phases and wreaking havoc in testing.
Using this fine graphic I’ve borrowed from a Git developer we can get an idea of how this happens. In this case there’s a “master” branch of the code that customer releases are done from. Feeding that, there’s a branch called “release” that is likely owned by the QA guys. When the developers think they’re ready for a release they merge “develop” up into “release” and QA tests it. If it is good it moves on to “master.”
Developers who are adding features and fixing bugs create their own branches (“feature/xxx” etc.) where they can work, and then merge into “develop.” At each level there’s usually senior coders and project managers acting as gatekeepers, doing review and managing the flow of updates. On big code bases there are sometimes hundreds of branches open at any given time.
So now imagine that you’re a company like VMware, and you’ve just done a big software release, like VMware vSphere 5.5 Update 1, that has huge new functionality in it (VSAN). There’s a lot of coding activity against your code base because you’re fixing new bugs that are coming in. You’re probably also adding features, and you’re doing all this against multiple major versions of the product. You might have had a plan for a maintenance release in a couple of months, but suddenly this OpenSSL thing pops up. It’s such a basic system library that it affects everything, so everybody will need to get involved at some level.
On top of that, the QA team is in hell because it isn’t just the OpenSSL fix that needs testing. A ton of other stuff was checked in, and is in the queue to be released. But all that needs testing, first. And if they find a regression they might not even be able to jettison the problem code, because it’ll be intertwined with other code in the version control system. So they need to sort it out, and test more, and sort more out, and test again, until it works like it should. The best way out is through, but the particular OpenSSL fix can’t get released until everything else is ready.
This all takes time, to communicate and resolve problems and coordinate hundreds of people. We need to give them that time. While the problem is urgent, we don’t really want software developers doing poor work because they’re burnt out. We also don’t want QA to miss steps or burn out, either, because this is code that we need to work in our production environments. Everybody is going to run this code, because they have to. If something is wrong it’ll create a nightmare for customers and support, bad publicity, and ill will.
So let’s not complain about the pace of vendor-supplied software updates appearing, without at least recognizing our hypocrisy. Let’s encourage them to fix the problem correctly, doing solid QA and remediation so the problem doesn’t get worse. Cut them some slack for a few more days while we remember that this is why we have mitigating controls, and defense-in-depth. Because sometimes one of the controls fails, for an uncomfortably long time, and it’s completely out of our control.
 This is 100% speculative, while I have experience with development teams I have no insight into VMware or IBM or any of the other companies I’m waiting for patches from.
Comments on this entry are closed.