I was just taking a break and reading some tech news and I saw a wonderfully detailed post from El Reg (link below) about an Intel CPU design flaw and impending crisis-level security updates to fix it. As if that wasn’t bad enough, the fix for the problem is estimated to decrease performance by 5% to 30%, with older systems being the hardest hit.
Welcome to 2018, folks.
In short, an Intel CPU tries to keep itself busy by speculating about what it’s going to need to work on next. On Intel CPUs (but not AMD) this speculative execution doesn’t properly respect the security boundaries between the OS kernel and userspace applications, so you can trick an Intel processor into letting you read memory you shouldn’t have access to. That’s a big problem because that memory could hold encryption keys & other secrets, virtual machines, anything.
So what? Here’s my thoughts:
- All of our systems just got 30% more expensive. Put another way, we are all about to lose 5-30% of the systems we paid for, if they’re built on Intel hardware. That includes network switches, storage arrays, traditional servers, everything.
- I’m guessing there’s a class-action lawsuit in the works already against Intel, if only to establish whose fault this is (not Dell, HP, etc. but Intel’s).
- We don’t know the effects of these updates yet, insofar as whether the performance hit will be global, just to CPU or memory, just to I/O, or some mix. We also don’t know how workloads will react to this. If you don’t have a proper test and/or QA environment you’re going to fly by the seat of your pants for a bit.
- What we can surmise, though, is that all system benchmarks are now null & void. This is an epoch, the great extinction of performance data from vendors. As of right now any sizing or performance data offered by a vendor needs to meet with questions around when that data was gathered, what OS levels & patches, and probably should have some written guarantees in the contract.
- If you have a system or application that’s Intel-based and within 30% of “full” you probably should start thinking about your options, especially if it’s on older hardware.
- If you aren’t collecting performance data from your systems you should get that going. There are lots of options, from established vendors like Solarwinds, newcomers like Uila, to open-source tools like Observium. Historical performance data is essential for assessing a situation like this, as well as system sizing and troubleshooting.
- Microsoft has announced that Azure instances will rebooted on January 10, 2018. AWS is dancing around the same message. They don’t have live migration, like vMotion, so it’s a huge deal when they decide to fix something like this. The speed and scope of the reaction should tell you how important this is. It also should delineate how helpful things like vMotion are in a VMware vSphere environment, where you’ll be able to update the infrastructure without taking applications down (versus the public cloud which doesn’t live-migrate workloads). Yes, in an ideal world applications are built to not care, but very few of the world’s companies have their systems set up that way (and a discussion for the comments or over a beer).
- Remember that the public cloud will take a performance hit, too. Yet one more way the public cloud DOESN’T actually help IT. At least a SaaS application means it’s someone else’s problem, though.
- Companies that don’t patch won’t have a problem with this, but that’s gross criminal negligence (e.g. Equifax, etc.) and should be the subject of whistleblowing action from here on out. Companies that do patch are getting screwed, of course, but this is solid due diligence and part of the cost of doing business. Truth is, regular patching is the #1 way to prevent security problems, but defense-in-depth is equally important (multiple other security controls that can help mitigate a problem like this until you figure out what you’re going to do to fix it).This update isn’t going to be avoidable for long, so you might as well suck it up and deal with it.
- I’d bet HPC/supercomputing folks won’t apply this update, though, but hopefully they have an understanding of their workloads and defense-in-depth. Losing even 5% of a system like TACC’s Stampede would hurt. Also seen another way, Intel’s insecure design practices just made things like cancer research 5-30% slower.
- If you don’t take snapshots or image-level backups now might be a time to try it, so you can roll things back quickly. Remember, though, that snapshots are a performance hit on their own. Rolling back the OS patches might be acceptable, too. The point is to have an answer to the question “how do we go back to the way things were after this patch is applied?” You might need to buy yourself some time to cope with these updates.
- AMD is probably going to try to make hay here, because they’re not affected. However, AMD systems have classically had problems of their own, such as bugs that ended up disabling all L3 cache, etc. There’s no high ground to be occupied by them. As always insist on actual performance data around vendor promises, and insist that those promises get documented, preferably in contractual form.
- Sysadmins are merely the messengers here, but we need to begin communicating this problem to the business around us. Our managers, VPs, CTOs, CIOs, everybody. This is an all-hands issue. The effect on IT is clear, but if we get ahead of it with our management stacks it’ll demonstrate our competence & security-mindedness. It’ll also clear the path for when we ask to buy something to cope with the 30% capacity hit.
As always, good luck.
Update (2018/01/03): Should we panic about the KPTI/KAISER Intel CPU design flaw?