As a followup to yesterday’s post, I’ve been asked: should we panic about the KPTI/KAISER/F*CKWIT Intel CPU design flaw?
My answer was: it depends on a lot of unknowns. There are NDAs around a lot of the fixes so it’s hard to know the scope and effect. We also don’t know how much this will affect particular workloads. The folks over at Sophos have a nice writeup today about the actual problem (link below) but in short, the fix will reduce the effectiveness of the CPU’s speculative execution and on-die caches, forcing it to go out to main memory more. Main memory (what we call RAM) is 20x slower than the CPU’s L2 cache (look below for a good link showing the speed/latency differences between computer components). How that affects driver performance, workloads, I/O, and so on is hard to tell now.
Here’s what I think, based on my experience with stuff like this:
First, there are some people out there with gaming benchmarks saying there’s no performance impact. They’re benchmarking the wrong thing, though. This isn’t about GPUs, it’s about CPUs, and the frame rate they can get while killing each other online is mostly dependent on the Graphics Processing Unit, or GPU.
If you use physical servers that are only accessed by a trusted team, and you have excess capacity then you should remain calm. Doubly so if you have a test environment and/or can simulate production workloads. Don’t panic, apply your security updates according to your regularly scheduled process.
If you own virtual infrastructure and your company is the only user of it, insofar as everything from the hardware to the applications is run by the same trusted group of admins, don’t panic. Plan to use your normal patching process for both the hypervisor and the workloads, but keep in mind that there might be a loss of performance.
If you own virtual infrastructure and there are workloads on it that are outside of your control you will need to set yourself up to respond quickly to the patches when they are released. I wouldn’t panic, but you’re going to need to move faster than usual. I’d be getting a plan together for testing and deployment right now both for the hypervisors and the workloads you do control, prioritizing the hypervisors. Keep in mind the loss of performance. I might plan to start with a smaller cluster and work my way up to a larger one. I might be warning staff about some extra work coming up, and warning other projects that something is happening and timelines might change a bit.
If you use the public cloud I’d be looking up the Azure, AWS, and Google Compute Engine notices about this problem and seeing if your workloads will be forcibly rebooted in the near future. I’d also make plans to patch your virtual machines, and keep in mind the possible loss of performance depending on your instance type.
If you use containers I’d make sure that your baseline images are all patched once patches are released. Likewise with template VMs, if you don’t have a process to bring them to current immediately upon deployment or build VMs dynamically.
I would stop trusting all internet-supplied VM appliances and container images until they have documented updates. If you didn’t build it yourself you don’t know it’s safe.
In all the scenarios I’d be doing some basic capacity planning so you have a baseline to compare to, auditing to make sure that applications are patched, and auditing firewall rules and access control.
As the British say, keep calm and carry on. Good luck.
Just wanted to point out that this is also about GPUs now. I think they were added to the list of affected hardware a few days after your post. (at least for most Nvidia GPU)