We’re all at least roughly familiar with Fault Tolerance, a feature VMware added to vSphere 4 to establish a mirrored VM on a secondary host. It’s kind of like RAID 1 for VMs. To do this, Fault Tolerance records the inputs to a primary VM, and then replays it on the secondary VM to achieve the same results.
There are two important and somewhat subtle points here that help us understand why Fault Tolerance is limited to one CPU.
First, the process records the inputs, not the state of the PC after the inputs happen. If you moved the mouse on the primary it moves the mouse on the secondary VM in exactly the same fashion. If you ping the primary VM it pings the secondary. The process isn’t mirroring the state of the VMs, it’s just mirroring the inputs in order to make the state identical.
Second, in order to guarantee that both VMs are identical the process needs to be deterministic — 100% probability that the same outcome will occur when an input happens. The inputs need to achieve exactly the same end results on both VMs. If you can’t guarantee that identical inputs on two identical VMs will yield exactly the same machine state then you have failed. If you don’t have two identical VMs future actions against the secondary VM can’t be guaranteed to work right.
To really consider the problem of determinism you can think about having two desktop PCs right next to each other. You boot each of them. You start Microsoft Word on each of them, in exactly the same way. You minimize Word and start Excel. Everything is lock-step, everything is identical. Right?
Go deeper than the user interface, though. When you started Word on both PCs did Word end up using exactly the same blocks of memory on both PCs? Does the CPU cache on both PCs contain exactly the same contents? Did the operating system execute Word on exactly the same CPU cores? Did the hardware MMU interleave the memory accesses in exactly the same manner?
Probably not. While on the surface they look identical, the computer hardware probably did things differently at a very fundamental level, based on timing, random events, and subtle variations between the two machines. That’s non-determinism: you did the same thing twice and got two different results. CPUs are very complex, using lots of techniques like branch prediction, speculation, and out-of-order execution to get work done as fast as possible. Chipsets have hardware memory management units (MMUs) that independently handle the storing & retrieving of data into cache and RAM, interleaving data between different DIMMs and CPUs for speed. And operating systems themselves have complex CPU schedulers that move workloads around between cores, deciding what to do based on a lot of different rules. The more independent a subsystem is, the more components it has, the more non-deterministic it is.
So, if you want to build a system that’s predictable you need to avoid all of the things that cause non-determinism. Use only one CPU, so there’s no question what core you’re executing on or what the OS scheduler did (it placed you on the only CPU it knows about, duh!). And shut the hardware MMU off, so you can use a predictable (but slower) software one. These things are exactly what Fault Tolerance does.
So why couldn’t you just copy the machine state across the network, and skip all this deterministic/non-deterministic stuff? Well, it’d be really, really slow, because you’d need to trap and record an incredible amount information. You’d also need an incredible amount of bandwidth to transport it. For example, 1 Hz means that something changes once a second. 1 MHz means one million cycles per second. 1 GHz means one billion cycles per second. A 10-core Intel E7 CPU running flat-out is making, at the least, 20 billion changes a second (potentially more given hyperthreading, etc.). On top of that is all the independent work the hardware MMU is doing to assist the CPU. Good luck copying all that. To create my own Yogi Berra-ism, even if it was possible it’d be impossible. That’s why VMware engineers chose the route of mirroring only the inputs, and controlling the environment. Way less work to do, and a product that actually shipped and gets used.
Hopefully by now you see why SMP under Fault Tolerance is a hard problem to solve. Too many moving parts and not enough predictability & determinism. Uniprocessor operations were hard enough, and needed support built in to CPUs by Intel and AMD to make Fault Tolerance usably fast. The same is going to be true of multiprocessor support. People at Intel and AMD are thinking about the problem, and not just because virtualization could use it. Anybody who does multi-threaded debugging could benefit from hardware-level assistance. It’s possible we’ll see something in the medium-term, but for now if you need SMP and fault tolerance you need a cluster of some sort.
I’ve found my understanding of CPUs and operating system concepts to be very helpful in my career in IT. If you’re interested in these topics there are some additional links I’d recommend:
- A good introduction to a lot of these topics is ExtremeTech’s “PC Processor Microarchitecture” which, though 10 years old, has good explanations of some of the basics of CPU microarchitecture. Every newer CPU makes it even more complex, so it’s good to start with their examples of the Pentium 4.
- The Wikipedia entries on operating system CPU schedulers and superscalar CPU architecture are also good places to start, and branch out to things like pipelines and MMUs, and maybe even interrupts and DMA.
- “Virtues and Obstacles of Hardware-assisted Multi-processor Execution Replay” is a paper presented at USENIX HotPar10 by Intel engineers that talks about these topics and what can be done to assist. It’s interesting to see how they’re thinking about these topics, even if parts of the paper amount to a big academic shrug.
As always, if I was unclear or incorrect with something leave me a comment. Thanks!