We’re all at least roughly familiar with Fault Tolerance, a feature VMware added to vSphere 4 to establish a mirrored VM on a secondary host. It’s kind of like RAID 1 for VMs. To do this, Fault Tolerance records the inputs to a primary VM, and then replays it on the secondary VM to achieve the same results.
There are two important and somewhat subtle points here that help us understand why Fault Tolerance is limited to one CPU.
First, the process records the inputs, not the state of the PC after the inputs happen. If you moved the mouse on the primary it moves the mouse on the secondary VM in exactly the same fashion. If you ping the primary VM it pings the secondary. The process isn’t mirroring the state of the VMs, it’s just mirroring the inputs in order to make the state identical.
Second, in order to guarantee that both VMs are identical the process needs to be deterministic — 100% probability that the same outcome will occur when an input happens. The inputs need to achieve exactly the same end results on both VMs. If you can’t guarantee that identical inputs on two identical VMs will yield exactly the same machine state then you have failed. If you don’t have two identical VMs future actions against the secondary VM can’t be guaranteed to work right.
To really consider the problem of determinism you can think about having two desktop PCs right next to each other. You boot each of them. You start Microsoft Word on each of them, in exactly the same way. You minimize Word and start Excel. Everything is lock-step, everything is identical. Right?
Go deeper than the user interface, though. When you started Word on both PCs did Word end up using exactly the same blocks of memory on both PCs? Does the CPU cache on both PCs contain exactly the same contents? Did the operating system execute Word on exactly the same CPU cores? Did the hardware MMU interleave the memory accesses in exactly the same manner?
Probably not. While on the surface they look identical, the computer hardware probably did things differently at a very fundamental level, based on timing, random events, and subtle variations between the two machines. That’s non-determinism: you did the same thing twice and got two different results. CPUs are very complex, using lots of techniques like branch prediction, speculation, and out-of-order execution to get work done as fast as possible. Chipsets have hardware memory management units (MMUs) that independently handle the storing & retrieving of data into cache and RAM, interleaving data between different DIMMs and CPUs for speed. And operating systems themselves have complex CPU schedulers that move workloads around between cores, deciding what to do based on a lot of different rules. The more independent a subsystem is, the more components it has, the more non-deterministic it is.
So, if you want to build a system that’s predictable you need to avoid all of the things that cause non-determinism. Use only one CPU, so there’s no question what core you’re executing on or what the OS scheduler did (it placed you on the only CPU it knows about, duh!). And shut the hardware MMU off, so you can use a predictable (but slower) software one. These things are exactly what Fault Tolerance does.
So why couldn’t you just copy the machine state across the network, and skip all this deterministic/non-deterministic stuff? Well, it’d be really, really slow, because you’d need to trap and record an incredible amount information. You’d also need an incredible amount of bandwidth to transport it. For example, 1 Hz means that something changes once a second. 1 MHz means one million cycles per second. 1 GHz means one billion cycles per second. A 10-core Intel E7 CPU running flat-out is making, at the least, 20 billion changes a second (potentially more given hyperthreading, etc.). On top of that is all the independent work the hardware MMU is doing to assist the CPU. Good luck copying all that. To create my own Yogi Berra-ism, even if it was possible it’d be impossible. That’s why VMware engineers chose the route of mirroring only the inputs, and controlling the environment. Way less work to do, and a product that actually shipped and gets used.
Hopefully by now you see why SMP under Fault Tolerance is a hard problem to solve. Too many moving parts and not enough predictability & determinism. Uniprocessor operations were hard enough, and needed support built in to CPUs by Intel and AMD to make Fault Tolerance usably fast. The same is going to be true of multiprocessor support. People at Intel and AMD are thinking about the problem, and not just because virtualization could use it. Anybody who does multi-threaded debugging could benefit from hardware-level assistance. It’s possible we’ll see something in the medium-term, but for now if you need SMP and fault tolerance you need a cluster of some sort.
I’ve found my understanding of CPUs and operating system concepts to be very helpful in my career in IT. If you’re interested in these topics there are some additional links I’d recommend:
- A good introduction to a lot of these topics is ExtremeTech’s “PC Processor Microarchitecture” which, though 10 years old, has good explanations of some of the basics of CPU microarchitecture. Every newer CPU makes it even more complex, so it’s good to start with their examples of the Pentium 4.
- The Wikipedia entries on operating system CPU schedulers and superscalar CPU architecture are also good places to start, and branch out to things like pipelines and MMUs, and maybe even interrupts and DMA.
- “Virtues and Obstacles of Hardware-assisted Multi-processor Execution Replay” is a paper presented at USENIX HotPar10 by Intel engineers that talks about these topics and what can be done to assist. It’s interesting to see how they’re thinking about these topics, even if parts of the paper amount to a big academic shrug.
As always, if I was unclear or incorrect with something leave me a comment. Thanks!
Face melting goodness, Bob! Great post to keep in my back pocket then next time I hear a snotty comment about VMware not having SMP FT yet. It’s because it’s freaking hard, people!!
Fantastic stuff, the kind of ‘under the bonnetr’ explanation that will make my courses (I’m a VCI) so much more informative!
Perfectly good article!
Just an idea: why would it be a problem if the secondary VM starts a process on another core (or even using another scheduler)?
I think the main ‘purpose’ of FT is to failover the primary VM. I believe all I/O output on the secondary VM is ‘captured’ and discarded (meaning Disk & NIC output). When the secondary, shadow VM gets activatied, those 2 don’t care on which cores processes are started or which mem blocks are used, right?
Or am I missing something?
I believe you are missing something. 🙂 It’s less about the specific cores on the host and more about the cores available to the VM. Whatever happens to the primary VM needs to happen identically to the secondary, including which CPU a process runs on, placement in RAM, etc. With multiple CPU, if the primary OS executes a process on core 2 there is no guarantee that the secondary will execute the process on core 2. It’s non-deterministic, and it needs to be.
And yes, when the shadow VM gets activated then nobody cares any more about determinism, but up until that point it’s very important in order to guarantee that the VM can take over exactly where the other left off.
Have you heard of Remus? This nut has been cracked in 2008. Do some research before you say whats hard and whats been done.
You could be less of an ass and leave a link to whatever it is Remus is. Bloggers are not omniscient and we don’t know everything that happens in the whole realm of IT computing. I now see it’s some research project add-on to Xen (which kinda fits with your rude comment, and why I wasn’t aware of it). Their papers say:
“Another significant drawback of deterministic replay as exemplified by Bressoud and Schneider’s work is that
it does not easily extend to multi-core CPUs. The problem is that it is necessary, but difficult, to determine the order in which cores access shared memory.
…
While these approaches do make SMP deterministic replay possible, it is not clear if they make it
feasible due to their high overhead, which increases at least linearly with the degree of concurrency. Our work sidesteps this problem entirely because it does not require deterministic replay.”
So they may not be able to guarantee that the secondary copy of the VM on the other host is identical to the primary, which, for many, is a problem. I don’t consider this nut to be cracked when the solution is not acceptable to many.