“I just read an article about an IBM shop that is planning a 50:1 virtualization ratio in their VMware environment,” I told two coworkers.
“What’s our ratio?”
“19:1 across our whole VMware environment.”
“Do we plan to get to 50:1?”
“No,” I reply.
“Why not?”
“Well, that’s just not the way we’re built. They went for fewer, larger servers. Some places go for lots of smaller machines. We’re sort of middle-of-the-road with four way and dual quad-core machines. The bigger the building blocks in the environment the more VMs you can put on them. But the more eggs you have in one basket, too.”
“Isn’t it easier to manage two huge machines instead of ten smaller ones?”
“Often, yes. In this case, what if something happens to one of them and you have 50 VMs die instead of 19? Plus think of how that works with VMotion. With two huge machines you need 50% extra capacity on both, so each machine can absorb the load from the other. With five smaller machines you only need 20% free. Big machines have price premiums, too. The money you pay for 50% extra capacity on a big machine can often buy you several smaller machines.”
“I suppose. Plus, big machines might have other problems, too, strange performance bottlenecks with all that load.”
“Exactly!” I exclaim. “You’re right on. You can often avoid having to do strange things, or having strange bottlenecks by just keeping things reasonable.”
“It would be cool to announce we got to 100:1 or something.”
“Yeah, but that’s like uptime wars. Sure, you haven’t rebooted in ten years, but your machine is swiss cheese from a security point of view. Plus those 50:1 guys actually put thought into it, based on their workloads. More power to them when it works. It isn’t quite right for us, though, at least right now.”
“So, people don’t reboot their machines for ten years? How often do we reboot?”
Oh brother.
It’s all a question of tradeoffs, isn’t it? Nice job breaking down the pros and cons of UPMC’s approach. FYI, you’re not alone in opting for a more modest consolidation ratio. Check out, for example, another article on the site, Capacity planner limits VMware consolidation ratio.
This guy running 19:1 on dual quad core machine can not be compared to the company running 50:1 without looking at the entire scope of the virtual infrastructure. The 50:1 guy does not need 50% free capacity on a host if he running 30+ servers is a cluster all running 50:1, he’s also not going to spend more money on hardware, TCO should go down with the purchase of larger servers with increased ratios, larger servers equals a reduction in HBAs, SAN ports, network ports, power etc.
I do believe there is a ratio ‘sweet spot’ for every environment and no two are alike, but for the 19:1 guy to say the 50:1 guys solutions is not a solid solution and take the assumptions he’s taking without knowing the entire picture is not fare.
This guy should remember that there are still plenty of guys running 1:1 ratios (Non- VMware) and are looking at his 19:1 as risky and expensive. He should have explained to his boss that ratios are all about size and scale of the virtual infrastructure and for their environment 19:1 works best financially and operationally.
Unfortunately, Rob, I don’t think I was totally clear on what I was saying, because you took a lot of time to comment, and you and I agree on a lot of this! 🙂 I wasn’t saying that the 50:1 folks need 50% spare capacity. Someone with two ESX servers that wants to fail their entire workload between them does, because each machine represents 50% of the capacity. If you have five servers each one represents 20% of the workload, so you only need 20% spare capacity in the rest of the cluster in order for that same work to continue.
There isn’t a sweet spot, and it depends heavily on each environment. Some folks can do 50:1, I can do 19:1, some like 4:1 or 1:1. Like I said in the title, it isn’t a competition. 🙂 The competition is on the business side of things, whether I can help my company work smarter so we can beat another company in the marketplace.
As for larger machines, it’s a trade-off between a lot of things. A big machine may saturate its fibre channel HBA more easily, causing admins to have to do strange things to get the load balanced between multiple cards. That adds complexity which may not be worth it. In HA terms, if one big machine faults you have more servers that will go down. But, you also have bigger resource pools to handle more dynamic loads with less hassle. You also have fewer machines to admin, pay maintenance on, power, and cool. But, bigger machines are often more expensive than an equivalent amount of clustered smaller machines, often orders of magnitude different depending on how big you get.
It’s all a tradeoff and based on what you and your organization are comfortable with.
So, do you have a rule of thumb for a HBA VM ratio? Yes, it’s load related, but as a rule of thumb for moderately I/O loaded servers, what would you guys think.
Thanks,
-C