VMware vSphere 4.1: What's New

Once again we find ourselves staring at a major release of VMware infrastructure software: vSphere 4.1. It’s been a bit over a year since 4.0 dropped, with two big bugfix releases since. vSphere 4.1 adds over 150 new features and improvements, including some features that were previewed at VMworld 2009 to much applause. Here are some of the highlights, twists, and turns.

Storage I/O Control:

This is a global, cluster-wide I/O scheduler, working to throttle I/O to ensure that a single VM cannot monopolize a single datastore’s capabilities. If you consider that a datastore’s backend storage can only sustain a certain numbers of IOPS it’s possible, and likely, that a single VM will consume a disproportionate amount of those IOPS. This feature lets you control which VMs get priority when there is congestion.

This is enabled per-datastore, and will kick in when latencies are 30 ms or greater by default, though you can adjust the latencies as you see fit via advanced settings. This feature is configured through VMware vCenter, but the ESX servers communicate between themselves via shared headers on each datastore. This preserves the ability for the cluster to continue to operate when the management interfaces are inoperative.

Storage I/O Control is only available to Enterprise Plus customers.

Network I/O Control:

As network connectivity on hosts moves from multiple 1 Gbps Ethernet links towards one or two 10 Gbps per host it becomes more important that network operations get prioritized, to avoid starving any particular operation. Network I/O Control does this by adding a few different features.

First, it implements the ability to schedule and prioritize traffic types. There are six types of traffic that are identified: VM, management, NFS, iSCSI, VMotion, and FT logging. You can assign limits to these types of traffic, to absolutely cap the amount of bandwidth being consumed by each. You can also assign shares to these types of traffic to designate the relative priority of that traffic. These limits and shares are always enacted on egress, meaning that it’s only outbound traffic that is throttled.

Second, a new feature called Load-Based Teaming has been implemented, to alter the physical NIC being used for a type of traffic when ESX detects that a link is more than 75% utilized over 30 seconds. This will be especially useful for 1 Gbps links, spreading the load across different links as utilization rises.

This technology only works as part of a vNetwork Distributed Switch, and as such is only part of Enterprise Plus licenses.

Memory Compression:

Memory overcommitment, while a hotly-debated topic at times, is a reality for a lot of people. Overcommitment is a gamble, though, and sometimes you end up with virtual machines that, in total, need more RAM than you have. vSphere 4 tried helping you out in three ways when you were caught short. First, it used transparent page sharing to try to deduplicate RAM. Second, it used the VMware Tools balloon driver to reduce the amount of RAM actively consumed by a VM. Third, it just paged to disk.

Paging to disk is a terribly slow operation. Disks operate thousands to millions of times slower than RAM  and CPUs, so VMware added another technology to stave off the need to page to disk: memory compression. ESX will take a configurable amount of memory and compress it. Of course, the optimal situation is to not need any of these technologies, but another way to avoid disk accesses is great.

Those of you having bad flashbacks to MS-DOS and RAM Doubler should consider that CPUs and RAM are much faster, this is configurable via the advanced settings on a per-VM basis, and that VMware says that decompression takes about 20 μs, which is still way faster than disk.

Memory compression is available to all licensed customers.

vMotion & EVC Improvements:

vMotion has had evolutionary improvements to it, to significantly decrease migration times. They do this through some behind-the-scenes improvements, as well as increasing the number of concurrent vMotions. You can now have 4 vMotions running simultaneously on a 1 Gbps link, and 8 on a 10 Gbps link. One datastore can also have 128 concurrent vMotions running on it.

Enhanced vMotion Compatibility (EVC) has also been updated to consider the CPU feature set of a running VM, rather than the features supported by the host. This allows for some better compatibility while moving VMs around, and better error detection when adding new hosts to clusters.

vMotion is also now officially spelled with a lowercase ‘v’ — whew! :)

vStorage API for Array Integration (VAAI):

Any old mainframe IT guy will look at all this new virtualization technology and make a snide remark that they had all this 25 years ago. In a lot of cases it’s true, especially when you consider coprocessors. While my iPhone has more CPU time than a mid-1990s mainframe, the mainframe didn’t actually have to use its CPU for much. Everything non-CPU was offloaded to coprocessors, especially I/O.

Continuing this trend among us naive open systems types, VMware and a variety of storage partners are working to enable offloading of storage operations to the arrays themselves, coprocessor-style. Initially there will be three operations handled by the arrays: full copy, zero-out, and locking. For operations like cloning this means that ESX won’t copy a template out over the SAN or network just to put it right back on the array. This is going to be a killer performance improvement for people using slower storage link, like iSCSI over 1 Gbps links. Likewise, the locking functions will help avoid locking the whole datastore for management operations. That will make datastores more scalable, as locking is a concern for big datastores.

Actual support on arrays is forthcoming, and vendors such as Dell, NetApp, IBM, Hitachi, HP, and EMC are actively working to release compatible array firmware. This feature is only available to Enterprise and Enterprise Plus licensees.

ESXi:

vSphere 4.1 is the last major release to support the classic ESX software with the service console. Future releases will be ESXi only, using APIs to control and configure the host, and people should consider moving towards ESXi. There is a nice “VMware ESX to ESXi Upgrade Center” to answer a lot of questions, and help smooth the transition.

ESX 4.1 really brings it up to the level that ESX has been at for a while. They have added boot-from-SAN capabilities, scripted installs, enhanced Update Manager to push drivers and other modules, added built-in Active Directory support, and now fully support both local and remote Tech Support Mode.

This will be a big change for a lot of people, and some thought and testing would be prudent for everybody, beginning now.

High Availability:

High Availability has had its limits increased, to 32 hosts, 320 VMs per host, and 3000 VMs per cluster. These limits are enforced after a failover, of course, so you still need to think about your cluster as N+1, and keep N under those limits. The concept/limit of 5 primary nodes still exists as it was.

Reporting of problems is much improved, via the new Cluster Operational Status window. It shows a better view of what is wrong, as determined by a new health check process. High Availability and DRS now work together to move resources to better help restart VMs after a failure.

High Availability now also has APIs for working with applications. This lets monitoring agents work with HA to do a variety of things, including a full guest restart. This works through the VMware Tools and involves guest to host communication, which may be a security concern in some cases. However, while VMware hasn’t said much so far it opens the door for some interesting monitoring and self-healing possibilities.

Fault Tolerance (FT):

Still no SMP support in FT — that’s a much harder problem than it seems because there’s hardware support for FT for single-CPU VMs, but none for SMP VMs. However, a lot of other restrictions have been lifted. DRS can now move FT VMs around, which is great. Having FT VMs locked to a specific host was terribly restricting. It requires having EVC enabled, but a lot of people are already doing that. You can also have ESX hosts at different versions.

Dynamic Resource Scheduling & Dynamic Power Management:

There is a new set of host affinity rules for DRS, that can give VMs affinity (or anti-affinity) for a set of hosts in a cluster. This helps a lot when someone is faced with host-based licensing restrictions, for instance. These rules can be set to preferred or required, and both DRS and HA will obey these rules.

DPM now has a set of scheduled tasks to help control it, turning it on and off at certain times of the day if you’d like. Disabling DPM will bring all the hosts out of standby, to help guarantee that no hosts get stuck in a useless state.

vCenter:

The changes to vCenter are really too numerous to list here. The biggest change is that it is 64-bit only, which will be a giant problem for a lot of people who are running 32-bit Windows. There are a lot of other incremental improvements, including a lot more information being presented to the users, often in conjunction with the new features throughout vSphere 4.1.

Whew. Get testing, people!

Comments on this entry are closed.

  • Really appreciate these just-the-facts rundowns thank you!