If there is one thing about the intarweb that really, really sets me off it’s when people reply privately to public inquiries. I usually see this in list archives where someone posted “Hey, I’m seeing a certain problem, has anyone else? Email me off the list.” Hell no. You posted your question publicly, let the response be public, too. Now I have the problem and I have no idea what the solution was, because you kept it private. Why the person replying to you replied privately is beyond me, too. The internet is a big conversation, you idiots, and you’re whispering.
If you comment on a post here there are three things that might happen:
1) Nothing. I don’t have more to add, so I’ll be quiet.
2) I’ll reply via email to you.
3) I’ll reply publicly so that everyone can see my answer.
If I reply in a new post there’s a chance that someone down the road will be able to find the whole thing and derive some value from it. That’s what I am going to do here.
This is long — sorry. I spent the day at the Wisconsin VMUG meeting (which was sweet, BTW), and ended up talking about this a lot, so it’s fresh in my mind.
For those just tuning in, I made a comment about horrible VMware/EMC disk I/O performance in a post a while back, and received a few comments & emailed questions about what I did to fix it. So I posted about that, too. Now I’ve been asked to characterize my workload when I was seeing I/O problems. With pleasure!
The simple answer is: the problems appeared with my “normal” workload, though I did diagnose and test with synthetic workloads.
At the time I had about 45 VMs between five Dell PowerEdge 6650s attached to EMC CLARiiON CX700s. The CX700s each have a remote counterpart, and all the LUNs on each are mirrored with MirrorView via long wave (under 2 km) links in our SANs. The mirroring is synchronous but because it is a short distance it doesn’t impose too much of a performance hit. The CX700s are pretty loaded, which has led recently to a purchase of a DMX-3, but they operate well when they are not in a degraded state from a storage processor problem. All of the disks in the arrays are RAID5 groups. A pack of disks in the CX700s, also known as a DAE to EMC folks, has fifteen 146 GB disks, and for us they are arranged as 7+7+1. Two seven-disk RAID5s, one hot spare. They are split up like that to minimize rebuild times during a disk failure, and therefore minimize the chance of a dual-drive failure where there would be data loss. Each LUN allocated to the ESX servers is a MetaLUN, is 200 GB, and is striped across six RAID groups in three DAEs using the defaults for caching, stripe size, etc.
Each ESX server is a Dell PowerEdge 6650, with four 2.7 GHz CPUs and 24 GB of RAM. They attach to the SAN via QLogic QLA2342 HBAs, running the latest firmware approved for use in our environment. Each server had 10-15 VMs on it, all pretty low utilization as we were mostly virtualizing “low hanging fruit,” or very easy virtualization targets. The VMs were all Microsoft Windows Server 2000, Microsoft Windows Server 2003 Standard, or Red Hat Enterprise Linux AS 3. The VMs were mostly single CPU, with 384-1024 MB of RAM. Some developers were attempting to use some of the Linux VMs for development, but compile times with gcc were 50x longer on a VM than on physical hardware with local disk. The rest of the workload that was experiencing the problem were mostly light-use web servers, light-use mail servers (sendmail), a number of backup Windows domain and active directory controllers, and a medium-use PostgreSQL database supporting a development and test Lyris ListManager instance. We were running about 25% CPU and 10% RAM utilization. Disk I/O never peaked above 10 MB/sec, though I really thought it should, at least during backups.
To diagnose storage performance problems I use Navisphere Analyzer on the arrays, the basic monitoring on the Brocade 3900s, and MRTG running against the 3900s. On the VMs I generate load with the benchmark tools IOzone and bonnie++. If I just need some basic load I’ll use dd, like “dd if=/dev/zero of=testfile”. For ESX Server I used the stats gathered via esxtop and VirtualCenter. The problem itself could be seen as long I/O wait on VMs, coupled with horrible throughput. Random I/O sucked, sequential I/O sucked, and as the load went up the amount it sucked went up, too. Think logarithmic.
To narrow things down I started removing potential causes. My storage guys disabled MirrorView for my LUNs. We created a new MetaLUN where only my test VM would reside, so that I could watch the I/O there without seeing other guest OS I/O. I even ran my test VMs alone on an ESX server, directly attached to a core SAN switch. Alone, the performance wasn’t too bad, but get a few VMs together and things got ugly real quick. Frustrated, I started benchmarking each and every Linux VM (about 30). In most cases, no matter how hard I tried I couldn’t get any VM to break 10 MB/sec. To my surprise a couple of them had decent performance, as you’d expect from the environment, including big bursts (100+ MB/sec). WTF?!? It would turn out that those VMs had inadvertently been aligned properly on disk, just because of their VMDK file size and placement. Oops.
When I/O performance was bad I could not correlate it with heavy I/O on the array (which did make the problem worse), SAN performance problems, or any sort of contention on the ESX servers. We had adjusted things like the queue length and maximum outstanding requests at the advice of VMware support staff early on in this saga. It didn’t help much. There was a lot of stripe crossing going on, though, which was starting to pique my interest, especially comparing the VMs that worked to the others that didn’t.
The VMworld 2005 presentations were posted while I was in the middle of figuring this out. I read through Bala Ganeshan’s presentation, where he spoke about storage issues, and this issue in particular. A quick check of EMC’s “best practices” documents confirmed that alignment was a problem for Linux and Windows hosts.
I don’t have the data in a presentable format, but fixing the alignment of the VMFS filesystems improved our performance by about 70%, bringing VMware back from the dead. Fixing the alignment of the VMDK files gets us another 30%. Your mileage will probably vary, but for us this was huge.
So what needs to happen to get this fixed?
1) VMware has a new KB article on the subject, but it is wrong. That needs to get fixed, because I, several other VMware admins from around Wisconsin, and at least one guy from EMC (Ganeshan) disagree with its crappy conclusions. I did submit feedback that that article is wrong.
2) EMC needs to add VMware-specific notes about this in their documentation. My storage guys had looked for notes about ESX server and found nothing. All of these notes were for Linux hosts. They actually knew alignment was a problem on Linux and Windows hosts but didn’t think it was enough to worry about (they’re wrong, too — it’s worth a lot in extra I/O against the arrays).
3) ESX server should detect and deal with this automatically, maybe using the vendor information for the LUN. Or even offer to format LUNs with custom partition start blocks. At the very least, though, something needs to be put into the guest OS and SAN documents that VMware distributes.
So there, that’s pretty much my story, at least for my VMware experience between July 2005 and January 2006. I love VMware, but I love it so much more when it works fast.