I had previously mentioned I/O problems I was having with my VMware ESX Servers. I didn’t really elaborate on the problems I was having, but having been asked, I will.
This was happening to me using ESX Server 2.x against EMC CLARiiON CX700s. It also appears to be a problem under ESX Server 3.
The problems I was having were caused by logical block addressing (LBA), a feature of the PC BIOS that reworks the disk geometry so that a disk always appears to have 1024 tracks/cylinders and 63 sectors per track/cylinder. It’s a hack around some old limits on PC hardware.
On storage attached to PCs the first cylinder is the master boot record (MBR), partition tables, etc. Because of LBA it is 63 sectors long (sectors 0-62). So when you create the first partition on a disk it starts at sector 63.
Computers, being what they are, like things that come in powers of 2. Storage arrays, being computers, like getting I/O requests in sizes that are powers of 2. Storage arrays stripe the data internally in sizes that are powers of 2. 63 is not a power of 2, and because a PC starts data at sector 63 it causes the disk data structures the PC uses to be misaligned with the structures the array uses. And that results in more work for the array, because each I/O request from the PC ends up straddling the stripes on the array, so it has to read twice as much stuff in (two stripes instead of one).
Here’s a simple illustration:
With VMware you get hit twice. Each virtual machine emulates LBA, too, and so each virtual machine is also misaligned by default. Coupled with the misalignment at the VMFS level you have a mess, and each I/O request made by a VM results in a lot of work at the lower level, as well as a lower cache hit ratio and fewer chances for I/O aggregation, write/read combining, etc.
EMC documents how to fix this in the notes for the arrays. Basically you have to use fdisk’s expert mode (or diskpart on Windows) to move the beginning of the partition so that it’s at sector 128. That way it’ll be aligned with the CX700’s stripes. With VMware, this means you have to partition the VMFS volume before you format it — ESX Server doesn’t do the right thing when it formats it for you:
$ sudo /sbin/fdisk /dev/sdj
Command (m for help): n
p primary partition (1-4)
Partition number (1-4): 1
First cylinder (1-522, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-522, default 522):
Using default value 522
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fb
Changed system type of partition 1 to fb (Unknown)
Command (m for help): x
Expert command (m for help): b
Partition number (1-4): 1
New beginning of data (63-8385929, default 63): 128
Expert command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Partition type ‘fb’ is VMware’s VMFS partition type. If you’re just fixing a Linux partition you can leave it as the default of ’83’.
Once you’ve done all that you can use the VMware ESX MUI to create a VMFS filesystem on that LUN. Don’t forget to rescan the SAN before you do so that ESX Server will see the partitioned LUN.
You should use this procedure for both the ESX Servers and the VMs. If you use LVM under Linux you should partition the disk first, then run pvcreate on the partition (sudo /usr/sbin/pvcreate /dev/sdb1). That way the LVM I/O will be aligned, too.
Most OS installers won’t let you alter the beginning sector of the partitions so you might have to partition the disk using a Linux rescue disk or something. But luckily with VMware you only have to do that once to your template machines.
I have no idea if this is a problem on the EMC DMX/Symmetrix line, or for other vendors. If you happen to know one way or another please post a comment, and I’ll update this.
I first discovered this while reading the VMworld 2005 presentation by Bala Ganeshan, an EMC SE. The data you need starts on page 19 of the presentation.
Update: The EMC document describing the problem and resolution is the “EMC CLARiiON Best Practices for Fibre Channel Storage” (sorry, the link requires an EMC PowerLink login). Treat the VMFS filesystem like a Linux host, only with partition type ‘fb’, and then do whatever you need to make the guest OSes aligned right, too.