VMware I/O Problems

I had previously mentioned I/O problems I was having with my VMware ESX Servers. I didn’t really elaborate on the problems I was having, but having been asked, I will.

This was happening to me using ESX Server 2.x against EMC CLARiiON CX700s. It also appears to be a problem under ESX Server 3.

The problems I was having were caused by logical block addressing (LBA), a feature of the PC BIOS that reworks the disk geometry so that a disk always appears to have 1024 tracks/cylinders and 63 sectors per track/cylinder. It’s a hack around some old limits on PC hardware.

On storage attached to PCs the first cylinder is the master boot record (MBR), partition tables, etc. Because of LBA it is 63 sectors long (sectors 0-62). So when you create the first partition on a disk it starts at sector 63.

Computers, being what they are, like things that come in powers of 2. Storage arrays, being computers, like getting I/O requests in sizes that are powers of 2. Storage arrays stripe the data internally in sizes that are powers of 2. 63 is not a power of 2, and because a PC starts data at sector 63 it causes the disk data structures the PC uses to be misaligned with the structures the array uses. And that results in more work for the array, because each I/O request from the PC ends up straddling the stripes on the array, so it has to read twice as much stuff in (two stripes instead of one).

Here’s a simple illustration:

VMFS & Array

With VMware you get hit twice. Each virtual machine emulates LBA, too, and so each virtual machine is also misaligned by default. Coupled with the misalignment at the VMFS level you have a mess, and each I/O request made by a VM results in a lot of work at the lower level, as well as a lower cache hit ratio and fewer chances for I/O aggregation, write/read combining, etc.

VMDK, VMFS, and Array

EMC documents how to fix this in the notes for the arrays. Basically you have to use fdisk’s expert mode (or diskpart on Windows) to move the beginning of the partition so that it’s at sector 128. That way it’ll be aligned with the CX700’s stripes. With VMware, this means you have to partition the VMFS volume before you format it — ESX Server doesn’t do the right thing when it formats it for you:

$ sudo /sbin/fdisk /dev/sdj

Command (m for help): n

Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-522, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-522, default 522):

Using default value 522

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fb
Changed system type of partition 1 to fb (Unknown)

Command (m for help): x

Expert command (m for help): b
Partition number (1-4): 1
New beginning of data (63-8385929, default 63): 128

Expert command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

Partition type ‘fb’ is VMware’s VMFS partition type. If you’re just fixing a Linux partition you can leave it as the default of ’83’.

Once you’ve done all that you can use the VMware ESX MUI to create a VMFS filesystem on that LUN. Don’t forget to rescan the SAN before you do so that ESX Server will see the partitioned LUN.

You should use this procedure for both the ESX Servers and the VMs. If you use LVM under Linux you should partition the disk first, then run pvcreate on the partition (sudo /usr/sbin/pvcreate /dev/sdb1). That way the LVM I/O will be aligned, too.

Most OS installers won’t let you alter the beginning sector of the partitions so you might have to partition the disk using a Linux rescue disk or something. But luckily with VMware you only have to do that once to your template machines.

I have no idea if this is a problem on the EMC DMX/Symmetrix line, or for other vendors. If you happen to know one way or another please post a comment, and I’ll update this.

I first discovered this while reading the VMworld 2005 presentation by Bala Ganeshan, an EMC SE. The data you need starts on page 19 of the presentation.

Update:
The EMC document describing the problem and resolution is the “EMC CLARiiON Best Practices for Fibre Channel Storage” (sorry, the link requires an EMC PowerLink login). Treat the VMFS filesystem like a Linux host, only with partition type ‘fb’, and then do whatever you need to make the guest OSes aligned right, too.

Comments on this entry are closed.

  • Might be worth noting that VMFS3 will be aligned automatically in ESX 3.0. From the VI3 Installation guide, page 12:

    VMFS3 Partitioning – — For best performance, use the VI Client or Web Access to set up your VMFS3 partitions, rather than using the ESX Server installer. Using the VI Client or Web Access ensures that the starting sectors of partitions are 64K‐ aligned, which improves storage performance.

  • What was the Disk I/O per sec you were seeing when you had this issue? How many VM’s perphysical server?

  • Are there any recommendations for configuring geometries within a Guest O/S like Solaris? The command `fdisk` provides the ability to override the geometry.
    Also any recommendations on filesystem blocking?
    Or should the work in the VM just be happily ignorant of the disk and disk i/o structures?

  • Bob,

    SAN/ESX/Linux/LVM is the most common OS & disk setup I am doing lately. When I get to my Linux VM install, I am using a Rescue Disk to create my disk partitions first, including LVM partitions. I then reboot into install mode, and go through the normal OS install, including creating the LVM volume groups.
    Is there a problem with this process? Is the LVM I/O aligned in this case?

  • Unless you use fdisk’s expert mode to change the alignment of the partitions to be on 128 byte boundaries you are not successfully aligning your volumes.

  • This is a very interesting find for me – although I don’t use ESX or a SAN, I am using VMWare Server 2.x and have found that disk i/o has been extremely CPU intensive on the host OS (guest OSes show no signs of CPU load, although the entire system is sluggish).

    I noticed that indeed the same normal formatting is going on in the guest VMs here – an offset of 63 bytes for the first bit of data. Thinking about it, this could very well make for similar performance issues like you noted. Having everything in the guest offset by 63 bytes means that the host has to read 2 blocks for every 1 block the guest wants and then re-create a new virtual block of data (I’m assuming). This could be the cause of the poor disk i/o I’m encountering (although I have heard that this wasn’t a problem in VMWare Server 1.x – but I haven’t done any testing).

    Running dumpe2fs on the host to quickly get the formatted block size of the disk the guest VM’s are on and then offsetting the first block of data in the guest disk to match might be the trick. I’ll be giving this a try and see how it turns out. Hopefully it solves my problem!

  • Thoughts…

    1. Why would I possibly want to create a partition table for a disk that will become a PV ?????
    I have always created PV’s on whole disks (pvcreate /dev/sdb) and that has been working just fine.

    2. Even without LVM why bother with a partition ?
    As far as I can see partitions do nothing but add problems (such as the alignment offset) and adds no value whatsoever. The only reason to partition is if you need to slice a disk into smaller disks (like the installer does with /boot and / etc…) That’s it.
    Other than that partitions have no purpose.
    Why create a partition that is as large as the entire disk when I want to use the entire disk anyway ?

    [root@vm1 astuck]# /sbin/mkfs -t ext3 -j -m0 /dev/sdb
    mke2fs 1.39 (29-May-2006)
    /dev/sdb is entire device, not just one partition!
    Proceed anyway? (y,n) y
    Filesystem label=
    OS type: Linux
    Block size=4096 (log=2)
    Fragment size=4096 (log=2)
    1310720 inodes, 2621440 blocks
    0 blocks (0.00%) reserved for the super user
    First data block=0
    Maximum filesystem blocks=2684354560
    80 block groups
    32768 blocks per group, 32768 fragments per group
    16384 inodes per group
    Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

    Writing inode tables: done
    Creating journal (32768 blocks): done
    Writing superblocks and filesystem accounting information: done

    This filesystem will be automatically checked every 26 mounts or
    180 days, whichever comes first. Use tune2fs -c or -i to override.

    Now I can easily expand the virtual disk on esx and run resizefs on /dev/sdb. No partition ever needed.

  • @Stucky101 — yes, everything you suggest works just fine. What works way better, though, is to follow the best practices of your storage vendor and align your I/O by partitioning the disk.

  • Bob

    Not sure I follow you since the guide tells you to align partitions. They’re not saying that you have to have partitions in the first place.
    If I don’t have partitions there is nothing to align.
    What is the difference between aligning a partition and not having one in the first place except less hassle ? Am I missing something here ?

    Besides, as per my discussion with an EMC engineer…
    Quote :
    “There is absolutely no guarantee that any i/o will be aligned with our stripes. Clariion does not offer true Raid-3 in which all i/o’s are forced to read exactly the same full stripe or nothing at all. All the alignment tools do is make sure the beginning of the filesystem is aligned with the lun’s stripes but i/o’s in general are not aligned”
    End quote

    In other words you can align your partitions all you want you cannot prevent further disk crossing afterwards.
    I’m still discussing with them why they say that but at the same time put so much emphasis on that one-time alignment.

  • One of the interesting items I found when using 8-way partitions on EMC DMX disk subsystems in a UNISYS mainframe environment (driven by their XPC controller) was the specification that the at-rest position of the heads was “in the middle of the disk”, meaning between partitions 3 and 4 (in a 0,1,..7 set). We gained the highest disk performance by limiting all database files to partitions 3 and 4, and throwing away (ignoring) the outer partitions. We were able to reduce transaction times from 3 seconds to 0.3 second by reducing single-disk conflicts and moving all high-reference files to their own individual disks. The net result was to gain the equivalent capacity of four mainframes without hardware or application software changes. Similar effects can be observed in VM machines using small numbers of disks, where increasing the number of disks and repositioning data files can result in significant increases in overall throughput.

Previous Post:

Next Post: