Use elevator=noop For Linux Virtual Machines

This is post #6 in my December 2013 series about Linux Virtual Machine Performance Tuning. For more, please see the tag “Linux VM Performance Tuning.”

I've seen disk arrays in worse shape.Modern operating systems are fairly modular, and often have different modules to deal with memory, network I/O, and CPU scheduling. Disk I/O is no exception under Linux, as there are usually four different schedulers a sysadmin can choose from. Red Hat had a nice write-up on these a few years back and it remains relevant today:

The Completely Fair Queuing (CFQ) scheduler is the default algorithm in Red Hat Enterprise Linux 4. As the name implies, CFQ maintains a scalable per-process I/O queue and attempts to distribute the available I/O bandwidth equally among all I/O requests. CFQ is well suited for mid-to-large multi-processor systems and for systems which require balanced I/O performance over multiple LUNs and I/O controllers.

The Deadline elevator uses a deadline algorithm to minimize I/O latency for a given I/O request. The scheduler provides near real-time behavior and uses a round robin policy to attempt to be fair among multiple I/O requests and to avoid process starvation. Using five I/O queues, this scheduler will aggressively re-order requests to improve I/O performance.

The NOOP scheduler is a simple FIFO queue and uses the minimal amount of CPU/instructions per I/O to accomplish the basic merging and sorting functionality to complete the I/O. It assumes performance of the I/O has been or will be optimized at the block device (memory-disk) or with an intelligent HBA or externally attached controller.

The Anticipatory elevator introduces a controlled delay before dispatching the I/O to attempt to aggregate and/or re-order requests improving locality and reducing disk seek operations. This algorithm is intended to optimize systems with small or slow disk subsystems. One artifact of using the AS scheduler can be higher I/O latency.

In the physical world, with operating systems on bare metal, we would choose an algorithm that is best suited to our workload. In the virtual world there is a hypervisor between the OS and the disks, and the hypervisor has its own disk queues. So what happens is:

  1. The VM has disk I/O occur.
  2. The guest OS uses some CPU time to sort that I/O into an order it thinks is ideal, based on the disk algorithm that’s active.
  3. The guest OS makes the requests to the underlying virtual hardware in that new order.
  4. The hypervisor takes those requests and re-sorts them based on its own algorithms.
  5. The hypervisor makes the requests to the actual hardware based on its own order.

Since the hypervisor is going to do its own sorting, into its own disk queues, there’s very little point in a guest OS attempting the same work. At the very least, the guest OS is introducing some latency and wasting some CPU cycles. At the worst it’s making things more difficult for the hypervisor and the back-end storage (see “I/O blender“). If we switch to the NOOP scheduler the guest OS will do as little as possible to the I/O before it passes it along, which sounds perfect for a virtual environment.

So where do I put elevator=noop?

First, make sure NOOP is an option for you. The following command will show you the scheduler being used. Substitute your own block device for “sda.”

$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

The one in brackets is the active scheduler. To switch to NOOP, add “elevator=noop” to the default kernel parameters in /etc/grub.conf. Something like:

title Red Hat Enterprise Linux Server (2.6.32-431.el6.x86_64)
   root (hd0,0)
   kernel /vmlinuz-2.6.32-431.el6.x86_64 ro root=/dev/mapper/Volume00-root rd_NO_LUKS LANG=en_US rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_LVM_LV=Volume00/root selinux=1 audit=1 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet elevator=noop
   initrd /initramfs-2.6.32-431.el6.x86_64.img

Reboot and rerun the test above, the brackets should move to noop.

You can also use Puppet and Augeas to change it for your virtual machines, as per the Augeas examples at Puppet Labs:

if $::is_virtual {
   augeas { "Kernel Options":
      context => "/files/etc/grub.conf",
      changes => [
         "set title[1]/kernel/elevator noop",
      ],
   }
}

I use this method and it works well, though you may wish to do some testing with VMs in a public cloud scenario before you start changing kernel parameters. At the very least take a snapshot. :)

I also wrote about elevator=noop all the way back in 2008.

——

Image of Chad Vader © 2008 Geoff Stearns, licensed as CC-BY 2.0, provided via the Wikimedia Commons, and a shout out to my friends in the Madison theater community that made Chad Vader possible.

Comments on this entry are closed.

  • Just a note, VMware recommends either noop or deadline scheduler (KB 2011861).
    We use deadline scheduler for our workload (a fairly write-intensive virtual appliance).

    • Understandable — deadline is a fairly simplistic queue that tries to minimize latency, so it makes sense as a choice.

  • > The hypervisor makes the requests to the actual hardware based on its own order.

    It’s even worse than that because the hypervisor doesn’t actually know whats going on with the underlying storage either, the device it makes its IO requests to, whether local RAID, iSCSI or FC is the first device in the chain that has _any_ actual knowledge of the underlying disks. The storage controller is going to have its own elevators and scheduling and the underlying storage units (disks, SSDs) have their own firmware and do their own management with elevators and wear leveling. The one thing that the hypervisor doing its own IO coalescing has going for it is that there is probably significant round trip latency and IO command limitations so queuing before unleashing a torrent of IO requests to the storage controller CPU may make sense.

  • Regarding the Puppet snippet, I believe you want to check the $::is_virtual fact, which is false if the machine is physical ($::virtual = ‘physical’ in that case, which means the augeaus resource would still get created).

    • Great catch. Was an error in my live manifest, too. Oops. Shows you how little physical hardware I have nowadays.

      • Yeah, me too. In fact, when I was trying to verify, I couldn’t find any physical Puppet clients to test against.

  • If I understand the technology correctly, this should *not* apply to OpenVZ or LXC VMs, since they don’t actually run their own kernel?

    • I believe you are correct, those inherit tuning parameters from their parent. If those technologies involve mounting loopback filesystems there might be an opportunity for tuning there.

  • It’s interesting to see that in the SPECvirt benchmarks of ESXi hosts, they seem to always configure all VMs with the deadline scheduler (check the full result disclosures):
    http://www.spec.org/virt_sc2013/results/
    http://www.spec.org/virt_sc2010/results/

    It would be interesting to see some real comparison between different IO schedulers and how they affect various IO workloads.

    Also looking at the notes they tune quite a few parameters in the GuestOS, perhaps a few of them are worth to check out generally for our templates.