The Problem With I/O

It used to be troublesome when someone needed 200 GB of disk space. It was this big negotiation between a system administrator or storage administrator and the DBA, or the user, or the application admin about why and how and how long and space is expensive and etc. etc. I believe the right term for it is “goat rodeo.”

With the advent of 300 GB fibre channel drives and 750 GB SATA drives storage administrators don’t need to worry about any of that crap anymore. They don’t even bat an eye at a 500 GB space request because it isn’t a problem anymore. Some of you will say I am spoiled in the environment I’m in, but it’s a fact for me. You want 500 GB? Sure thing, it’ll be ready in a minute.

The problem now is I/O.

A single drive has a single set of read/write heads mounted on a single armature, which means that it can only do one thing at a time. The advent of caches and command queues means the drive can make better decisions about what order it does things, but when it comes down to it each drive has a finite amount of I/O it can do in a second. The number of IOPS a drive can sustain depends on a number of factors. It’s less with SATA, more with fibre channel. Rotational speed (RPMs) has a lot to do with it, too. A 15K drive has 5000 more opportunities a minute to do stuff than a 10K drive does.

Most users of applications want their application to complete its I/O as fast as possible, because they’re waiting for the results. Since a single drive usually cannot keep up with your enterprise demands you gang a bunch of drives together and spread the I/O across them. And what happens is you end up with a ton of space but not enough I/O capacity.

So what do we do?

You could keep adding more drives, but drives cost money. They use power. Their capacity causes licensing issues on storage arrays, as most vendors charge per terabyte for their software licenses. As a mechanical part drives fail, too, which means you have to replace them. You can take the MTBF of a drive and calculate how many hours a week you will spend dealing with dead disks. For large implementations that calculation is a very real number. Some of the large storage implementations, like those attached to the TeraGrid, have designed their systems to tolerate failure because of the staff time required to change disks. There are certain limits to adding drives to arrays, too. You have to account for rebuild times when a drive fails. The more drives you have in an array the longer the rebuild time, which puts you at risk for a second disk failure. Two drive failures in RAID5 is fatal.

You could get drives with a faster rotational speed. However, faster drives cost more money. They generate more heat, which means they consume more data center resources in power and cooling. In my experience they also fail sooner because they are less tolerant of environmental fluctuations.

You could get smaller capacity drives to add spindles but not incur licensing costs. However, as you grow you’ll end up with a lot more drives. You could also switch from RAID3-5 to RAID 10, though. As disks come down in price it is now feasible to use RAID 10 which has advantages in both I/O capacity and fault-tolerance for certain applications like databases. Storage admins just shied away from it mainly because it has been costly to waste 50% of the capacity of an array. Nowadays, though, it isn’t about capacity, it’s about I/O.

2.5″ disks show some promise in these areas. You can now get more spindles in 1U, they run cooler and consume less power. Of course, running cooler and using less power is offset by the additional drives you’ll add, so effectively you get more IOPS for the same power bill.

Another option is to manage hotspots on your disks. Inevitably when you add drives you don’t get the optimal host-I/O to I/O capacity mix. You can tweak that, though, and spend years sitting and watching performance graphs of your arrays. Most storage vendors have some capability to move a LUN around within an array. They just have no way for the array to automatically do it. Sure, some of them claim that capability but their software has so many limitations it isn’t practical. Right now “hotspot management” still means “large staff time expenditure.”

Solid state disks (SSDs) are an option. They are really freaking fast, but they scare the crap out of most people (probably unnecessarily) because they’re storing data in volatile RAM. SSDs also tend to be small in comparison to typical drives (100-200 GB), and because of that you need to design your systems to use them very specifically. One of my rules of system administration is that the more custom a system design is the more time it takes to administer it. That ultimately means more admin time, and on the whole it might be cheaper to just buy more conventional drives and manage it as you do everything else.

As you add I/O capacity to your arrays you also have to think about your SAN. Are you going to be moving the bottleneck to the fibre channel switches and HBAs?

So what do we do? It would be really nice if storage vendors designed a no-bullshit optimizer for their arrays. Something us admin types can just turn on and forget, kind of like the Dynamic Resource Scheduling in VMware’s Virtual Infrastructure 3. VMware hit the nail on the head — the systems know what they’re doing so let them make the call.

A lot of this functionality could also be moved to a storage virtualization device. I wrote about this a couple of days ago in a post about how virtualization engines suck.

In the meantime, however, storage admins just need to realize that it isn’t about dollars per GB anymore, it’s about dollars per I/O per second. Plan accordingly. And if you have a choice get as many 15K spindles as you can. 🙂

1 thought on “The Problem With I/O”

Chris Adams

July 25, 2006 at 12:40 PM

I’m really looking forward to ZFS spreading for precisely this reason – I can’t wait to get out of the micro-optimization business and ZFS has automatic hotspot balancing across the entire pool. You can basically wait for there to be a problem (pool-wide – completely avoiding the problems with small volumes being trivially spindle-limited) and add an extra/faster drive (which can be anything accessible by the system – you could buy some time to deal with a SCSI hotspot by tossing SAN storage at it or even a USB/firewire drive if you were really desperate).

One of the developers likes to tell an anecdote about this flushed out some differences in the drive settings on a RAID enclosure which one of the beta testers used – the tester was wondering why one drive’s activity LED flickered less often and learned that ZFS had noticed that it was slower than the others and wasn’t using it for hot data.

Comments are closed.