I’ve long been a fan of RAID 5. Since you only lose one disk worth of space to parity it has been the best way to maximize local disk space. Sure, the performance isn’t the greatest, but I haven’t had applications that taxed the local drives, and the disk space and generically decent performance has been a good trade off.
In the last six months, though, I’ve had three machines die from a double drive fault. This is the Achilles Heel of RAID 5. A single drive failure is as much as can be tolerated. In two of those cases the array had a hot spare drive, and the second drive faulted during the process of rebuilding on that drive.
This makes me wonder why I’ve gone ten years without any problems, just to be blindsided now.
One answer comes to mind: increased capacities of disks, leading to long array rebuild times.
Think of what happens when a drive fails in a RAID 5 array, especially on an older array that isn’t very busy. If there is a hot spare the controller starts rebuilding the array, which causes a lot of I/O. If there is a second drive that is questionable in the array this might push it over the edge. Before you are done rebuilding you get a second drive error. Game over.
So what do I do about this? Change RAID levels? RAID 0 is out. RAID 1 can’t handle a double drive failure. RAID 1+0 (10) might be able to handle a double drive failure if the two that fail are in the right places. Stick to smaller drives? With less capacity the rebuilds happen faster, helping to minimize the exposure. Use faster drives? Maybe switching from 10,000 RPM disks to 15,000 RPM disks would help. They’re faster, so that would also help minimize the exposure to a double disk fault. However, 15K RPM disks seem to be more sensitive to cooling issues, making them less reliable and more prone to a fault if the environment isn’t perfect.
Maybe I can make disks irrelevant… no. I can’t. I can push my applications towards enterprise storage arrays, but this is a big issue there, too. Similarly, the movement within to embed hypervisors in hardware just moves the issues to the central disk arrays. I don’t want to shuffle the problem around, I want to solve it. The closest I can get is keeping backup copies of my data in as many places as possible, sharing as few things as possible. Keep a copy of my data in separate data centers, on separate servers and disk arrays, even preferably separate types of media, like tape.
All of that is expensive, though. Money is the ultimate trade-off with this sort of discussion.
For now I think RAID 1+0 plus a spare, on 73 GB 15K RPM disks might be my new direction.