Storage virtualization engines suck. Sure, they make it easy to move your data around. They also add cache, which helps a bit. While they’re doing this they also add another point of failure on your SAN, another potential performance bottleneck, another system to learn how to use. They’re clumsy, they’re feature-poor, and I think they have a real long way to go.
I’d think about them more positively if they did a few things for me:
1) Automate array-level failover. Take IBM’s Storage Virtualization Controller, for example. It can mirror your data to another array, but if the primary array dies your hosts will lose their storage. You then have to go into the SVC and promote the second copy to be the primary, and then recover all your hosts. LAME. Just freaking do it automatically so the hosts don’t crash. Offer an option to shut that off for the freaks that won’t want it.
Why do I want this? A lot of array and SAN maintenance is orders of magnitude easier if the array is not servicing I/O. Period. I want the virtualization engine to give me the same seamless failover that I get from RAID 1 on a local RAID controller.
2) RAID 1-like reads from arrays. One of the things you get from RAID 1 is the ability to read from both disks, doubling the read rate. Why can’t we read simultaneously from both the primary and secondary copies of data?
Update: I realize suggestions 1 and 2 are not feasible in situations where the remote copy is distant. In my case, though, the remote copy is 2 kilometers away (by wire). I’ve also seen situations where people had their second copies in the same data center. Why? Long story.
3) Have a user interface and implementation that makes life simpler. Every virtualization technology I’ve seen adds another layer of abstraction but uses a whole new interface. The interface is inevitably designed by an engineer. EMC, IBM, hire a UI designer to help you. Your goal is to be able to set a new virtualization engine in front of a moderately knowledgeable SAN admin and have them set the system up from scratch, including business continuity tasks like remote mirrors, without the manual.
4) Optimize my I/O for me automatically. You know what’s doing a lot of I/O, so help spread it around to minimize hotspots on the disks. While you are at it you could also let me define the relative speeds of my arrays and then use that to make decisions. The host LUNs that aren’t I/O intensive should automatically get moved from my expensive EMC DMX-3 to my EMC CLARiiON array with all SATA in it, based on rules I set. If you want an example of this look at the way VMware does it’s Dynamic Resource Scheduling (DRS) in VI3. With DRS I can set rules, but the main way I set rules is with a slider bar. One end is “all operations are manual,” the other end is “fully automated,” and in-between sets the tolerance for partial automation.
What would be really slick is block-level moves, where a host might see a contiguous LUN but only the most active sectors reside on the fast disk. The rest is on slower storage. Why does information lifecycle management (ILM) need to be at the file or LUN level?
5) Do not rely on an IP network. I built a SAN because my IP network is built with different SLAs that are incompatible with storage uses. Don’t make me tie my completely isolated SAN into my administratively-incompatible IP network. Now, I’m not saying that you should ignore IP networks, but be open to not using it. Hell, run IP over the SAN and call it even.
I know some virtualization offerings have parts of this, but none of them have enough to warrant the large price tags. Make my life easier in ways that matter and I’ll buy.
I’ve been wondering about the possibility of transparent failover for a while now, but it’s not as simple as it seems. The array system I’m most familiar with is one you don’t mention, the HP Enterprise Virtual Array (EVA). All the critical components are duplicated e.g. it has 2 storage controllers with transparent failover, dual back-end loops, and dual SANs (switches, HBAs etc.) is the norm. No SPOF unless you set up RAID-0 volumes. You have data replication to another array if you want.
However, you’re talking about failover between complete arrays, a.k.a. “site failover”, since real disaster recovery means having arrays on different physical sites. That’s a bit more serious: if there is a real disaster, you’ve lost the servers AND the arrays at one site, so you hope to fail all that over at once. Trying to make that fully transparent is a tall order.
The kind of thing you’re talking about is what HP call “Cluster Extensions” (CLX), which can automate “site failover” of arrays and server clusters at the same time. Needs a lot of planning and testing, but the result is transparent failover at the “share” level (where clients attach to clusters).
Great wishlist. They’re pretty consistent with what we’ve been told by our customers and prospects. I’m pleased to say that most of these (not sure about the RAID-1 request) are part of the design goals for future versions of our storage virtualization. As noted, however, these are not trivial. And, lots of options on how to implement them.
We’ve found that while there’s a lot of interest in automation the overwhelming majority of our customers don’t want our storage systems to implement changes on their own. Part of the issue is trusting our systems to make good decisions, but much of it is that a human consider other business factors. What most customers want is for our systems to make intelligent recommendations and then let a human make a go/no-go decision.
Rick, Product Marketing, Hitachi Data Systems
Note that the comments above are personal interpretation and do not necessarily reflect corporate position.