Brian over at stereoroid.com commented on my last post about what I want from a storage virtualization engine. Brian, I hope you don’t mind but I’d like to answer outside of the comments section. Your comments were not counterpoint, they were more in the realm of adding clarity, something which my other post may have lacked. I hope this doesn’t scare people away from making comments. 🙂 I really appreciate them.
I’m most familiar with EMC and IBM high-end and midrange offerings, and a smattering of whatever LSI Logic is called now (StorageTek/FastT/Engenio/etc.). I don’t know very much about HP EVAs. From what you’ve said it sounds like they are very much like EMC and IBM’s high-end offerings where there are no SPOFs.
The problem is that high end arrays are expensive and stodgy. Midrange arrays have cool new features on them. High end arrays are stable. Midrange arrays aren’t so stable. From what I’ve seen I’ve concluded that what makes midrange arrays cool and modern is also why they are not stable (maybe this is just a property of the arrays I have been subjected to). They try to do too much to please people, and end up pleasing nobody because the faults are so devastating. To fix problems we always need the next version of the code. To get there we need to take an outage. From my experience with EMC CX700s you will get an outage even if they say you don’t need one, so it’s best to just take one and not have a mess on your hands. My experience with StorageTek D-series arrays, aka IBM FastT, is similar in that it’s just easiest to come down. Ditto for Apple Xserve RAIDs on the low end where there is no redundancy at the controller level. Since my DR copy is within 2 kilometers I could just use that, and take the outage on the array and not my servers.
What if storage vendors split their software development from their hardware development? We know software has bugs, and that these defects are inevitable. We know that sometimes software bugs in one part of a system creep over and mess up other working parts of a system. I’ve seen this in my storage arrays, where MirrorView dumps out and kills a storage processor. What if we separated the two so that it was more clear what was working where? What if one team within a storage company set out to make an array that was ultra stable, but had only the most basic of features? What if another team set out to implement all the cool advanced features somewhere else, like a virtualization controller, at a level above the individual array? Since that’s the level we expect the software to operate at doesn’t it make sense to position it there physically as well as logically?
I’m not saying my wishlist is universal but I do think it has merit for a lot of companies and institutions. Classically, IT resources that have been considered “infrastructure” end up with very little flexibility. With that comes a general stodginess and slowness to change. These infrastructure resources are the ones with the most potential to advance, though, and while we’re marching forward I want to see us add as much flexibility as we can to them before we get entrenched again. That way we might be able to change a little faster in the future.
Interestingly enough, and as a topic change, when I talk about site failover I really am not interested in disaster recovery. If you lose a whole site you’re in trouble, and it’s a whole different mess. Not to say that I’m not concerned with it, because I am, but it’s a different problem. Some of what I propose might help with DR but I’m more interested in mitigating the ongoing week to week trouble that crops up as new applications and systems are added to a data center. I want to see administrators given more tools to fight back, and I want to see vendors thinking intelligently about how these tools are built. Cluster Extensions from HP, HACMP from IBM, all of those are really cool if you have a system that can incorporate them. I haven’t had the opportunity to work with them extensively but from the little I’ve seen they are expensive, require lots of planning, and necessarily have many rules and constraints. From a business perspective they seem to make sense but from a flexibility standpoint they leave much to be desired. I also think they are a symptom of a larger reliability problem. What if we could make monolithic machines more reliable?
Yeah, right. 🙂
Hi again. I totally understand where you’re coming from. I was just thinking in “business continuity” terms e.g. the way banks in downtown NYC needed (and were able to) keep the business going on September 11 2001. A major disaster, certainly, but they could not stop business, even for that. (It’s all about the Money!) It wasn’t automated back then, but it’s getting there, and that’s kinda what I was thinking of when I saw your original question.
“What if storage vendors split their software development from their hardware development?”
In some ways this is already happening, with SNIA-driven interoperability initiatives. Seems like everyone is adding SMI-S-compliant management layers to their storage, so independent developers can build their own software front-ends. Even EMC, which is surprising considering their history of locking the end-user out of real array management. AppIQ were doing well here, before HP gobbled them up and rebadged their software as “Storage Essentials”. Storage Management is one of the growth areas in the storage business at the moment, everyone wants in on the act, it seems. But it’s not cheap, only really makes sense in large SANs.
As for making monolithic machines more reliable… oh, you can, just be prepared to pay for it (think Tandem). For me, a cheapo Windows cluster keeps the data available if one node goes down.
“What if another team set out to implement all the cool advanced features somewhere else, like a virtualization controller, at a level above the individual array? Since that’s the level we expect the software to operate at doesn’t it make sense to position it there physically as well as logically?”
I’m not sure what the “cool advanced features” really are, but whether I’d want them in a separate box would depend on what’s involved. e.g. if it involves moving lots of data around, then I don’t want that traffic on the SAN if that can be avoided. If it’s anything that is essential for access to the data, I don’t want a single point of failure. If it’s anything along the lines of “tiered” or “hierarchical” storage, sending “stale” data off to “nearline” storage, that’s at the file system level, while the array offers block-level storage, so that will need an extra box & extra layers. And so on, I look at these on a case-by-case basis.
Sorry if I’ve gone on a bit, you seem to have brought out my pedantic streak..!
There’s a short podcast at http://www.evolvingsol.com/sol_storagevirtualization.html on the value of storage virtualization.