Rackspace's Terrible Maintenance Plan

Update, 3/21/12: please read the comments, too — we got a good response from one of Rackspace’s folks.

I got a note today from Rackspace, where I have two virtual servers in their Rackspace Cloud. It was opened in the form of a support ticket, waiting for input from me, but with the text of the support ticket labeled as if I entered it, which was weird.

As part of our ongoing effort to provide you with the best Cloud Servers service possible, we routinely perform maintenance and upgrades of our underlying systems. The majority of these are performed non-disruptively, however maintenances sometimes arise that impact Cloud Servers instances. At this time, a Cloud Servers host update is required that will involve an automated migration (i.e. relocation to a new physical host server) of some cloud servers including the following server(s) in your account:

This makes sense. Over time, hardware gets old and should be replaced. Amazon just did this, too, forcing a lot of people to reboot their stuff, and in the restart process it’ll find its way to hardware & hypervisors from this decade. Shouldn’t be a big deal, I do this at work with my virtualized stuff, too, and most people can work it into their normal maintenance.

Preferred Option: MANAGED MIGRATIONS – Rackspace managed and controlled

After March 21st, 2012 at 11:59PM CDT, managed migrations will begin being scheduled for any cloud servers listed above. If you have multiple servers listed, it is likely that they will be spread out across several days beyond March 21st. You will be notified 24 hours in advance of any managed migrations and will be provided a specific time window in which the migration will occur. Managed migrations require no effort on your part, will automatically be performed by Rackspace, and will effectively appear to you as a reboot…

Alternate Option: AUTOMATED SELF-MIGRATIONS – One click migrations you control

To allow you to migrate at your convenience and minimize impact to your applications, you may perform your own automated self-migrations anytime between now when your servers are scheduled for a managed migration. This process is simple, can be performed with the click of a button, and will effectively appear to you as a reboot.

The plan seems sound, but today is March 20, 2012. From the time this made it to my inbox I had only 57 hours (33 + the 24 they give you prior to doing it for you) to reboot my stuff on my own terms? Are you serious?

While I’m reading this whole thing an update comes in:

We apologize for the miscommunication. The date on the initial update is incorrect. Managed migrations will not begin on these servers until after March 27th, 2012 at 11:59 PM CST.

Now I get a week to take my stuff down on my own terms, which, frankly, isn’t a whole lot better. What if I don’t have a maintenance window in the next week, or staff availability, or any other good reason why I can’t or shouldn’t do the work in that big of a rush? This becomes an unplanned emergency for me now, for no good reason as far as I can tell.

Here’s my take on this whole situation:

  1. A day’s notice is irresponsible & asinine, a week’s notice is ridiculous. A month is better. Two months would be nice. Give people ample time & notice to take care of the situation themselves, then force the stragglers into compliance. There’s a darn good chance that in the next two months they’ll have a maintenance window anyhow. And it isn’t like the folks there at Rackspace haven’t known this was going to need to happen for a long time (or, if this was somehow a surprise, they need a new project manager). Send the stragglers a note every week for two months, and in the last email to them assign a firm date & time when they’re going to see a forced reboot.
  2. Fix the CST/CDT issues in the notices so they’re accurate. Better yet, express time the way everybody else with a multi-timezone audience does: in 24-hour GMT. If you’re worried about people asking questions put a translation table into the FAQ.
  3. While I think the support ticket idea was a good one, don’t open support tickets in my name with initial text credited to me that I didn’t enter. At the best it’s confusing, and I’m the kind of guy that doesn’t like being credited with things I didn’t do. The followup correction was from “Support at Rackspace Cloud…” so it’s obvious it can be done. Choose to do it right.
  4. Get someone who is detail-oriented to read the notices you send to your customers prior to sending them, to vet the whole plan, and perhaps play the devil’s advocate.  I suspect there’s someone in the Rackspace Cloud support organization that could have provided all this same feedback, internally, if they’d had the chance, but now you have an annoyed customer doing it for you. Nice shot.
  5. Do a better job of highlighting that this process won’t be instantaneous, that it requires a soft reboot from the Rackspace Cloud control panel and not from within the OS, and that people should go read the FAQ for more details on how long it’ll take. The notice could have easily been more informational, though the FAQ did do a good job of indicating to me what was going to happen.

I’m comfortable saying that if this were a change request I’d filed in my place of employment it would have been denied by our change managers based on lack of timely customer communication for a non-emergency change and inaccurate details. C’mon Rackspace, you can do much better, and you need to if you want enterprises to move any workloads in your direction after this.

9 thoughts on “Rackspace's Terrible Maintenance Plan”

  1. I was floored when they told me they didn’t have the capability to migrate between hosts – even HyperV has live migration.

    • Their FAQ says that they’re changing hypervisors, which explains why they cannot live-migrate. I’m okay with that, just not the timeline they’re forcing.

  2. Companies that have gone bankrupt have given their customers more time to grab their data and run.

    But what IaaS company out there really treats their customers that graciously? I mean, support like that would have to be downright fanatical. Wait…

  3. Though I agree that more notice would be nice, if you require long lead times on any change and can’t handle a (relatively) spontaneous restart or migration, I would assert that you miss the entire point of building out in a cloud.

    The same way your design can use ESX as the only component of a HA solution or RAID1 as your only “backup” solution, that doesn’t mean either is a good design choice.

    • The “build for failure” model is great when you’re a larger organization and can justify it, but entirely overkill for many small & medium-sized businesses who don’t want to pay the 200%-500% overhead of extra VMs, load balancers, and administration when they really don’t need anything more than a single small server.

      I don’t consider a month for maintenance activities a long lead time, either, especially for an entirely predictable non-emergency situation.

      Also, RAID1 isn’t a backup solution, no matter who you are.

  4. Bob,
    Great post – this caught the attention of the team working on this maintenance (of which I am a part) so I wanted to respond to your take on the situation:

    1. Limited advanced notice
    We seek to provide customers as much advance notice as possible but there are instances where that notice has to be shorter than we or our customers would prefer. I know this does little to make you feel better about the situation. If you have extenuating circumstances you need to plan around (such as those mentioned in your post) please open a ticket and mention our dialogue so that I can see how our team can best work with you.

    2. CST/CDT issues
    Completely agree with you – the team working on this maintenance made the switch to UTC today. Just for background there are instances where our maintenance windows expressed in UTC would break across multiple days but when expressed in CDT stay on the same day. We were trying to avoid the potential confusion that multiple dates in a notice could create but I like your idea of including a translation table in the FAQ.

    3. Ticket opened in your name
    It is an awkward way of starting the dialogue but the system we use has a limitation that the FIRST comment always looks like you initiated it. When we make updates to a ticket it gets attributed to us correctly but that first comment always looks like the customer made it. I will check with the team that handles that particular vendor relationship to see if it can be changed but I do agree that is a confusing way to start a ticket.

    4. Error in original message
    This is actually my fault. I changed some of the general wording of the ticket and sent it over to the team that was getting ready to send out your notice. I sent them changes based off a template that used A date but did not use THE date for your particular maintenance. This demonstrates poor execution of change control on my part and I apologize for the confusion this caused.

    5. Do a better job explaining the process
    We debated how much information to include in the original notice versus what to put in the FAQ. The first draft was much longer but the team felt that the main message would get lost in all the information we were placing in the ticket. I am glad you found the FAQ to be helpful but it sounds like we may have swung the pendulum too far towards short and concise for our original message.

    I hope my response helps to provide some additional insight to the poor experience you have had. We want you to be satisfied with our support so if there are further issues we can address please let us know.

    Thanks,

    Jeff

    • Jeff, thank you for replying. It’s a credit to Rackspace and it’s employees that you have taken the time to do so, and are doing it publicly. Many companies won’t reply or force their employees to do so with personal accounts, on the sly. It’s also a big credit to you that you can take responsibility for some of this. Where I work we try very hard to not assign blame, but just move forward with fixing the problem and learning from the situation. Hopefully Rackspace is the same way. I also joke that if you aren’t screwing something up once in a while you probably aren’t a productive employee, either. 🙂

      As far as the scheduling goes, item #1, perhaps you’d consider saying upfront that this is an emergency or a situation where a longer window isn’t possible. Knowing that we can open a case to discuss the situation goes a long way to helping customers, though. Should definitely make that as clear as you can upfront.

      #3 isn’t that big of a deal. I like the approach, and it solves the unending open communication problem by having the ticketing system do its thing to auto-close stuff. The rest of it is just communication cleanups, and I appreciate you considering my thoughts on the matter. I try to be hyper-accurate when it comes to communications of this nature, and hold my vendors to the same standards. Time is always a pain, because nobody gets it, ever.

      Not to keep going back to my environment, but we have a set of guidelines for risk & potential impact vs. amount and type of communication needed. They’re just guidelines but they’re handy because everybody knows what to do and what to expect, and we share them with our customers so they know that, as an example, they’ll have two weeks or four weeks of notice for certain types of big, planned changes. Does Rackspace have anything like that, or published scheduled maintenance windows in which work like this could be done? My cursory searching last night didn’t find anything.

      I hope I run into you at a conference, I’ll buy you a beer.

  5. FWIW, When my people asked for an extension, RS was totally unwilling to work with us. Also, for the little corner of our architecture for which I am responsible (7 servers), 2 of them failed to migrate cleanly. The first one simply came back corrupted and the 2nd one failed to migrate altogether. (RS support doesn’t know what went wrong so we are going to migrate it “later” – wasting a not insignificant portion of my Saturday)

    I understand that in today’s fluffy society they will probably get a pass from most people, but to me this is anything BUT now a professional hosting service should work. These last several months have *really* made me miss slicehost.

Comments are closed.