My coworkers and I recently undertook the task of upgrading our vSphere 5.1 environment to version 5.5. While upgrades of these nature aren’t really newsworthy we did something of increasing interest in the VMware world: switched from the Windows-based vCenter Server on a physical host to the vCenter Server Appliance, or vCSA, which is a VM. This is the story of that process.
If you aren’t familiar with the vCSA it is a vCenter implementation delivered as a SuSE-based appliance from VMware. It has been around for several major versions, but until vSphere 5.5 it didn’t have both feature parity with Windows and the ability to support very many hosts & VMs without connecting to an external database. Under vSphere 5.5 the embedded database has improved to support 100 hosts and 3000 virtual machines, which easily covers our needs. While my team consists of very capable IT professionals, able to run Windows and MS SQL Server with their proverbial eyes shut and limbs tied behind them, it’d be better if we simply didn’t need to. On top of all of this, upgrades between major vCenter releases on Windows have always been perilous, with full reinstalls the norm. The few major upgrades we’ve done with the vCSA have been pretty straightforward and easy, and when they weren’t we just reverted the snapshot and tried again.
There are still some limitations to the vCSA. It doesn’t support linked mode, because linked mode is built on the Active Directory Application Mode (ADAM) functionality in Windows (which is also why a Windows vCenter cannot reside on a domain controller). We don’t use linked mode because it makes the environment more complicated, without much return on investment for the time we would spend dealing with the additional complexity. The vCSA doesn’t support vCenter Heartbeat, either. We don’t use Heartbeat because it’s fairly expensive, and if our vCenter servers are virtual machines we can use snapshots, replication, HA, and DRS to help minimize possible downtime.
Last, the vCSA doesn’t include Update Manager support, so you still need a Windows guest to run it, and, if you follow directions, a MS SQL Server, too. We thought about those directions, and how we actually use Update Manager. We use Update Manager to keep our infrastructure updated, but it isn’t critical to our operations, and our Update Manager configuration isn’t complicated (the default baselines, add the Dell OpenManage depot URL, upload a couple of custom ESXi boot images, Dell EqualLogic MPMs, and newer Broadcom drivers for our blades). Coupled with the ability to take snapshots, and our use of Veeam Backup & Replication to back the whole thing up, what would we lose if it was down for a day? (Nothing, we plan our patching in advance). Does anybody but my team rely on it? (No.) What would we lose if we had to rebuild it from scratch or restore it from backup? (About an hour of someone’s time). Are we concerned with SQL database performance for this application? (No, we run scans and remediations asynchronously — we start them and walk away). Given this, we decided to build a Windows VM for each of our vCSAs to run Update Manager, and we would use the MS SQL Server Express database it offers to install for non-production use. Easy.
While it is possible to run vCenter inside the cluster it manages, not all the VMware components support that as well as vCenter does. As a result, VMware best practices for vCloud instances, and likely many other things going forward, now include the idea of a separate “management cluster.” This cluster should be a simple, independent cluster that houses all the management components for a site, like vCenter, vCloud Director, Chargeback, Site Recovery Manager, Infrastructure Navigator, etc. Not only does this simplify administration, it helps organizations properly compute the overhead costs associated with their environments, and it makes some aspects of licensing and business continuity & disaster recovery easier. Since we were redoing all our management infrastructure anyhow we decided it would be a good time to implement this. It looks something like:
There isn’t an official upgrade process to move from Windows vCenter to the vCSA, so we had to come up with our own. What we’ve done in the past is disconnect an ESXi host from vCenter with all the VMs running, and add it to another vCenter somewhere else. When we tested that we found a big snag: the vSphere Distributed Switches (vDS) disappeared. In vSphere 5.1 VMware added the ability to export a vDS configuration and import it somewhere else, which, in theory, should have made this easy. When we did that export/import and then reconnected our ESXi hosts the vDS on the host didn’t mate up with vCenter’s vDS, erasing the vDS on the host and leaving our VMs with no network. Not good.
As it turns out, there is a bug in vSphere 5.1 that prevents this from working correctly, which bas been fixed in vCenter 5.1 Update 2. Our vCenter was 5.1 Update 1, and because Windows vCenter upgrades are often a crapshoot we didn’t feel like wasting a ton of our staff time getting to Update 2. Most of our network links are redundant, and standard virtual switches import seamlessly. So, using a bunch of PowerCLI commands we moved the redundant NICs to a new standard vSwitch and recreated the tagged VLAN port groups.
Our general plan became:
- Build the new management cluster first, get that set up, tested, and debugged. This also gives people a chance to upgrade clients and whatnot. Deploy a Veeam backup proxy into the management cluster so you can back the new appliances up.
- Get the new production cluster vCSA deployed, get authentication working, and duplicate the clusters, resource pools (enable DRS in partial mode), folder structure, and permissions. This was also a good time to work through some of the vSphere Hardening Guide, dealing with SSL, resetting service account passwords to long random strings, and ensuring there is a service account for each add-on (vCOPS, Veeam, VIN, etc.).
- Document resource pool configurations, as the cutover process will mess with them and you want to know the way they were set up originally.
- Document HA exceptions and settings.
- Document all DRS rules and groups for re-creation on the new vCSA (you can’t create rules until vCenter sees the VMs).
- Import a copy of the vSphere Distributed Switches, because even if we couldn’t use them straight up it made rebuilding easier. Resist the urge to upgrade them to 5.5 at this point — remember that you’ll be importing ESXi 5.1 hosts which can’t participate in a newer vDS. We also audited the port group configurations at this time.
- Set Update Manager up so we could do ESXi 5.5 upgrades.
- Verify all physical network port configurations. We actually didn’t do this, trusting that they’d been set up correctly by our Networking group. We discovered, the hard way, that at some point some of our ports became misconfigured through human error (switchport trunk allowed vlan vs. switchport trunk allowed vlan add — under Cisco IOS the word “add” is very significant), and others through configuration rot. As you’d expect, this caused outages when VMs were migrated to those ports. It’s an easy fix, fix the ports, just put the VMs back on the primary NIC, or put the primary NIC in the standard vSwitch temporarily. I suggest you trust but verify. Actually, I suggest you automate and remove the humans from the process altogether.
- One day before, remove all extra infrastructure components (Infrastructure Navigator, vC Ops, NetApp Virtual Storage Console, etc.) from the old vCenter. There may be a way to keep vCenter Operations Manager going and just move it, but in our testing it lost track of the VMs that moved, even when it could see them on a different vCenter. So we just dumped the reports we wanted, documented the customizations, and planned to start fresh on the other side.
- One day before, split the networking and move all VMs to the standard virtual switches. Use PowerCLI to reduce time and errors. Isolate workloads that do not have redundant networking or rely on a vDS feature to one host that can stay on the old vCenter until a future scheduled outage window. I would suggest using the backup or secondary links for the standard vSwitch. Why? When you add a host to a vDS you’ll be prompted to specify the uplink NIC for that host. vCenter will assign that NIC to the first uplink slot. You can save some work by choosing wisely in this step.
- Remove the ESXi hosts from the vDSes.
- Day of the upgrade, disable vSphere Replication and Veeam Backup & Replication. We aren’t using these heavily, relying on array-based replication for most of our stuff. If you care about this you will definitely want to test this more than I did.
- Disable HA on the old vCenter (we didn’t want something we did to trigger it, and we’d be online anyhow to restart VMs if something went wrong).
- Cripple DRS by putting it into manual mode. Don’t ever disable DRS — your resource pools will go away.
- One at a time, disconnect (don’t remove) the ESXi hosts from the old vCenter, and add them to the vCSA. We asked it to keep the resource pools, grafting them into the root resource pool. This operation seems to mess with the resource pool settings a bit so you want to have already created good resource pools as part of step 2, and then you can just move the VMs out of one and into the other.
- Move all ESXi hosts to the vCSA except the host that has workloads with specific networking needs. Get them organized into clusters.
- Sort out resource pools.
- Recreate DRS & HA rules & customizations.
- Readd the ESXi hosts to the vCSA vDSes. Migrate VMs back to the vDS, then remove the standard vSwitch and re-add that NIC to the secondary uplink.
- Fix & restart Veeam Backup & Replication & vSphere Replication.
- During the wee hours of the next morning we moved the ESXi hosts with the specialized networking on them. We’d documented the networking configurations so we could rapidly rebuild them when they lost their vDS configurations, minimizing the outage.
- Remove all ESXi hosts from the old Windows vCenter. I like doing this in case I have to restart the old box, I don’t want that old vCenter getting ideas about my hosts. Shut the old vCenter down. I also disabled the services so they wouldn’t restart, or cause alarms (we have a monitoring check to see if any service set to “Automatic” isn’t running).
- Re-enable HA. Put DRS back in fully automated mode. Resolve issues.
- Move any service DNS names (A or CNAME records) to where they need to go. We did this earlier in the process and discovered that the old vCenter used some of these names internally. It crashed and halted our upgrade for 45 minutes while we switched them back and restarted the old vCenter. We might have been able to hack it with a hosts file entry or some DNS trickery but it wasn’t worth it, as all the new vCSAs had their own DNS entries.
- Wait for a day or two to make sure everything is stable. If it isn’t you still have a working vCenter 5.1, and you haven’t upgraded your hosts yet.
- Upgrade your hosts to 5.5 using Update Manager.
- Upgrade your vDS to 5.5.
- Switch scripts & third-party apps to use the new vCSAs. You might also be able to do this earlier in the process, depending on the tool.
Bask in the glory of vSphere 5.5 running as a vCSA. It’s a lot of steps but mostly straightforward. It is also a good opportunity to learn how to script a lot of this if you are unfamiliar with it. For a couple of these steps I just used PowerCLI one-liners and populated them with Excel’s autocomplete, pasting them into the PowerCLI window as I needed them. Crude? Yes. Effective? Yes. I’m way better in C, C++, and Perl than I am in PowerShell. Also, I owe Alan Renouf and Luc Dekens beer.
Have we had any issues so far? Yes! No major upgrade is without a bunch of little problems.
- One of our hosts crashed a few months ago from a hardware fault, and during the upgrade when we moved VMs around we were getting disconnected NICs on VMs. When we tried to reconnect them we’d get “Invalid configuration for device ’0′” and the NIC wouldn’t reconnect. A temporary fix is to reassign that NIC to a different port group, save it, then reassign it to the correct port group and reconnect it. The KB indicates that this is a problem with the cached information for the vDS but that it’s fixed in 5.1. I don’t think it is. Nevertheless, we just split the primary & backup NICs again, created a new vDS, and moved all the VMs a third time.
- We are also having a hell of a time with the VMware Tools under 5.5, where it just deletes the VMXNET3 driver and quits, leaving our Windows VMs without a NIC. We have a support case open on that one, moving incredibly slowly. I doubt that’s a problem with the vCSA, just the typical craptastic Tools installer, probably not checking return codes or handling error conditions well (not that I’m bitter, every hour my team spends dealing with Windows Tools issues is an hour out of our lives we never get back).
- Lastly, there’s a bug with the vCSA that causes the console functionality in the web client to fail, citing “Could not connect to x.y.z.a:7331.” There is a nice KB article on how to fix it that my team found, and I’m hoping that it’s fixed permanently in 5.5 Update 1. I know bugs happen, but I don’t like it when we customize appliances.
So, in conclusion, I hope this helps someone. Feel free to use the comments here to add upgrade tips if you’ve been through it or correct glaring mistakes I’ve made in writing this up. That’s why I like you folks. :)