vCenter 6.5b Resets Root Password Expiration Settings

I’m starting to update all my 6.x vCenters and vROPS, pending patches being released. You should be doing this, too, since they’re vulnerable to the Apache Struts 2 critical security holes. One thing I noted in my testing is that after patching the 6.5 appliances, their root password expiration settings go back to the defaults. In this case I’d set them to not expire, but it’s clearly not that way anymore:

Depending on your security requirements this might not be what you want. It’s bad form on VMware’s part, changing something that had been explicitly set. I also didn’t test to see if it resets the actual password age, or just the expiry. You might have far less than 365 days before it expires.

While it’s a good idea to rotate passwords, I also hate being locked out of my infrastructure, especially since I usually discover it in the middle of another problem… But to each their own. Good luck!

How Not To Quit Your Job

I’ve thought a lot lately about Michael Thomas, a moron who caused criminal amounts of damage to his former employer in the process of quitting. From The Register[0]:

As well as deleting ClickMotive’s backups and notification systems for network problems, he cut off people’s VPN access and “tinkered” with the Texas company’s email servers. He deleted internal wiki pages, and removed contact details for the organization’s outside tech support, leaving the automotive software developer scrambling.

The real-life BOFH then left his keys, laptop, and entry badge behind with a letter of resignation and an offer to stay on as a consultant.

More than a decade ago I did some consulting for a company that had this happen. They fired their sysadmin and he basically ransomed them, logging in through dozens of back doors to disrupt their business. My first call was to the local police department. This was before these types of crimes were very prevalent; we were lucky that the larger Californian city these crimes were in had a detective with an idea of what to do. Let me tell you: hiring the guy back was never on the list (though pretending to, and meeting up with the guy to grab him, was what the FBI wanted to do). If you do this to someone and they invite you back in to talk or rehire you, and you go, you deserve everything you get because you’re dumb.

Whistleblowing aside, if you’re playing Michael Thomas in a story like this there is absolutely nothing you can say to law enforcement to keep them from throwing you in jail. Think about it. On one side you have a business with a demonstrable material loss because of your actions. On the other side, you’re saying “BUT THEY WERE MEAN TO ME.” And unlike my story above, set in the early ‘oughts, there are actually laws and law enforcement professionals now that will bust your ass and make the charges stick. The process will be years long, too. Mr. Thomas pulled his stunt in 2011, and they finally got around to convicting him. Do you really want to waste that much of your life, with something like that hanging over your head that’ll ultimately destroy your life and career, because of something that felt good for a few minutes?[1]

Beyond all of that, what bugs me the most is how many ways this guy could have screwed with them and gotten away with it. I’m bothered for two reasons:

1. It speaks to how much trust we place in system administrators, and how system administrators need impeccable ethics as well as good judgement. We can implement all the security in the world and, usually, it still comes down to needing to trust a person. Hiring the right people is SO important.

2. It also bothers me because the guy was JUST. SO. DUMB. In a couple minutes over lunch some colleagues and I had ten different, solid, ideas for ways to screw with someone’s systems, mostly based in real-life experience with well-meaning dumbasses. Some highlights were: change the netmasks in their DHCP pools to non-standard ones (e.g. 254.192.138.0) so it’s pretty random what works and what doesn’t, any manner of trickery with scheduled tasks/at/cron, off-hours system shutdowns that look like scripting errors, and redefining localhost (we just had this happen in our Active Directory with someone trying to join an Ubuntu host… OMFG). Extra points if it all just looks like errors, or makes them think you’re an idiot if & when they find the problem. Though in smaller communities that may backfire — people do talk to one another.

Interestingly enough, though, nothing any of us suggested was inherently destructive, just annoying. And when it comes down to it, none of us would actually do any of it, choosing instead to drink a beer and move on with our lives. That, perhaps, is the biggest lesson in the Michael Thomas story. As cathartic as it may be to stick it to the man, if you don’t like your job it’s always a better choice to just simply find a different one and politely move on.

 


[0] “I was authorized to trash my employer’s network, sysadmin tells court” – The Register, 23 Feb 2017

[1] Get your mind out of the gutter, kids are great.

Standards, to and with Resolve

"You can have any color as long as it's black" - Henry Ford

“You can have any color as long as it’s black” – Henry Ford (Image (C) Michael LoCascio, via Wikimedia Commons)

As the holiday season has progressed I’ve spent a bunch of time in the car, traveling three hours at a crack to see friends and family in various parts of Midwestern USA. Much of that travel has been alone, my family having decided to ensconce themselves with my in-laws for the full duration of the week. That has left me ample time to sing aloud in the car, take unplanned detours to collect growlers of beer from esteemed breweries, and to think.

I don’t do New Year’s resolutions. I’m not against them, per se, but I just think they’re too conveniently abandoned. I like the noun form of “resolve” better — a firm determination to do something. I aspire to have resolve, whether I am deciding firmly on a course of action, or settling or finding a solution to a problem, dispute, or contentious matter.

So to what issue should I bring my resolve to bear? What is it that I want to work on in 2017?

As I thought about this, I always crept back to the idea that IT just isn’t the game I signed up for a few decades ago. It seems a lot less technical, at least at the infrastructure level. A lot of the new infrastructure, whether it’s on site or in the cloud, is just simpler. Storage is getting simpler because SSDs are now cheaper than rotational media. Hyperconverged infrastructure has removed a number of pain points as well, including things like discrete SANs. Compute is getting ridiculously dense. What was possible in a 4U server is now possible in essentially a half rack unit (something like a Dell FX2).

With all that, a lot of the crap we’ve dealt with over the years just evaporates.

So what do I work on? What’s the biggest, most fundamental problem around, lying at the core of everything?

Standards.

That’s it. Standards. Without standards you cannot automate, and cannot remove many of the remaining problems at the infrastructure level. Without standards there are bad assumptions, and the inevitable human error and downtime that follow. The foundation of a modern IT operation is standards.

As it turns out, standards aren’t a technical problem, either. The way I see it, they’re usually a financial problem, insofar as someone didn’t budget enough money to do something the way everybody else does, and now it needs to work. Or perhaps it’s a difference of opinion, or a technical requirement that is incompatible with things. Maybe a time constraint. Or a workflow problem, where the workflow should have included IT but didn’t until it was too late. Regardless, though, I see standards as the foundation of IT moving forward, transcending clouds, containers, applications, networking, everything.

So that’s what I’m going to work on –finding a way to enable deep automation and staff time savings with standardization, without unduly limiting projects or adding financial burdens. I urge you to do the same with the copious free time you now have because of flash disk and hyperconvergence.

:)

esxupdate Error Code 99

So I’ve got a VMware ESXi 6.0 host that’s been causing me pain lately. It had some storage issues, and now it won’t let VMware Update Manager scan it, throwing the error:

The host returns esxupdate error code:99. An unhandled exception was encountered. Check the Update Manager log files and esxupdate log files for more details.

A little Google action later and it’s clear there isn’t a lot of documentation, recent or otherwise, about this out there. People suggest rebuilding Update Manager, or copying files from other hosts to repair them. The VMware KB has documentation of the particular error but only in context of the Cisco Nexus 1000V, and only for ESXi 5.0 and 5.1. Here’s another thought, if you’re in my same situation.

1. First, do what it says: check esxupdate.log. Log into the console of the ESXi host (SSH or otherwise) and “tail -f /var/log/esxupdate.log”

2. Scan the host with Update Manager so that the log has fresh data in it. You should see it pop up. In my case it showed:

2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: An unexpected exception was caught:
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: Traceback (most recent call last):
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/usr/sbin/esxupdate", line 238, in main
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: cmd.Run()
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/Cmdline.py", line 113, in Run
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 244, in Scan
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 106, in _generateOperationData
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 89, in _getInstallProfile
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esximage/ImageProfile.py", line 627, in ScanVibs
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esximage/VibCollection.py", line 62, in __add__
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esximage/VibCollection.py", line 79, in AddVib
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esximage/Vib.py", line 627, in MergeVib
 2016-05-27T15:54:52Z esxupdate: esxupdate: ERROR: ValueError: Cannot merge VIBs Dell_bootbank_OpenManage_8.3.0.ESXi600-0000, Dell_bootbank_OpenManage_8.3.0.ESXi600-0000 with unequal payloads attributes: ([OpenManage: 7807.439 KB], [OpenManage: 7809.081 KB])
 2016-05-27T15:54:52Z esxupdate: esxupdate: DEBUG: <<<

Ctrl-C will end the “tail” command.

3. It looks like during the storage issues that something about the OpenManage VIB became corrupt, and now it thinks there’s two copies with different payload sizes. You know what? I can just remove this VIB and reinstall it (rather than having to rebuild the host or do some other complicated fixes). I issue a “esxcli software vib list | grep -i dell” command to find the name of the VIB:

[root@GOAT:/var/log] esxcli software vib list | grep -i dell
OpenManage 8.3.0.ESXi600-0000 Dell PartnerSupported 2016-05-04 
iSM        2.3.0.ESXi600-0000 Dell PartnerSupported 2016-05-04

4. Then we need a simple “esxcli software vib remove –vibname=OpenManage”

[root@GOAT:/var/log] esxcli software vib remove --vibname=OpenManage
Removal Result
 Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
 Reboot Required: true
 VIBs Installed: 
 VIBs Removed: Dell_bootbank_OpenManage_8.3.0.ESXi600-0000
 VIBs Skipped:

5. Do what it says and reboot, then scan to see if it works. In my case it did, then I reinstalled the missing extension, and patched to the latest version like normal.

Use Microsoft Excel For Your Text Manipulation Needs

I’m just going to lay it out there: sysadmins should use Microsoft Excel more.

I probably will be labeled a traitor and a heathen for this post. It’s okay, I have years of practice having blasphemous opinions on various IT religious beliefs. Do I know how to use the UNIX text tools like sed, awk, xargs, find, cut, and so on? Yes. Do I know how to use regular expressions? Yes. Do I know how to use Perl and Python to manipulate text, and do poor-man’s extract-transform-load sorts of things? Absolutely.

It’s just that I rarely need such complicated tools in my daily work. I often just have a short list of something that I need to turn into a bunch of one-off commands. And many times I’m sharing it with others of varying proficiency, so readability is key. As it turns out, Excel has some very worthwhile text manipulation. Couple that with the ability to import CSV and autofill it’s a pretty decent solution. Let me give you some examples.

First, we need some text to manipulate. In cells A1 through D1 we have Goats, Sheep, Clowns, and Fire. Some people have Alice & Bob, I have goats & sheep.

Excel Text Example

First, we can concatenate strings very easily in Excel, as well as insert new strings. This is very handy for building commands you can then paste into a CLI, especially for doing one-off sorts of things. We do this with the ampersand, ‘&’.

=C1&” eat “&B1&” that are on “&D1

=”puppet cert sign “&A1&”.domain.com”

Excel Text Example

Oh, you’re doing something that needs the text in all upper- or lower-case? No problem. We have UPPER() and LOWER() functions. Suck it, /usr/bin/tr.

=UPPER(C1)&” eat “&LOWER(B1)&” that are on “&UPPER(D1)

Excel Text Example

Maybe we have a list and we need the first or last few characters from each. There’s LEFT() and RIGHT(), which will return a certain number of characters from those sides of the string.

=LEFT(A1,2)

=RIGHT(C1,4)

Excel Text Example

Perhaps you have a list of domain names, and want to grab the first part. We can use FIND() with LEFT() and RIGHT(). We can add or subtract 1 to get what we want.

=LEFT(A17,FIND(“.”,A17))

=LEFT(A17,FIND(“.”,A17)-1)

Excel Text Example

Maybe we need to do some autofilling, perhaps for a quick way to take some snapshots through VMware’s PowerCLI. I had the list on the left, then incorporated it into a larger command, dragging down to autofill all the names. Copy & paste that into a PowerCLI window and you’re set. Ad-hoc PowerCLI commands on small lists is actually my #1 use case.

=”New-Snapshot -Name Pre-Patch -VM “&A30&” -Confirm:$false”

Excel Text Example

Autofill automatically adjusts cell references, too, so if you specified A1 and dragged down it’ll use A2, A3, A4, and so on. If that’s not what you want you can preface parts of the reference with a dollar sign, ‘$’, to make it a static reference. I made it completely static with $A$1, but you can do $A1 or A$1, too.

=A30&”=”&$A$1&”.domain.com”

Excel Text Example

Excel knows how to autofill just about anything ending in a number or a letter sequence. If it doesn’t catch on with one, try selecting two cells, then filling down. And if it really doesn’t catch on just insert a new column, autofill there, then concatenate that column with your others. In a pinch I’ve built BIND DNS zone files in Excel this way.

I think you get the idea. There’s a good reference in the Excel help, too – hit F1 and then search for “text functions.” The “Text Functions (reference)” result will show more commands, like LEN() for string length, MID() for getting substrings from the middle of a cell, SUBSTITUTE() for replacing text, and so on.

Next time you are tempted to assemble a list of commands by hand save yourself time, keystrokes, and potential errors by doing it in Excel instead!

Here’s my sample workbook, too, if you want to look at these examples yourself. Have fun!

Big Trouble in Little Changes

I was making a few changes today when I ran across this snippet of code. It bothers me.

/bin/mkdir /var/lib/docker
/bin/mount /dev/Volume00/docker_lv /var/lib/docker
echo "/dev/Volume00/docker_lv /var/lib/docker ext4 defaults 1 2" >> /etc/fstab

“Why does it bother you, Bob?” you might ask. “They’re just mounting a filesystem.”

My problem is that any change that affects booting is high risk, because fixing startup problems is a real pain. And until the system reboots the person who executes this won’t know that it works. If it doesn’t work it’ll stop during the boot, sitting there waiting for someone with a root password to come fix it. So you’ll have to get a console on the machine and dig up the root password. Then you need to type it in. If it’s anything like my root passwords it’s 20+ characters long and horrible to type, especially on crappy cloud console applets that tend to repeat characters because they’re written in Java by a high schooler on a reliable, near-zero latency network, twelve versions of Chrome ago.

Once you’re in you need to figure out what the problem is, and that’s an even bigger rub. It might be months or, God help you, years between when these commands run and when they get tested in a reboot. So there’s no correlation, and you’ll have no idea what the problem is aside from a filesystem issue. And all the while it’s burning up your maintenance window and your chance to do the maintenance you actually intended & scheduled, making you look bad.

But what if we just change it a little?

/bin/mkdir /var/lib/docker
echo "/dev/Volume00/docker_lv /var/lib/docker ext4 defaults 1 2" >> /etc/fstab
/bin/mount -a

Now, when it runs it’ll actually test the entry in /etc/fstab, and you’ll know right away if it’s wrong.

Slick, eh?

Are you properly assessing the risk of your changes? Anything that affects booting is high risk, in my opinion. Rebooting properly is the foundation of good patching practices, disaster recovery, automated deployments, and so on.

How do you know the change you’re making actually works? Not just because it worked on a test system, either. How do you know, without a doubt, that it works on each machine you changed?

Configuration management tools help immensely, too, but there’s no substitute for thinking critically about the change you’re making, big or seemingly small.

Interesting Dell iDRAC Tricks

Deploying a bunch of machines all at once? Know your way around for loops in shell scripts, or Excel enough to do some basic text functions & autofill? You, too, can set up a few hundred servers in one shot. Here’s some interesting things I’ve done in the recent past using the Dell iDRAC out-of-band hardware management controllers.

You need to install the racadm utility on your Windows or Linux host. I’ll leave this up to you, but you probably want to look in the Dell Downloads for your server, under “Systems Management.” I recently found it as “Dell OpenManage DRAC Tools, includes Racadm” in 32- and 64-bit flavors.

Basic Command

The basic racadm command I’ll represent with $racadm from now on is:

racadm -r hostname.or.ip.com -u iDRACuser -p password

Set a New Root Password

I don’t know how many times I see people with iDRACs on a network and the root password is still ‘calvin.’ If you do nothing else change that crap right away:

$racadm set iDRAC.Users.2.Password newpassword

The number ‘2’ indicates the user ID on the iDRAC. The root user is 2 by default.

If you have special characters in your password, and you should, you may need to escape them or put them in single quotes. You will want to test this on an iDRAC that has another admin user on it, or where you have console access or access through a blade chassis, for when you screw up the root password and lock yourself out. Not that I’ve ever done this, not even in the course of writing this post. Nope, not admitting anything.

Dump & Restore Machine Configurations

Once upon a time I embarked on a quest to configure a server solely with racadm ‘set’ commands. Want to know a secret? That was a complete waste of a few hours of my life. What I do now is take one server and run through all the BIOS, PERC, and iDRAC settings via the console and/or the web interface, then dump the configuration with a command:

$racadm get -t xml -f idrac-r730xd.xml

That’ll generate an XML file of all the settings, which you can then load back into the other servers with:

$racadm set -t xml -f idrac-r730xd.xml -b graceful -w 600

This tells it to gracefully shut the OS down, if there is one, before rebooting to reload the configurations. It also says to wait 600 seconds for the job to complete. The default is 300 seconds but with an OS shutdown, long reboot, memory check, etc. it gets tight. There are other reboot options, check out the help via:

$racadm help set

You can also edit the XML file to remove parts that you don’t want, such as when you want to preconfigure a new model of server with common iDRAC settings but do the BIOS & RAID configs on your own. That XML file will also give you clues to all the relevant configuration options, too, which you can then use via the normal iDRAC ‘get’ and ‘set’ methods.

Upload New SSL Certificates

I like knowing that the SSL certificates on my equipment aren’t the defaults (and I get tired of all the warnings). With access to a certificate authority you can issue some valid certs for your infrastructure. However, I don’t want to manage SSL certificates for hundreds of servers. Where I can I’ll get a wildcard certificate, or if that’s expensive or difficult I’ll abuse the Subject Alternate Name (SAN) features of SSL certificates to generate one with all my iDRAC names in it. Then I can upload new keys and certificates, and reset the iDRAC to make it effective:

$racadm sslkeyupload -t 1 -f idrac.key
$racadm sslcertupload –t 1 -f idrac.cer
$racadm racreset

Ta-dum, green valid certificates for a few years with only a bit of work. If you don’t have your own CA it’s probably worth creating one. You can load the CA certificate as a trusted root into your desktop OS and make the warnings go away, and you know that your SSL certs aren’t the vendor defaults. What’s the point of crypto when everybody has the same key as you?

There are lots of cool things you can do with the iDRAC, so if you’re doing something manually via the console or iDRAC web interface you might think about looking it up in the Dell iDRAC RACADM Command Line Reference first.

10 Years

Ten years ago I wrote the first post on this blog. 3:43 AM. I’m a late night kinda guy, I guess. Actually, I probably came home from a bar, installed WordPress 1.5.1, and started writing.

Ten years seems like an awfully long time ago. So much has changed in my life. I like my job, most days. That wasn’t true back then. That’s part of why this started, as a way to vent. I have a wife and a kid now… almost two kids, just a couple days more until it is man-to-man coverage around Chez Plankers.

I’ve been a little burnt out lately, with work and kids and life, and slacked off on writing in almost every way. As such, it’s been interesting to look back at some of my first posts here. Ugh. I wonder if, in ten years, my recent posts will be as irrelevant as those early posts are now. They’re not bad, per se, but I hadn’t found focus yet. There’s even recipes back in the archives. Hell, I made the panang the other night. And to this day my #1 post is the one where I show how to reassemble a faucet aerator. No kidding. #2 is how to disable Teredo, 6to4, and whatnot under Windows.

I am definitely a better writer now, though. It is true about Carnegie Hall — you get there with practice.

I wasn’t part of the virtualization community, early on. My goal was to write about system administration, mostly. I’d been virtualizing things for a couple of years at that point, but it was only when I discovered that EMC wasn’t recommending that people align the partitions on their disks, and that there were serious negative performance implications there, that I started writing about VMware. We had Dell PowerEdge 6650s and EMC Clariion CX3s at the time, ESX 1.5, vMotion but nothing more. vMotion made us laugh the first time we set it up. I think we spent an hour moving things back & forth, in a shared area, and by the time my friend & coworker Rich and I were done we’d accumulated a lot of our coworkers around us, witnessing the beginning of the next phase of IT.

So I started writing about it, among other things. I owe two people thanks for support in those early years. John Troyer, who forged the next generation of vendor communities. He reached out to me early and encouraged me to write more and often. He used the term “bully pulpit” at least once with me, but in that I found balance and moderation. He may also have been the first one to tell me I was a good writer, in front of a lot of other people.

The other is Marc Farley, who surprised me once at an early Las Vegas VMworld by reaching out, inviting me to dinner, and drinking tequila with me. I had no idea what to think when he first made contact, but by the end of the night I had gained a sense of the possible community and friendships. Also, tequila, which would repeat itself a few times here and there. Not nearly enough, though, mostly due to proximity.

Thank you guys.

There are so many more out there that encourage me, that have encouraged me, and give me hope and inspiration, reminded me there’s a point to this stuff. People I’ve enjoyed times with over the years, people I’m happy to call friends, even if we don’t see each other all that much anymore. Damian Karlson and an intoxicated evening in the Venetian. Frank Denneman and Duncan Epping and late night hot dogs in Copenhagen. Ed Czerwin, Chris Dearden, and Christian Hobbel, the vSoup guys, for ongoing support and love. Jason Boche, Todd Scalzott, Chris Wahl, Drew Denson, and Rich Lingk, people I can smoke cigars and talk about anything into the wee hours of the morning. Michael Keen, Stu Miniman, and Ganesh Padmanabhan, always up for a Moscow Mule. People I don’t even know how I know them anymore, who I love seeing, people like Julia Weatherby and Jay Weinshenker, Gina Minks, GS Khalsa, and Matt Vogt. Edward Haletky, Bernd Herzog, and all the TVP crew past & present. Stephen Foskett, Claire Chaplais, Tom Hollingsworth, Matt Simmons, Ben Freedman, all the TFD crew, and all the repeat offenders I meet at conferences, like Justin Warren, Howard Marks, Ethan Banks, Greg Ferro, Alastair Cooke, Keith Townsend, John Obeto, Curtis Preston, and more. The TechTarget folks, Nick Martin, Alex Barrett, Colin Steele, and Lauren Horwitz, who have taken my writing to the next level. And of course all the folks with vendors that keep good track of me, and allow me to see some of these people from time to time. Doug Hazelman, Sarah Vela, Jason Collier, Rick Vanover, Melanie Boyer, Eric Nielsen, and more.

It’s late and I’ve forgotten people in this list. People who are important. I’m sorry, and I’m thankful. Thank you to everybody who still works for and in this community of bloggers. Thank you for everybody that has encouraged me. Thank you to everybody who reads my writings. Thank you, all.

Three Thoughts on the Nutanix & StorageReview Situation

Photo courtesy of National Nuclear Security Administration / Nevada Field Office.

Photo courtesy of National Nuclear Security Administration / Nevada Field Office.

I’ve watched the recent dustup between VMware and Nutanix carefully. It’s very instructive to watch how companies war with each other in public, and as a potential customer in the hyperconverged market it’s nice to see companies go through a public opinion shakedown. Certainly both VMware and Nutanix tell stories that seem too good to be true about their technology.

On the VMware side VSAN is new-ish, and VMware doesn’t have the greatest track record for stability in new tech, though vSphere 6 seems to be a major improvement. On the Nutanix side I have always had a guarded opinion of technologies that introduce complexity and dependency loops, especially where storage systems are competing with workloads for resources. I’ve argued the point with Nutanix on several occasions, and their answer has been essentially “well, we sell a lot of them.” I had no real data either way, so it was hard to argue.

As such, you can imagine that I found the StorageReview post on why they cannot review a Nutanix cluster very interesting (link below). I have a lot of respect for Brian and Kevin at StorageReview. Not only are they nice guys, they do a lot of good work supplying useful performance data to customers. They use testing methods designed to reflect real world situations. Not all of us have data centers full of idyllic cloud-ready apps that do 100% read I/O on 512 byte blocks. In fact, most of us in the real world have apps that are haphazardly smashed together by companies like Oracle or Infor, sold to CIOs with lies, kickbacks, and hookers. These abominations are often performance nightmares to start with, and if they’re designed at all it’s for copious professional services and collusion with hardware vendors. I need infrastructure that can run them well (or at least less poorly), and I appreciate a good review with good testing methodologies.

There are a lot of opinions about this article. Here are three of mine.

It Should Have Never Gone This Far

Some industry & vendor folks think that it’s irresponsible to have posted this. I empathize with them. Nobody likes the idea of someone publishing an article like this on their watch, especially during the middle of a nasty war with a huge competitor. StorageReview just armed all the competitors with fresh dirt to throw, and it’s bad.

However, it should have never gone this far. Six months is ample time to fix the situation or work something out in good faith. There are lots of ways to explain performance issues. All systems have tradeoffs, and perhaps NX-OS trades performance for OpEx savings. Perhaps most customers don’t need that level of performance, and the system wasn’t designed for it. Whatever. Anything sounds better than what seems to have happened.

If there are problems, and it seems like there are some big ones, own them and fix them. If you need to know how to do this call someone at Jeep. Between the 2012 “Moose Test” failures (links below) and the recent hacks they’ve had a lot of experience acknowledging a problem, owning it, and fixing it.

Covering Something Up Makes People More Curious About It

Have you ever watched or read Tom Clancy’s “Clear and Present Danger?” In it, the main character, Jack Ryan, advises the US President to not dodge a question about a friend who was revealed as a drug smuggler:

“If a reporter asked if you and Hardin were friends, I’d say, ‘No, we’re good friends.’ If they asked if you were good friends, I’d say, ‘No, no, we’re lifelong friends.’ I would give them no place to go… There’s no sense defusing a bomb after it’s already gone off.”

Why can’t I run a standard benchmark like VMmark on a Nutanix cluster? Why can’t people share performance results? If I bought one of these would I be able to talk about my performance? Why is Nutanix uncomfortable with performance results? Why do they ship underpowered VSAN configurations for comparison to Nutanix clusters? Why do they insist on synthetic workloads? If I buy one of these systems and it doesn’t perform can I return it? What happens if I have performance problems after an upgrade? Can I downgrade? What will it cost to buy a reasonable test system so I can vet all changes on these systems?

We all have a lot of questions now, and that isn’t particularly good for Nutanix or their partner Dell. Great for VMware, great for Simplivity, great for Scale Computing, though.

This Isn’t About Performance, It’s About Support

For me, this whole issue isn’t about performance. It’s about support. It’s about knowing that when I have a problem someone will help me fix it. If a reviewer who was intentionally shipped a system for review cannot get support for that system when they have issues what are the chances I will be able to when I have issues? I already anticipate that, given the fighting, VMware won’t support me well or at all on a Nutanix system. Now I have doubts that Nutanix will be able to make up the difference. Doubly so if I bought an XC unit from Dell.

If you’re in the market for a hyperconverged system you have a lot of new questions to ask. Remember that vendors will tell you anything to get you to buy their goods and services. Insist on a try & buy with specific performance goals. Insist on a bake-off between your top two choices. Ask for industry-standard benchmark numbers. Stick to your guns.

Leave your comments below — I’m interested in what people think.

Links:

When Should I Upgrade to VMware vSphere 6?

I’ve been asked a few times about when I’m planning to upgrade to VMware vSphere 6.

Truth is, I don’t know. A Magic 8 Ball would say “reply hazy, try again.”

Some people say that you should wait until the first major update, like the first update pack or first service pack. I’ve always thought that approach is crap. Software is a rolling collection of bugs. Some are old, some are new, and while vendors try to make the number of bugs go down the truth is that isn’t the case all the time. Especially with large releases, like service packs. The real bug fixing gains are, to borrow a baseball term, in the “small ball” between the big plays. The way I see it, the most stable product is the version right before the big service pack.

Some people say that because 6.0 ends in .0 they’ll never run that code. “Dot-oh code is always horrible,” they volunteer. My best theory is that these people have some sort of PTSD from a .0, coupled with some form of cult-like shared delusion. A delusion like “nobody gets fired for buying IBM” or “we’ll be whisked away on the approaching comet.” Personally, I should get twitchy when I think about versions ending in .1. The upgrade to vSphere 5.1 was one of the most horrific I had. Actually, speaking of IBM, it seems to me that I filed a fair number of bugs against AIX 5.1, too, back in the day. Somehow I still can sleep at night.

Thing is, a version number is just a name, often chosen more for its marketing value than its basis in software development reality. It could have been vSphere 5.8, but some products were already 5.8. It could have vSphere 5.9, but that’s real close to 6.0. Six is a nice round number, easy to rebase a ton of products to and call them all Six Dot Oh. Hell, AIX never had a real 5.0, either, except internally in IBM as an Itanium prototype. To the masses they went from 4.3.3 to 5.1. Oh, and IBM’s version number was 5.1.0. OH MY GOD A DOT-ONE AND A DOT-OH. Microsoft is skipping Windows 9, not because Windows 10 is so epically awesome, but because string comparisons on “Windows 9*” will match “Windows 95,” too. And a lot of version numbers get picked just because they’re bigger than the competitors’ versions.

Given all this it seems pretty stupid to put much stock in a version number. To me, it’s there to tell us where this release fits in the sequence of time. 6.0 was before 7.0 and after 5.5.

Oh, but what about build numbers? I’ve had people suggest that. Sounds good, until you realize that the build numbers started back years ago when the codebase was forked for the new version. And, like the version number, it means almost nothing. It doesn’t tell you what bugs are fixed. It doesn’t tell you if there are regressions where 6.0 still has a bug that was fixed in 5.5, or the other way around where 5.5 still has a bug that 6.0 doesn’t because of the rework done to fix something else. Build numbers tell you where you are in time for a particular version, and roughly how many times someone (or something) recompiled the software. That’s it.

Some people say “don’t upgrade, what features does it have that you really need?” Heard this today on Twitter, and it’ll likely end up as a harsh comment on this post. Sure, maybe vSphere 6 doesn’t have any features I really need. But sometimes new versions have features that I want, like a much-improved version of that goddamned web client. Or automated certificate management — the manual process makes me think suicidal thoughts. Or cross-vCenter vMotion, oh baby where have you been all my life. Truth is, every time I hear this sort of upgrade “advice” and ask the person what version of vSphere they’re running it’s something ancient, like 4.1. I suspect their idea of job security is all the busy work it takes to run an environment like that, not to mention flaunting end-of-support deadlines. Count me out. I like meaningful work and taking advantage of improvements that make things better.

Some people say “upgrade your smallest environments first, so if you have problems it doesn’t impact very much.” Isn’t that the role of a test environment, though? Test is never like production, mostly because there’s never the same amount of load on it. And if you do manage to get the same load on it it’s not variable & weird like real user load. Just never the same. And while I agree in principle that we should choose the first upgrades wisely I always rephrase it to say “the least critical environments.” My smallest environments hold some of the most critical workloads I have. One of them is “things die & police are dispatched if there are problems” critical. I don’t think I’ll start there.

So where do I start? And how long will it take?

First, I’m doing a completely fresh install of vSphere 6.0 GA code in a test environment. I’m setting it up like I’d want it to be in production. Load-balanced Platform Service Controllers (PSCs). Fresh vCenters, the new linked mode (old linked mode was a hack, new linked mode isn’t even really quite linked mode, just a shared perception of the PSCs). A few nested ESXi hosts for now. I just want to check out the new features and test compatibility, gauge if it’s worth it.

Second, I’m going to wait for the hardware and software vendors in my ecosystem to catch up. Dell has certified the servers I’m running with ESXi 6.0. Dell, HDS, and NetApp have certified my storage arrays. But Veeam hasn’t released a version of Backup & Replication that supports 6.0 yet (soon, says Rick). Backups are important, after all, and I like Veeam because they actually do meaningful QA (I got a laugh from them once because I said I adore their radical & non-standard coding practices, like actually checking return codes). Beyond that, I’m going to need to test some of my code, scripts written to do billing that use the Perl SDK, PowerCLI scripts to manage forgotten snapshots, etc. I’m also going to need to test the redundancy. What happens when a patch comes along? What happens if we lose a PSC, or a vCenter, or something? Does HA work for vRealize Automation? Does AD authentication work? Can I restore a backup?

Third, I’m going to test actual upgrades. I’ll do this with a fresh 5.5 install, running against demo ESXi hosts with demo VMs, with the goal of having the upgraded environment look exactly like my fresh install. Load balanced PSCs, linked mode, vRealize Operations, Replication, Veeam, Converter, Perl SDK, PowerCLI, everything. I’ll write it all down so I can repeat it.

Last, I’ll test it against a clone of my 5.5 VCSA, fenced off from the production networks. I’ll use the playbook I wrote from the last step, and change it as I run into issues.

Truth is, I’ll probably get through step 1 and 2 by mid-May. But then it’ll drag out a bit. I expect upgrade problems, based on experience. I also know I’ve got some big high-priority projects coming, so my time will be limited for something like this. And it’ll be summer, so I’ll want to be in a canoe or on my motorcycle and not upgrading vSphere.

The one thing I do know, though, is that when I get to the production upgrade my path will be laid out by facts and experience, and not folk wisdom and the wives’ tales of IT.

%d bloggers like this: