Three Thoughts on the Nutanix & StorageReview Situation

Photo courtesy of National Nuclear Security Administration / Nevada Field Office.

Photo courtesy of National Nuclear Security Administration / Nevada Field Office.

I’ve watched the recent dustup between VMware and Nutanix carefully. It’s very instructive to watch how companies war with each other in public, and as a potential customer in the hyperconverged market it’s nice to see companies go through a public opinion shakedown. Certainly both VMware and Nutanix tell stories that seem too good to be true about their technology.

On the VMware side VSAN is new-ish, and VMware doesn’t have the greatest track record for stability in new tech, though vSphere 6 seems to be a major improvement. On the Nutanix side I have always had a guarded opinion of technologies that introduce complexity and dependency loops, especially where storage systems are competing with workloads for resources. I’ve argued the point with Nutanix on several occasions, and their answer has been essentially “well, we sell a lot of them.” I had no real data either way, so it was hard to argue.

As such, you can imagine that I found the StorageReview post on why they cannot review a Nutanix cluster very interesting (link below). I have a lot of respect for Brian and Kevin at StorageReview. Not only are they nice guys, they do a lot of good work supplying useful performance data to customers. They use testing methods designed to reflect real world situations. Not all of us have data centers full of idyllic cloud-ready apps that do 100% read I/O on 512 byte blocks. In fact, most of us in the real world have apps that are haphazardly smashed together by companies like Oracle or Infor, sold to CIOs with lies, kickbacks, and hookers. These abominations are often performance nightmares to start with, and if they’re designed at all it’s for copious professional services and collusion with hardware vendors. I need infrastructure that can run them well (or at least less poorly), and I appreciate a good review with good testing methodologies.

There are a lot of opinions about this article. Here are three of mine.

It Should Have Never Gone This Far

Some industry & vendor folks think that it’s irresponsible to have posted this. I empathize with them. Nobody likes the idea of someone publishing an article like this on their watch, especially during the middle of a nasty war with a huge competitor. StorageReview just armed all the competitors with fresh dirt to throw, and it’s bad.

However, it should have never gone this far. Six months is ample time to fix the situation or work something out in good faith. There are lots of ways to explain performance issues. All systems have tradeoffs, and perhaps NX-OS trades performance for OpEx savings. Perhaps most customers don’t need that level of performance, and the system wasn’t designed for it. Whatever. Anything sounds better than what seems to have happened.

If there are problems, and it seems like there are some big ones, own them and fix them. If you need to know how to do this call someone at Jeep. Between the 2012 “Moose Test” failures (links below) and the recent hacks they’ve had a lot of experience acknowledging a problem, owning it, and fixing it.

Covering Something Up Makes People More Curious About It

Have you ever watched or read Tom Clancy’s “Clear and Present Danger?” In it, the main character, Jack Ryan, advises the US President to not dodge a question about a friend who was revealed as a drug smuggler:

“If a reporter asked if you and Hardin were friends, I’d say, ‘No, we’re good friends.’ If they asked if you were good friends, I’d say, ‘No, no, we’re lifelong friends.’ I would give them no place to go… There’s no sense defusing a bomb after it’s already gone off.”

Why can’t I run a standard benchmark like VMmark on a Nutanix cluster? Why can’t people share performance results? If I bought one of these would I be able to talk about my performance? Why is Nutanix uncomfortable with performance results? Why do they ship underpowered VSAN configurations for comparison to Nutanix clusters? Why do they insist on synthetic workloads? If I buy one of these systems and it doesn’t perform can I return it? What happens if I have performance problems after an upgrade? Can I downgrade? What will it cost to buy a reasonable test system so I can vet all changes on these systems?

We all have a lot of questions now, and that isn’t particularly good for Nutanix or their partner Dell. Great for VMware, great for Simplivity, great for Scale Computing, though.

This Isn’t About Performance, It’s About Support

For me, this whole issue isn’t about performance. It’s about support. It’s about knowing that when I have a problem someone will help me fix it. If a reviewer who was intentionally shipped a system for review cannot get support for that system when they have issues what are the chances I will be able to when I have issues? I already anticipate that, given the fighting, VMware won’t support me well or at all on a Nutanix system. Now I have doubts that Nutanix will be able to make up the difference. Doubly so if I bought an XC unit from Dell.

If you’re in the market for a hyperconverged system you have a lot of new questions to ask. Remember that vendors will tell you anything to get you to buy their goods and services. Insist on a try & buy with specific performance goals. Insist on a bake-off between your top two choices. Ask for industry-standard benchmark numbers. Stick to your guns.

Leave your comments below — I’m interested in what people think.


When Should I Upgrade to VMware vSphere 6?

I’ve been asked a few times about when I’m planning to upgrade to VMware vSphere 6.

Truth is, I don’t know. A Magic 8 Ball would say “reply hazy, try again.”

Some people say that you should wait until the first major update, like the first update pack or first service pack. I’ve always thought that approach is crap. Software is a rolling collection of bugs. Some are old, some are new, and while vendors try to make the number of bugs go down the truth is that isn’t the case all the time. Especially with large releases, like service packs. The real bug fixing gains are, to borrow a baseball term, in the “small ball” between the big plays. The way I see it, the most stable product is the version right before the big service pack.

Some people say that because 6.0 ends in .0 they’ll never run that code. “Dot-oh code is always horrible,” they volunteer. My best theory is that these people have some sort of PTSD from a .0, coupled with some form of cult-like shared delusion. A delusion like “nobody gets fired for buying IBM” or “we’ll be whisked away on the approaching comet.” Personally, I should get twitchy when I think about versions ending in .1. The upgrade to vSphere 5.1 was one of the most horrific I had. Actually, speaking of IBM, it seems to me that I filed a fair number of bugs against AIX 5.1, too, back in the day. Somehow I still can sleep at night.

Thing is, a version number is just a name, often chosen more for its marketing value than its basis in software development reality. It could have been vSphere 5.8, but some products were already 5.8. It could have vSphere 5.9, but that’s real close to 6.0. Six is a nice round number, easy to rebase a ton of products to and call them all Six Dot Oh. Hell, AIX never had a real 5.0, either, except internally in IBM as an Itanium prototype. To the masses they went from 4.3.3 to 5.1. Oh, and IBM’s version number was 5.1.0. OH MY GOD A DOT-ONE AND A DOT-OH. Microsoft is skipping Windows 9, not because Windows 10 is so epically awesome, but because string comparisons on “Windows 9*” will match “Windows 95,” too. And a lot of version numbers get picked just because they’re bigger than the competitors’ versions.

Given all this it seems pretty stupid to put much stock in a version number. To me, it’s there to tell us where this release fits in the sequence of time. 6.0 was before 7.0 and after 5.5.

Oh, but what about build numbers? I’ve had people suggest that. Sounds good, until you realize that the build numbers started back years ago when the codebase was forked for the new version. And, like the version number, it means almost nothing. It doesn’t tell you what bugs are fixed. It doesn’t tell you if there are regressions where 6.0 still has a bug that was fixed in 5.5, or the other way around where 5.5 still has a bug that 6.0 doesn’t because of the rework done to fix something else. Build numbers tell you where you are in time for a particular version, and roughly how many times someone (or something) recompiled the software. That’s it.

Some people say “don’t upgrade, what features does it have that you really need?” Heard this today on Twitter, and it’ll likely end up as a harsh comment on this post. Sure, maybe vSphere 6 doesn’t have any features I really need. But sometimes new versions have features that I want, like a much-improved version of that goddamned web client. Or automated certificate management — the manual process makes me think suicidal thoughts. Or cross-vCenter vMotion, oh baby where have you been all my life. Truth is, every time I hear this sort of upgrade “advice” and ask the person what version of vSphere they’re running it’s something ancient, like 4.1. I suspect their idea of job security is all the busy work it takes to run an environment like that, not to mention flaunting end-of-support deadlines. Count me out. I like meaningful work and taking advantage of improvements that make things better.

Some people say “upgrade your smallest environments first, so if you have problems it doesn’t impact very much.” Isn’t that the role of a test environment, though? Test is never like production, mostly because there’s never the same amount of load on it. And if you do manage to get the same load on it it’s not variable & weird like real user load. Just never the same. And while I agree in principle that we should choose the first upgrades wisely I always rephrase it to say “the least critical environments.” My smallest environments hold some of the most critical workloads I have. One of them is “things die & police are dispatched if there are problems” critical. I don’t think I’ll start there.

So where do I start? And how long will it take?

First, I’m doing a completely fresh install of vSphere 6.0 GA code in a test environment. I’m setting it up like I’d want it to be in production. Load-balanced Platform Service Controllers (PSCs). Fresh vCenters, the new linked mode (old linked mode was a hack, new linked mode isn’t even really quite linked mode, just a shared perception of the PSCs). A few nested ESXi hosts for now. I just want to check out the new features and test compatibility, gauge if it’s worth it.

Second, I’m going to wait for the hardware and software vendors in my ecosystem to catch up. Dell has certified the servers I’m running with ESXi 6.0. Dell, HDS, and NetApp have certified my storage arrays. But Veeam hasn’t released a version of Backup & Replication that supports 6.0 yet (soon, says Rick). Backups are important, after all, and I like Veeam because they actually do meaningful QA (I got a laugh from them once because I said I adore their radical & non-standard coding practices, like actually checking return codes). Beyond that, I’m going to need to test some of my code, scripts written to do billing that use the Perl SDK, PowerCLI scripts to manage forgotten snapshots, etc. I’m also going to need to test the redundancy. What happens when a patch comes along? What happens if we lose a PSC, or a vCenter, or something? Does HA work for vRealize Automation? Does AD authentication work? Can I restore a backup?

Third, I’m going to test actual upgrades. I’ll do this with a fresh 5.5 install, running against demo ESXi hosts with demo VMs, with the goal of having the upgraded environment look exactly like my fresh install. Load balanced PSCs, linked mode, vRealize Operations, Replication, Veeam, Converter, Perl SDK, PowerCLI, everything. I’ll write it all down so I can repeat it.

Last, I’ll test it against a clone of my 5.5 VCSA, fenced off from the production networks. I’ll use the playbook I wrote from the last step, and change it as I run into issues.

Truth is, I’ll probably get through step 1 and 2 by mid-May. But then it’ll drag out a bit. I expect upgrade problems, based on experience. I also know I’ve got some big high-priority projects coming, so my time will be limited for something like this. And it’ll be summer, so I’ll want to be in a canoe or on my motorcycle and not upgrading vSphere.

The one thing I do know, though, is that when I get to the production upgrade my path will be laid out by facts and experience, and not folk wisdom and the wives’ tales of IT.

9 Things You’ll Love About vSphere 6.0

vSphere 6.0vSphere 6.0, finally. It’s been in beta for what seems like an eternity. Betas are like Fight Club, where the first rule of participation is that you may not talk about your participation. But today’s the day that changes, as VMware just announced 6.0. A lot of rough edges were smoothed in this release, and all the limits have increased again (64 hosts per cluster, etc.). Beyond that, though, there’s much to like. Here are nine things I think are pretty neat about 6.0.

1. Centralized Services (PSC, Content Library, Update Manager)

VMware has acknowledged that there’s a fair amount of “meta-administration” (my term) that goes on for vSphere. To help curb that they’ve created the Platform Services Controller, which is essentially a separate virtual appliance that is paired with each vCenter that handles SSO, licensing, and certificate authority services. This next version of SSO has its own native replication in it, making it even more like Active Directory, and because it’s a virtual appliance it’s easier to maintain than the motley collection of services you’d have to run on a Windows host.

There is a new Content Library, which aims to help collect and organize ISO images, templates, vApps, scripts, etc. in one easy-to-find location.

UPDATE, 2/5/2015: Some folks have rightfully pointed out that my original statement about Update Manager as a virtual appliance is incorrect. They’re right, between my notes being wrong (well, mashed together) and distractions while editing Update Manager became something it isn’t right now. Sorry. VUM is still a Windows application and dependent on the C# client. Hopefully VMware will do something more interesting with it moving forward!

2. Web & Classic Client Improvements

Everybody hates the web client, so VMware spent some time on it. Login is 13x faster, they say. Most tasks are 50% faster, or more. Performance charts actually work, they say. It integrates the VM Remote Console (VMRC) for better console access & security. It has a dockable UI. They flattened the right-click menu. And it refreshes more, though you can’t see other running tasks by default. It’ll be interesting to see how these improvements stand in the court of public opinion.

The classic C# vSphere client is still around, and can now read the configurations for v10 and v11 (the vSphere 6.0 hardware version) VMs. It can’t edit them, but that’s fine. It’s still the way you get direct access to ESXi hosts and use Update Manager.

3. MSCS Improvements

A lot of people want to use Microsoft Cluster Services to protect their applications. This has classically presented serious issues, in that you usually had to give up vMotion, paravirtual drivers, etc. Not anymore — many of these restrictions have been fixed up for SQL Server 2012 & Windows Server 2012 R2.

4. Auditing & Security (Certs, Auditing, etc.)

Security is on VMware’s mind, and one of the biggest pains in securing a vSphere environment is certificate management. To that end VMware has created a Certificate Authority (VMCA) that runs on the Platform Services Controller and will manage the creation, signing, and installation of certificates for all aspects of vSphere (vCenter, ESXi, Solutions, etc.). It can work with third-party CAs as well, and promises to make it easier to do the right thing.

It’s also much easier to manage local ESXi users, and many of the password & lockout settings are now configured centrally from vCenter. Administrator actions are now logged as the particular user that caused them (and not just vpxuser anymore), enabling people to actually figure out who did what and when.

5. vCloud Suite 6 Updates

The vCloud Suite was looking pretty bad, with most of the components deprecated or not-updated in the face of the new vRealize Suites. Not anymore. vCloud Suite licensees get vRealize Automation, vRealize Operations Management, as well as vRealize Business Standard.

vCloud Director is dead, though, as promised, and it’s gone from the suites.

6. Fault Tolerance & VM Component Protection

FT has always been a feature that looks interesting but was mostly useless due to the 1 vCPU restriction. Not anymore. You can have up to 4 vCPUs per VM, and a total of 8 vCPUs in FT per host. It’s really just limited by the network bandwidth between source and target, though. FT no longer requires shared disk, either, which is a great addition, and they made FT-protected VMs capable of snapshots, paravirtualized drivers, and vADP. This helps make the choice of FT more palatable to administrators.

A new feature, VM Component Protection, has been added. It protects VMs against misconfigurations and connectivity problems. Because FT in vSphere 6.0 doesn’t need a shared disk backend you can place your replica copy on a separate array, and if something happens to the first array vSphere will detect the problem and proactively fail over to the replica. This is an understated but extremely powerful feature.

7. vCSA Improvements

The vCenter Server Appliance is now a full replacement for Windows-based vCenter. It can support 1000 hosts per instance, 10000 VMs, 64 hosts per cluster, and “linked mode” using the embedded database. This is great. Not only is the vCSA easier to maintain than a Windows-based installation but now you don’t need a database license or DBA, either.

You will need to dedicate more vCPU and vRAM resources to the new vCenter & PSC virtual appliances, though.

8. vMotion Improvements

You can vMotion between vCenters now, with or without shared storage.It keeps the UUID for the VM, so the VM looks like it moved, so it doesn’t confuse backup and other tools. When it moves it keeps history, alarms, HA & DRS settings, shares, limits, and the MAC address. AWESOME.

You can move from a vSwitch to a Distributed vSwitch in a vMotion, too, and it’ll move port metadata.

You can vMotion anywhere up to 100 ms away. That’s basically across a continent. Hurricane coming? No big deal, vMotion to Wisconsin! This also means that vMotion across a router is now supported. It’s worked for years but they’ll actually stand behind it now.

9. Virtual SAN 6.0

VSAN has jumped from 1.0 to 6.0, signifying that it’s ready for prime time. It now can treat SSD as read-intensive and write-intensive, making a two-tier all-flash configuration possible. You can use it on a 64-host cluster. You can use 62 TB VMDK files now, a much-missed feature. Snapshots and clones have been reworked to integrate the Virsto intellectual property, making snapshots useful again. They added fault domain support, so you can model your VSAN implementation around the possible problems in your data center, making it cheaper and more reliable to operate. Last, there are major management updates to VSAN, making it easy to set up, upgrade, and maintain your hyperconverged VSAN deployment.

Lots of great stuff! I look forward to upgrading to these new technologies.

Latest ESXi Turns Off Transparent Page Sharing, So Watch Your RAM

Transparent Page Sharing is a technology from VMware that essentially deduplicates memory. Instead of having 100 copies of the same memory segment it keeps just one, and returns the savings to the user in the form of additional free capacity.

In a move that further encourages people to never patch their systems VMware has set the new default for Transparent Page Sharing to “off.” They did this in the latest Updates to ESXi (ESXi 5.5 Update 2d, for example). More specifically, in order to use it by default you now need to configure your virtual machines to have a “salt,” and only VMs with identical salts will share pages. To specify a salt you need to manually edit a virtual machine’s .VMX file, which also requires the VM to be off and removed from inventory, too.

What a mess.

If you patch to the latest versions for security reasons it’ll be off, and you’ll likely have some serious memory contention issues. In one of my environments we have almost an entire host worth of shared pages (209.71 GB):


which, under ESXi 5.5 Update 2d, will be gone.

If I were you I’d put it back the way it was, which means an advanced configuration option on each ESXi box, Mem.ShareForceSalting set to 0. You can’t use Update Manager, though, because the new default appears after ESXi 5.5 Update 2d boots. You need to set the flag once the host is up, then reboot the host once more. So it’s a pretty manual patching round this time.

This is all due to a low-risk security concern that’s likely not even a consideration for a typical enterprise. VMware wants ESXi to be secure by default, which I get, but this is a major change that seriously affects resources & budgets, too. Major changes go in major releases (5.1.0, 5.5.0, etc.), not third-level point releases, and NOT mixed up with a critical security fix, either. This is a classic VMware Hobson’s choice: take the whole thing or leave it. Or perhaps it’s a Catch-22. Regardless of the cliche, I suspect this poor “feature” release decision is going to cause some headaches for VMware customers this time.

More information is in the VMware KB:

How to Install CrashPlan on Linux

I like CrashPlan. They support a wider range of operating systems than some of their competitors, they have a simple pricing model, unlimited storage & retention, and a nice local, mobile, and web interfaces. I’ve been a customer for a few years now, and recently have switched a few of my clients’ businesses over to them, too.

What I don’t like is that they don’t seem to support Linux very well, which is typical of companies when their installed base is mostly Windows & Mac. Most notably, their install instructions are sparse and they don’t tell you what packages you need to have installed, which is important because cloud VMs and whatnot are usually “minimal” installations. I’ve attempted to open a support case, but they suggested running a “headless” client, which is both unsupported and a huge pain. And then they closed the support case, because it’s unsupported! DOH.

So here’s how I get CrashPlan installed on Linux in case it helps others, and maybe Code42 themselves.


  • As of this writing the CrashPlan software is version 3.7.0.
  • I am doing this on Red Hat Enterprise Linux, CentOS, and Oracle Enterprise Linux 5, 6, and 7 servers.
  • These servers have outbound access to the Internet.
  • I am using a Windows 7 desktop.
  • This document assumes you’ve done some things with Linux before.
  • This document assumes you’re logging in as yourself and then using sudo to run things as root.
  • Your mileage may vary, and my ability to help you is limited. I don’t work for Code42. That said, if you can improve this document please let me know how.

1. Your Linux server has to have a few packages on it to enable basic X Windows support for the CrashPlan GUI. Most cloud servers are built with a minimal installation and don’t have these. On Enterprise Linux variants you can issue the command:

sudo yum install xauth xterm wget xorg-x11-fonts-Type1 xorg-x11-font-utils libXfont

This will get you xterm, so you can test the setup, xauth which is part of the X Windows authentication setup, and the fonts the client will need. It also grabs wget, which the CrashPlan installer will use to retrieve a copy of the Java Runtime Environment.

2. You need to forward X Windows (X11) graphics to your desktop. To do this you need what’s called an X Server, and you need an SSH client that can forward X11 packets. Given that you did #1 already you probably have SSH. I’m on Windows, so I use:

Xming — — the X Server you need

PuTTY — — a free SSH client

I actually use Van Dyke’s SecureCRT — — but it’s a commercial product. IMO, totally worth it if you spend a lot of time logged into UNIX hosts, though. Both it and PuTTY can forward X11.

Install these things. Run Xming (not Xlaunch, etc.) and your SSH client of choice. Xming may trigger a firewalling prompt under Windows. The X Windows data will be coming across the SSH connection you’re about to establish, so you shouldn’t need to open anything up.

3. Set up a new SSH connection, or edit an existing one, and tell it to Enable X11 Forwarding. Here’s where the X11 option is in PuTTY:

2015-01-24 23_28_13-PuTTY Configuration

Open that connection.

4. Run ‘xterm’:


You should see:

2015-01-25 14_14_32-xterm

If you get this error on the console:

xterm: Xt error: Can't open display: localhost:10.0

Check to make sure Xming is running. If you get:

xterm: Xt error: Can't open display: %s
xterm: DISPLAY is not set

Forwarding isn’t working when you logged in. Check your settings, and check to make sure that xauth and xterm got installed correctly. If you’re ssh’ing from a command line, like on a Mac or through a second server you might need:

ssh -Y hostname

5. Close xterm – you can either type ‘exit’ like in a normal shell or just close the window like a normal window. That was just to check to make sure forwarding was working.

6. Put the CrashPlan installer on the Linux server. My favorite way to do this is to copy the direct download link from CrashPlan and paste it into wget, but you can use sftp, scp, Zmodem[0], whatever.


Then expand it:

gtar xlzf CrashPlanPRO_3.7.0_Linux.tgz
cd CrashPlanPRO-install/

7. Run the installer as root. The CrashPlan service needs to run as root if you want it to back up the whole system, and the installer will want to put startup files in the right spots so CrashPlan starts after a reboot. This is a good idea.

sudo ./

8. Answer questions, read the EULA. You can get out of the EULA by hitting ‘q’. I let it put everything in the default locations.

9. Let it start the CrashPlanDesktop app. In a few moments you should see the GUI pop up. Log in, do your thing. You will likely have to adjust what it’s backing up if you want it to get system files and such. Be careful, though, because it’ll back up things that change, and a lot of system files change a lot. You might want to consider lowering the backup frequency in that case.

2015-01-25 14_31_47-CrashPlan PRO


2015-01-25 14_33_45-Change File Selection

10. If the CrashPlan GUI didn’t start you can try running it manually:


If you get errors about permissions check your forwarding again (run xterm). If you’re trying to run it with sudo you might be getting permission errors. X Windows has some authentication in it so people can’t just pop windows open on your desktop. You can give root permission, though, by copying your .Xauthority file to root’s home directory:

cd ~; sudo cp .Xauthority /root/.Xauthority

You shouldn’t need this, though. The CrashPlan client will work when run as your user.

11. If you don’t want all the users on your server to be able to run CrashPlan you should set the CrashPlan client itself to require a password to run. This is the easiest way to handle this, but it requires that people who need to do a restore have the password to the account that it was set up under. Thus, in a multi-admin environment you might want to create a shared user to log in as.

Another way to do this is to create an archive password which is shared. You won’t ever be able to remove the archive password once you do that, though you can change it.

You might be tempted to just change the permissions on /usr/local/crashplan/bin/ so only root can get there, but remember that anybody can copy the GUI to the server and run it.

12. Make sure you have a firewall protecting incoming connections, as the CrashPlan backup engine listens on tcp/4243. You don’t want that open to the world.

13. I actually create a separate /usr/local/crashplan filesystem before I install. CrashPlan keeps logs and file caches there and they get big sometimes, and this keeps them separate from everything else.

Good luck! If you see an error here or have a suggestion let me know in the comments. Thank you.


[0] I’m not kidding. SecureCRT supports Zmodem, and I install lrzsz on all my servers so I can just type ‘rz’ to send a file to the server and ‘sz’ to send one back. Encrypted, fast, easy.

Why Use SD Cards For VMware ESXi?

I’ve had four interactions now regarding my post on replacing a failed SD card in one of my servers. They’ve ranged from inquisitive:

to downright rude:

“SD cards are NOT reliable and you are putting youre [sic^2] infrastructure at risk. Id [sic] think a person like you would know to use autodeploy.”

Aside from that fellow’s malfunctioning apostrophe, he has a good, if blunt, point. SD cards aren’t all that reliable, and there are other technologies to get a hypervisor like ESXi on a host. So why use SD cards?

1. Cost. Looking at, if I outfit a Dell PowerEdge R630 with a traditional setup of two magnetic disks and a decent RAID controller my costs are:

300 GB 10K SAS 2.5″ disk: $212.75
300 GB 10K SAS 2.5″ disk: $212.75
PERC H730: $213.47
Keep My Hard Drive, 5 years: $213.46
Power for this setup, at 5.9 Watts per drive (as per Tom’s Hardware), guestimating 5.9 Watts for the RAID controller, and $0.14133 per kWh in my locale: $109.60 for 5 years.
Labor costs dealing with drive replacements, monitoring, etc.: $200.00 (this is low).

This comes to $1162.03 per server. On a 32 node cluster that’s $37,184.96, or the cost of three servers, over five years.

In contrast, the Dell Internal Dual SD Module is $104.60 per server with two 16 GB SD cards. That’s $3347.20 for a 32 node cluster.

To head off the inevitable comment: the PERC H310/H330 is not a decent RAID controller. To start, it isn’t even certified for VMware VSAN. Anybody that argues that the H330 is fine ought to be okay with the mirroring the Internal Dual SD Module does, because the two are about equal in that regard.

2. Use drive bays more productively. Say that I do want to put local disk in my servers, be it some SSD so I can do caching (a la SanDisk Flashsoft, PernixData, vFRC, etc.) or maybe do VSAN, I’d have to use two of my limited drive bays for boot volumes. That isn’t the most productive use of my expensive drive bays (and data center space).

3. Avoid dependency loops. Auto Deploy is an interesting VMware feature but it relies on a functioning network, DHCP & PXE & TFTP, DNS, and vCenter infrastructure to work. And that’s a problem when you’re in the middle of an outage (planned or unplanned) and any of that infrastructure is a VM.

If your vCenter is a VM how do you start everything up after an outage? Oh, you run a management cluster that doesn’t Auto Deploy… well that’s a pain in the ass, because you now have a vSphere cluster that’s different. Different means harder to manage, which means human error and additional operational cost. What’s the ongoing cost of that, vs $104.60 per server?

If your DHCP/PXE/TFTP server is a VM how do you start everything up after an outage? Oh, you use Auto Deploy with local caching? Okay, where does it cache? Traditional magnetic media… ah, you’ve chosen expensive AND complicated!

My network guys use a bunch of VMs to provide network functionality like DHCP, some DNS (half our anycast DNS infrastructure is on VMs, half on physical hosts), carrier-grade NAT, log collection, etc. It’s great – vSphere dramatically reduces risk and improves uptime for their services. We always have to make sure that we keep track of dependencies, though, which is easy to do when there are so few.

4. Avoid unreliable vSphere features. While we’re on the topic of Auto Deploy I’d just like to say that it isn’t production-ready. First, it’s not at all configurable from the vCenter GUIs. It’s all done from PowerShell, which is hard for many IT shops. People just aren’t as good at scripting and CLIs as they should be. Second, it relies on the perpetually crappy Host Profiles. I don’t think I’ve ever seen a cluster that isn’t complaining about host profile compliance. And when you look at what’s out of compliance you see it’s some parameter that gets automatically changed by vCenter. Or the local RAID controller pathing, or a CD-ROM drive, or something that should just be automatically handled for you. And Lord help you if you want to use mutual CHAP with host profiles.

“I seem to have forgotten all the different iSCSI passwords you entered to be compliant with the Hardening Guide, Bob” – Host Profiles, on every edit.

Auto Deploy also scares me a little when it comes to my pre-existing datastores. I don’t build ESXi hosts with fibre channel storage connected, lest something go wrong and do something bad to a few hundred TB of storage. Yet every time an ESXi host boots from Auto Deploy it’ll look to do some partitioning of local storage. It isn’t supposed to interact with “remote” storage, but I don’t trust VMware’s QA very much, especially with underused features like this. Not worth the risk.

Auto Deploy & host profiles are interesting but until VMware puts more effort into both I cannot endorse putting either in your environment’s critical support path, if only because the alternatives are so cheap & reliable.

5. Boot From SAN is complicated, too. The three major alternatives to SD cards are traditional magnetic media, Auto Deploy, and boot from SAN. Boot from SAN is another one of those ideas that seems great on paper but doesn’t really pan out well in real life. First, look at your disk vendor’s upgrade notes. Pay attention to all the caveats if you’re booting from SAN, versus if you’re booting locally. A number of vendors don’t even support array software updates when you’re booting from the SAN. It all has to come down, and that’s lame.

Second, you’ve got dependency issues again. If you’re booting locally you need power and cooling and you can figure everything else out later. If you’re booting off the SAN you need working power, cooling, networking/SAN, etc. to start. You’re also exposed to a lot more human error, too. Someone screws up a SAN zoning change and your whole infrastructure is offline, versus just some VMs. Good luck with the finger pointing that ensues.

Last, the pre-boot environment on servers is hard to manage, inflexible, and isn’t real helpful for troubleshooting. To make changes you need to be on the console, as very little of it is manageable through typical system management tools. Configuring this sort of boot often uses the horrible BIOS interfaces on the CNAs or NICs you have installed, or archaic DOS utilities you have to figure out how to cram on an ISO or bootable USB drive. That isn’t worth anybody’s time.

6. It just freakin’ works. When it comes right down to it none of these approaches to booting an ESXi host have any ROI. None. Zip. Zero. So every minute you spend messing around with them is a minute of your life you wasted.

The solution is super cheap, from both CapEx and OpEx perspectives. It doesn’t take long to get to $104.60 of labor, especially when Dell will also pre-load ESXi on your new SD cards, thereby saving you even more time.

Once booted, ESXi only writes configuration data back to the card(s) every 10 minutes, so despite the limited write cycles of flash a decent SD card will last the lifetime of the server. And if it doesn’t, the mirroring is reliable enough to let you limp along. Replacement & remirroring is easy, just say Yes at the prompt.

Last, they save you from a ton of extra complexity, complexity with no ROI. I don’t know about you but I’d rather be working on real problems than spinning my wheels managing an overcomplicated environment.

SD cards — Just Enough Storage.

How to Replace an SD Card in a Dell PowerEdge Server

We use the Dell Internal Dual SD module (IDSDM) for our VMware ESXi hosts. It works great, and saves us a bunch of money per server in that we don’t need RAID controllers, spinning disks, etc. Ours are populated with two 2 GB SD cards from the factory, and set to Mirror Mode in the BIOS.

Dell BIOS SD Card Mirror Mode

The other day we received an alarm:

Failure detected on Internal Dual SD Module SD2

We’d never seen a failure like this so we had no idea how to fix it, and the Internet was only slightly helpful (hence the point of this writeup). Here’s what we did to replace it.

Note: I’m certified to work on Dell servers, and have been messing with hardware for 25 years. To me this is a real easy fix, but you should do what you’re comfortable with. I always suggest you manage possible static electricity by ensuring you’re at the same electrical potential as the server. Touching the metal portions of the case works well for this, which you clearly have to do to get inside a PowerEdge. Just remember to do it again if you wander off and come back. This is also good advice for fueling your car in dry climates. :)

1. First, the SD cards themselves are not covered under Dell warranty, so we bought a new 4 GB card (the smallest we could find) from the nearest place that sold them (try your nearest Walgreens or CVS). The SD card from Dell was Kingston so we chose that brand. Total expenditure was $10 for two cards. Why two cards? Because the staff time & fuel costs more than the part, so stocking a spare makes sense. Plus, if one has failed I suspect I’ll see another failure. After all, I do have a couple hundred of these things.

2. Second, we shut the host down, unplugged it, and found the SD card module using the map under the Dell server cover. On our PowerEdge R720 it was below the PCIe riser closest to the power supplies. On blades it’s out the back, labeled “SD,” and you just have to pull the blade out to get to it.

The Dell IDSDM whitepaper indicates that, because of the way the module is powered, you should always do this work with the AC disconnected.

3. We took the expansion cards out of that PCIe riser and noted which one was in what slot (top vs. bottom). Then we gently removed the PCIe riser itself. Last, the IDSDM has a little blue strap to help you pull it straight up and out of the socket.

4. The error in the system event log indicated that SD2 was faulty. But now that we’ve got the thing in our hands which card is SD2? Turns out on 12G PowerEdge IDSDMs there’s an activity light on each side, and one is labeled SD1_LED and the other is SD2_LED. Your mileage here will vary — 11G servers had the slots labeled, and I haven’t looked at a 13G IDSDM yet. Use your head.


5. The SD card locks in, so you need to push it in to eject it. We took the 2 GB card out, put our new 4 GB card in, and put everything back together.

6. When the server boots it’ll ask you what you want to do about rebuilding the mirror. If you have F1/F2 prompts disabled in the BIOS you’ll have 10 seconds to answer before the boot continues without a rebuild.


For us it took about 5 minutes to resilver the mirror, then the boot process continued into ESXi. In keeping with good security techniques I put the old SD card through a shredder.

Deduplication & Write Once, Read Many

It’s probably sad that I see this and think about deduplication & WORM. This fellow achieved a 27% deduplication rate, though. Think of all the extra letters he could tattoo on his back now!

Says 'Misisipi' instead of 'Mississippi'

For those of you who don’t speak English natively I assume he was going for “Mississippi.”


Table Stakes for Storage Arrays

What 1987 Storage Looked Like

What 1987 Storage Looked Like (Micropolis 760 MB SCSI HDD)

I was just looking at Andreas Lesslhumer’s post about blog posting volume in the virtualization community, and it’s depressing. I didn’t blog a whole lot here last year. Why was that? Because I was writing elsewhere!

Speaking of that, the first half of my “Six Features You Absolutely Need on Your Storage in 2015” list is up over at The Virtualization Practice, wherein I outline what the table stakes are for enterprise storage arrays, get only slightly snarky about why we’re still discussing, as an industry, why & how to use flash, and highlight the good work some vendors are doing (SolidFire, Dell, and Tintri in this post, more in next week’s second part). Check it out.

You Cannot Use open-vm-tools to Customize VMs

Homer Simpson: Kids: there’s three ways to do things; the right way, the wrong way and the Max Power way!
Bart: Isn’t that the wrong way?
Homer Simpson: Yeah, but faster!

My biggest pet peeve with open source is that projects don’t ever solve whole problems. They get 60% of the way to solving a whole problem and then run off to chase another squirrel.

The most recent example of this is VMware’s recommendation to use the open-vm-tools packages that ship with modern distributions of Linux. Dumbest recommendation ever. Why? Because the project got to 60% of the solution and stopped, effectively solving no problems for anybody. From what appears to be a VMware employee on the open-vm-tools mailing list archives:

> On Ubuntu it is very easy to install the open-vm-tools, so I wonder
> how these differ from the VMware provided? Do they have the same
> features?

There are a few features missing in open-vm-tools, e.g. DeployPkg (for guest customization), Unity etc. Otherwise, the two are pretty much the same.

Um, no, they’re decidedly not the same. With open-vm-tools, ongoing maintenance of the Tools is much easier, but you won’t get a working VM when you clone a template, deploy with vRA/vCAC/vCD, etc. Turns out that people like having working VMs, with the right IPs and whatnot. :)

With legacy VMware Tools, updating is mostly broken, but we’ve been working around that for a decade so it’s figured out. And you get working VMs from a deployment! Snazzy.

DeployPkg is critical in private cloud environments, and to pretty much everything VMware is promoting lately. Could someone add it and make life easier for those of us out here actually using this stuff? Please? And maybe stop recommending this unfinished software until then?

%d bloggers like this: