Why Use SD Cards For VMware ESXi?

I’ve had four interactions now regarding my post on replacing a failed SD card in one of my servers. They’ve ranged from inquisitive:

to downright rude:

“SD cards are NOT reliable and you are putting youre [sic^2] infrastructure at risk. Id [sic] think a person like you would know to use autodeploy.”

Aside from that fellow’s malfunctioning apostrophe, he has a good, if blunt, point. SD cards aren’t all that reliable, and there are other technologies to get a hypervisor like ESXi on a host. So why use SD cards?

1. Cost. Looking at dell.com, if I outfit a Dell PowerEdge R630 with a traditional setup of two magnetic disks and a decent RAID controller my costs are:

300 GB 10K SAS 2.5″ disk: $212.75
300 GB 10K SAS 2.5″ disk: $212.75
PERC H730: $213.47
Keep My Hard Drive, 5 years: $213.46
Power for this setup, at 5.9 Watts per drive (as per Tom’s Hardware), guestimating 5.9 Watts for the RAID controller, and $0.14133 per kWh in my locale: $109.60 for 5 years.
Labor costs dealing with drive replacements, monitoring, etc.: $200.00 (this is low).

This comes to $1162.03 per server. On a 32 node cluster that’s $37,184.96, or the cost of three servers, over five years.

In contrast, the Dell Internal Dual SD Module is $104.60 per server with two 16 GB SD cards. That’s $3347.20 for a 32 node cluster.

To head off the inevitable comment: the PERC H310/H330 is not a decent RAID controller. To start, it isn’t even certified for VMware VSAN. Anybody that argues that the H330 is fine ought to be okay with the mirroring the Internal Dual SD Module does, because the two are about equal in that regard.

2. Use drive bays more productively. Say that I do want to put local disk in my servers, be it some SSD so I can do caching (a la SanDisk Flashsoft, PernixData, vFRC, etc.) or maybe do VSAN, I’d have to use two of my limited drive bays for boot volumes. That isn’t the most productive use of my expensive drive bays (and data center space).

3. Avoid dependency loops. Auto Deploy is an interesting VMware feature but it relies on a functioning network, DHCP & PXE & TFTP, DNS, and vCenter infrastructure to work. And that’s a problem when you’re in the middle of an outage (planned or unplanned) and any of that infrastructure is a VM.

If your vCenter is a VM how do you start everything up after an outage? Oh, you run a management cluster that doesn’t Auto Deploy… well that’s a pain in the ass, because you now have a vSphere cluster that’s different. Different means harder to manage, which means human error and additional operational cost. What’s the ongoing cost of that, vs $104.60 per server?

If your DHCP/PXE/TFTP server is a VM how do you start everything up after an outage? Oh, you use Auto Deploy with local caching? Okay, where does it cache? Traditional magnetic media… ah, you’ve chosen expensive AND complicated!

My network guys use a bunch of VMs to provide network functionality like DHCP, some DNS (half our anycast DNS infrastructure is on VMs, half on physical hosts), carrier-grade NAT, log collection, etc. It’s great – vSphere dramatically reduces risk and improves uptime for their services. We always have to make sure that we keep track of dependencies, though, which is easy to do when there are so few.

4. Avoid unreliable vSphere features. While we’re on the topic of Auto Deploy I’d just like to say that it isn’t production-ready. First, it’s not at all configurable from the vCenter GUIs. It’s all done from PowerShell, which is hard for many IT shops. People just aren’t as good at scripting and CLIs as they should be. Second, it relies on the perpetually crappy Host Profiles. I don’t think I’ve ever seen a cluster that isn’t complaining about host profile compliance. And when you look at what’s out of compliance you see it’s some parameter that gets automatically changed by vCenter. Or the local RAID controller pathing, or a CD-ROM drive, or something that should just be automatically handled for you. And Lord help you if you want to use mutual CHAP with host profiles.

“I seem to have forgotten all the different iSCSI passwords you entered to be compliant with the Hardening Guide, Bob” – Host Profiles, on every edit.

Auto Deploy also scares me a little when it comes to my pre-existing datastores. I don’t build ESXi hosts with fibre channel storage connected, lest something go wrong and do something bad to a few hundred TB of storage. Yet every time an ESXi host boots from Auto Deploy it’ll look to do some partitioning of local storage. It isn’t supposed to interact with “remote” storage, but I don’t trust VMware’s QA very much, especially with underused features like this. Not worth the risk.

Auto Deploy & host profiles are interesting but until VMware puts more effort into both I cannot endorse putting either in your environment’s critical support path, if only because the alternatives are so cheap & reliable.

5. Boot From SAN is complicated, too. The three major alternatives to SD cards are traditional magnetic media, Auto Deploy, and boot from SAN. Boot from SAN is another one of those ideas that seems great on paper but doesn’t really pan out well in real life. First, look at your disk vendor’s upgrade notes. Pay attention to all the caveats if you’re booting from SAN, versus if you’re booting locally. A number of vendors don’t even support array software updates when you’re booting from the SAN. It all has to come down, and that’s lame.

Second, you’ve got dependency issues again. If you’re booting locally you need power and cooling and you can figure everything else out later. If you’re booting off the SAN you need working power, cooling, networking/SAN, etc. to start. You’re also exposed to a lot more human error, too. Someone screws up a SAN zoning change and your whole infrastructure is offline, versus just some VMs. Good luck with the finger pointing that ensues.

Last, the pre-boot environment on servers is hard to manage, inflexible, and isn’t real helpful for troubleshooting. To make changes you need to be on the console, as very little of it is manageable through typical system management tools. Configuring this sort of boot often uses the horrible BIOS interfaces on the CNAs or NICs you have installed, or archaic DOS utilities you have to figure out how to cram on an ISO or bootable USB drive. That isn’t worth anybody’s time.

6. It just freakin’ works. When it comes right down to it none of these approaches to booting an ESXi host have any ROI. None. Zip. Zero. So every minute you spend messing around with them is a minute of your life you wasted.

The solution is super cheap, from both CapEx and OpEx perspectives. It doesn’t take long to get to $104.60 of labor, especially when Dell will also pre-load ESXi on your new SD cards, thereby saving you even more time.

Once booted, ESXi only writes configuration data back to the card(s) every 10 minutes, so despite the limited write cycles of flash a decent SD card will last the lifetime of the server. And if it doesn’t, the mirroring is reliable enough to let you limp along. Replacement & remirroring is easy, just say Yes at the prompt.

Last, they save you from a ton of extra complexity, complexity with no ROI. I don’t know about you but I’d rather be working on real problems than spinning my wheels managing an overcomplicated environment.

SD cards — Just Enough Storage.

How to Replace an SD Card in a Dell PowerEdge Server

We use the Dell Internal Dual SD module (IDSDM) for our VMware ESXi hosts. It works great, and saves us a bunch of money per server in that we don’t need RAID controllers, spinning disks, etc. Ours are populated with two 2 GB SD cards from the factory, and set to Mirror Mode in the BIOS.

Dell BIOS SD Card Mirror Mode

The other day we received an alarm:

Failure detected on Internal Dual SD Module SD2

We’d never seen a failure like this so we had no idea how to fix it, and the Internet was only slightly helpful (hence the point of this writeup). Here’s what we did to replace it.

Note: I’m certified to work on Dell servers, and have been messing with hardware for 25 years. To me this is a real easy fix, but you should do what you’re comfortable with. I always suggest you manage possible static electricity by ensuring you’re at the same electrical potential as the server. Touching the metal portions of the case works well for this, which you clearly have to do to get inside a PowerEdge. Just remember to do it again if you wander off and come back. This is also good advice for fueling your car in dry climates. :)

1. First, the SD cards themselves are not covered under Dell warranty, so we bought a new 4 GB card (the smallest we could find) from the nearest place that sold them (try your nearest Walgreens or CVS). The SD card from Dell was Kingston so we chose that brand. Total expenditure was $10 for two cards. Why two cards? Because the staff time & fuel costs more than the part, so stocking a spare makes sense. Plus, if one has failed I suspect I’ll see another failure. After all, I do have a couple hundred of these things.

2. Second, we shut the host down, unplugged it, and found the SD card module using the map under the Dell server cover. On our PowerEdge R720 it was below the PCIe riser closest to the power supplies. On blades it’s out the back, labeled “SD,” and you just have to pull the blade out to get to it.

The Dell IDSDM whitepaper indicates that, because of the way the module is powered, you should always do this work with the AC disconnected.

3. We took the expansion cards out of that PCIe riser and noted which one was in what slot (top vs. bottom). Then we gently removed the PCIe riser itself. Last, the IDSDM has a little blue strap to help you pull it straight up and out of the socket.

4. The error in the system event log indicated that SD2 was faulty. But now that we’ve got the thing in our hands which card is SD2? Turns out on 12G PowerEdge IDSDMs there’s an activity light on each side, and one is labeled SD1_LED and the other is SD2_LED. Your mileage here will vary — 11G servers had the slots labeled, and I haven’t looked at a 13G IDSDM yet. Use your head.


5. The SD card locks in, so you need to push it in to eject it. We took the 2 GB card out, put our new 4 GB card in, and put everything back together.

6. When the server boots it’ll ask you what you want to do about rebuilding the mirror. If you have F1/F2 prompts disabled in the BIOS you’ll have 10 seconds to answer before the boot continues without a rebuild.


For us it took about 5 minutes to resilver the mirror, then the boot process continued into ESXi. In keeping with good security techniques I put the old SD card through a shredder.

Table Stakes for Storage Arrays

What 1987 Storage Looked Like

What 1987 Storage Looked Like (Micropolis 760 MB SCSI HDD)

I was just looking at Andreas Lesslhumer’s post about blog posting volume in the virtualization community, and it’s depressing. I didn’t blog a whole lot here last year. Why was that? Because I was writing elsewhere!

Speaking of that, the first half of my “Six Features You Absolutely Need on Your Storage in 2015” list is up over at The Virtualization Practice, wherein I outline what the table stakes are for enterprise storage arrays, get only slightly snarky about why we’re still discussing, as an industry, why & how to use flash, and highlight the good work some vendors are doing (SolidFire, Dell, and Tintri in this post, more in next week’s second part). Check it out.

You Cannot Use open-vm-tools to Customize VMs

Homer Simpson: Kids: there’s three ways to do things; the right way, the wrong way and the Max Power way!
Bart: Isn’t that the wrong way?
Homer Simpson: Yeah, but faster!

My biggest pet peeve with open source is that projects don’t ever solve whole problems. They get 60% of the way to solving a whole problem and then run off to chase another squirrel.

The most recent example of this is VMware’s recommendation to use the open-vm-tools packages that ship with modern distributions of Linux. Dumbest recommendation ever. Why? Because the project got to 60% of the solution and stopped, effectively solving no problems for anybody. From what appears to be a VMware employee on the open-vm-tools mailing list archives:

> On Ubuntu it is very easy to install the open-vm-tools, so I wonder
> how these differ from the VMware provided? Do they have the same
> features?

There are a few features missing in open-vm-tools, e.g. DeployPkg (for guest customization), Unity etc. Otherwise, the two are pretty much the same.

Um, no, they’re decidedly not the same. With open-vm-tools, ongoing maintenance of the Tools is much easier, but you won’t get a working VM when you clone a template, deploy with vRA/vCAC/vCD, etc. Turns out that people like having working VMs, with the right IPs and whatnot. :)

With legacy VMware Tools, updating is mostly broken, but we’ve been working around that for a decade so it’s figured out. And you get working VMs from a deployment! Snazzy.

DeployPkg is critical in private cloud environments, and to pretty much everything VMware is promoting lately. Could someone add it and make life easier for those of us out here actually using this stuff? Please? And maybe stop recommending this unfinished software until then?

CentOS 7 Refusing VMware vSphere Guest OS Customizations

So I just spent two hours of my life trying to get my CentOS 7 VM template to deploy correctly with a vSphere customization specification. No matter what I did it would customize the VM, then uncustomize it, essentially leaving me with the template again. I finally asked our oracle and savior, Google, and two amazing things occurred.

First, I found the answer. About three weeks ago a fellow named Jeff Burns asked this same question on Server Fault, then answered his own question five minutes later (this is often what happens to me immediately upon filing a support case). He built on something I’d seen in /var/log/vmware-imc/toolsDeployPkg.log, where the VMware Tools couldn’t figure out what the OS is and then would abort. In short, you need to make /etc/redhat-release say:

Red Hat Enterprise Linux Server release 7.0 (Maipo)

Indeed, it works, though I’m guessing someone at VMware will fix this and we’ll have to decide which string we stick with. :)

Second, as part of his question Jeff linked to my post about Preparing Linux Template VMs. Thank you Jeff! That’s really cool. In fact, I’ve just updated that post with new thoughts on cleaning up those VM templates so they’re as small as possible, based on my ongoing experiences. Go check it out.

Apple Lawsuit Over iOS Advertised Capacity

In case you hadn’t seen it, Apple is being sued over the fact that a 16 GB iOS device does not have 16 GB of space usable on it. The Verge has a good story on it, link is below.

In contrast, Macworld’s Susie Ochs has published a whiny, elitist article entitled “Apple faces dumb lawsuit over the size of iOS 8.” This link is also below if you’d like to witness the cesspool that Macworld has become.

I don’t think the lawsuit is dumb, at all. On one hand computers have never included the space consumed by the OS when listing their storage capacities. Consider that an OS installed on a PC stays fairly static over the life of the PC. My mother’s computer’s copy of Windows 7 will be the OS on there until the PC is retired, and I’m comfortable generalizing that to most consumers. Given Microsoft support policies she won’t be forced to update to a new major OS release, or rather, that upgrade will come with the purchase of a new PC. There will be some space usage fluctuation as patches are installed, but since local disks on PCs are generally measured in terabytes, or at least 3 digit gigabytes, it’s not tight. Assuming 15 GB for Windows and a 1 TB drive the consumer loses 1% of the capacity to the OS.

Apple, on the other hand, doesn’t release security updates for older iOS versions. They also don’t guarantee backwards compatibility for applications. Applications always get the latest updates, which means that because there is no backwards compatibility, and no way for a consumer to downrev an app that may have been mistakenly updated, an owner of an iOS device is forced into running the latest major iOS version if they’d like their apps to work or their data to be secure.

I’ll repeat that, because it’s the crux of my whole argument. If you own an iOS device you must keep it at the latest iOS level to be secure. There are no security updates for older iOS versions.

A 16 GB iPhone 5S has 13.1 GB of usable space on it under iOS 8, representing a loss of 18% of the advertised capacity immediately out of the box.

Add to that the fact that iOS 8 required 5.7 GB of free space to install for an over-the-air update. It needs roughly half that if I connect it to iTunes. We’ll give Apple the benefit of the doubt here and say that to continue getting updates to my phone, via iTunes, I need 2.8 GB free. Now I’m down to 10.3 GB usable, or a 35.6% loss of advertised capacity.

Last, the nebulous “other” data class that seems to crop up on iOS devices:

iTunes Display of Space Usage on my iPhone

iTunes Display of Space Usage on my iPhone

What is this stuff? App data gets associated with the “Apps” class (when I copy a 1 GB movie to VLC on my iPhone the Apps class grows by 1 GB). There’s nothing in Settings->General->Usage that indicates what “Other” is, either. I can get rid of it for a while by wiping my phone, but then it grows back. In my case it’s consuming 1.5 GB of space. Now I’ve got 8.8 GB usable space, which is a 45% loss of advertised capacity. If I want to do over-the-air updates (which I do), I have 5.9 GB usable, for a 63% loss.

That’s complete bullshit, especially since I have to stay current with iOS.

Ever since I found myself clearing space on my iOS devices for iOS 8 I’ve thought that Apple really should start treating these things as devices with firmware, more like a Blu-ray player than a computer. My Blu-ray player has separate storage for its over-the-air updates, and conducts them seamlessly after I agree to proceed. Why doesn’t Apple have separate storage for their OS and for updates? Even the addition of a hidden 8 GB system storage volume, to hold iOS updates and “Other,” would do wonders in making iOS devices more supportable, more user-friendly, and more likely to actually get the updates Apple pushes. Maybe this lawsuit will force the issue.


Tech Bloggers: Punctuation Goes Inside Quotation Marks

One of the biggest differences between writing code for machines and writing English-language text for humans in the United States is the use of quotation marks. When you’re programming a computer a set of double quotation marks indicates a string, which is an atomic entity. As such, punctuation goes outside the quotes to delimit lists and whatnot.

#include <stdlib.h>
string animals[4] = {"Goat","Sheep","Cow","Platypus"};

This is not how it works when you’re writing in the English language. Periods and commas always go inside the double quotation marks in English.

Incorrectly punctuated sarcasm: We all know how that piece of software “works”.
Correctly punctuated sarcasm: We all know how that piece of software “works.”

Incorrect: Her number is “867-5309”.
Correct: Her number is “867-5309.”

Question and exclamation marks go inside the double quotation marks, too.

Incorrect: “What the hell are you doing to that hard disk”? he said.
Correct: “What the hell are you doing to that hard disk?” he said.

Incorrect: “Don’t press the EPO button in the data center”! he screamed.
Correct: “Don’t press the EPO button in the data center!” he screamed.

Instead of a period use a comma inside the quotes:

Incorrect: “I simply asked him why he was in his underwear and there was 1000 feet of unspooled fiber on the floor”, Joe said quietly to the HR representative.
Correct: “I simply asked him why he was in his underwear and there was 1000 feet of unspooled fiber on the floor,” Joe said quietly to the HR representative.

Yes, this presents problems sometimes. Sometimes you want to denote that something is a string, or spelled a particular way, and don’t want to misconstrue that the punctuation is part of it. For me this usually happens with server names.

Grammatically incorrect: We need to fix the fans on servers “esx-1-goat-4”, “esx-1-goat-10”, and “hosebeast”.

You have several ways to tackle this without running afoul of the pedantic grammar Nazis.

Capital letters: We need to fix the fans on servers ESX-1-GOAT-4, ESX-1-GOAT-10, and HOSEBEAST.
Italics: We need to fix the fans on servers esx-1-goat-4, esx-1-goat-10, and hosebeast.
Bold: We need to fix the fans on servers esx-1-goat-4, esx-1-goat-10, and hosebeast.

A list works, too.

We need to fix the fans on the following servers:

  • esx-1-goat-4
  • esx-1-goat-10
  • hosebeast

My favorite is caps. Easy to read, gets the point across, looks like a normal paragraph. Be creative, but ask yourself how you can write it so your meaning is crystal clear to the reader.

In conclusion, y’all are really freaking smart, yet I see you all making this mistake all the time. English is just another programming language, except you’re getting humans to do your bidding instead of a machine. Consider this a compiler warning. :)

P.S. I explicitly mention the English language & U.S.A. just because I don’t know the rules in other languages. I’m guessing that in many cases the rules are the same. Leave a comment if they aren’t (or if I’ve missed something here, like always).

P.P.S. I’ve been informed — several times — that “proper” English, aka that which is spoken in the United Kingdom, has this all backwards, including using single quotes where Americans use double and vice-versa, putting all manner of punctuation outside the quotes, etc. Of course, the Americans are the ones with it backwards. :) Do what you want but in the U.S.A. it’s a grammatical error.

Minimum Vacation

Sysadmin1138 has a post today on minimum vacation policies, an interesting twist on the unlimited vacation policies many startups now have:

The idea seems to be a melding of the best parts of unlimited and max. Employees are required to take a certain number of days off a year, and those days have to be full-disconnect days in which no checking in on work is done. Instead of using scarcity to urge people to take real vacations, it explicitly states you will take these days and you will not do any work on them.

Sysadmin1138 expounds on several ways this is a cool idea. I agree. There are real benefits to forcing employees to go (and stay) completely away for a moderate amount of time. In fact, financial institutions usually have a mandatory absence policy as part of their security program. The United States’ Federal Deposit Insurance Corporation (FDIC) encourages institutions to require no less than two continuous weeks of vacation for all employees:

During this time, their duties and responsibilities should be assumed by other employees. This basic control has proven to be an effective internal safeguard in preventing fraud. In addition, such a policy is viewed as a benefit to the well-being of the employees and can be a valuable aid to the institution’s overall training program.

If someone is out of the office for two weeks there is a greater likelihood that fraudulent activity will be visible . Less maliciously, when someone is gone and unreachable holes in training and documentation become pretty clear.

LumberghThe Achilles Heel of all these new-fangled vacation policies is that they need managers and company leadership to be results-oriented. That means corporate middle management actually has to manage people, might have to have some difficult conversations about performance, might have to occasionally deny vacation based on performance, and will definitely have to stop treating employees like kindergarteners. Just because Alice took 22 vacation days doesn’t mean Bob gets to take 22 days, if he’s not meeting expectations.

Of course, it’s hard to tell Bob that, though, especially if management doesn’t have a good framework for setting goals and expectations in the first place. And creating that framework sounds like work… isn’t it just easier to just go ahead and do things the way they’ve always been done, and leave these strange vacation pipe dreams to the Californians? And where is your TPS report?

A New Hope^H^H^H^HLook

Once in a while you’ve got to pull the trigger and actually ship some code.

It’s been seven years (!!!!) since I did anything serious with the way this blog looked and worked. There were plugins that weren’t supported anymore. The old theme had been so extensively customized by me that it wasn’t upgradable, and didn’t really work well with new functionality or WordPress releases. A lot of the new functionality duplicated what I’d hacked into the theme, too. Stuff like Google +1, sharing, etc. Plus I wanted SSL (and not just CloudFlare’s poser Google juice SSL crap, I wanted the security).

I started redoing the site six months ago, where “redoing” meant the cloud-esque “completely starting over.” The synchronization of content between my development site and the live site was pretty involved, and I found myself avoiding writing because I knew it’d just be more to synchronize later. Realistically a few more posts wouldn’t have mattered much, but it became a serious mental obstacle to writing, on all fronts.

Then came some asshats, hammering the old blog’s XML-RPC interface, trying to break in.
If you’ve tried to get to this site in the last couple of weeks and it’s been down it’s because those idiots went from breakin attempt to DoS, essentially causing Apache to use too much memory. The cognitively disabled Linux out-of-memory killer then kicked in, nuking the VM. Thanks, OOM Killer, you’ve been very helpful. Double thanks, jackhole at in the Netherlands. May you rot in hell.

Yes, I could have tuned Apache and Linux better, but the new site is nginx on CentOS 7 and I didn’t want to waste more time on old, tired web servers. So screw it, I pulled the trigger. This battle station is fully operational. Let’s do this.