CODE Keyboard

“You spent $150 on a keyboard?” – My wife

There are two kinds of people in technology: those with an opinion about their keyboard, and everybody else. I happen to be one of the first.

Buckling Spring image courtesy of Wikipedia.

I grew up using the IBM Model F and M keyboards. They have a spring in the key switches that buckles as you press down. That gives you two things: a prominent clicking sound from the keypress, and solid tactile feedback from the key. You definitely know when that key switch actuated.

Years ago I had to give up my Model M keyboards. They’re built to last but it was getting harder to find working ones, it was getting inconvenient to adapt them to USB from PS/2, and a case of carpal tunnel made it painful to use a keyboard that required a decent amount of force to type. This also pleased my coworkers, who didn’t particularly like the stream of loud clicking when I was in the office. And so I settled on a series of Dell keyboards, mostly because we had some sitting around. The multimedia controls on the newer Dell Business keyboards are nice, and I’ve been using those for a while now.

“Does it do cool things?” – My six year old daughter

In a few weeks I’m not going to have coworkers within 50 feet of me, and my old keyboards are getting a little, well, old. So I thought I’d treat myself to a new keyboard. Over the last couple years I’ve been lurking in the community around keyboards, marveling at the incredible love that people pour into the devices at their fingertips. In particular, Massdrop has a quite the stream of interesting keyboards and customizations, many available for purchase. There are cheaper options there but I don’t like ground-effect lighting for my keyboard enough to spend $500, though.

Turns out you can buy a faithful clone of the IBM Model M from Unicomp, but I think I’m past the mega-clicky stage of my life. I don’t want people to hear all that when I’m on the phone. So after looking around I decided on a 104-key CODE Keyboard, which is a collaboration between Jeff Atwood of Stack Overflow fame and WASD Keyboards. You can choose the switches that are in it so you get exactly what you want for noise, feel, and actuation pressure. The keys have backlighting, which is great. The keyboard weighs a couple pounds, so you can defend your home office with it if you need to, and it has big patches of rubber underneath so it does not move. It’s got a standard USB cable (micro to A), so you can replace it or customize it, and a bunch of routing options underneath. And best of all, it’s simple & clean.

It’s got six DIP switches on the back to customize it if you are a Mac, Windows, or UNIX person (if you’re used to a Sun keyboard that swapped Ctrl and Caps Lock). I flipped the sixth switch so that the keyboard Function key can do the multimedia controls (versus an OS “menu” key). If you want to customize it further you can just order a WASD v2 keyboard and customize it fully, from a variety of languages and layouts to what color each key is. I liked the compromises and the LED backlighting in the CODE model, but I can order new keycaps in the future if I want.

“I AM A BAT. I FLY.” – My three year old son, unfazed by a new keyboard

Best of all, I was looking for a reason to try it out, so I wrote this. It’s definitely a different feel than my old keyboard, but that’s what I wanted. I like it so far. At the beginning here I was doing a lot of double capitalization (WRiting THings Like THis), but 600 words in that seems to have cleared up. I think this keyboard and I might get along just fine.

Now I need to find an amazing mouse to go with it. Thoughts?

Joining VMware

“We changed again, and yet again, and it was now too late and too far to go back, and I went on. And the mists had all solemnly risen now, and the world lay spread before me.” – Pip, Great Expectations

Growing up the son of a firefighter and homemaker, I was fortunate to have been given the opportunity to go to college so many years ago. So in the autumn of the release of Windows 95 I left my childhood home to go to school at the University of Wisconsin – Madison. At four hours by car the UW was far enough away from my parents that they wouldn’t stop in randomly, but it was close enough that I could go home easily. I never really went home, though. Sure, I’d go visit, but my home became Madison, and I dug in. And while my parents helped with my tuition, room & board was solely my responsibility. I got a job, hired at the UW-Madison Help Desk to do phone support for the dial-in modem pool.

Information Technology wasn’t a career path when I was in high school, at least according to the school guidance counselor who told me I was going to be a chemical engineer, and that was that. All engineering students go through the first sets of classes together, though, and along the way I heard about Electrical & Computer Engineering. Took me about 12 seconds to switch. The grass is always greener, it seems, and it didn’t take long for me to figure out that I liked the software side more than hardware. The overlap with Computer Science seemed a natural path.

Fast-forward a few years. I’d been promoted out of the Help Desk. I was running giant AIX systems for our PeopleSoft implementations, and I was wondering what was next in my life. The work I was doing was so much more interesting than school, and it was the path I wanted to be on. I liked the UW, I had lots of friends there, and the people I was working with and for had interesting problems to solve. Above all, it was safe and familiar. My father died in 2001 and that left me adrift and with a case of PTSD, so when the UW offered me a real job, with real pay and real benefits, I signed on.

23 years later I’ve been fortunate to have worked with some of the brightest (and interestingly enough, fastest and strongest) folks around. I’ve been able to reinvent my job a few times, as new technology comes along to reshape the landscape. Landscaping in higher education involves a lot of hard work, overcoming inertia of silos, culture, and incredible fear of change. It requires immense amounts of patience. It has worn on me, as I’d seen my father’s job as a first responder wear on him, turning us into sarcastic, bitter, angry people. I grew more and more like the mythical Sisyphus, destined to roll rocks up hills as punishment for offending self-appointed gods in non-specific ways.

I’ve been thinking about moving on for a while now. I don’t want to turn into my father, and I cannot keep rolling the rock uphill for 20 more years. I’ve talked to a number of friends that have made the leap to vendors, all of which told me, nicely, to shut up and do it. I clearly enjoy technology, but I also enjoy speaking and writing about it to help others understand more. I’ve been active in the VMware community for years. With all of that I’ve been envious of the work the VMware Technical Marketing folks do in all these spaces, getting paid to do the things I basically do as a hobby.

With two small children I’ve been hesitant to take a position with a lot of travel, though, and I’m very fortunate to be in a spot where I could take some time to make sure where I was going is a very good fit. That said, it took almost no time for me to respond when I was asked to consider applying for a position at VMware, in the Cloud & Platform Business Unit’s Technical Marketing group. I am the secret Mike Foley’s been dying to reveal on Twitter, and I’m very excited to work with him, Adam Eckerle, Niels Hagoort (who just joined as well) and all the others that produce such great content and understanding for VMware customers.

I start at VMware in early December and for the first time in a long time I feel again like Pip in that quote above, excited and nervous at the possibilities that lay before me.

Fixing X11 Forwarding Over SSH and with Sudo

X11 forwarding over SSH not working? Not setting $DISPLAY correctly in your shell? Having problems with X11 and sudo? Yeah, me too. Total pain in the duff. Here’s what I do to fix it. I’m thinking about Linux when I write stuff like this but a lot of this has worked on AIX and Solaris, too.

  • Make sure your SSH client supports X11 Forwarding and that it’s turned on. I use SecureCRT but I know it works in PuTTY as well. Once you turn it on in your client & save the settings you will need to reconnect, the forwarding is established with the connection.
  • Ensure xauth and xterm are installed. You need xauth for this to work, and xterm is a lightweight way to troubleshoot this stuff (just run “xterm” at a shell prompt and a window should pop open).
  • If you are using a command-line client, or forwarding across multiple hosts, is X11 forwarding enabled in your ~/.ssh/config file? Add “ForwardAgent yes” and “ForwardX11 yes” to it. You can also force it with “ssh -X user@host” when you connect.
  • Do you have an X Windows server running on your desktop PC? I use Windows on my desktop and I use VcXsrv. Make sure it’s started and running. VcXsrv asks me how I want to run it, I always choose “Multiple windows,” set the display number to -1 to let it choose, and start no client. You can futz with the rest once you know it’s working.
  • Is your $DISPLAY variable being set but you get errors? If so, that’s usually not forwarding, that’s something on your PC. Check your $DISPLAY with “echo $DISPLAY” at a prompt. It should have something in it like “localhost:10.0” or “localhost:13.0” or so. Does your X Windows server software (VcXsrv) have permissions? If so, set them wide open (allow all hosts to connect).
  • On your SSH server do you have “X11Forwarding yes” and “AllowAgentForwarding yes” in sshd_config? If it’s commented out uncomment it and restart the SSH daemon (“service sshd restart” works on a lot of distros).
  • Is your home directory writable? When you log in it’ll need to create an ~/.Xauthority file and if it cannot do that you’ll have problems.
  • Is your ~/.ssh directory writable and correct permissions? It should be owned by your user and chmod 700. Things in it should be chmod 600.
  • Is there an old ~/.Xauthority file sitting there? Try removing it and logging in again.
  • Did you disable IPv6? If you run “sysctl net.ipv6.conf.all.disable_ipv6” and it comes back as 1, or “lsmod | grep ipv6” shows nothing you might have IPv6 disabled. Turns out OpenSSH hates that and has a very passive-aggressive way of showing it. Add “AddressFamily inet” to your sshd_config and restart the daemon. That forces it back to IPv4 only.
  • Are you trying to run something as root using sudo or su? Getting “X11 connection rejected because of wrong authentication?” That gets funky because of permissions with xauth. There are lots of tricky fixes with xauth but I’ve just found copying my .Xauthority file to my target user works great. Then you can “sudo xterm” with impunity. You might try avoiding “sudo su -” as the hyphen wipes your environment out, and along with it your $DISPLAY. Just try “sudo -u targetusername command” instead.
sudo cp ~plankers/.Xauthority ~root/.Xauthority
  • If you’ve gotten this far and you’re still not able to run ‘xterm’ and have it pop a window open I’m surprised. Try SSHing with debugging on, “ssh -v -X user@host” and see if it tells you what’s wrong. Add more “v” to increase the debugging level, like “ssh -vv -X user@host.”
  • What do the logs say when you connect to the server? A lot of times when there’s something wrong it’ll put something in the logs about what it is.
  • Absolute vanilla installs of Linux distributions usually work fine. As a last resort try a VM running a stock installation of something like Ubuntu and see what happens.

Good luck! I hope at least some of this helps.

Fixing Veeam Backup & Replication Proxy Install Errors

Every once in a while I struggle a little to add a new Veeam Backup & Replication hot-add proxy. If you’re like me and seeing proxy install errors maybe some of these will fix you up. This is what worked for me on Windows Server 2016 when I was getting error 0x00000057, “Failed to create persistent connection to ADMIN$” and some other unhelpful messages.

If you’re using a hardened Windows installation all bets are off, since the goal of hardening is to intentionally disrupt remote access. I’d get it running with as close to a stock Windows installation as possible and then work from there if you need to secure things further. There are also ways to manually install the Veeam Transport Service that might be more helpful.

You might want to consider taking a snapshot before this work, so when you discover what fixes the problem you can revert the snapshot and just implement the fix cleanly.

  1. First, try specifying the username as the full “DOMAIN\Username” format when you add it to the Backup & Replication console. Don’t use the “.\username” format and don’t omit the domain part itself. If you are using local accounts you’ll want to specify “SERVERNAME\username” instead, using what the proxy knows as its name. This alone fixes 90% of the issues I’ve seen.
  2. If you aren’t using the Administrator account (and it’s a good idea not to) does the account you want to use have Administrator rights on the proxy VM, and the correct password? I sometimes forget to add the domain service account I created to the local administrators group.
  3. Check to see if you can reach the administrative shares on the proxy VM. Do this from the Backup & Replication main backup server itself by browsing to \\COMPUTERNAME\\Admin$ using the credentials you’re going to use for Veeam. This may mean you need to use “net use” to map it so you can specify a different username. If that works you should see the Windows directory on the remote computer.
  4. Didn’t work? Is the firewall enabled? For troubleshooting try adding an explicit “allow any” rule for all traffic to & from the backup server. If that makes browsing to Admin$ work then make sure you have rules to permit traffic between the proxy and the other proxies, and the proxy and the main backup server. Note that you can test this by just shutting the firewall off, but don’t do that unless you’re protected in some other way (hardware firewall, etc.).
  5. If the firewall is disabled and you still cannot browse can the backup server ping the proxy? Is there another firewall between them that’s denying traffic?
  6. If the firewall is disabled, they can ping each other, and you still cannot browse have you disabled remote UAC on the proxy VM? Open an administrator-level command prompt and run:
    reg add HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\system /v LocalAccountTokenFilterPolicy /t REG_DWORD /d 1 /f

    Reboot the proxy VM and try again. At this point you probably can browse to Admin$, and you should take a moment to make sure your firewall is on and everything is secured again. If you still can’t get in I’d look at more fundamental issues, like time synchronization and DNS.

Good luck!

vSphere 6.7 Will Not Run In My Lab: A Parable

CPU Icon“Hey Bob, I tried installing vSphere 6.7 on my lab servers and it doesn’t work right. You tried using it yet? Been beating my head against a wall here.”

“Yeah, I really like it. A lot. Like, resisting the urge to be irresponsible and upgrade everything. What are your lab servers?”

I knew what he was going to say before he said it. “Dell PowerEdge R610s.” I was actually surprised it was that new, and rack-mountable.

“Yeah, you’re out of luck. CPUs before the E3/E5/E7 family didn’t have VT-x extensions in them to make virtualization easy so VMware had to do this thing called binary translation. vSphere 6.5 was the last release that they supported that on because, frankly, it’s slow and everything associated with that technique is getting really old.”

“What the hell? You’d think they’d tell people about that!”

“What, an obscure KB article with absolutely no practical information in it and a reference in the 6.5 release notes to said obscure KB article didn’t catch your eye?” I say, dripping with sarcasm. “I think there was a warning that flashed on the console of affected hardware when you booted, too, but to be honest I only know that because someone mentioned it, I’ve never seen it myself.”

“That’s total crap, like anybody looks at the console. So now what am I going to do? All my gear doesn’t work.”

“One might argue it works just fine. 6.5 will be supported until November of 2021, you could stay on that. You could run 6.7 nested inside 6.5. I know this is a terrifying thought but you could buy some new equipment, too, something that was on a HCL this decade. Given the current generations of CPUs you’d probably be able to cut your VMware licensing in half while doubling your performance. Stick it to the man, or something.”

“Ha! Somehow I doubt my six licenses would attract their attention. I think I’d need four anyhow for vSAN. Maybe I’ll try the nested thing. Thanks man.”

As a side note to my parable here, if you’re thinking about this and have some time before you have to refresh your hardware it’s worth waiting to see how all this Spectre/Meltdown stuff turns out. None of the junk the ferenghis at Intel are shipping today is secure, at any level, especially given the latest wave of vulnerability disclosures. AMD might also turn out to be a good play moving forward, too, if they’re not in exactly the same spot because they blindly copied everything from Intel. The SSD shortages are subsiding so you don’t have to plan 60 or 90 days out anymore. Time will tell, so take some time if you can.

Midnight is a Confusing Choice for Scheduling

Clock IconMidnight is a poor choice for scheduling anything. Midnight belongs to tomorrow. It’s 0000 on the clock, which is the beginning of the next day. That’s not how humans think, though, because tomorrow is after we wake up!

A great example is a statement like “proposals are due by midnight on April 15.”

What you actually said: proposals aren’t welcome after April 14.
What you probably meant: you want the proposals before the date is April 16.

There’s a 24 hour difference there, and if you enforce the deadline accurately people are going to complain because they were all thinking the second thing (before April 16).

Similarly, this is a problem in change notices and customer communications. When you say there’s an outage scheduled for midnight there’s a very good chance someone will misunderstand when that is. Being wrong by an hour in the middle of the night isn’t so bad. Being wrong by 24 hours gets people riled up and you have enough problems as it is.

The second issue with midnight is when folks represent it as 12:00 AM. When you’re moving fast, as many people are, it’s easy to confuse with noon. Even worse when people mess up and write it as 12:00 PM, because in their head midnight is night which is PM. Except, of course, it isn’t.

Last, midnight is a popular time to schedule automated processes. I get it, it’s easy. If you run something at midnight you don’t have to do much processing to separate yesterday from today. The problem is that there’s a ton of stuff already running on the hour, and you’re just piling on. Most people try to avoid shopping when it’s crazy busy, why would you want to run your jobs that way? If you ran your job a bit earlier or later chances are it’ll run faster because you’re not competing with everyone else.

So instead of midnight, what?


1. If you care about time then act like you care about time and write your jobs the right way. Or, decide you don’t care about time so much and put a random sleep in them. Jobs don’t have to sleep long, just enough to avoid parts of the hour that end in :00 and :30.

2. Be strict about how you write your times. Write the date in the ISO 8601 format to help avoid global formatting issues (YYYY-MM-DDThh:mmTZD). Mind daylight savings when you add the time zone (-0500 vs -0600, etc.). Don’t be afraid to spell it out in two ways, ISO and how a non-technical reader would want to see it:

“2018-04-05T23:00-07:00 (11 PM Pacific Daylight Time on April 5, 2018).”

3. Don’t schedule things at midnight or noon. Chances are that if you’re scheduling something you could move it to avoid the issue. Deadlines could move to 2200 or 0600 without too much inconvenience, drastically reducing the potential for confusion. Scheduled work could be 2330 (and if you needed to wait until 0000 just adjust the length of the maintenance window). Even if you’re simply telling someone else that something is going to happen, pick a different time that’s clearly inside a specific day.

Time notation drives everybody crazy — look up some of the holy wars around server clocks set to UTC/GMT vs local time. Communication is hard, too, especially conveying technical topics to non-technical people. Let’s be mindful of these tricky spots and work to reduce confusion where we can. That way, instead of ridiculous & angry conversations about definitions of midnight we can have meaningful & clear conversations about the work itself.

No VMware NSX Hardware Gateway Support for Cisco

I find it interesting, as I’m taking my first real steps into the world of VMware NSX, that there is no Cisco equipment supported as a VMware NSX hardware gateway (VTEP). According to the HCL on March 13th, 2018 there is a complete lack of “Cisco” in the “Partner” category:

Cisco Missing from VMware NSX hardware gateway support

I wonder how that works out for Cisco UCS customers.

As I continue to remind vendors, virtualization environments cannot virtualize everything. There are still dependencies on things like DNS, DHCP, NTP, and AD that need a few physical servers. There will also always be a few hosts that can’t be virtualized because of vendor requirements, politics, and/or fear. Any solution for a virtual environment needs to help take care of those systems or it’s not a solution people can use. Beyond that, most people are unwilling to spend precious time and funds on two solutions. The most amazing solution for VM backup, monitoring, or security is useless if you don’t solve my entire problem, which includes the core dependencies I have running as physical hosts.

Folks like Rubrik and Veeam caught on and solved the problem with backup agents. Now we can back up the physical hosts we still have. Extending NSX services, especially security, to the physical systems would help immensely, too. This functionality is “table stakes” now, base functionality customers expect as we design new systems and refresh old ones, but lots of others are missing the boat, too. HPE only has two models of switches listed. Dell only has three. None of them are 25 Gbps. Most of them aren’t certified for recent NSX releases, either.

This seems like a fly in VMware’s NSX ointment. Is it weak demand for NSX that is leading to networking vendors not supporting VXLAN? Or is it terrible networking products that are causing a lack of NSX sales because of their inability to support these features? Whatever it is, this stands as a big opportunity for players like Arista to stand out and eat Cisco, Dell, and HPE’s lunches by being a big and reliable part of the solution, not just another perpetuation of the problem.

How to Troubleshoot Unreliable or Malfunctioning Hardware

CPU IconMy post on Intel X710 NICs being awful has triggered a lot of emotion and commentary from my readers. One of the common questions has been: so I have X710 NICs, what do I do? How do I troubleshoot hardware that isn’t working right?

1. Document how to reproduce the problem and its severity. Is it a management annoyance or does it cause outages & downtime? Is there a reasonable expectation that what you’re trying to do should work the way you expect? That might seem like an odd question, but sometimes other people do the procurement for (and without) us and there are gotchas they didn’t think to ask about.

In my case with the X710s I felt I had a reasonable expectation that the machine would stay up and that standard features like LLDP, which worked fine with other NICs, would work on these.

Being able to reproduce a problem is key. Intermittent issues are really hard to deal with. Get screen shots of the behavior, of the consoles, of the BSODs & PSODs. Get crash dumps if you can.

2. Check the Hardware Compatibility List for the particular OS and hardware you’re trying to use. Make sure it’s on there. If not, you might not have much success in getting support. The HCL might also have clues about driver levels and settings, too.

3. Check the vendor knowledge bases. At the time I was fighting the X710 issues there were no articles about it but now there are, and there are some suggested workarounds.

4. Update the firmware to the latest levels. You should be doing this already as part of your patching process. If you’re having issues your vendor’s support is going to ask you to do this anyhow, so might as well get ahead of it. Do it on the whole machine, not just the malfunctioning component, because sometimes the problem is an interaction somewhere else.

5. Update the driver to the latest levels. The VMware HCL often lists newer drivers you can apply via Update Manager. Try applying one of those. Sometimes a vendor like Intel will supply a newer driver than a server vendor like Dell will qualify. I usually try to stick with what the vendor who sold me the server has for drivers. For Dell & VMware, that often means installing with and/or remediating to the Dell customized ESXi ISO.

6. Update the OS to the latest levels. Again, you should be doing this for security reasons but on the off-chance you aren’t patched up to the latest levels do it and see if the problem persists. Support is going to ask you to do this anyhow. This isn’t saying you need to upgrade to Windows Server 2016 from 2012R2 or anything, just be at current releases of 2012R2. Of course, if you have the opportunity to test against another OS like that it might be a useful data point.

7. Open a support case with your vendor. Let them help you, or at least get it on record that there are problems. Ask for escalation if there isn’t timely progress.

8. Let your sales team know that you are having problems. Ask them how long you have to return the equipment since it isn’t performing correctly. Let them know you opened a support case. Let them know you need escalation because the support folks aren’t resolving your problems. Sales teams want you to be successful, and they absolutely don’t want the equipment returned so they’ll lean on their technical resources to fix your problem.

9. Let your management know that you are having problems. Often, vendors will be having separate conversations with management around business goals and whatnot. Executives need to know that a vendor isn’t delivering on their promises. I guarantee that the vendor isn’t going to bring it up in conversation so you need to. Besides, most executives & managers I know love a way to derail a sales pitch.

This is also very important if this equipment needs to be installed and operational in certain timeframes. Management might need to adjust project timelines, reset customer expectations, or do some damage control. Get ahead of it.

10. Let your purchasing people know that you are having problems. If this is new equipment they might want to get involved before they pay the vendor, or stop payment until this is resolved. Governmental & SLED entities sometimes have other mechanisms of recourse under their vendor contracts which can be very helpful.

11. Don’t be afraid to tell the vendor that their ideas aren’t an acceptable fix. For example, the LLDP problems on X710 cards have a fix in newer drivers, but it’s completely manual, and will not work if your card is partitioned.

If you need the partitions then you’re stuck with no LLDP, which is crap. If you have a large cluster or value your time (and even if you don’t your employer probably does) a time-consuming, hard-to-maintain manual fix is unacceptable, too. You paid a price premium for X710 cards and you expect them to be fully supported & functional in your OS. Frankly, you could have paid less and had a NIC that actually worked as advertised out of the box.

12. Have someone high in your organization start the conversation around returning the equipment. This is basically the nuclear option, but you might have to do it. If you’ve done the other steps here this shouldn’t be a surprise. In my case with the X710s we said “it’s been three months with no resolution, we either need to return this equipment or get replacement NICs.” Because we’d worked through it and offered them a chance to resolve it, and there wasn’t a resolution, Dell did right by us and got us replacement Broadcom NICs. Problem solved.

Finding a way through situations like these is half linear troubleshooting and half good communications. Make sure you are doing both. Good luck!

Intel X710 NICs Are Crap

(I’m grumpy this week and I’m giving myself permission to return to my blogging roots and complain about stuff. Deal with it.)

In the not so distant past we were growing a VMware cluster and ordered 17 new blade servers with X710 NICs. Bad idea. X710 NICs suck, as it turns out.

Those NICs do all sorts of offloads, and the onboard processor intercepts things like CDP and LLDP packets so that the OS cannot see or participate. That’s a real problem for ESXi hosts where you want to listen for and broadcast meaningful neighbor advertisements. Under Linux you can echo a bunch of crap into the right spot in /dev and shut that off but no such luck on VMware ESXi. It makes a person wonder if there’s any testing that goes into drivers advertising VMware compatibility.

Even worse, we had many cases where the driver would crash, orphaning all the VMs on the host and requiring VMware HA to detect host isolation and intercede. The NICs would just disappear from the host but the host would still be up. Warm reboot and everything is fine. I doubt it was random but we could never reproduce it. The advice from Dell & VMware was crappy, around shutting off the offload processing, updating the driver, updating firmware, double checking that we were running the current versions of everything, doing some crazy dance, slaughtering a goat. Didn’t change anything, we still had an outage a week.

Recently, and what popped this on to my list of complaints, was a network engineer coworker telling me he’s having a heck of a time getting X710 NICs to negotiate speed with some new 25 Gbps networking gear. When he told me what model NIC I just cringed, and had to share my experiences. “But the 520s were such solid cards,” he said..

Dell eventually ended up relenting and sending us replacement Broadcom 10 Gbps NICs for our blade servers. My team spent an afternoon replacing them and we’ve had absolutely no problems since (we did the work on “Bring Your Kid to Work Day” and gave the old X710s, which Dell said not to send back, to kids on a data center tour).

Back in the day we used to talk about Broadcom this way, all the problems their tg3 chipset had with offloads and such. It’s been a complete role reversal, with Broadcom being the better, more reliable choice in NICs now. Good for them, but in light of everything recently it’s an absolute shame what the monopolistic Intel, helmed by Ferengi, has become.

If you value your time or system reliability don’t buy Intel X710 NICs.

Update: Jase McCarty reports that newer firmware might fix some of these issues, and also provides some PowerCLI code for disabling TSO/LRO if you’re seeing PSODs (VMware KB 2126909). YMMV.

Update 2: John Nicholson reports:

Figures it’d be the vSAN guys with the details, at least around the PSOD/stability issues. Thanks guys.

Update 3: It appears that newer i40e drivers let you change the LLDP behavior under certain circumstances, but it still doesn’t work right by default, or if you are doing NIC partitioning. These drivers are as of February 9, 2018, which is several years after the release of these cards, and the fix is still a bunch of manual work. Just vote with your wallet and buy someone else’s NICs.

Fix the Security Audits in vRealize Operations Manager

Security Shield(I’m grumpy this week and I’m giving myself permission to return to my blogging roots and complain about stuff. Deal with it.)

Several bloggers have written about the Runecast Analyzer lately. I was crazy bored in a meeting the other day so instead of stabbing myself with my pen to see if I still feel things I decided to go check out their website. My interest piqued when I saw the screen shot where they show security hardening guideline compliance, as well as compliance with the DISA STIG for ESX 6. I do a lot of that crap nowadays.

You know what my first thought was about the Runecast product, though? It was “This is what vRealize Operations Manager (vROPS) could have been, especially the Security Hardening Guide alerts.” When it debuted, the vROPS security audit policies showed immense amounts of promise. They weren’t developed beyond that, though, and now someone is eating VMware’s lunch, to the dismay of all of us who actually own licenses for vROPS.

As someone who has to be deeply concerned with compliance regulations on virtualization systems, who is also an actual customer (not a partner, not a developer, not an analyst), here’s what I want improved with the vROPS security audit alerts:

Instead of a single, outdated, one-size-fits-nothing policy we need policies matching the current guidance for each supported version of ESXi, at each level (1, 2, and 3). I will stack up the policies to meet the level I need for a particular set of objects.

We need separate policies to match the guidance for virtual machines. Rolling the ESXi guidance up with the VM guidance is a mess. Separate them.

We need default actions to fix any deficiencies found. Just like you can resize a VM you should be able to disable the MOB on an ESXi host if it’s found to be enabled, fix the security on a virtual switch, or set a VM configuration flag. It’d be particularly sweet if it could just remediate a whole ESXi host or VM in one pass. After all, the product is “Operations Manager” and security is a massive part of operations, so make it able to manage that stuff. As my six-year-old has taught her two-year-old brother, “DUH.”

We need a policy for the DISA STIG (after 16 months we also need a prototype DISA STIG for ESXi 6.5, but that’s a whole other complaint). Lots of people use — and even more people should use — the STIG to harden their installations, and it’d be grand if life were easier for us people in federal regulation hell. The whole reason we spend gobs of money on these tools is to try to make things easier, but there’s always some catch. Hence this post.

The default vROPS policies should not (I repeat: NOT) complain about the new secure defaults in vSphere 6.5 being unset. It also shouldn’t complain about VMware Workstation parameters, or any other inane unexposed features it checks for. Just tell me if & when something is actually set wrong.

Last, the policies must be kept up to date. Maybe the vROPS team could just use a VPN service and secretly check the VMware Security web site from time to time (perhaps before a  vROPS update?) so they don’t have to actually talk to the weird Security folks. Whatever it is, just get it done, and don’t give me bullcrap excuses about competing with other parts of the ecosystem. vROPS was in this space first, fix it up and make it right for your customers.

Thank you. Sorry if you’re a vROPS person and offended, but hey, I said I’m grumpy this week, and I tried to be constructive. Fix your stuff. If you’re a fellow vROPS customer and agree with me, well, there’s nothing stopping you from sending this to your account team as a request for enhancement.

Mentioned Links:

%d bloggers like this: