(I’m grumpy this week and I’m giving myself permission to return to my blogging roots and complain about stuff. Deal with it.)
In the not so distant past we were growing a VMware cluster and ordered 17 new blade servers with X710 NICs. Bad idea. X710 NICs suck, as it turns out.
Those NICs do all sorts of offloads, and the onboard processor intercepts things like CDP and LLDP packets so that the OS cannot see or participate. That’s a real problem for ESXi hosts where you want to listen for and broadcast meaningful neighbor advertisements. Under Linux you can echo a bunch of crap into the right spot in /dev and shut that off but no such luck on VMware ESXi. It makes a person wonder if there’s any testing that goes into drivers advertising VMware compatibility.
Even worse, we had many cases where the driver would crash, orphaning all the VMs on the host and requiring VMware HA to detect host isolation and intercede. The NICs would just disappear from the host but the host would still be up. Warm reboot and everything is fine. I doubt it was random but we could never reproduce it. The advice from Dell & VMware was crappy, around shutting off the offload processing, updating the driver, updating firmware, double checking that we were running the current versions of everything, doing some crazy dance, slaughtering a goat. Didn’t change anything, we still had an outage a week.
Recently, and what popped this on to my list of complaints, was a network engineer coworker telling me he’s having a heck of a time getting X710 NICs to negotiate speed with some new 25 Gbps networking gear. When he told me what model NIC I just cringed, and had to share my experiences. “But the 520s were such solid cards,” he said..
Dell eventually ended up relenting and sending us replacement Broadcom 10 Gbps NICs for our blade servers. My team spent an afternoon replacing them and we’ve had absolutely no problems since (we did the work on “Bring Your Kid to Work Day” and gave the old X710s, which Dell said not to send back, to kids on a data center tour).
Back in the day we used to talk about Broadcom this way, all the problems their tg3 chipset had with offloads and such. It’s been a complete role reversal, with Broadcom being the better, more reliable choice in NICs now. Good for them, but in light of everything recently it’s an absolute shame what the monopolistic Intel, helmed by Ferengi, has become.
If you value your time or system reliability don’t buy Intel X710 NICs.
Update 2: John Nicholson reports:
After a couple years of cursing at the LSO/TSO bugs I’m told the new driver firmwares fix it so you don’t have to disable it.
— John Nicholson (@Lost_Signal) February 28, 2018
Figures it’d be the vSAN guys with the details, at least around the PSOD/stability issues. Thanks guys.
Update 3: It appears that newer i40e drivers let you change the LLDP behavior under certain circumstances, but it still doesn’t work right by default, or if you are doing NIC partitioning. These drivers are as of February 9, 2018, which is several years after the release of these cards, and the fix is still a bunch of manual work. Just vote with your wallet and buy someone else’s NICs.