Intel X710 NICs Are Crap

(I’m grumpy this week and I’m giving myself permission to return to my blogging roots and complain about stuff. Deal with it.)

In the not so distant past we were growing a VMware cluster and ordered 17 new blade servers with X710 NICs. Bad idea. X710 NICs suck, as it turns out.

Those NICs do all sorts of offloads, and the onboard processor intercepts things like CDP and LLDP packets so that the OS cannot see or participate. That’s a real problem for ESXi hosts where you want to listen for and broadcast meaningful neighbor advertisements. Under Linux you can echo a bunch of crap into the right spot in /dev and shut that off but no such luck on VMware ESXi. It makes a person wonder if there’s any testing that goes into drivers advertising VMware compatibility.

Even worse, we had many cases where the driver would crash, orphaning all the VMs on the host and requiring VMware HA to detect host isolation and intercede. The NICs would just disappear from the host but the host would still be up. Warm reboot and everything is fine. I doubt it was random but we could never reproduce it. The advice from Dell & VMware was crappy, around shutting off the offload processing, updating the driver, updating firmware, double checking that we were running the current versions of everything, doing some crazy dance, slaughtering a goat. Didn’t change anything, we still had an outage a week.

Recently, and what popped this on to my list of complaints, was a network engineer coworker telling me he’s having a heck of a time getting X710 NICs to negotiate speed with some new 25 Gbps networking gear. When he told me what model NIC I just cringed, and had to share my experiences. “But the 520s were such solid cards,” he said..

Dell eventually ended up relenting and sending us replacement Broadcom 10 Gbps NICs for our blade servers. My team spent an afternoon replacing them and we’ve had absolutely no problems since (we did the work on “Bring Your Kid to Work Day” and gave the old X710s, which Dell said not to send back, to kids on a data center tour).

Back in the day we used to talk about Broadcom this way, all the problems their tg3 chipset had with offloads and such. It’s been a complete role reversal, with Broadcom being the better, more reliable choice in NICs now. Good for them, but in light of everything recently it’s an absolute shame what the monopolistic Intel, helmed by Ferengi, has become.

If you value your time or system reliability don’t buy Intel X710 NICs.

Update: Jase McCarty reports that newer firmware might fix some of these issues, and also provides some PowerCLI code for disabling TSO/LRO if you’re seeing PSODs (VMware KB 2126909). YMMV.

Update 2: John Nicholson reports:

Figures it’d be the vSAN guys with the details, at least around the PSOD/stability issues. Thanks guys.

Update 3: It appears that newer i40e drivers let you change the LLDP behavior under certain circumstances, but it still doesn’t work right by default, or if you are doing NIC partitioning. These drivers are as of February 9, 2018, which is several years after the release of these cards, and the fix is still a bunch of manual work. Just vote with your wallet and buy someone else’s NICs.

4 comments… add one
  • In my $day job we have been hit by numerous and random PSOD due to issue with ESXi and Intel x710 nic. Disabling TSO/LRO is not an option due to our workload. We end up replace them all with Mellanox nic. What puzzle me is even with random PSOD issue, VMWare still listing Intel x710 in verified hardware HCL list.

    • Thing is, you pay real money for those TSO/LRO offloads. They need to stay on.

  • You have no idea how many firmware and driver versions I had to go through over the past 2 years to get ones that were stable. When they were brand new the TSO had the same corruption bug that intel had in their nics 10 years earlier. This is all on Linux without the complications of VMware.


Leave a Comment

%d bloggers like this: