How to Troubleshoot Unreliable or Malfunctioning Hardware

CPU IconMy post on Intel X710 NICs being awful has triggered a lot of emotion and commentary from my readers. One of the common questions has been: so I have X710 NICs, what do I do? How do I troubleshoot hardware that isn’t working right?

1. Document how to reproduce the problem and its severity. Is it a management annoyance or does it cause outages & downtime? Is there a reasonable expectation that what you’re trying to do should work the way you expect? That might seem like an odd question, but sometimes other people do the procurement for (and without) us and there are gotchas they didn’t think to ask about.

In my case with the X710s I felt I had a reasonable expectation that the machine would stay up and that standard features like LLDP, which worked fine with other NICs, would work on these.

Being able to reproduce a problem is key. Intermittent issues are really hard to deal with. Get screen shots of the behavior, of the consoles, of the BSODs & PSODs. Get crash dumps if you can.

2. Check the Hardware Compatibility List for the particular OS and hardware you’re trying to use. Make sure it’s on there. If not, you might not have much success in getting support. The HCL might also have clues about driver levels and settings, too.

3. Check the vendor knowledge bases. At the time I was fighting the X710 issues there were no articles about it but now there are, and there are some suggested workarounds.

4. Update the firmware to the latest levels. You should be doing this already as part of your patching process. If you’re having issues your vendor’s support is going to ask you to do this anyhow, so might as well get ahead of it. Do it on the whole machine, not just the malfunctioning component, because sometimes the problem is an interaction somewhere else.

5. Update the driver to the latest levels. The VMware HCL often lists newer drivers you can apply via Update Manager. Try applying one of those. Sometimes a vendor like Intel will supply a newer driver than a server vendor like Dell will qualify. I usually try to stick with what the vendor who sold me the server has for drivers. For Dell & VMware, that often means installing with and/or remediating to the Dell customized ESXi ISO.

6. Update the OS to the latest levels. Again, you should be doing this for security reasons but on the off-chance you aren’t patched up to the latest levels do it and see if the problem persists. Support is going to ask you to do this anyhow. This isn’t saying you need to upgrade to Windows Server 2016 from 2012R2 or anything, just be at current releases of 2012R2. Of course, if you have the opportunity to test against another OS like that it might be a useful data point.

7. Open a support case with your vendor. Let them help you, or at least get it on record that there are problems. Ask for escalation if there isn’t timely progress.

8. Let your sales team know that you are having problems. Ask them how long you have to return the equipment since it isn’t performing correctly. Let them know you opened a support case. Let them know you need escalation because the support folks aren’t resolving your problems. Sales teams want you to be successful, and they absolutely don’t want the equipment returned so they’ll lean on their technical resources to fix your problem.

9. Let your management know that you are having problems. Often, vendors will be having separate conversations with management around business goals and whatnot. Executives need to know that a vendor isn’t delivering on their promises. I guarantee that the vendor isn’t going to bring it up in conversation so you need to. Besides, most executives & managers I know love a way to derail a sales pitch.

This is also very important if this equipment needs to be installed and operational in certain timeframes. Management might need to adjust project timelines, reset customer expectations, or do some damage control. Get ahead of it.

10. Let your purchasing people know that you are having problems. If this is new equipment they might want to get involved before they pay the vendor, or stop payment until this is resolved. Governmental & SLED entities sometimes have other mechanisms of recourse under their vendor contracts which can be very helpful.

11. Don’t be afraid to tell the vendor that their ideas aren’t an acceptable fix. For example, the LLDP problems on X710 cards have a fix in newer drivers, but it’s completely manual, and will not work if your card is partitioned.

If you need the partitions then you’re stuck with no LLDP, which is crap. If you have a large cluster or value your time (and even if you don’t your employer probably does) a time-consuming, hard-to-maintain manual fix is unacceptable, too. You paid a price premium for X710 cards and you expect them to be fully supported & functional in your OS. Frankly, you could have paid less and had a NIC that actually worked as advertised out of the box.

12. Have someone high in your organization start the conversation around returning the equipment. This is basically the nuclear option, but you might have to do it. If you’ve done the other steps here this shouldn’t be a surprise. In my case with the X710s we said “it’s been three months with no resolution, we either need to return this equipment or get replacement NICs.” Because we’d worked through it and offered them a chance to resolve it, and there wasn’t a resolution, Dell did right by us and got us replacement Broadcom NICs. Problem solved.

Finding a way through situations like these is half linear troubleshooting and half good communications. Make sure you are doing both. Good luck!