Summary
NIC hardware failure covers adapters that are not detected in BIOS or the OS, show no link despite good cabling and switch ports, or fail only in specific slots or chassis paths. In this dataset, the core diagnostic challenge is separating a dead NIC from motherboard, riser, PCIe, optics, or firmware-state problems that present the same way.
Frequency
47 tickets.
Common Causes
- The NIC card itself is defective. Repeated evidence shows adapters that fail in multiple systems, remain undetected in both BIOS and OS, or never establish link even after normal firmware and cable checks ([11403], [6152], [18235], [31853], [35697]) ...and several more.
- Platform-side faults masquerade as NIC failure. Motherboards, risers, daughterboards, or specific PCIe slot paths often prevent otherwise good NICs from linking or appearing, especially when multiple cards fail in the same host path ([22408], [36662], [38340], [41155], [8224]) ...and 10+ more.
- External cabling or optics must be ruled out first. Several high-speed Ethernet and InfiniBand cases required eliminating QSFP/SFP modules, trunk cables, and switch-port issues before the NIC itself was blamed ([6152], [18128], [25210], [35697], [36560]) ...and 4 more.
-
Firmware or driver state can confuse the diagnosis. Some cards appear in
lshwor the OS but stay down, train poorly in PCIe, or need firmware/driver inspection before the hardware conclusion is solid ([14460], [25210], [31853], [35697], [6152]).
Diagnostic Steps
-
Confirm the failure at both firmware and OS level. Check BIOS visibility,
lshw,ip a,ethtool,ibnodes, or equivalent tools to see whether the NIC is absent, present-but-down, or link-failing ([11403], [27047], [35697], [6152], [18235]). - Eliminate cable, optic, and switch-port variables. Swap transceivers, DAC/AOC/fiber cables, and switch ports with known-good ones before deciding the NIC is bad ([6152], [18128], [25210], [35697], [36560]).
- Isolate whether the fault follows the card or stays with the host. Move the NIC to another slot or system, and if possible test a known-good NIC in the suspect chassis path ([11403], [6152], [18235], [22408], [38340]).
-
Review firmware and low-level clues. Capture
journalctl,dmesg,mst,mstflint, PCIe training symptoms, or link-training messages before escalating ([31853], [35697], [6152], [25210]). - Escalate to board-level repair if multiple good NICs fail in one system. Repeated slot-path failures or integrated-NIC issues point toward motherboard, riser, or PCIe infrastructure rather than the adapter alone ([22408], [36662], [38340], [41155], [8224]).
Solutions
- Component RMA or advance replacement of the NIC. This is the most common proven fix once the fault follows the card or the adapter fails in multiple environments ([11403], [18235], [35697], [41155], [6152]) ...and 10+ more.
- Use advance replacement when downtime matters. Fast NIC replacement was repeatedly used for research and production systems that could not wait for standard return-first processing ([17684], [24391], [36930], [41155], [3137]).
- Repair the host platform when the NIC is not the true failure. System RMA is appropriate when multiple known-good NICs fail in the same PCIe path or integrated NIC behavior points to motherboard/riser faults ([22408], [36662], [38340], [8224]).
- Reset firmware state or correct low-level config when hardware is still healthy. At least some apparent NIC failures were resolved by CMOS/BIOS recovery rather than replacing the card ([27047], [25210]).
Edge Cases
-
OS sees the card, but the link never comes up. InfiniBand and QSFP cases can show adapter presence in
lshwor Linux while still having no electrical link or training success ([6152], [35697]). - Customer modification can void the warranty path. One ConnectX-7 card with an extra fan soldered on was rejected from warranty replacement and disposed of at the customer’s request ([31853]).
- Advance replacement may close only the NIC portion of a broader system problem. In at least one case, Exxact shipped the replacement NIC while keeping larger chassis troubleshooting open in the parent ticket ([41155]).
- Integrated NIC issues can be firmware-state problems, not dead hardware. A workstation with both onboard NICs down recovered after CMOS reset, avoiding RMA entirely ([27047]).
Related Issues
bios-bmc-issuesmotherboard-hardware-failuresystem-boot-failureincorrect-hardware-shippedfirmware-driver-compatibility
Referenced by
- A100 — product affected by this issue (×2)
- Ian Dicarlo — handled tickets on this issue (×7)
- Jared Royster — handled tickets on this issue (×8)
- H200 — product affected by this issue (×1)
- RMA Workflow — co-occurs with this issue (×34)
- Matt — handled tickets on this issue (×4)
- Andrew Rodriguez — handled tickets on this issue (×8)
- Philip Nguyen — handled tickets on this issue (×2)
- David — handled tickets on this issue (×2)
- Network Port Failure — related issue (×1)
Comments
0 comments
Please sign in to leave a comment.