GPU Hardware Failure

Dev Account
Dev Account
  • Updated

Summary

GPU hardware failure covers cards that disappear from nvidia-smi, throw ECC or Xid faults, lose display output, or enter ERR state under load. These symptoms often overlap with motherboard, riser, slot, PSU, cable, or firmware problems, so isolation testing is essential.

Frequency

652 tickets

Common Causes

  1. The GPU card itself is defective. Failures repeatedly follow the card across slots or into another known-good system, including no-detect, no-video, ECC, and load-crash behavior ([13820], [15214], [18211], [30015], [35728]) ...and 80+ more.
  2. GPU faults only appear under load or thermal stress. Many cards look normal at idle, then enter ERR, crash gpu_burn, throw Xid timeouts, or fail once jobs start ([27436], [30738], [35728], [36869], [39756]) ...and 35+ more.
  3. Platform issues masquerade as GPU failure. Motherboards, risers, PCIe slots, backplanes, or PSUs often produce “dead GPU” symptoms until isolation proves the card is not the only problem ([10124], [30015], [32991], [34977], [39651]) ...and 200+ more.
  4. Firmware or driver symptoms can confound the diagnosis. Missing GPUs, handle errors, ECC state issues, or “not recognized” reports sometimes need software and firmware checks before the hardware conclusion is solid ([27436], [30015], [35279], [39756], [41270]) ...and 200+ more.

Diagnostic Steps

  1. Verify the symptom and capture GPU-level errors. Check nvidia-smi, nvidia-smi -q, journalctl, dmesg, and Xid or ECC output to distinguish no-detect, load-crash, memory, and display-loss patterns ([27436], [30015], [30738], [35728], [39756]).
  2. Do physical isolation before RMA. Reseat the card, move it to another slot, test the slot with a known-good GPU, and if possible test the suspect card in another system ([13820], [15214], [18211], [30015], [35728]).
  3. Check whether the fault follows the card or stays with the platform. If it moves with the GPU, suspect the card; if it stays with one slot or chassis, investigate motherboard, backplane, rails, PSU, or wiring instead ([30015], [32991], [34977], [35727], [39756]).
  4. Stress the card after a seemingly successful change. Use gpu_burn, memory tests, or real production workloads because intermittent cards may pass short checks and fail only later ([27436], [30738], [35728], [42171]).

Solutions

  1. Replace the GPU through component RMA or advance replacement. This is the most common proven fix once the card has been isolated cleanly ([13820], [15214], [18211], [35728], [39756]) ...and 90+ more.
  2. Use cross-ship or advance replacement to reduce downtime. Advanced replacement shortened outages in many cases ([27436], [30044], [31680], [35728], [39756]) ...and dozens more GPU-specific RMAs.
  3. Fix the platform instead when the GPU is only a symptom. Motherboard, slot, riser, or PSU replacement is required when the same GPU passes elsewhere or the chassis keeps reproducing PCIe faults ([30015], [32991], [34977], [35727], [39651]).
  4. Reseat or reconfigure when the card is not truly failed. Some “GPU device error” or not-detected cases recovered after reseating or driver/ECC correction ([27436], [30015], [35279], [41270]).

Edge Cases

  • Bench pass does not always clear the card. Some GPUs passed Exxact bench testing yet still ended in replacement after customer-side escalation and reproducible field failures ([30738], [32991]).
  • Replacement hardware can also be bad. One case involved an initially failed GPU, then a second replacement card that also lost display and failed recognition before the real platform fault was identified ([32991]).
  • Low-level silicon faults can be more specific than generic “not detected.” H200 failure evidence included Xid 143, FSP boot-complete polling failures, and bad-register reads, which justified replacement even without a simple display symptom ([39756]).
  • Packaging damage can complicate warranty processing. A returned Zotac card arrived with a bent bracket, but the manufacturer still replaced it ([18211]).

Related Issues

  • bios-bmc-issues
  • system-boot-failure
  • motherboard-hardware-failure
  • network-port-failure

Referenced by

  • Rtx A6000 — product affected by this issue (×72)
  • RTX 4090 — product affected by this issue (×29)
  • RTX 3090 — product affected by this issue (×31)
  • A100 — product affected by this issue (×34)
  • RTX 5090 — product affected by this issue (×7)
  • L40s — product affected by this issue (×12)
  • RMA Workflow — co-occurs with this issue (×485)
  • Philip Nguyen — handled tickets on this issue (×27)
  • Ian Dicarlo — handled tickets on this issue (×56)
  • No Trouble Found RMA — co-occurs with this issue (×19)

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.