GPU Jobs Crash Node Poweroff
Summary
Systems in this cluster of tickets stay up at idle or light load, then freeze, reboot, hang, or power off once sustained GPU work starts. The pattern appears across training, RELION, CryoSPARC, PyTorch, GPU Burn, and similar multi-GPU workloads, with root causes spanning thermals, power delivery, PCIe behavior, firmware, and occasionally software.
Frequency
- 56 tickets
Common Causes
-
Thermal overload or failed cooling components
Repeated evidence ties crashes to high GPU temperatures, CPU thermal trips, bad airflow, liquid-cooler faults, or failed CPU coolers. Examples: #6046, #13541, #18562, #22208, #35390, and 7 more. -
Power-delivery instability, PSU faults, or over-current events
Several cases mention PSU over-current alerts, knock sounds, sudden hard-offs under full GPU draw, or power hypotheses that later drove RMA. Examples: #12346, #16700, #18951, #21208, #7016, and 8 more. -
PCIe / motherboard path instability under load
Some systems only failed when all GPUs were active, with GPUs falling off the bus, downgraded PCIe link speed, bad slot behavior, or motherboard suspicion. Examples: #15022, #19229, #26199, #28711, #36958, and 7 more. -
Firmware / driver / OS interaction issues
A smaller but real subset stabilized after BIOS, BMC, PSU firmware, driver, or OS changes, or showed software-specific reproduction differences. Examples: #12209, #15528, #18688, #19627, #18951, and 6 more. -
No-trouble-found or not fully reproduced in-house
Some RMAs never reproduced the customer crash cleanly, which complicated closure and shifted emphasis to extended validation. Examples: #15381, #15022, #21957, #26591, #28667.
Diagnostic Steps
-
Confirm the failure pattern under real workload
Check whether the crash happens only under sustained GPU load, only with all GPUs active, or only after the system warms up. Representative tickets: #11007, #15022, #26591, #5628, #6046. -
Collect thermal, SEL, and health telemetry during or right after failure
Reviewnvidia-smi -q,ipmitool sel elist,ipmitool sdr, psensor, and journal logs for thermal trips, missing GPUs, or power events. Representative tickets: #18562, #18951, #19627, #24079, #35390. -
Reduce variables with staged stress tests
Compare customer workload togpu_burn, mprime, stress-ng, or single-GPU / partial-GPU tests to learn whether the issue is load-shape-specific. Representative tickets: #11007, #21208, #21516, #5628, #9134. -
Check platform paths, not just the GPUs
Inspect PCIe lane width, slot mapping, cooler condition, DIMMs, PSU behavior, and motherboard health when symptoms include bus drops or hard-offs. Representative tickets: #11007, #19229, #26199, #3001, #15022. -
Escalate to RMA when the crash is credible but not safely isolatable remotely
This issue often needed depot reproduction because remote logs alone did not prove the failing component. Representative tickets: #11721, #14498, #18132, #20475, #30954.
Solutions
-
Repair or replace the thermal root cause
Successful fixes included restoring correct chassis airflow, replacing failed coolers, and addressing GPU overheating. Examples: #13541, #18562, #22208, #6046, #34983. -
RMA the system or affected platform components
Many tickets only converged after depot testing and hardware replacement at system, motherboard, PSU, or barebone level. Examples: #12641, #14498, #16700, #18132, #36633, and 12 more. -
Apply firmware, BIOS, or power-management changes
Stability improved in some cases after BIOS or BMC updates, PSU firmware updates, or disabling ASPM and related PCIe power saving. Examples: #18951, #19627, #26199, #28711. -
Use workload-specific software remediation when hardware evidence is weak
A minority of tickets improved after driver updates, Windows-side changes, or application-specific configuration changes. Examples: #15528, #18688, #19627. -
Validate with the customer workload, not generic burn only
Several cases showed standard tests passing while the real training or cryo-EM workload still failed, so customer reproduction steps were essential to proving the fix. Examples: #15022, #18132, #18717, #26591, #5628.
Edge Cases
- Cooling issue outside the GPUs themselves: CPU cooler or liquid cooler faults presented as GPU-load crashes because the trigger was overall system thermal stress, not a dead GPU. See #18562, #22208, #24079.
- Shipping damage / mechanical contamination: transit damage and dried thermal paste in a socket created load-instability symptoms that looked like generic GPU crashes. See #3001, #19229.
- Bus-speed degradation rather than outright device loss: one repeat-RMA case stabilized only after PCIe power-management changes restored expected link speed. See #26199.
-
Standard stress tools can miss the failure: multiple tickets report
gpu_burnor other QA tools passing while the real research workload still crashed the node. See #5628, #15022, #26591.
Related Issues
- gpu-hardware-failure
- power-supply-failure
- cpu-cooler-failure
- bios-bmc-issues
- pcie-riser-failure
- system-boot-failure
- cryosparc-integration
Referenced by
- RTX 3090 — product affected by this issue (×4)
- H100 — product affected by this issue (×5)
- Overheating — co-occurs with this issue (×5)
- Andrew Rodriguez — handled tickets on this issue (×16)
- Sheng Ye — handled tickets on this issue (×1)
- H200 — product affected by this issue (×1)
- Ian Dicarlo — handled tickets on this issue (×5)
- BIOS Firmware Update — related issue (×2)
- Nam Luong — handled tickets on this issue (×8)
- RMA Workflow — co-occurs with this issue (×23)
Comments
0 comments
Please sign in to leave a comment.