GPU Jobs Crash Node Poweroff

Dev Account
Dev Account
  • Updated

GPU Jobs Crash Node Poweroff

Summary

Systems in this cluster of tickets stay up at idle or light load, then freeze, reboot, hang, or power off once sustained GPU work starts. The pattern appears across training, RELION, CryoSPARC, PyTorch, GPU Burn, and similar multi-GPU workloads, with root causes spanning thermals, power delivery, PCIe behavior, firmware, and occasionally software.

Frequency

  • 56 tickets

Common Causes

  1. Thermal overload or failed cooling components
    Repeated evidence ties crashes to high GPU temperatures, CPU thermal trips, bad airflow, liquid-cooler faults, or failed CPU coolers. Examples: #6046, #13541, #18562, #22208, #35390, and 7 more.
  2. Power-delivery instability, PSU faults, or over-current events
    Several cases mention PSU over-current alerts, knock sounds, sudden hard-offs under full GPU draw, or power hypotheses that later drove RMA. Examples: #12346, #16700, #18951, #21208, #7016, and 8 more.
  3. PCIe / motherboard path instability under load
    Some systems only failed when all GPUs were active, with GPUs falling off the bus, downgraded PCIe link speed, bad slot behavior, or motherboard suspicion. Examples: #15022, #19229, #26199, #28711, #36958, and 7 more.
  4. Firmware / driver / OS interaction issues
    A smaller but real subset stabilized after BIOS, BMC, PSU firmware, driver, or OS changes, or showed software-specific reproduction differences. Examples: #12209, #15528, #18688, #19627, #18951, and 6 more.
  5. No-trouble-found or not fully reproduced in-house
    Some RMAs never reproduced the customer crash cleanly, which complicated closure and shifted emphasis to extended validation. Examples: #15381, #15022, #21957, #26591, #28667.

Diagnostic Steps

  1. Confirm the failure pattern under real workload
    Check whether the crash happens only under sustained GPU load, only with all GPUs active, or only after the system warms up. Representative tickets: #11007, #15022, #26591, #5628, #6046.
  2. Collect thermal, SEL, and health telemetry during or right after failure
    Review nvidia-smi -q, ipmitool sel elist, ipmitool sdr, psensor, and journal logs for thermal trips, missing GPUs, or power events. Representative tickets: #18562, #18951, #19627, #24079, #35390.
  3. Reduce variables with staged stress tests
    Compare customer workload to gpu_burn, mprime, stress-ng, or single-GPU / partial-GPU tests to learn whether the issue is load-shape-specific. Representative tickets: #11007, #21208, #21516, #5628, #9134.
  4. Check platform paths, not just the GPUs
    Inspect PCIe lane width, slot mapping, cooler condition, DIMMs, PSU behavior, and motherboard health when symptoms include bus drops or hard-offs. Representative tickets: #11007, #19229, #26199, #3001, #15022.
  5. Escalate to RMA when the crash is credible but not safely isolatable remotely
    This issue often needed depot reproduction because remote logs alone did not prove the failing component. Representative tickets: #11721, #14498, #18132, #20475, #30954.

Solutions

  1. Repair or replace the thermal root cause
    Successful fixes included restoring correct chassis airflow, replacing failed coolers, and addressing GPU overheating. Examples: #13541, #18562, #22208, #6046, #34983.
  2. RMA the system or affected platform components
    Many tickets only converged after depot testing and hardware replacement at system, motherboard, PSU, or barebone level. Examples: #12641, #14498, #16700, #18132, #36633, and 12 more.
  3. Apply firmware, BIOS, or power-management changes
    Stability improved in some cases after BIOS or BMC updates, PSU firmware updates, or disabling ASPM and related PCIe power saving. Examples: #18951, #19627, #26199, #28711.
  4. Use workload-specific software remediation when hardware evidence is weak
    A minority of tickets improved after driver updates, Windows-side changes, or application-specific configuration changes. Examples: #15528, #18688, #19627.
  5. Validate with the customer workload, not generic burn only
    Several cases showed standard tests passing while the real training or cryo-EM workload still failed, so customer reproduction steps were essential to proving the fix. Examples: #15022, #18132, #18717, #26591, #5628.

Edge Cases

  • Cooling issue outside the GPUs themselves: CPU cooler or liquid cooler faults presented as GPU-load crashes because the trigger was overall system thermal stress, not a dead GPU. See #18562, #22208, #24079.
  • Shipping damage / mechanical contamination: transit damage and dried thermal paste in a socket created load-instability symptoms that looked like generic GPU crashes. See #3001, #19229.
  • Bus-speed degradation rather than outright device loss: one repeat-RMA case stabilized only after PCIe power-management changes restored expected link speed. See #26199.
  • Standard stress tools can miss the failure: multiple tickets report gpu_burn or other QA tools passing while the real research workload still crashed the node. See #5628, #15022, #26591.

Related Issues

Referenced by

  • RTX 3090 — product affected by this issue (×4)
  • H100 — product affected by this issue (×5)
  • Overheating — co-occurs with this issue (×5)
  • Andrew Rodriguez — handled tickets on this issue (×16)
  • Sheng Ye — handled tickets on this issue (×1)
  • H200 — product affected by this issue (×1)
  • Ian Dicarlo — handled tickets on this issue (×5)
  • BIOS Firmware Update — related issue (×2)
  • Nam Luong — handled tickets on this issue (×8)
  • RMA Workflow — co-occurs with this issue (×23)

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.