Workload Resource Exhaustion System Crash
Summary
Systems that appear to shut down, kernel panic, OOM, or become unstable under load may be destabilized by user workloads, runaway agents, update fallout, or resource contention; outcomes vary from customer-side workload cleanup to full-system RMA when instability persists (#43599, #44285).
Frequency
- 2 tickets
Common Causes
- Runaway CPU-consuming agents or jobs
- Memory exhaustion/kernel panic followed by persistent OS or network corruption
A customer reported repeated load-triggered server crashes/shutdowns, then later found that a few agents were hogging CPU resources and the server remained stable after the workload issue was addressed (#43599).
A DGX Spark suffered a RAM-exhaustion kernel panic during LLVM/source-library builds, then remained prone to OOM errors and NetworkManager/SSH disruption after a DGX Dashboard update attempt reported some unsuccessful updates (#44285).
Diagnostic Steps
- Confirm whether the trigger is a specific workload or user process
- Check resource utilization before assuming hardware failure
- Check post-crash OS and access integrity
- Continue normal power, thermal, and RMA triage if symptoms persist
Ask what changed immediately before the shutdowns, kernel panic, or OOM state and whether a specific user job, daemon, build, or agent consistently precedes the event (#43599, #44285).
Review CPU load, memory pressure/OOM history, process tables, job scheduler state, logs, and whether the system stabilizes after stopping runaway processes or reducing load (#43599, #44285).
If the system remains unstable after restart, verify update status, NetworkManager/SSH health, boot/kernel logs, and whether remote access corruption blocks normal triage (#44285).
If the system still powers off, OOMs, or cannot be managed after workload cleanup and OS/update checks, continue checking fans, thermal paste/cooler contact, PSU behavior, SEL/BMC logs, facility power, and whether a system RMA is needed (#43599, #44285).
Solutions
- Control or stop the runaway workload when the system stabilizes afterward
- Escalate to full-system RMA when load-triggered instability persists or corrupts access paths
- Use best-effort thermal checks on out-of-warranty systems before recommending parts
One confirmed resolution was customer-side identification of CPU-hogging agents, after which the server was reported to be running normally without hardware replacement (#43599).
A DGX Spark case moved to full-system RMA after an LLVM-build RAM-exhaustion kernel panic was followed by repeated OOMs, manual restarts, NetworkManager/SSH disruption, and an unsuccessful update state (#44285).
For one out-of-warranty system, Support suggested observing fan behavior and considering CPU thermal-paste replacement before the customer reported a workload-side resolution (#43599).
Edge Cases
- Looks like power or thermal failure at first: delayed restart after shutdown and load-only failures can strongly resemble PSU or thermal protection behavior, but one case resolved through workload/resource management with no confirmed PSU or cooling repair (#43599).
- Workload-triggered does not always mean workload-only resolution: a RAM/OOM-triggered kernel panic can leave an appliance unstable enough to justify full-system RMA when update state, SSH, or NetworkManager are corrupted and the system remains unusable (#44285).
Comments
0 comments
Please sign in to leave a comment.