GPU Troubleshooting Guide: Crashes During Idle

Alexander Hill
Alexander Hill
  • Updated

Overview

GPU crashes during idle periods present unique troubleshooting challenges because they occur when the system is not under load. These crashes often manifest as system freezes, black screens, spontaneous reboots, or application failures that happen specifically when the GPU is in a low-power state.

The most common causes of idle GPU crashes include:

  1. Power state transitions - GPUs can become unstable when transitioning between different power states
  2. Memory or core clock fluctuations - Unstable clock speeds during idle can cause system instability
  3. Thermal issues - Some GPUs may overheat even at idle if cooling is inadequate
  4. Driver incompatibilities - Certain driver versions may have bugs specifically related to idle states
  5. PCI Express power management issues - Problems with how the PCIe bus handles low-power states

This guide provides a systematic approach to identifying and resolving these issues, starting with basic configuration steps and progressing to more advanced troubleshooting techniques. By following these steps in order, you can efficiently diagnose and resolve GPU crashes that occur specifically during idle periods.


Prerequisites

Before beginning the troubleshooting process, ensure these baseline requirements are met:

Driver Management

  • Ensure you're using the latest stable NVIDIA driver
  • If problems persist after updating, consider rolling back to a previously stable version

Boot Parameters

Add the following parameters to your GRUB configuration:

  • pcie_aspm=off - Disables Active State Power Management
  • acpi=off - Disables Advanced Configuration and Power Interface
  • nvidia-drm.modeset=1 - Enables kernel mode setting
example:
GRUB_CMDLINE_LINUX="crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=28070ec6-f366-44fc-90c9-1ec719f63153 rhgb quiet pcie_aspm=off acpi=off nvidia-drm.modeset=1"

Benefits of nvidia-drm.modeset=1:

  • Smooth kernel-mode framebuffer console (splash screen, VT switching)
  • Fewer display issues with hybrid graphics / Wayland
  • Required for some desktop environments (GNOME on Wayland)

Note: In most cases, it's recommended to enable nvidia-drm.modeset=1 on modern systems — especially with display issues, black screen on boot, or idle crashes.

 

Update GRUB Configuration

RHEL/CentOS/Rocky/Fedora

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Ubuntu/Debian

sudo update-grub

Reboot your system

sudo reboot

 

Troubleshooting Steps

1. Enable Persistence Mode

Setting persistent mode helps prevent GPUs from entering problematic low-power states when idle:

sudo nvidia-smi -pm 1
 

2. Check System Logs

Examine logs to identify error patterns:

dmesg | grep -i "nvidia\|gpu\|err\|fail\|nvrm\|xid"
journalctl -b | grep -i "nvidia\|gpu\|error\|fail"
cat /var/log/Xorg.0.log | grep -i "nvidia\|gpu\|error\|fail"
 

Look for common error patterns like:

  • XID errors (particularly XID 79, 62, or 31)
  • NVRM errors
  • GPU has fallen off the bus
  • TDR (Timeout Detection and Recovery) events

3. Monitor GPU Temperature

Check if thermal issues are causing the crashes:

nvidia-smi -q -d TEMPERATURE
  • Set a dynamic fan speed curve with a minimum of 30% to ensure constant cooling
  • Consider improved case ventilation if temperatures are consistently high

4. Check for Memory Clock Fluctuations

Memory clock fluctuations during idle can cause instability:

 
watch -n 1 nvidia-smi -q -d CLOCK

 

If you notice fluctuations when idle, check supported clock rates:

nvidia-smi -i 0 -q -d SUPPORTED_CLOCKS | head -20
 

5. Lock Memory and Graphics Clocks

If memory or graphics clocks are fluctuating, lock them to stable values:

 
# Lock memory clocks (adjust values to match your GPU)
nvidia-smi -i 0 --lock-memory-clocks=8001,8001 
# Lock graphics clocks (adjust values to match your GPU)
nvidia-smi -i 0 -lgc 2100,2100
 

6. Disable Runtime Power Management

Runtime power management can cause issues when transitioning power states:

 
# Replace XX with your GPU's PCI bus ID
echo on | sudo tee /sys/bus/pci/devices/0000:XX:00.0/power/control

 

Possible outputs:

  • auto → Runtime PM is enabled (kernel may power down the device during idle)
  • on → Runtime PM is disabled (device stays fully powered even when idle)

7. Hardware Testing

If software solutions don't resolve the issue:

  1. Test the GPU in another system if possible
  2. Monitor system power with a wattmeter during idle periods
  3. Run memory tests on both system RAM and GPU memory
  4. Consider a different power supply if available

When to Consider RMA

If you've exhausted all troubleshooting steps and:

  • The problem persists across multiple driver versions
  • The issue occurs with consistent reproducibility during idle
  • Other GPUs work fine in the same system
  • The problem follows the GPU when moved to another system

It may be time to contact the manufacturer about a replacement (RMA).

Summary Table

 Prerequisite: Update/Rollback Drivers
 Prerequisite: Configure Boot Parameters
 Prerequisite: Update GRUB
 Enable Persistence Mode
 Check System Logs
 Monitor Temperature
 Check Clock Fluctuations
 Lock Memory Clocks
 Lock Graphics Clocks
 Disable Runtime PM
 Hardware Testing
 Consider RMA
 

Related to

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.