Overview
GPU crashes during idle periods present unique troubleshooting challenges because they occur when the system is not under load. These crashes often manifest as system freezes, black screens, spontaneous reboots, or application failures that happen specifically when the GPU is in a low-power state.
The most common causes of idle GPU crashes include:
- Power state transitions - GPUs can become unstable when transitioning between different power states
- Memory or core clock fluctuations - Unstable clock speeds during idle can cause system instability
- Thermal issues - Some GPUs may overheat even at idle if cooling is inadequate
- Driver incompatibilities - Certain driver versions may have bugs specifically related to idle states
- PCI Express power management issues - Problems with how the PCIe bus handles low-power states
This guide provides a systematic approach to identifying and resolving these issues, starting with basic configuration steps and progressing to more advanced troubleshooting techniques. By following these steps in order, you can efficiently diagnose and resolve GPU crashes that occur specifically during idle periods.
Prerequisites
Before beginning the troubleshooting process, ensure these baseline requirements are met:
Driver Management
- Ensure you're using the latest stable NVIDIA driver
- If problems persist after updating, consider rolling back to a previously stable version
Boot Parameters
Add the following parameters to your GRUB configuration:
pcie_aspm=off
- Disables Active State Power Management
acpi=off
- Disables Advanced Configuration and Power Interface
nvidia-drm.modeset=1
- Enables kernel mode setting
example:
GRUB_CMDLINE_LINUX="crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=28070ec6-f366-44fc-90c9-1ec719f63153 rhgb quiet pcie_aspm=off acpi=off nvidia-drm.modeset=1
"
Benefits of nvidia-drm.modeset=1:
- Smooth kernel-mode framebuffer console (splash screen, VT switching)
- Fewer display issues with hybrid graphics / Wayland
- Required for some desktop environments (GNOME on Wayland)
Note: In most cases, it's recommended to enable nvidia-drm.modeset=1
on modern systems — especially with display issues, black screen on boot, or idle crashes.
Update GRUB Configuration
RHEL/CentOS/Rocky/Fedora
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
Ubuntu/Debian
sudo update-grub
Reboot your system
sudo reboot
Troubleshooting Steps
1. Enable Persistence Mode
Setting persistent mode helps prevent GPUs from entering problematic low-power states when idle:
2. Check System Logs
Examine logs to identify error patterns:
dmesg | grep -i "nvidia\|gpu\|err\|fail\|nvrm\|xid"
journalctl -b | grep -i "nvidia\|gpu\|error\|fail"
cat /var/log/Xorg.0.log | grep -i "nvidia\|gpu\|error\|fail"
Look for common error patterns like:
- XID errors (particularly XID 79, 62, or 31)
- NVRM errors
- GPU has fallen off the bus
- TDR (Timeout Detection and Recovery) events
3. Monitor GPU Temperature
Check if thermal issues are causing the crashes:
nvidia-smi -q -d TEMPERATURE
- Set a dynamic fan speed curve with a minimum of 30% to ensure constant cooling
- Consider improved case ventilation if temperatures are consistently high
4. Check for Memory Clock Fluctuations
Memory clock fluctuations during idle can cause instability:
watch -n 1 nvidia-smi -q -d CLOCK
If you notice fluctuations when idle, check supported clock rates:
nvidia-smi -i 0 -q -d SUPPORTED_CLOCKS | head -20
5. Lock Memory and Graphics Clocks
If memory or graphics clocks are fluctuating, lock them to stable values:
# Lock memory clocks (adjust values to match your GPU)
nvidia-smi -i 0 --lock-memory-clocks=8001,8001
# Lock graphics clocks (adjust values to match your GPU)
nvidia-smi -i 0 -lgc 2100,2100
6. Disable Runtime Power Management
Runtime power management can cause issues when transitioning power states:
# Replace XX with your GPU's PCI bus ID
echo on | sudo tee /sys/bus/pci/devices/0000:XX:00.0/power/control
Possible outputs:
auto
→ Runtime PM is enabled (kernel may power down the device during idle)
on
→ Runtime PM is disabled (device stays fully powered even when idle)
7. Hardware Testing
If software solutions don't resolve the issue:
- Test the GPU in another system if possible
- Monitor system power with a wattmeter during idle periods
- Run memory tests on both system RAM and GPU memory
- Consider a different power supply if available
When to Consider RMA
If you've exhausted all troubleshooting steps and:
- The problem persists across multiple driver versions
- The issue occurs with consistent reproducibility during idle
- Other GPUs work fine in the same system
- The problem follows the GPU when moved to another system
It may be time to contact the manufacturer about a replacement (RMA).
Summary Table
✅ |
Prerequisite: Update/Rollback Drivers |
✅ |
Prerequisite: Configure Boot Parameters |
✅ |
Prerequisite: Update GRUB |
✅ |
Enable Persistence Mode |
✅ |
Check System Logs |
✅ |
Monitor Temperature |
✅ |
Check Clock Fluctuations |
✅ |
Lock Memory Clocks |
✅ |
Lock Graphics Clocks |
✅ |
Disable Runtime PM |
✅ |
Hardware Testing |
✅ |
Consider RMA |
Comments
0 comments
Please sign in to leave a comment.