How to Look for GPU Errors

Alexander Hill
Alexander Hill
  • Updated

Overview

Identifying and diagnosing GPU errors is essential for maintaining system stability and performance, especially in high-performance computing environments. This guide outlines the steps to check for common GPU errors, investigate hardware issues, and interpret error messages from NVIDIA GPUs.

Prerequisites

  • A system with one or more NVIDIA GPUs installed
  • NVIDIA GPU drivers properly installed
  • Terminal or command-line access to your system
  • Administrative or root privileges (for some commands)

Steps

Check GPU Visibility

1. Verify all GPUs are visible to the system:

lspci -tvnn | less

This command displays the PCI device tree. Look for entries containing "NVIDIA Corporation" to confirm that all GPUs are recognized by the system at the hardware level.

 

Basic GPU Information

2. Check basic GPU information and driver version:

 
nvidia-smi --query-gpu=name,index,driver_version,serial --format=csv

This returns essential information about each GPU including:

    • GPU name/model
    • Index number (for multi-GPU systems)
    • Driver version
    • Serial number (useful for warranty and support)

Monitor GPU Health Metrics

3. Check critical GPU health metrics:

 
nvidia-smi

Pay close attention to:

    • Fan speed (abnormal values may indicate cooling issues)
    • Temperature (consistently high temperatures can indicate cooling problems)
    • Power usage vs capacity (unusual power patterns can signal hardware issues)
    • Memory usage (unexpected memory consumption may indicate memory leaks)
    • GPU utilization (mismatches between utilization and expected workload can signal issues)

Check for ECC Errors

4. Monitor ECC (Error Correction Code) errors:

 
nvidia-smi -q -d ECC

This displays:

  • Whether ECC is enabled or disabled
  • Volatile ECC errors (errors since last driver reload)
  • Aggregate ECC errors (cumulative errors since last reset)

Any non-zero values could indicate GPU memory issues or hardware problems.

 

Check System Logs for GPU Errors

5. Examine kernel messages for GPU-related errors:

 
dmesg | grep -iE 'nvidia|drm|nvrm'

This shows kernel messages related to NVIDIA drivers and GPUs, which may reveal driver crashes or hardware issues.

 

6. Review system logs using journalctl:

 
journalctl -p 3 -xb

This shows error-level (priority 3) messages from the current boot, which may include critical GPU errors.

 

Check for GPU Processing Errors

7. Check for CUDA errors in applications:

bash
 
export CUDA_DEVICE=0 # Target a specific GPU 
export CUDA_VISIBLE_DEVICES=0,1 # Limit visible GPUs
cuda-memcheck ./your_application

This helps detect memory access errors when running CUDA applications.

 

Reset Problematic GPUs

8. Reset a GPU if it becomes unresponsive:

bash
 
sudo nvidia-smi -i [GPU_INDEX] -r

Replace [GPU_INDEX] with the index of the problematic GPU. This performs a software reset without restarting the system.

 

Advanced Diagnostics with DCGM

9. Use NVIDIA Data Center GPU Manager (DCGM) for comprehensive diagnostics:

bash
 
dcgmi diag -r 3

This runs level 3 diagnostics (comprehensive tests) on all GPUs, checking for:

    • Memory errors
    • GPU stress test issues
    • Hardware failures
    • Performance anomalies

Error Interpretation

Common error messages and their typical causes:

  • "NVRM: GPU at XX:XX.X has fallen off the bus": Physical connection issue or GPU hardware failure
  • "Xid errors": Critical errors requiring attention:
    • Xid 13: General GPU error
    • Xid 31: GPU memory page fault
    • Xid 43/45: GPU is not responding to commands
    • Xid 62: Thermal violation, GPU overheating
  • "ECC error": Memory corruption on the GPU
  • "CUDA: out of memory": Application needs more GPU memory than available

Additional Resources

For more advanced diagnostics and monitoring, refer to the NVIDIA Data Center GPU Manager (DCGM) documentation. DCGM provides enterprise-grade monitoring and management capabilities, including:

  • Health monitoring
  • Configuration management
  • Diagnostics
  • Policy-based automation

The complete DCGM documentation can be found on NVIDIA's website, with detailed information on installation, configuration, and advanced usage for datacenter environments.

 

Related to

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.