GPU Troubleshooting Guide: Resolving ECC Errors

Alexander Hill
Alexander Hill
  • Updated

Overview

Error-Correcting Code (ECC) memory is a critical feature in professional-grade GPUs designed to detect and correct memory errors during operation. When ECC errors occur, they can indicate underlying hardware issues that may impact system stability, computational accuracy, and long-term GPU reliability.

This guide provides a systematic approach to diagnosing and addressing ECC errors in NVIDIA GPUs. Understanding the difference between correctable and uncorrectable errors is essential, as it helps determine the severity of the issue and appropriate actions. While correctable errors are automatically fixed and serve as early warning signs, uncorrectable errors can lead to application crashes or system instability and require immediate attention.

By following this troubleshooting guide, system administrators and engineers can identify the root causes of ECC errors, implement effective solutions, and potentially prevent costly hardware failures or computational inaccuracies in critical workloads.

Prerequisites

Before beginning troubleshooting, ensure you have:

  • Verified your System BIOS is the latest version
  • Installed the latest NVIDIA driver for your GPU
  • Confirmed your PSU provides enough total wattage AND amperage per rail

Understanding ECC Error Types

Type Meaning Risk Level
Correctable Detected and fixed automatically ⚠️ Warning sign
Uncorrectable Could not be corrected — may crash applications ❌ Critical

 

Step 1: Check ECC Status and Error Logs

Check the current ECC status and error counts:

nvidia-smi -q -d ECC
 

Look for:

  • Affected GPU(s)
  • Error counts (correctable vs uncorrectable)
  • Error locations (volatile vs aggregate)

🧠 Note: Volatile error counts reset on reboot, while aggregate errors accumulate over the GPU's lifetime.

Step 2: Check for PCIe Errors

PCIe errors can sometimes correlate with or cause ECC errors:

nvidia-smi pci -gErrCnt

 

Sample output:

 
GPU 0: NVIDIA RTX A6000 (UUID: GPU-ac8aab02-60a5-c154-247b-d416b8fa93e7)
REPLAY_COUNTER: 0
REPLAY_ROLLOVER_COUNTER: 0
L0_TO_RECOVERY_COUNTER: 511
CORRECTABLE_ERRORS: 0
NAKS_RECEIVED: 0
RECEIVER_ERROR: 0
BAD_TLP: 0
NAKS_SENT: 0
BAD_DLLP: 0
NON_FATAL_ERROR: 0
FATAL_ERROR: 0
UNSUPPORTED_REQ: 0
LCRC_ERROR: 0
LANE_ERROR:
lane 0: 0
lane 1: 0
lane 2: 0
lane 3: 0
lane 4: 0
lane 5: 0
lane 6: 0
lane 7: 0
lane 8: 0
lane 9: 0
lane 10: 0
lane 11: 0
lane 12: 0

🧾 Interpreting Each Field

Field Meaning What to Watch For
REPLAY_COUNTER Retries of PCIe packets due to transient issues Occasional small numbers are okay; large or growing counts = ⚠️ suspect link quality
REPLAY_ROLLOVER_COUNTER Counter for how many times REPLAY_COUNTER wrapped >0 means REPLAY_COUNTER exceeded 255 at least once
L0_TO_RECOVERY_COUNTER Times the PCIe link dropped to recovery mode 🚨 High values here = serious PCIe link stability problem
CORRECTABLE_ERRORS Total correctable PCIe errors Minor, but should be monitored
NAKS_RECEIVED / NAKS_SENT Negative acknowledgment signals sent/received If non-zero, PCIe communication problems
RECEIVER_ERROR Errors receiving packets from the PCIe bus Indicates potential signal integrity issues
BAD_TLP / BAD_DLLP Malformed PCIe transaction/data link packets Usually 0 — non-zero = bus instability or broken device/riser
NON_FATAL_ERROR Non-fatal PCIe errors logged Warning sign, can cause degraded performance
FATAL_ERROR Critical error that could crash the GPU 🚨 If this is >0, the GPU likely experienced a crash or hard fault
UNSUPPORTED_REQ PCIe request not supported by GPU or device May happen in some workloads — shouldn't be frequent
LCRC_ERROR Link CRC check failures — integrity issues Usually 0; high = data corruption on the wire
LANE_ERROR Errors per PCIe lane (x16 GPUs have 16 lanes) Should all be 0 — non-zero = specific bad lane or contact

 

Step 3: Check System Logs

Examine system logs for GPU-related errors:

 
dmesg | egrep -i 'err|nvrm|xid' 

ipmitool sel list

These logs may provide additional context about when and how errors are occurring.

 

Step 4: Power Cycle the System

Some ECC errors can occur due to unstable initialization:

  • Fully power off the system (disconnect AC power)
  • Wait 30 seconds
  • Reconnect power and reboot

This clears volatile memory and resets GPU firmware paths.

 

Step 5: Check Temperature and Cooling

Overheating can cause bit flips, especially under sustained load:

nvidia-smi --query-gpu=temperature.gpu,fan.speed,power.draw --format=csv

Recommended Operating Temperatures

GPU Type Ideal Idle Temp Typical Load Temp Throttle Point
Consumer GPUs 30-45°C 65-85°C ~90-95°C
Data Center GPUs 35-50°C 70-85°C ~85-90°C

Keep GPUs under 80-85°C for optimal ECC reliability.

 

Step 6: Check PCIe and Power Stability

Physical connections can affect memory integrity:

  • Reseat GPU(s) in their slots
  • Reconnect PCIe power cables
  • Verify PSU wattage is sufficient

Power instability can cause ECC errors during memory transactions.

 

Step 7: Stress Test with ECC Monitoring

Enable ECC (if not already enabled)

sudo nvidia-smi -e 1

Reboot the system after enabling ECC.

 

Run GPU Stress Test

  1. Clone and build the tool:
git clone https://github.com/wilicc/gpu-burn.git 
cd gpu-burn
make
 

✅ Ensure nvcc (CUDA compiler) is available. If not, install the CUDA toolkit first.

  1. Run the GPU burn test
./gpu_burn 60
  1. Monitor ECC errors while under load:
watch -n 1 nvidia-smi -q -d ECC

 

Advanced gpu-burn Options

Task Command
Run on a specific GPU CUDA_VISIBLE_DEVICES=0 ./gpu_burn 60
Run on multiple GPUs selectively CUDA_VISIBLE_DEVICES=0,2 ./gpu_burn 120
Run in background ./gpu_burn 300 & disown
Run with logging ./gpu_burn 300 > thermal_test.log

⚠️ Warning: This test will heat up your GPU rapidly. Monitor temperatures closely to avoid damage. Data center GPUs may throttle or shut down above ~85-90°C. Don't run these tests unattended.

 

Step 8: Isolate Hardware Issues

If using a multi-GPU setup:

  • Rotate workloads across different GPUs
  • If ECC errors consistently appear on only one GPU, it suggests a hardware issue with that specific GPU

Step 9: Address PCIe Issues (if detected)

If PCIe error counts are high:

Action Reason
Reseat GPU Fixes poor slot contact
Check/replace riser Faulty risers cause signal errors
Reduce power draw or add airflow High temps can corrupt PCIe signals
Update motherboard BIOS Improves PCIe lane stability
Use different PCIe slot Bad slot or insufficient lanes
Increase PCIe slot retention (if custom-built) Prevents flex or sag disrupting contacts

 

Step 10: Replace or RMA GPU

Consider replacement if:

  • Uncorrectable ECC errors occur regularly
  • Correctable errors are persistent across reboots and tests
  • Only one GPU consistently experiences errors while others function normally

These symptoms typically indicate a failing memory chip or board issue.

 

Prevention Tips

Tip Reason
Keep GPUs cool and well-ventilated Reduces thermal-induced bit errors
Don't undervolt ECC-enabled GPUs Can destabilize memory operations
Use stable, quality power supplies Avoid brownouts or rail dips
Monitor ECC errors over time Spot degrading GPUs early
Implement regular maintenance schedule Proactive rather than reactive

 

Conclusion

ECC errors are important signals about GPU health and reliability. While occasional correctable errors may be normal, persistent or uncorrectable errors typically indicate hardware issues requiring attention. By systematically troubleshooting and addressing these errors, you can maintain computational accuracy and prevent system failures in critical environments.

Related to

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.