Overview

Error-Correcting Code (ECC) memory is a critical feature in professional-grade GPUs designed to detect and correct memory errors during operation. When ECC errors occur, they can indicate underlying hardware issues that may impact system stability, computational accuracy, and long-term GPU reliability.

This guide provides a systematic approach to diagnosing and addressing ECC errors in NVIDIA GPUs. Understanding the difference between correctable and uncorrectable errors is essential, as it helps determine the severity of the issue and appropriate actions. While correctable errors are automatically fixed and serve as early warning signs, uncorrectable errors can lead to application crashes or system instability and require immediate attention.

By following this troubleshooting guide, system administrators and engineers can identify the root causes of ECC errors, implement effective solutions, and potentially prevent costly hardware failures or computational inaccuracies in critical workloads.

Affected Systems

Any Nvidia GPU

Symptoms

CUDA applications crashing unexpectedly with memory-related errors
Applications hanging or freezing during GPU-intensive operations
Random system crashes or blue screens during GPU workloads
GPU utilization dropping unexpectedly during compute tasks
System becoming unresponsive under GPU load

Prerequisites

Before beginning troubleshooting, ensure you have:

Verified your System BIOS is the latest version
Installed the latest NVIDIA driver for your GPU
Confirmed your PSU provides enough total wattage AND amperage per rail

Troubleshooting Steps

Understanding ECC Error Types

Type	Meaning	Risk Level
Correctable	Detected and fixed automatically	⚠️ Warning sign
Uncorrectable	Could not be corrected — may crash applications	❌ Critical

Step 1: Check ECC Status and Error Logs

Check the current ECC status and error counts:

nvidia-smi -q -d ECC

Look for:

Affected GPU(s)
Error counts (correctable vs uncorrectable)
Error locations (volatile vs aggregate)

🧠 Note: Volatile error counts reset on reboot, while aggregate errors accumulate over the GPU's lifetime.

Step 2: Check for PCIe Errors

PCIe errors can sometimes correlate with or cause ECC errors:

nvidia-smi pci -gErrCnt

Sample output:

GPU 0: NVIDIA RTX A6000 (UUID: GPU-ac8aab02-60a5-c154-247b-d416b8fa93e7)
REPLAY_COUNTER: 0
REPLAY_ROLLOVER_COUNTER: 0
L0_TO_RECOVERY_COUNTER: 511
CORRECTABLE_ERRORS: 0
NAKS_RECEIVED: 0
RECEIVER_ERROR: 0
BAD_TLP: 0
NAKS_SENT: 0
BAD_DLLP: 0
NON_FATAL_ERROR: 0
FATAL_ERROR: 0
UNSUPPORTED_REQ: 0
LCRC_ERROR: 0
LANE_ERROR:
lane 0: 0
lane 1: 0
lane 2: 0
lane 3: 0
lane 4: 0
lane 5: 0
lane 6: 0
lane 7: 0
lane 8: 0
lane 9: 0
lane 10: 0
lane 11: 0
lane 12: 0

🧾 Interpreting Each Field

Field	Meaning	What to Watch For
REPLAY_COUNTER	Retries of PCIe packets due to transient issues	Occasional small numbers are okay; large or growing counts = ⚠️ suspect link quality
REPLAY_ROLLOVER_COUNTER	Counter for how many times REPLAY_COUNTER wrapped	>0 means REPLAY_COUNTER exceeded 255 at least once
L0_TO_RECOVERY_COUNTER	Times the PCIe link dropped to recovery mode	🚨 High values here = serious PCIe link stability problem
CORRECTABLE_ERRORS	Total correctable PCIe errors	Minor, but should be monitored
NAKS_RECEIVED / NAKS_SENT	Negative acknowledgment signals sent/received	If non-zero, PCIe communication problems
RECEIVER_ERROR	Errors receiving packets from the PCIe bus	Indicates potential signal integrity issues
BAD_TLP / BAD_DLLP	Malformed PCIe transaction/data link packets	Usually 0 — non-zero = bus instability or broken device/riser
NON_FATAL_ERROR	Non-fatal PCIe errors logged	Warning sign, can cause degraded performance
FATAL_ERROR	Critical error that could crash the GPU	🚨 If this is >0, the GPU likely experienced a crash or hard fault
UNSUPPORTED_REQ	PCIe request not supported by GPU or device	May happen in some workloads — shouldn't be frequent
LCRC_ERROR	Link CRC check failures — integrity issues	Usually 0; high = data corruption on the wire
LANE_ERROR	Errors per PCIe lane (x16 GPUs have 16 lanes)	Should all be 0 — non-zero = specific bad lane or contact

Step 3: Check System Logs

Examine system logs for GPU-related errors:

dmesg | egrep -i 'err|nvrm|xid' 

ipmitool sel list

These logs may provide additional context about when and how errors are occurring.

Step 4: Power Cycle the System

Some ECC errors can occur due to unstable initialization:

Fully power off the system (disconnect AC power)
Wait 30 seconds
Reconnect power and reboot

This clears volatile memory and resets GPU firmware paths.

Step 5: Check Temperature and Cooling

Overheating can cause bit flips, especially under sustained load:

nvidia-smi --query-gpu=temperature.gpu,fan.speed,power.draw --format=csv

Recommended Operating Temperatures

GPU Type	Ideal Idle Temp	Typical Load Temp	Throttle Point
Consumer GPUs	30-45°C	65-85°C	~90-95°C
Data Center GPUs	35-50°C	70-85°C	~85-90°C

Keep GPUs under 80-85°C for optimal ECC reliability.

Step 6: Check PCIe and Power Stability

Physical connections can affect memory integrity:

Reseat GPU(s) in their slots
Reconnect PCIe power cables
Verify PSU wattage is sufficient

Power instability can cause ECC errors during memory transactions.

Step 7: Stress Test with ECC Monitoring

Enable ECC (if not already enabled)

sudo nvidia-smi -e 1

Reboot the system after enabling ECC.

Run GPU Stress Test

Clone and build the tool:

git clone https://github.com/wilicc/gpu-burn.git 
cd gpu-burn 
make

✅ Ensure nvcc (CUDA compiler) is available. If not, install the CUDA toolkit first.

Run the GPU burn test

./gpu_burn 60

Monitor ECC errors while under load:

watch -n 1 nvidia-smi -q -d ECC

Advanced gpu-burn Options

Task	Command
Run on a specific GPU	`CUDA_VISIBLE_DEVICES=0 ./gpu_burn 60`
Run on multiple GPUs selectively	`CUDA_VISIBLE_DEVICES=0,2 ./gpu_burn 120`
Run in background	`./gpu_burn 300 & disown`
Run with logging	`./gpu_burn 300 > thermal_test.log`

⚠️ Warning: This test will heat up your GPU rapidly. Monitor temperatures closely to avoid damage. Data center GPUs may throttle or shut down above ~85-90°C. Don't run these tests unattended.

Step 8: Isolate Hardware Issues

If using a multi-GPU setup:

Rotate workloads across different GPUs
If ECC errors consistently appear on only one GPU, it suggests a hardware issue with that specific GPU

Step 9: Address PCIe Issues (if detected)

If PCIe error counts are high:

Action	Reason
Reseat GPU	Fixes poor slot contact
Check/replace riser	Faulty risers cause signal errors
Reduce power draw or add airflow	High temps can corrupt PCIe signals
Update motherboard BIOS	Improves PCIe lane stability
Use different PCIe slot	Bad slot or insufficient lanes
Increase PCIe slot retention (if custom-built)	Prevents flex or sag disrupting contacts

Step 10: Replace or RMA GPU

Consider replacement if:

Uncorrectable ECC errors occur regularly
Correctable errors are persistent across reboots and tests
Only one GPU consistently experiences errors while others function normally

These symptoms typically indicate a failing memory chip or board issue.

Additional Notes & Tips

Keeping GPUs cool and well-ventilated reduces thermal-induced bit errors
Don't undervolt ECC-enabled GPUs, it can destabilize memory operations
Use stable, quality power supplies, this will avoid brownouts or rail dips
Monitor ECC errors over time to spot degrading GPUs early
Implement regular maintenance schedule being proactive rather than reactive can help avoid issues

Related to

GPU Troubleshooting Guide: Resolving ECC Errors

Overview

Affected Systems

Symptoms

Prerequisites

Troubleshooting Steps

Understanding ECC Error Types

Step 1: Check ECC Status and Error Logs

Step 2: Check for PCIe Errors

Step 3: Check System Logs

Step 4: Power Cycle the System

Step 5: Check Temperature and Cooling

Keep GPUs under 80-85°C for optimal ECC reliability.

Step 6: Check PCIe and Power Stability

Step 7: Stress Test with ECC Monitoring

Step 8: Isolate Hardware Issues

Step 9: Address PCIe Issues (if detected)

Step 10: Replace or RMA GPU

Additional Notes & Tips

Was this article helpful?

Comments

Search

GPU Troubleshooting Guide: Resolving ECC Errors

Overview

Affected Systems

Symptoms

Prerequisites

Troubleshooting Steps

Understanding ECC Error Types

Step 1: Check ECC Status and Error Logs

Step 2: Check for PCIe Errors

Step 3: Check System Logs

Step 4: Power Cycle the System

Step 5: Check Temperature and Cooling

Keep GPUs under 80-85°C for optimal ECC reliability.

Step 6: Check PCIe and Power Stability

Step 7: Stress Test with ECC Monitoring

Step 8: Isolate Hardware Issues

Step 9: Address PCIe Issues (if detected)

Step 10: Replace or RMA GPU

Additional Notes & Tips

Was this article helpful?

Comments