Overview
Error-Correcting Code (ECC) memory is a critical feature in professional-grade GPUs designed to detect and correct memory errors during operation. When ECC errors occur, they can indicate underlying hardware issues that may impact system stability, computational accuracy, and long-term GPU reliability.
This guide provides a systematic approach to diagnosing and addressing ECC errors in NVIDIA GPUs. Understanding the difference between correctable and uncorrectable errors is essential, as it helps determine the severity of the issue and appropriate actions. While correctable errors are automatically fixed and serve as early warning signs, uncorrectable errors can lead to application crashes or system instability and require immediate attention.
By following this troubleshooting guide, system administrators and engineers can identify the root causes of ECC errors, implement effective solutions, and potentially prevent costly hardware failures or computational inaccuracies in critical workloads.
Prerequisites
Before beginning troubleshooting, ensure you have:
- Verified your System BIOS is the latest version
- Installed the latest NVIDIA driver for your GPU
- Confirmed your PSU provides enough total wattage AND amperage per rail
Understanding ECC Error Types
Type | Meaning | Risk Level |
---|---|---|
Correctable | Detected and fixed automatically | ⚠️ Warning sign |
Uncorrectable | Could not be corrected — may crash applications | ❌ Critical |
Step 1: Check ECC Status and Error Logs
Check the current ECC status and error counts:
nvidia-smi -q -d ECC
Look for:
- Affected GPU(s)
- Error counts (correctable vs uncorrectable)
- Error locations (volatile vs aggregate)
🧠 Note: Volatile error counts reset on reboot, while aggregate errors accumulate over the GPU's lifetime.
Step 2: Check for PCIe Errors
PCIe errors can sometimes correlate with or cause ECC errors:
nvidia-smi pci -gErrCnt
Sample output:
GPU 0: NVIDIA RTX A6000 (UUID: GPU-ac8aab02-60a5-c154-247b-d416b8fa93e7)
REPLAY_COUNTER: 0
REPLAY_ROLLOVER_COUNTER: 0
L0_TO_RECOVERY_COUNTER: 511
CORRECTABLE_ERRORS: 0
NAKS_RECEIVED: 0
RECEIVER_ERROR: 0
BAD_TLP: 0
NAKS_SENT: 0
BAD_DLLP: 0
NON_FATAL_ERROR: 0
FATAL_ERROR: 0
UNSUPPORTED_REQ: 0
LCRC_ERROR: 0
LANE_ERROR:
lane 0: 0
lane 1: 0
lane 2: 0
lane 3: 0
lane 4: 0
lane 5: 0
lane 6: 0
lane 7: 0
lane 8: 0
lane 9: 0
lane 10: 0
lane 11: 0
lane 12: 0
🧾 Interpreting Each Field
Field | Meaning | What to Watch For |
---|---|---|
REPLAY_COUNTER | Retries of PCIe packets due to transient issues | Occasional small numbers are okay; large or growing counts = ⚠️ suspect link quality |
REPLAY_ROLLOVER_COUNTER | Counter for how many times REPLAY_COUNTER wrapped | >0 means REPLAY_COUNTER exceeded 255 at least once |
L0_TO_RECOVERY_COUNTER | Times the PCIe link dropped to recovery mode | 🚨 High values here = serious PCIe link stability problem |
CORRECTABLE_ERRORS | Total correctable PCIe errors | Minor, but should be monitored |
NAKS_RECEIVED / NAKS_SENT | Negative acknowledgment signals sent/received | If non-zero, PCIe communication problems |
RECEIVER_ERROR | Errors receiving packets from the PCIe bus | Indicates potential signal integrity issues |
BAD_TLP / BAD_DLLP | Malformed PCIe transaction/data link packets | Usually 0 — non-zero = bus instability or broken device/riser |
NON_FATAL_ERROR | Non-fatal PCIe errors logged | Warning sign, can cause degraded performance |
FATAL_ERROR | Critical error that could crash the GPU | 🚨 If this is >0, the GPU likely experienced a crash or hard fault |
UNSUPPORTED_REQ | PCIe request not supported by GPU or device | May happen in some workloads — shouldn't be frequent |
LCRC_ERROR | Link CRC check failures — integrity issues | Usually 0; high = data corruption on the wire |
LANE_ERROR | Errors per PCIe lane (x16 GPUs have 16 lanes) | Should all be 0 — non-zero = specific bad lane or contact |
Step 3: Check System Logs
Examine system logs for GPU-related errors:
dmesg | egrep -i 'err|nvrm|xid'
ipmitool sel list
These logs may provide additional context about when and how errors are occurring.
Step 4: Power Cycle the System
Some ECC errors can occur due to unstable initialization:
- Fully power off the system (disconnect AC power)
- Wait 30 seconds
- Reconnect power and reboot
This clears volatile memory and resets GPU firmware paths.
Step 5: Check Temperature and Cooling
Overheating can cause bit flips, especially under sustained load:
nvidia-smi --query-gpu=temperature.gpu,fan.speed,power.draw --format=csv
Recommended Operating Temperatures
GPU Type | Ideal Idle Temp | Typical Load Temp | Throttle Point |
---|---|---|---|
Consumer GPUs | 30-45°C | 65-85°C | ~90-95°C |
Data Center GPUs | 35-50°C | 70-85°C | ~85-90°C |
Keep GPUs under 80-85°C for optimal ECC reliability.
Step 6: Check PCIe and Power Stability
Physical connections can affect memory integrity:
- Reseat GPU(s) in their slots
- Reconnect PCIe power cables
- Verify PSU wattage is sufficient
Power instability can cause ECC errors during memory transactions.
Step 7: Stress Test with ECC Monitoring
Enable ECC (if not already enabled)
sudo nvidia-smi -e 1
Reboot the system after enabling ECC.
Run GPU Stress Test
- Clone and build the tool:
git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make
✅ Ensure
nvcc
(CUDA compiler) is available. If not, install the CUDA toolkit first.
- Run the GPU burn test
./gpu_burn 60
- Monitor ECC errors while under load:
watch -n 1 nvidia-smi -q -d ECC
Advanced gpu-burn Options
Task | Command |
---|---|
Run on a specific GPU | CUDA_VISIBLE_DEVICES=0 ./gpu_burn 60 |
Run on multiple GPUs selectively | CUDA_VISIBLE_DEVICES=0,2 ./gpu_burn 120 |
Run in background | ./gpu_burn 300 & disown |
Run with logging | ./gpu_burn 300 > thermal_test.log |
⚠️ Warning: This test will heat up your GPU rapidly. Monitor temperatures closely to avoid damage. Data center GPUs may throttle or shut down above ~85-90°C. Don't run these tests unattended.
Step 8: Isolate Hardware Issues
If using a multi-GPU setup:
- Rotate workloads across different GPUs
- If ECC errors consistently appear on only one GPU, it suggests a hardware issue with that specific GPU
Step 9: Address PCIe Issues (if detected)
If PCIe error counts are high:
Action | Reason |
---|---|
Reseat GPU | Fixes poor slot contact |
Check/replace riser | Faulty risers cause signal errors |
Reduce power draw or add airflow | High temps can corrupt PCIe signals |
Update motherboard BIOS | Improves PCIe lane stability |
Use different PCIe slot | Bad slot or insufficient lanes |
Increase PCIe slot retention (if custom-built) | Prevents flex or sag disrupting contacts |
Step 10: Replace or RMA GPU
Consider replacement if:
- Uncorrectable ECC errors occur regularly
- Correctable errors are persistent across reboots and tests
- Only one GPU consistently experiences errors while others function normally
These symptoms typically indicate a failing memory chip or board issue.
Prevention Tips
Tip | Reason |
---|---|
Keep GPUs cool and well-ventilated | Reduces thermal-induced bit errors |
Don't undervolt ECC-enabled GPUs | Can destabilize memory operations |
Use stable, quality power supplies | Avoid brownouts or rail dips |
Monitor ECC errors over time | Spot degrading GPUs early |
Implement regular maintenance schedule | Proactive rather than reactive |
Conclusion
ECC errors are important signals about GPU health and reliability. While occasional correctable errors may be normal, persistent or uncorrectable errors typically indicate hardware issues requiring attention. By systematically troubleshooting and addressing these errors, you can maintain computational accuracy and prevent system failures in critical environments.
Related to
Comments
0 comments
Please sign in to leave a comment.