Before you begin: Have your Exxact serial number (SN) ready — found on the system label or from 'nvidia-smi -L'. Note: GPU issues at the hardware level (PCIe slot, power, thermals) must be ruled out before driver reinstall. If the system is under warranty, avoid permanent hardware changes without contacting Exxact Support first. |
1. Symptom Lookup — Go to the Right Section
| What you see | Most likely cause | Go to section |
|---|---|---|
| GPU absent from nvidia-smi or Device Manager | PCIe seating, power connector, slot issue | Section 3 |
| GPU present but 0–3% utilization under full load | IOMMU/VT-d DMA contention (Intel systems) | Section 5 |
| System lockup / hard hang at full GPU power | PSU insufficient, thermal throttle, BIOS version | Sections 4 & 6 |
| NVIDIA driver install fails on Windows | Corrupted driver stack, incomplete Windows update | Section 7 |
| NVIDIA driver install fails on Linux | Secure Boot, module signing, nouveau conflict | Section 8 |
| Xid 79 error / GPU reset required | PCIe AER fault, power delivery, bad VBIOS | Section 9 |
| GPU overheating / fan not spinning | Airflow blockage, fan hardware failure, power cap | Section 4 |
| DCGM or burn-in test failure | Memory defect, thermal, driver stack corruption | Section 10 |
2. Quick Diagnostic Commands — Run These First
Run these on the affected system before contacting support. Paste the output into your support ticket.
Linux
nvidia-smi -q # full GPU status, temps, power, ECC
nvidia-smi --query-gpu=index,name,temperature.gpu,power.draw,clocks.sm,clocks.gr,pstate,ecc.errors.uncorrected.aggregate.total --format=csv
lspci | grep -i nvidia # confirm GPUs visible on PCIe bus
dmesg | grep -i 'nvidia\|nvrm\|xid' # kernel GPU errors
cat /proc/driver/nvidia/gpus/*/information 2>/dev/null
Windows
nvidia-smi -q # PowerShell or CMD
Get-PnpDevice | Where-Object {$_.FriendlyName -like '*NVIDIA*'} | Select Name, Status
Get-EventLog -LogName System -Newest 200 | Where-Object {$_.Source -like '*nvlddmkm*'} | Format-List
| Tip: All Exxact systems ship with a burn-in report (burn-in_PASSED.html) and DCGM validation log under your SN at cpq.exxactcorp.com/qa/. Compare current output to the factory baseline. |
3. GPU Not Detected (nvidia-smi returns no GPUs)
Step 1 — Confirm PCIe visibility first
If the GPU does not appear in lspci (Linux) or Device Manager (Windows), this is a hardware problem — driver reinstall will not help.
- lspci | grep -i nvidia
If no output: the GPU is not seen by the CPU. Proceed to hardware checks below.
If output is present but driver says no GPU: skip to Section 7 (Windows driver) or Section 8 (Linux driver).
Step 2 — Reseat the GPU
- Power off completely. Remove AC power cord and hold the power button 5 seconds to discharge capacitors.
- Remove the GPU from its PCIe slot. Inspect the connector edge for dust, corrosion, or bent pins.
- Reseat firmly until the slot latch clicks. On multi-GPU systems, try moving the card to a different PCIe slot.
- Reconnect all PCIe power connectors (6-pin, 8-pin, or 16-pin). Confirm they are fully seated and not reversed.
- Power on and recheck lspci.
Step 3 — Check PCIe power connectors and slot assignment
High-wattage GPUs (300 W+) require dedicated PCIe power cables — daisy-chained connectors from a single PSU rail are a common failure point.
- Verify cable routing: each GPU should ideally have its own PSU rail or cable run.
- For 4-GPU workstations or GPU servers, confirm the PSU rated wattage covers peak load: 4 × 300 W GPU + CPU + storage ≈ 1800 W minimum; use the next tier up for headroom.
- For Supermicro servers: confirm PCIe riser board seating and verify slot assignment in BIOS (Advanced > PCIe Configuration).
4. GPU Overheating, Fan Not Spinning, or Throttling
Identify throttle cause
nvidia-smi -q -d PERFORMANCE # look for HW Slowdown, HW Thermal, HW Power Brake flags
watch -n 1 'nvidia-smi --query-gpu=temperature.gpu,power.draw,clocks.throttle_reasons.active,clocks.sm --format=csv,noheader'
Thermal thresholds (RTX A6000 / typical NVIDIA professional GPUs)
| Threshold | Typical value | Meaning |
|---|---|---|
| GPU Slowdown Temp | 88–95 °C | Clock throttle begins |
| GPU Max Operating Temp | 93 °C | Performance degraded |
| GPU Shutdown Temp | 98 °C | Emergency power-off |
| GPU Target Temp | 84 °C | Fan control setpoint |
Cooling checklist
- Verify airflow direction. Exxact 1U/2U GPU servers use front-to-back airflow. Reverse-airflow cards will overheat in standard chassis.
- Check fan health via BMC/IPMI: ipmitool sdr type Fan — any fan reading 0 RPM is a failure.
- Inspect for blocked airflow: blanking panels missing, cables draped over GPU heatsinks, inadequate rack spacing.
- Check thermal paste on GPU heatsink if fan is spinning but temperatures are abnormally high (>85 °C at idle).
- On tower/workstation systems with blower-style cards: ensure at least 1U of clearance between GPU exhaust and the next component.
Power cap as a thermal workaround
If hardware changes cannot be made immediately, a temporary power cap reduces heat and prevents shutdowns:
nvidia-smi -pl 220 # example: cap at 220W (adjust per GPU TDP)
nvidia-smi -q -d POWER | grep -i limit # verify new limit applied
| Important: Power capping reduces compute throughput. It is a workaround, not a fix. Escalate to Exxact Support if the underlying cause (cooling or PSU) is not resolved. |
5. Severe GPU Slowdown on Intel Systems (IOMMU / VT-d)
Observed on Intel Xeon systems: deep learning workloads running at 1–3% GPU utilization. Benchmark shows 100–190× slowdown vs. AMD equivalents. Root cause: Intel VT-d (IOMMU) DMA translation overhead during concurrent CPU+GPU memory operations.
Diagnosis
dmesg | grep -i iommu
cat /proc/cmdline | grep iommu
If you see DMAR: IOMMU enabled in dmesg and no passthrough option in cmdline, IOMMU is likely the cause.
Fix — Try in order
| # | Kernel parameter | Effect | Risk |
|---|---|---|---|
| 1 | iommu=pt | IOMMU passthrough — reduces DMA overhead while keeping IOMMU active | Low |
| 2 | intel_iommu=off | Fully disables VT-d — maximum GPU DMA performance; disables SR-IOV and IOMMU isolation | Medium — disable only on bare-metal GPU compute |
Apply via GRUB (Ubuntu / RHEL):
sudo nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash iommu=pt"
sudo update-grub && sudo reboot
Also check BIOS:
Advanced > CPU Configuration > Intel Virtualization Technology for Directed I/O (VT-d) → Disabled for bare-metal GPU compute workloads. Confirm with your system admin if virtualization (SR-IOV, KVM pass-through) is needed before disabling.
Field note (Ticket #41473 — UT Dallas): 4× RTX A6000 on Supermicro X13DEG-QT (Intel Xeon Gold 5520+). iommu=pt reduced slowdown from ~192× to ~161×. Full resolution required intel_iommu=off. Confirmed: ECC enabled on all GPUs, power capped at 220 W, no Xid errors at time of report. Next steps at time of ticket: BIOS update and DCGM diagnostic. |
6. BIOS Update and Power Delivery
When to update BIOS
- System lockups under GPU load that cannot be explained by thermals or PSU capacity.
- PCIe link width unexpectedly downgrades (nvidia-smi showing PCIe Gen 1 when Gen 4 expected).
- BMC becomes unresponsive during GPU stress — may indicate BIOS-level power management bug.
Check current BIOS version
dmidecode -t bios | grep -E 'Version|Release' # Linux
Get-WmiObject Win32_BIOS | Select-Object Name, Version, ReleaseDate # Windows PowerShell
BIOS update by platform
| Platform | BIOS update method | Notes |
|---|---|---|
| Supermicro | IPMI Web UI > Maintenance > BIOS Update, or UEFI USB tool | Download from supermicro.com/support/resources/downloadcenter |
| ASUS | ASUS EZ Flash (BIOS screen) or USB BIOS Flashback | Flashback requires specific USB port — check manual |
| Gigabyte | Q-Flash or @BIOS utility | Use Q-Flash from BIOS for safest update |
| MSI | M-FLASH from BIOS menu | Requires formatted FAT32 USB drive |
PCIe link degradation note: nvidia-smi may report 'Current PCIe Generation: 1' even though the GPU max is Gen 4. This is normal at idle — NVIDIA downclocks PCIe links in P8 state. To verify at load: run a workload and re-query nvidia-smi. If still Gen 1 under load, check BIOS PCIe settings. |
7. NVIDIA Driver Reinstall — Windows
Use this procedure when: driver install fails, GPU shows 'Code 43' in Device Manager, or after a corrupted driver update pushed the system into repair mode.
Critical prerequisite: Complete ALL pending Windows updates before reinstalling NVIDIA drivers. A Windows servicing stack corruption (error 0x800F0991) will cause driver installation to fail even after DDU. Signs: WindowsUpdateClient errors in Event Viewer, SFC /scannow fails with 'Windows Resource Protection' error. Fix: run DISM /Online /Cleanup-Image /RestoreHealth, complete updates, then reboot before driver install. |
Procedure
- Download DDU (Display Driver Uninstaller) from guru3d.com and the target NVIDIA driver from nvidia.com. Store both on a USB drive.
- Boot into Safe Mode: Settings > System > Recovery > Advanced Startup > Troubleshoot > Advanced Options > Startup Settings > Restart > press 4.
- Run DDU: Select GPU type = 'GPU', Device = 'NVIDIA'. Choose 'Clean and restart in Safe Mode'. DDU will remove all NVIDIA components including audio, PhysX, and USB.
- After DDU reboot (still in Safe Mode): verify Device Manager shows no NVIDIA devices.
- Boot to normal Windows. Disconnect from the network. Run the NVIDIA installer as Administrator.
- If install still fails: open Event Viewer (eventvwr.msc) > Windows Logs > System. Look for WindowsUpdateClient or SetupAPI errors in the 30 minutes before the failure.
- If Windows update loop blocks driver install: try attaching a USB NIC — some NIC firmware versions block Windows authentication for update packages. Alternatively update BIOS (newer BIOS often includes updated NIC firmware).
Verify successful install
nvidia-smi # GPU should appear with correct driver version
Get-PnpDevice | Where {$_.FriendlyName -like '*NVIDIA*'} | Select Name, Status
Field note (Ticket #41003 — Fairfield University): 4× NVIDIA RTX A6000 on ASUS PRO WS WRX80E-SAGE (AMD Threadripper PRO 5995WX), Windows 11. Driver R580 (582.16) install failed after Windows repair mode. DDU run in Safe Mode — reinstall still failed. Root cause: incomplete Windows updates (KB5077181 error 0x800F0991). SFC also failed. Resolution: complete Windows updates first (DISM restore health + Windows Update), then reinstall driver. Secondary fix considered: BIOS update from v1201 (2023) to v1801 (2025) to resolve NIC firmware issue blocking Windows Update. |
8. NVIDIA Driver Reinstall — Linux
Use when: driver install fails, GPU disappears after kernel update, or nvidia.ko module fails to load.
Step 1 — Disable Secure Boot or sign the module
NVIDIA drivers are out-of-tree kernel modules. Secure Boot requires module signing.
- Recommended for pure GPU compute: disable Secure Boot in BIOS (Advanced > Security > Secure Boot).
- Alternative: enroll MOK key — see NVIDIA documentation for mokutil procedure.
Step 2 — Full driver purge
sudo systemctl stop gdm lightdm sddm 2>/dev/null; true
sudo apt-get purge --autoremove 'nvidia-*' 'cuda-*' 'libcuda*' -y # Ubuntu/Debian
sudo dnf remove 'nvidia*' 'cuda*' -y # RHEL/Rocky
sudo rm -f /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u # Ubuntu OR sudo dracut -f # RHEL
sudo reboot
Step 3 — Blacklist nouveau and install
After reboot, confirm nouveau is not loaded:
lsmod | grep nouveau # should return empty
Install via package manager (recommended — handles DKMS automatically):
# Ubuntu — via graphics-drivers PPA
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt-get update
sudo apt-get install nvidia-driver-550 -y # substitute target version
sudo reboot
Or install from .run file (for specific versions):
sudo bash NVIDIA-Linux-x86_64-550.163.01.run --dkms
Step 4 — Verify
nvidia-smi
lsmod | grep nvidia # nvidia, nvidia_modeset, nvidia_uvm should all be present
cat /proc/driver/nvidia/version
9. Xid 79 — GPU Reset Required (PCIe AER Error)
Xid 79 indicates the GPU detected a PCIe AER (Advanced Error Reporting) fault and initiated a reset. Possible causes: PCIe signal integrity, power delivery, or defective GPU.
Identify
dmesg | grep -i 'Xid\|NVRM\|nvrm' | tail -40
nvidia-smi --query-gpu=ecc.errors.uncorrected.aggregate.total --format=csv # check for aggregate ECC errors
Triage steps
- Check PCIe link width and generation under load — if downgrading, replace cable or switch to a different slot.
- Reseat GPU and reconnect power. Xid 79 is frequently caused by marginal power delivery.
- Update BIOS — BIOS updates often include PCIe error handling improvements.
- If Xid 79 is intermittent and tied to a specific GPU: run DCGM diag (Section 10) and consider RMA if test fails.
- Check VBIOS version: nvidia-smi --query-gpu=vbios_version --format=csv — outdated VBIOS can cause AER errors; contact Exxact for VBIOS update guidance.
10. DCGM Diagnostic (Linux Only)
NVIDIA's DCGM tool runs comprehensive hardware diagnostics including memory, compute, and bandwidth tests. Required for warranty RMA eligibility.
Install DCGM
# Ubuntu
sudo apt-get install datacenter-gpu-manager -y
# RHEL / Rocky
sudo dnf install datacenter-gpu-manager -y
dcgmi --version # verify install
Run diagnostic levels
| Level | Command | What it tests |
|---|---|---|
| -r 1 | dcgmi diag -r 1 | Quick — PCIe, memory bandwidth (< 1 min) |
| -r 2 | dcgmi diag -r 2 | Standard — adds ECC and compute (~ 2 min) |
| -r 3 | dcgmi diag -r 3 | Extended — full stress + memory (~ 15 min) |
| -r 4 | dcgmi diag -r 4 | Comprehensive — all tests at full duration (30+ min, required for RMA) |
RMA requirement: Exxact Support requires dcgmi diag -r 4 output before approving a GPU RMA. Attach the full output to your ticket. Note: DCGM is not installed by default on Exxact workstations. If 'dcgmi: command not found', install via the commands above. |
11. When to Escalate to Exxact Support
| Escalate if... | What to include in your ticket |
|---|---|
| GPU not visible in lspci after reseating | nvidia-smi -q output, lspci -vvv, dmesg | grep nvidia |
| Lockups persist after power cap + BIOS update | IPMI sensor log, ipmitool sdr, BIOS version, PSU model/wattage |
| DCGM diag -r 4 reports FAIL on any test | Full dcgmi diag -r 4 output (save to file) |
| Xid 79 errors recurring after reseat + BIOS update | dmesg grep Xid, nvidia-smi ECC aggregate errors, VBIOS version |
| Windows driver reinstall fails after DDU + Windows update | Event Viewer System log (.evtx), BIOS version, DDU log |
| GPU cooling issue requiring physical fan replacement | Fan RPM from IPMI, nvidia-smi temperature log, system model/SN |
Contact Exxact Support
| Portal | support.exxactcorp.com |
| support@exxactcorp.com | |
| Phone | (510) 226-7366 | Mon–Fri 8:30am–5:30pm PT |
| AI Chat | Available on support.exxactcorp.com — instant answers for common issues |
Appendix A — Key nvidia-smi Fields Explained
| Field | What it means |
|---|---|
| Performance State (Pstate) | P0 = max performance, P8 = idle. Stays at P8 at idle; should go to P0 under load. |
| HW Slowdown: Active | GPU is thermally or power-throttled at the hardware level. Check temps and power. |
| ECC Errors > 0 (Aggregate) | Persistent ECC errors indicate memory cells going bad. File a support ticket if uncorrectable > 0. |
| PCIe Generation: Current 1 | Normal at idle (P8 state). Must be Gen 3/4/5 under active compute load. |
| Reset Required: Yes | GPU has detected a fatal error and needs a driver reset or system reboot. Note any Xid in dmesg. |
| Drain and Reset Recommended: Yes | Soft reset recommended. Run: sudo nvidia-smi --gpu-reset |
© 2026 Exxact Corporation. Internal use and customer-facing knowledge base article.
Related to
Comments
0 comments
Please sign in to leave a comment.