GPU Not Detected, Slow, or Crashing

Before you begin: Have your Exxact serial number (SN) ready — found on the system label or from 'nvidia-smi -L'.

Note: GPU issues at the hardware level (PCIe slot, power, thermals) must be ruled out before driver reinstall.

If the system is under warranty, avoid permanent hardware changes without contacting Exxact Support first.

1. Symptom Lookup — Go to the Right Section

What you see	Most likely cause	Go to section
GPU absent from nvidia-smi or Device Manager	PCIe seating, power connector, slot issue	Section 3
GPU present but 0–3% utilization under full load	IOMMU/VT-d DMA contention (Intel systems)	Section 5
System lockup / hard hang at full GPU power	PSU insufficient, thermal throttle, BIOS version	Sections 4 & 6
NVIDIA driver install fails on Windows	Corrupted driver stack, incomplete Windows update	Section 7
NVIDIA driver install fails on Linux	Secure Boot, module signing, nouveau conflict	Section 8
Xid 79 error / GPU reset required	PCIe AER fault, power delivery, bad VBIOS	Section 9
GPU overheating / fan not spinning	Airflow blockage, fan hardware failure, power cap	Section 4
DCGM or burn-in test failure	Memory defect, thermal, driver stack corruption	Section 10

2. Quick Diagnostic Commands — Run These First

Run these on the affected system before contacting support. Paste the output into your support ticket.

Linux

nvidia-smi -q # full GPU status, temps, power, ECC

nvidia-smi --query-gpu=index,name,temperature.gpu,power.draw,clocks.sm,clocks.gr,pstate,ecc.errors.uncorrected.aggregate.total --format=csv

lspci | grep -i nvidia # confirm GPUs visible on PCIe bus

dmesg | grep -i 'nvidia\|nvrm\|xid' # kernel GPU errors

cat /proc/driver/nvidia/gpus/*/information 2>/dev/null

Windows

nvidia-smi -q # PowerShell or CMD

Get-PnpDevice | Where-Object {$_.FriendlyName -like '*NVIDIA*'} | Select Name, Status

Get-EventLog -LogName System -Newest 200 | Where-Object {$_.Source -like '*nvlddmkm*'} | Format-List

Tip: All Exxact systems ship with a burn-in report (burn-in_PASSED.html) and DCGM validation log under your SN at cpq.exxactcorp.com/qa/. Compare current output to the factory baseline.

3. GPU Not Detected (nvidia-smi returns no GPUs)

Step 1 — Confirm PCIe visibility first

If the GPU does not appear in lspci (Linux) or Device Manager (Windows), this is a hardware problem — driver reinstall will not help.

lspci | grep -i nvidia

If no output: the GPU is not seen by the CPU. Proceed to hardware checks below.

If output is present but driver says no GPU: skip to Section 7 (Windows driver) or Section 8 (Linux driver).

Step 2 — Reseat the GPU

Power off completely. Remove AC power cord and hold the power button 5 seconds to discharge capacitors.
Remove the GPU from its PCIe slot. Inspect the connector edge for dust, corrosion, or bent pins.
Reseat firmly until the slot latch clicks. On multi-GPU systems, try moving the card to a different PCIe slot.
Reconnect all PCIe power connectors (6-pin, 8-pin, or 16-pin). Confirm they are fully seated and not reversed.
Power on and recheck lspci.

Step 3 — Check PCIe power connectors and slot assignment

High-wattage GPUs (300 W+) require dedicated PCIe power cables — daisy-chained connectors from a single PSU rail are a common failure point.

Verify cable routing: each GPU should ideally have its own PSU rail or cable run.
For 4-GPU workstations or GPU servers, confirm the PSU rated wattage covers peak load: 4 × 300 W GPU + CPU + storage ≈ 1800 W minimum; use the next tier up for headroom.
For Supermicro servers: confirm PCIe riser board seating and verify slot assignment in BIOS (Advanced > PCIe Configuration).

4. GPU Overheating, Fan Not Spinning, or Throttling

Identify throttle cause

nvidia-smi -q -d PERFORMANCE # look for HW Slowdown, HW Thermal, HW Power Brake flags

watch -n 1 'nvidia-smi --query-gpu=temperature.gpu,power.draw,clocks.throttle_reasons.active,clocks.sm --format=csv,noheader'

Thermal thresholds (RTX A6000 / typical NVIDIA professional GPUs)

Threshold	Typical value	Meaning
GPU Slowdown Temp	88–95 °C	Clock throttle begins
GPU Max Operating Temp	93 °C	Performance degraded
GPU Shutdown Temp	98 °C	Emergency power-off
GPU Target Temp	84 °C	Fan control setpoint

Cooling checklist

Verify airflow direction. Exxact 1U/2U GPU servers use front-to-back airflow. Reverse-airflow cards will overheat in standard chassis.
Check fan health via BMC/IPMI: ipmitool sdr type Fan — any fan reading 0 RPM is a failure.
Inspect for blocked airflow: blanking panels missing, cables draped over GPU heatsinks, inadequate rack spacing.
Check thermal paste on GPU heatsink if fan is spinning but temperatures are abnormally high (>85 °C at idle).
On tower/workstation systems with blower-style cards: ensure at least 1U of clearance between GPU exhaust and the next component.

Power cap as a thermal workaround

If hardware changes cannot be made immediately, a temporary power cap reduces heat and prevents shutdowns:

nvidia-smi -pl 220 # example: cap at 220W (adjust per GPU TDP)

nvidia-smi -q -d POWER | grep -i limit # verify new limit applied

Important: Power capping reduces compute throughput. It is a workaround, not a fix. Escalate to Exxact Support if the underlying cause (cooling or PSU) is not resolved.

5. Severe GPU Slowdown on Intel Systems (IOMMU / VT-d)

Observed on Intel Xeon systems: deep learning workloads running at 1–3% GPU utilization. Benchmark shows 100–190× slowdown vs. AMD equivalents. Root cause: Intel VT-d (IOMMU) DMA translation overhead during concurrent CPU+GPU memory operations.

Diagnosis

dmesg | grep -i iommu

cat /proc/cmdline | grep iommu

If you see DMAR: IOMMU enabled in dmesg and no passthrough option in cmdline, IOMMU is likely the cause.

Fix — Try in order

#	Kernel parameter	Effect	Risk
1	iommu=pt	IOMMU passthrough — reduces DMA overhead while keeping IOMMU active	Low
2	intel_iommu=off	Fully disables VT-d — maximum GPU DMA performance; disables SR-IOV and IOMMU isolation	Medium — disable only on bare-metal GPU compute

Apply via GRUB (Ubuntu / RHEL):

sudo nano /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash iommu=pt"

sudo update-grub && sudo reboot

Also check BIOS:

Advanced > CPU Configuration > Intel Virtualization Technology for Directed I/O (VT-d) → Disabled for bare-metal GPU compute workloads. Confirm with your system admin if virtualization (SR-IOV, KVM pass-through) is needed before disabling.

Field note (Ticket #41473 — UT Dallas): 4× RTX A6000 on Supermicro X13DEG-QT (Intel Xeon Gold 5520+).

iommu=pt reduced slowdown from ~192× to ~161×. Full resolution required intel_iommu=off.

Confirmed: ECC enabled on all GPUs, power capped at 220 W, no Xid errors at time of report.

Next steps at time of ticket: BIOS update and DCGM diagnostic.

6. BIOS Update and Power Delivery

When to update BIOS

System lockups under GPU load that cannot be explained by thermals or PSU capacity.
PCIe link width unexpectedly downgrades (nvidia-smi showing PCIe Gen 1 when Gen 4 expected).
BMC becomes unresponsive during GPU stress — may indicate BIOS-level power management bug.

Check current BIOS version

dmidecode -t bios | grep -E 'Version|Release' # Linux

Get-WmiObject Win32_BIOS | Select-Object Name, Version, ReleaseDate # Windows PowerShell

BIOS update by platform

Platform	BIOS update method	Notes
Supermicro	IPMI Web UI > Maintenance > BIOS Update, or UEFI USB tool	Download from supermicro.com/support/resources/downloadcenter
ASUS	ASUS EZ Flash (BIOS screen) or USB BIOS Flashback	Flashback requires specific USB port — check manual
Gigabyte	Q-Flash or @BIOS utility	Use Q-Flash from BIOS for safest update
MSI	M-FLASH from BIOS menu	Requires formatted FAT32 USB drive

PCIe link degradation note: nvidia-smi may report 'Current PCIe Generation: 1' even though the GPU max is Gen 4.

This is normal at idle — NVIDIA downclocks PCIe links in P8 state.

To verify at load: run a workload and re-query nvidia-smi. If still Gen 1 under load, check BIOS PCIe settings.

7. NVIDIA Driver Reinstall — Windows

Use this procedure when: driver install fails, GPU shows 'Code 43' in Device Manager, or after a corrupted driver update pushed the system into repair mode.

Critical prerequisite: Complete ALL pending Windows updates before reinstalling NVIDIA drivers.

A Windows servicing stack corruption (error 0x800F0991) will cause driver installation to fail even after DDU.

Signs: WindowsUpdateClient errors in Event Viewer, SFC /scannow fails with 'Windows Resource Protection' error.

Fix: run DISM /Online /Cleanup-Image /RestoreHealth, complete updates, then reboot before driver install.

Procedure

Download DDU (Display Driver Uninstaller) from guru3d.com and the target NVIDIA driver from nvidia.com. Store both on a USB drive.
Boot into Safe Mode: Settings > System > Recovery > Advanced Startup > Troubleshoot > Advanced Options > Startup Settings > Restart > press 4.
Run DDU: Select GPU type = 'GPU', Device = 'NVIDIA'. Choose 'Clean and restart in Safe Mode'. DDU will remove all NVIDIA components including audio, PhysX, and USB.
After DDU reboot (still in Safe Mode): verify Device Manager shows no NVIDIA devices.
Boot to normal Windows. Disconnect from the network. Run the NVIDIA installer as Administrator.
If install still fails: open Event Viewer (eventvwr.msc) > Windows Logs > System. Look for WindowsUpdateClient or SetupAPI errors in the 30 minutes before the failure.
If Windows update loop blocks driver install: try attaching a USB NIC — some NIC firmware versions block Windows authentication for update packages. Alternatively update BIOS (newer BIOS often includes updated NIC firmware).

Verify successful install

nvidia-smi # GPU should appear with correct driver version

Get-PnpDevice | Where {$_.FriendlyName -like '*NVIDIA*'} | Select Name, Status

Field note (Ticket #41003 — Fairfield University): 4× NVIDIA RTX A6000 on ASUS PRO WS WRX80E-SAGE (AMD Threadripper PRO 5995WX), Windows 11.

Driver R580 (582.16) install failed after Windows repair mode. DDU run in Safe Mode — reinstall still failed.

Root cause: incomplete Windows updates (KB5077181 error 0x800F0991). SFC also failed.

Resolution: complete Windows updates first (DISM restore health + Windows Update), then reinstall driver.

Secondary fix considered: BIOS update from v1201 (2023) to v1801 (2025) to resolve NIC firmware issue blocking Windows Update.

8. NVIDIA Driver Reinstall — Linux

Use when: driver install fails, GPU disappears after kernel update, or nvidia.ko module fails to load.

Step 1 — Disable Secure Boot or sign the module

NVIDIA drivers are out-of-tree kernel modules. Secure Boot requires module signing.

Recommended for pure GPU compute: disable Secure Boot in BIOS (Advanced > Security > Secure Boot).
Alternative: enroll MOK key — see NVIDIA documentation for mokutil procedure.

Step 2 — Full driver purge

sudo systemctl stop gdm lightdm sddm 2>/dev/null; true

sudo apt-get purge --autoremove 'nvidia-*' 'cuda-*' 'libcuda*' -y # Ubuntu/Debian

sudo dnf remove 'nvidia*' 'cuda*' -y # RHEL/Rocky

sudo rm -f /etc/modprobe.d/blacklist-nouveau.conf

sudo update-initramfs -u # Ubuntu OR sudo dracut -f # RHEL

sudo reboot

Step 3 — Blacklist nouveau and install

After reboot, confirm nouveau is not loaded:

lsmod | grep nouveau # should return empty

Install via package manager (recommended — handles DKMS automatically):

# Ubuntu — via graphics-drivers PPA

sudo add-apt-repository ppa:graphics-drivers/ppa -y

sudo apt-get update

sudo apt-get install nvidia-driver-550 -y # substitute target version

sudo reboot

Or install from .run file (for specific versions):

sudo bash NVIDIA-Linux-x86_64-550.163.01.run --dkms

Step 4 — Verify

nvidia-smi

lsmod | grep nvidia # nvidia, nvidia_modeset, nvidia_uvm should all be present

cat /proc/driver/nvidia/version

9. Xid 79 — GPU Reset Required (PCIe AER Error)

Xid 79 indicates the GPU detected a PCIe AER (Advanced Error Reporting) fault and initiated a reset. Possible causes: PCIe signal integrity, power delivery, or defective GPU.

Identify

dmesg | grep -i 'Xid\|NVRM\|nvrm' | tail -40

nvidia-smi --query-gpu=ecc.errors.uncorrected.aggregate.total --format=csv # check for aggregate ECC errors

Triage steps

Check PCIe link width and generation under load — if downgrading, replace cable or switch to a different slot.
Reseat GPU and reconnect power. Xid 79 is frequently caused by marginal power delivery.
Update BIOS — BIOS updates often include PCIe error handling improvements.
If Xid 79 is intermittent and tied to a specific GPU: run DCGM diag (Section 10) and consider RMA if test fails.
Check VBIOS version: nvidia-smi --query-gpu=vbios_version --format=csv — outdated VBIOS can cause AER errors; contact Exxact for VBIOS update guidance.

10. DCGM Diagnostic (Linux Only)

NVIDIA's DCGM tool runs comprehensive hardware diagnostics including memory, compute, and bandwidth tests. Required for warranty RMA eligibility.

Install DCGM

# Ubuntu

sudo apt-get install datacenter-gpu-manager -y

# RHEL / Rocky

sudo dnf install datacenter-gpu-manager -y

dcgmi --version # verify install

Run diagnostic levels

Level	Command	What it tests
-r 1	dcgmi diag -r 1	Quick — PCIe, memory bandwidth (< 1 min)
-r 2	dcgmi diag -r 2	Standard — adds ECC and compute (~ 2 min)
-r 3	dcgmi diag -r 3	Extended — full stress + memory (~ 15 min)
-r 4	dcgmi diag -r 4	Comprehensive — all tests at full duration (30+ min, required for RMA)

RMA requirement: Exxact Support requires dcgmi diag -r 4 output before approving a GPU RMA. Attach the full output to your ticket.

Note: DCGM is not installed by default on Exxact workstations. If 'dcgmi: command not found', install via the commands above.

11. When to Escalate to Exxact Support

Escalate if...	What to include in your ticket
GPU not visible in lspci after reseating	nvidia-smi -q output, lspci -vvv, dmesg \| grep nvidia
Lockups persist after power cap + BIOS update	IPMI sensor log, ipmitool sdr, BIOS version, PSU model/wattage
DCGM diag -r 4 reports FAIL on any test	Full dcgmi diag -r 4 output (save to file)
Xid 79 errors recurring after reseat + BIOS update	dmesg grep Xid, nvidia-smi ECC aggregate errors, VBIOS version
Windows driver reinstall fails after DDU + Windows update	Event Viewer System log (.evtx), BIOS version, DDU log
GPU cooling issue requiring physical fan replacement	Fan RPM from IPMI, nvidia-smi temperature log, system model/SN

Contact Exxact Support

Portal	support.exxactcorp.com
Email	support@exxactcorp.com
Phone	(510) 226-7366 \| Mon–Fri 8:30am–5:30pm PT
AI Chat	Available on support.exxactcorp.com — instant answers for common issues

Appendix A — Key nvidia-smi Fields Explained

Field	What it means
Performance State (Pstate)	P0 = max performance, P8 = idle. Stays at P8 at idle; should go to P0 under load.
HW Slowdown: Active	GPU is thermally or power-throttled at the hardware level. Check temps and power.
ECC Errors > 0 (Aggregate)	Persistent ECC errors indicate memory cells going bad. File a support ticket if uncorrectable > 0.
PCIe Generation: Current 1	Normal at idle (P8 state). Must be Gen 3/4/5 under active compute load.
Reset Required: Yes	GPU has detected a fatal error and needs a driver reset or system reboot. Note any Xid in dmesg.
Drain and Reset Recommended: Yes	Soft reset recommended. Run: sudo nvidia-smi --gpu-reset

Related to

GPU Not Detected, Slow, or Crashing

1. Symptom Lookup — Go to the Right Section

2. Quick Diagnostic Commands — Run These First

Linux

Windows

3. GPU Not Detected (nvidia-smi returns no GPUs)

Step 1 — Confirm PCIe visibility first

Step 2 — Reseat the GPU

Step 3 — Check PCIe power connectors and slot assignment

4. GPU Overheating, Fan Not Spinning, or Throttling

Identify throttle cause

Thermal thresholds (RTX A6000 / typical NVIDIA professional GPUs)

Cooling checklist

Power cap as a thermal workaround

5. Severe GPU Slowdown on Intel Systems (IOMMU / VT-d)

Diagnosis

Fix — Try in order

6. BIOS Update and Power Delivery

When to update BIOS

Check current BIOS version

BIOS update by platform

7. NVIDIA Driver Reinstall — Windows

Procedure

Verify successful install

8. NVIDIA Driver Reinstall — Linux

Step 1 — Disable Secure Boot or sign the module

Step 2 — Full driver purge

Step 3 — Blacklist nouveau and install

Step 4 — Verify

9. Xid 79 — GPU Reset Required (PCIe AER Error)

Identify

Triage steps

10. DCGM Diagnostic (Linux Only)

Install DCGM

Run diagnostic levels

11. When to Escalate to Exxact Support

Contact Exxact Support

Appendix A — Key nvidia-smi Fields Explained

Was this article helpful?

Comments

Search

GPU Not Detected, Slow, or Crashing

1. Symptom Lookup — Go to the Right Section

2. Quick Diagnostic Commands — Run These First

Linux

Windows

3. GPU Not Detected (nvidia-smi returns no GPUs)

Step 1 — Confirm PCIe visibility first

Step 2 — Reseat the GPU

Step 3 — Check PCIe power connectors and slot assignment

4. GPU Overheating, Fan Not Spinning, or Throttling

Identify throttle cause

Thermal thresholds (RTX A6000 / typical NVIDIA professional GPUs)

Cooling checklist

Power cap as a thermal workaround

5. Severe GPU Slowdown on Intel Systems (IOMMU / VT-d)

Diagnosis

Fix — Try in order

6. BIOS Update and Power Delivery

When to update BIOS

Check current BIOS version

BIOS update by platform

7. NVIDIA Driver Reinstall — Windows

Procedure

Verify successful install

8. NVIDIA Driver Reinstall — Linux

Step 1 — Disable Secure Boot or sign the module

Step 2 — Full driver purge

Step 3 — Blacklist nouveau and install

Step 4 — Verify

9. Xid 79 — GPU Reset Required (PCIe AER Error)

Identify

Triage steps

10. DCGM Diagnostic (Linux Only)

Install DCGM

Run diagnostic levels

11. When to Escalate to Exxact Support

Contact Exxact Support

Appendix A — Key nvidia-smi Fields Explained

Was this article helpful?

Comments