GPU Not Detected, Slow, or Crashing

Alexander Hill
Alexander Hill
  • Updated

 

Before you begin:  Have your Exxact serial number (SN) ready — found on the system label or from 'nvidia-smi -L'.

Note: GPU issues at the hardware level (PCIe slot, power, thermals) must be ruled out before driver reinstall.

If the system is under warranty, avoid permanent hardware changes without contacting Exxact Support first.

 

1. Symptom Lookup — Go to the Right Section

What you seeMost likely causeGo to section
GPU absent from nvidia-smi or Device ManagerPCIe seating, power connector, slot issueSection 3
GPU present but 0–3% utilization under full loadIOMMU/VT-d DMA contention (Intel systems)Section 5
System lockup / hard hang at full GPU powerPSU insufficient, thermal throttle, BIOS versionSections 4 & 6
NVIDIA driver install fails on WindowsCorrupted driver stack, incomplete Windows updateSection 7
NVIDIA driver install fails on LinuxSecure Boot, module signing, nouveau conflictSection 8
Xid 79 error / GPU reset requiredPCIe AER fault, power delivery, bad VBIOSSection 9
GPU overheating / fan not spinningAirflow blockage, fan hardware failure, power capSection 4
DCGM or burn-in test failureMemory defect, thermal, driver stack corruptionSection 10

 

2. Quick Diagnostic Commands — Run These First

Run these on the affected system before contacting support. Paste the output into your support ticket.

 

Linux

nvidia-smi -q                          # full GPU status, temps, power, ECC

nvidia-smi --query-gpu=index,name,temperature.gpu,power.draw,clocks.sm,clocks.gr,pstate,ecc.errors.uncorrected.aggregate.total --format=csv

lspci | grep -i nvidia                 # confirm GPUs visible on PCIe bus

dmesg | grep -i 'nvidia\|nvrm\|xid'   # kernel GPU errors

cat /proc/driver/nvidia/gpus/*/information 2>/dev/null

 

Windows

nvidia-smi -q                          # PowerShell or CMD

Get-PnpDevice | Where-Object {$_.FriendlyName -like '*NVIDIA*'} | Select Name, Status

Get-EventLog -LogName System -Newest 200 | Where-Object {$_.Source -like '*nvlddmkm*'} | Format-List

 

Tip:  All Exxact systems ship with a burn-in report (burn-in_PASSED.html) and DCGM validation log under your SN at cpq.exxactcorp.com/qa/. Compare current output to the factory baseline.

 

3. GPU Not Detected (nvidia-smi returns no GPUs)

Step 1 — Confirm PCIe visibility first

If the GPU does not appear in lspci (Linux) or Device Manager (Windows), this is a hardware problem — driver reinstall will not help.

  1. lspci | grep -i nvidia

If no output: the GPU is not seen by the CPU. Proceed to hardware checks below.

If output is present but driver says no GPU: skip to Section 7 (Windows driver) or Section 8 (Linux driver).

 

Step 2 — Reseat the GPU

  1. Power off completely. Remove AC power cord and hold the power button 5 seconds to discharge capacitors.
  2. Remove the GPU from its PCIe slot. Inspect the connector edge for dust, corrosion, or bent pins.
  3. Reseat firmly until the slot latch clicks. On multi-GPU systems, try moving the card to a different PCIe slot.
  4. Reconnect all PCIe power connectors (6-pin, 8-pin, or 16-pin). Confirm they are fully seated and not reversed.
  5. Power on and recheck lspci.

 

Step 3 — Check PCIe power connectors and slot assignment

High-wattage GPUs (300 W+) require dedicated PCIe power cables — daisy-chained connectors from a single PSU rail are a common failure point.

  • Verify cable routing: each GPU should ideally have its own PSU rail or cable run.
  • For 4-GPU workstations or GPU servers, confirm the PSU rated wattage covers peak load: 4 × 300 W GPU + CPU + storage ≈ 1800 W minimum; use the next tier up for headroom.
  • For Supermicro servers: confirm PCIe riser board seating and verify slot assignment in BIOS (Advanced > PCIe Configuration).

  

4. GPU Overheating, Fan Not Spinning, or Throttling

Identify throttle cause

nvidia-smi -q -d PERFORMANCE          # look for HW Slowdown, HW Thermal, HW Power Brake flags

watch -n 1 'nvidia-smi --query-gpu=temperature.gpu,power.draw,clocks.throttle_reasons.active,clocks.sm --format=csv,noheader'

 

Thermal thresholds (RTX A6000 / typical NVIDIA professional GPUs)

ThresholdTypical valueMeaning
GPU Slowdown Temp88–95 °CClock throttle begins
GPU Max Operating Temp93 °CPerformance degraded
GPU Shutdown Temp98 °CEmergency power-off
GPU Target Temp84 °CFan control setpoint

 

Cooling checklist

  1. Verify airflow direction. Exxact 1U/2U GPU servers use front-to-back airflow. Reverse-airflow cards will overheat in standard chassis.
  2. Check fan health via BMC/IPMI: ipmitool sdr type Fan — any fan reading 0 RPM is a failure.
  3. Inspect for blocked airflow: blanking panels missing, cables draped over GPU heatsinks, inadequate rack spacing.
  4. Check thermal paste on GPU heatsink if fan is spinning but temperatures are abnormally high (>85 °C at idle).
  5. On tower/workstation systems with blower-style cards: ensure at least 1U of clearance between GPU exhaust and the next component.

 

Power cap as a thermal workaround

If hardware changes cannot be made immediately, a temporary power cap reduces heat and prevents shutdowns:

nvidia-smi -pl 220     # example: cap at 220W (adjust per GPU TDP)

nvidia-smi -q -d POWER | grep -i limit   # verify new limit applied

Important:  Power capping reduces compute throughput. It is a workaround, not a fix. Escalate to Exxact Support if the underlying cause (cooling or PSU) is not resolved.

 

5. Severe GPU Slowdown on Intel Systems (IOMMU / VT-d)

Observed on Intel Xeon systems: deep learning workloads running at 1–3% GPU utilization. Benchmark shows 100–190× slowdown vs. AMD equivalents. Root cause: Intel VT-d (IOMMU) DMA translation overhead during concurrent CPU+GPU memory operations.

 

Diagnosis

dmesg | grep -i iommu

cat /proc/cmdline | grep iommu

If you see DMAR: IOMMU enabled in dmesg and no passthrough option in cmdline, IOMMU is likely the cause.

 

Fix — Try in order

#Kernel parameterEffectRisk
1iommu=ptIOMMU passthrough — reduces DMA overhead while keeping IOMMU activeLow
2intel_iommu=offFully disables VT-d — maximum GPU DMA performance; disables SR-IOV and IOMMU isolationMedium — disable only on bare-metal GPU compute

 

Apply via GRUB (Ubuntu / RHEL):

sudo nano /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash iommu=pt"

sudo update-grub && sudo reboot

 

Also check BIOS:

Advanced > CPU Configuration > Intel Virtualization Technology for Directed I/O (VT-d) → Disabled for bare-metal GPU compute workloads. Confirm with your system admin if virtualization (SR-IOV, KVM pass-through) is needed before disabling.

 

Field note (Ticket #41473 — UT Dallas):  4× RTX A6000 on Supermicro X13DEG-QT (Intel Xeon Gold 5520+).

iommu=pt reduced slowdown from ~192× to ~161×. Full resolution required intel_iommu=off.

Confirmed: ECC enabled on all GPUs, power capped at 220 W, no Xid errors at time of report.

Next steps at time of ticket: BIOS update and DCGM diagnostic.

 

6. BIOS Update and Power Delivery

When to update BIOS

  • System lockups under GPU load that cannot be explained by thermals or PSU capacity.
  • PCIe link width unexpectedly downgrades (nvidia-smi showing PCIe Gen 1 when Gen 4 expected).
  • BMC becomes unresponsive during GPU stress — may indicate BIOS-level power management bug.

 

Check current BIOS version

dmidecode -t bios | grep -E 'Version|Release'   # Linux

Get-WmiObject Win32_BIOS | Select-Object Name, Version, ReleaseDate  # Windows PowerShell

 

BIOS update by platform

PlatformBIOS update methodNotes
SupermicroIPMI Web UI > Maintenance > BIOS Update, or UEFI USB toolDownload from supermicro.com/support/resources/downloadcenter
ASUSASUS EZ Flash (BIOS screen) or USB BIOS FlashbackFlashback requires specific USB port — check manual
GigabyteQ-Flash or @BIOS utilityUse Q-Flash from BIOS for safest update
MSIM-FLASH from BIOS menuRequires formatted FAT32 USB drive

 

PCIe link degradation note:  nvidia-smi may report 'Current PCIe Generation: 1' even though the GPU max is Gen 4.

This is normal at idle — NVIDIA downclocks PCIe links in P8 state.

To verify at load: run a workload and re-query nvidia-smi. If still Gen 1 under load, check BIOS PCIe settings.

 


 

 

7. NVIDIA Driver Reinstall — Windows

Use this procedure when: driver install fails, GPU shows 'Code 43' in Device Manager, or after a corrupted driver update pushed the system into repair mode.

 

Critical prerequisite:  Complete ALL pending Windows updates before reinstalling NVIDIA drivers.

A Windows servicing stack corruption (error 0x800F0991) will cause driver installation to fail even after DDU.

Signs: WindowsUpdateClient errors in Event Viewer, SFC /scannow fails with 'Windows Resource Protection' error.

Fix: run DISM /Online /Cleanup-Image /RestoreHealth, complete updates, then reboot before driver install.

 

Procedure

  1. Download DDU (Display Driver Uninstaller) from guru3d.com and the target NVIDIA driver from nvidia.com. Store both on a USB drive.
  2. Boot into Safe Mode: Settings > System > Recovery > Advanced Startup > Troubleshoot > Advanced Options > Startup Settings > Restart > press 4.
  3. Run DDU: Select GPU type = 'GPU', Device = 'NVIDIA'. Choose 'Clean and restart in Safe Mode'. DDU will remove all NVIDIA components including audio, PhysX, and USB.
  4. After DDU reboot (still in Safe Mode): verify Device Manager shows no NVIDIA devices.
  5. Boot to normal Windows. Disconnect from the network. Run the NVIDIA installer as Administrator.
  6. If install still fails: open Event Viewer (eventvwr.msc) > Windows Logs > System. Look for WindowsUpdateClient or SetupAPI errors in the 30 minutes before the failure.
  7. If Windows update loop blocks driver install: try attaching a USB NIC — some NIC firmware versions block Windows authentication for update packages. Alternatively update BIOS (newer BIOS often includes updated NIC firmware).

 

Verify successful install

nvidia-smi                             # GPU should appear with correct driver version

Get-PnpDevice | Where {$_.FriendlyName -like '*NVIDIA*'} | Select Name, Status

 

Field note (Ticket #41003 — Fairfield University):  4× NVIDIA RTX A6000 on ASUS PRO WS WRX80E-SAGE (AMD Threadripper PRO 5995WX), Windows 11.

Driver R580 (582.16) install failed after Windows repair mode. DDU run in Safe Mode — reinstall still failed.

Root cause: incomplete Windows updates (KB5077181 error 0x800F0991). SFC also failed.

Resolution: complete Windows updates first (DISM restore health + Windows Update), then reinstall driver.

Secondary fix considered: BIOS update from v1201 (2023) to v1801 (2025) to resolve NIC firmware issue blocking Windows Update.

 

8. NVIDIA Driver Reinstall — Linux

Use when: driver install fails, GPU disappears after kernel update, or nvidia.ko module fails to load.

 

Step 1 — Disable Secure Boot or sign the module

NVIDIA drivers are out-of-tree kernel modules. Secure Boot requires module signing.

  • Recommended for pure GPU compute: disable Secure Boot in BIOS (Advanced > Security > Secure Boot).
  • Alternative: enroll MOK key — see NVIDIA documentation for mokutil procedure.

 

Step 2 — Full driver purge

sudo systemctl stop gdm lightdm sddm 2>/dev/null; true

sudo apt-get purge --autoremove 'nvidia-*' 'cuda-*' 'libcuda*' -y    # Ubuntu/Debian

sudo dnf remove 'nvidia*' 'cuda*' -y                                 # RHEL/Rocky

sudo rm -f /etc/modprobe.d/blacklist-nouveau.conf

sudo update-initramfs -u    # Ubuntu   OR   sudo dracut -f    # RHEL

sudo reboot

 

Step 3 — Blacklist nouveau and install

After reboot, confirm nouveau is not loaded:

lsmod | grep nouveau    # should return empty

 

Install via package manager (recommended — handles DKMS automatically):

# Ubuntu — via graphics-drivers PPA

sudo add-apt-repository ppa:graphics-drivers/ppa -y

sudo apt-get update

sudo apt-get install nvidia-driver-550 -y    # substitute target version

sudo reboot

 

Or install from .run file (for specific versions):

sudo bash NVIDIA-Linux-x86_64-550.163.01.run --dkms

 

Step 4 — Verify

nvidia-smi

lsmod | grep nvidia    # nvidia, nvidia_modeset, nvidia_uvm should all be present

cat /proc/driver/nvidia/version

 

9. Xid 79 — GPU Reset Required (PCIe AER Error)

Xid 79 indicates the GPU detected a PCIe AER (Advanced Error Reporting) fault and initiated a reset. Possible causes: PCIe signal integrity, power delivery, or defective GPU.

 

Identify

dmesg | grep -i 'Xid\|NVRM\|nvrm' | tail -40

nvidia-smi --query-gpu=ecc.errors.uncorrected.aggregate.total --format=csv   # check for aggregate ECC errors

 

Triage steps

  1. Check PCIe link width and generation under load — if downgrading, replace cable or switch to a different slot.
  2. Reseat GPU and reconnect power. Xid 79 is frequently caused by marginal power delivery.
  3. Update BIOS — BIOS updates often include PCIe error handling improvements.
  4. If Xid 79 is intermittent and tied to a specific GPU: run DCGM diag (Section 10) and consider RMA if test fails.
  5. Check VBIOS version: nvidia-smi --query-gpu=vbios_version --format=csv — outdated VBIOS can cause AER errors; contact Exxact for VBIOS update guidance.

 

10. DCGM Diagnostic (Linux Only)

NVIDIA's DCGM tool runs comprehensive hardware diagnostics including memory, compute, and bandwidth tests. Required for warranty RMA eligibility.

 

Install DCGM

# Ubuntu

sudo apt-get install datacenter-gpu-manager -y

# RHEL / Rocky

sudo dnf install datacenter-gpu-manager -y

dcgmi --version    # verify install

 

Run diagnostic levels

LevelCommandWhat it tests
-r 1dcgmi diag -r 1Quick — PCIe, memory bandwidth (< 1 min)
-r 2dcgmi diag -r 2Standard — adds ECC and compute (~ 2 min)
-r 3dcgmi diag -r 3Extended — full stress + memory (~ 15 min)
-r 4dcgmi diag -r 4Comprehensive — all tests at full duration (30+ min, required for RMA)

 

RMA requirement:  Exxact Support requires dcgmi diag -r 4 output before approving a GPU RMA. Attach the full output to your ticket.

Note: DCGM is not installed by default on Exxact workstations. If 'dcgmi: command not found', install via the commands above.

 

11. When to Escalate to Exxact Support

Escalate if...What to include in your ticket
GPU not visible in lspci after reseatingnvidia-smi -q output, lspci -vvv, dmesg | grep nvidia
Lockups persist after power cap + BIOS updateIPMI sensor log, ipmitool sdr, BIOS version, PSU model/wattage
DCGM diag -r 4 reports FAIL on any testFull dcgmi diag -r 4 output (save to file)
Xid 79 errors recurring after reseat + BIOS updatedmesg grep Xid, nvidia-smi ECC aggregate errors, VBIOS version
Windows driver reinstall fails after DDU + Windows updateEvent Viewer System log (.evtx), BIOS version, DDU log
GPU cooling issue requiring physical fan replacementFan RPM from IPMI, nvidia-smi temperature log, system model/SN

 

Contact Exxact Support

Portalsupport.exxactcorp.com
Emailsupport@exxactcorp.com
Phone(510) 226-7366  |  Mon–Fri 8:30am–5:30pm PT
AI ChatAvailable on support.exxactcorp.com — instant answers for common issues

 

Appendix A — Key nvidia-smi Fields Explained

FieldWhat it means
Performance State (Pstate)P0 = max performance, P8 = idle. Stays at P8 at idle; should go to P0 under load.
HW Slowdown: ActiveGPU is thermally or power-throttled at the hardware level. Check temps and power.
ECC Errors > 0 (Aggregate)Persistent ECC errors indicate memory cells going bad. File a support ticket if uncorrectable > 0.
PCIe Generation: Current 1Normal at idle (P8 state). Must be Gen 3/4/5 under active compute load.
Reset Required: YesGPU has detected a fatal error and needs a driver reset or system reboot. Note any Xid in dmesg.
Drain and Reset Recommended: YesSoft reset recommended. Run: sudo nvidia-smi --gpu-reset

© 2026 Exxact Corporation. Internal use and customer-facing knowledge base article.

Related to

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.