Purpose
This document provides systematic troubleshooting procedures for scenarios where NVIDIA GPUs are not visible via NVIDIA-SMI commands in Linux systems. It serves as a reference guide for those who need to diagnose and resolve GPU detection issues.
Pre-requisites
Before diving into troubleshooting, ensure you have:
- Verified your System BIOS is the latest version
- Installed the latest NVIDIA driver for your GPU
- Confirmed your PSU provides enough total wattage AND amperage per rail
Setting Persistent Mode
To permanently set NVIDIA persistent mode:
- Create a systemd service file:
vi /etc/systemd/system/nvidia-persistenced.service
- Add the following content:
[Unit]
Description=Enable NVIDIA Persistence Mode
After=default.target
[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -pm ENABLED
RemainAfterExit=true
[Install] WantedBy=multi-user.target
- Enable and start the service:
systemctl restart nvidia-persistenced.service
- Verify that persistence mode is enabled:
nvidia-smi -q | grep "Persistence Mode"
Troubleshooting Steps
Step 1: Verify GPU Visibility
Check if the GPU is visible via PCI:
lspci -tvnn | grep -i nvidia
Step 2: Check System Logs
If the GPU is not visible, check logs for any GPU-related errors:
dmesg | egrep -i 'err|nvrm|xid'
ipmitool sel list
Step 3: System Reboot
Reboot the system to see if PCI devices reinitialize properly.
If this resolves the issue:
- Verify you're using the latest driver version
- Consider adding kernel parameters to GRUB_CMDLINE_LINUX:
example:
GRUB_CMDLINE_LINUX="crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=28070ec6-f366-44fc-90c9-1ec719f63153 rhgb quiet pcie_aspm=off acpi=off" - Run update-grub after making changes
RHEL/CentOS/Rocky/Fedora
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
Ubuntu/Debian
sudo update-grub
Reboot your system
sudo reboot
Step 4: Hardware Check
If issues persist after reboot:
- Physically reseat the GPUs and their power cables
- Ensure all connections are secure
If reseating fixes the issue:
- Update to the latest driver
- Add the recommended kernel parameters as in Step 3
Step 5: Isolate GPU Issues
Identify Specific GPU Issues
If only certain GPUs aren't showing:
- Map all visible GPUs and their locations:
nvidia-smi --query-gpu=name,index,serial,pci.bus_id --format=csv
- Compare this output with your expected GPU configuration to identify which specific GPUs are missing
- Note the PCI bus ID of any missing or problematic GPUs for further troubleshooting
Isolate Hardware Issues
If problems continue:
- Remove all GPUs and boot with only one GPU
- If this works, reinstall the latest driver and add kernel parameters
- Add GPUs back one at a time
If GPUs stop appearing after adding more, you may have hit one of these limits:
- Power limit
- Thermal limit
- BIOS limitation
- PCIe bifurcation limit
Step 6: Check and Adjust Power Settings
First, check the current power information and limits:
nvidia-smi -q -d POWER
This will show you:
- Power management mode
- Power draw
- Enforced power limit
- Default power limit
- Min/max power limits
If you suspect power issues, reduce consumption to see if you're hitting system power limits:
nvidia-smi -pl 250 # Set 250W power limit, adjust to your GPU model
Step 7: Consider RMA
If you've tried all steps and the unit consistently fails, consider returning the system or GPU under warranty (RMA).
Common Causes of LSPCI Not Showing NVIDIA GPUs
✅ |
Power Issues: Insufficient power or loose connections |
✅ |
Driver Problems: Incompatible or corrupted drivers |
✅ |
BIOS Settings: Outdated BIOS or incorrect PCIe settings |
✅ |
Hardware Failures: Damaged GPU or PCIe slots |
✅ |
System Overloading: Too many GPUs for your system configuration |
✅ |
Kernel Issues: Linux kernel parameters interfering with GPU detection |
Related to
Comments
0 comments
Please sign in to leave a comment.