PCIE Riser Failure
Summary
These tickets cover failures in PCIe risers, daughterboards, bridge boards, slot assemblies, and their attached signal paths. The symptom is usually that one slot, one side of the chassis, or one whole riser path stops enumerating GPUs, NICs, drives, or other PCIe devices, often while the device itself still works in another slot.
Frequency
- 72 tickets
Common Causes
-
Failed riser or daughterboard hardware
The most common pattern is a fault that stays with one riser, slot bank, or daughterboard after card swaps, proving the add-in card is not the root cause. Examples: #16225, #17752, #19637, #26650, #8386, and 16 more. -
Physical damage to slots, connectors, or PCB components
Several cases involve broken slots, snapped connectors, damaged sockets, shipping damage, or capacitors falling off the riser itself. Examples: #15826, #20777, #29305, #30755, #5610, and 11 more. -
Signal-cable or link-path faults masquerading as bad risers
Some tickets initially looked like riser failure but were ultimately traced to degraded transfer cables, loose signal cables, or lane-width problems in the same path. Examples: #17061, #19205, #33918, #38142, #41067, and 7 more. -
Barebone or motherboard-side faults behind the riser path
A smaller but important group started as riser suspicion and ended with barebone, tray, switch, or motherboard replacement instead. Examples: #17875, #19339, #24643, #33322, #33769, and 8 more. -
No-trouble-found or intermittent platform behavior
Some returned systems passed burn-in after reseat, firmware, or OS changes, leaving the original riser-path failure plausible but not permanently reproduced. Examples: #18775, #19778, #34868, #39887, #6075.
Diagnostic Steps
-
Prove the fault stays with the slot or riser path
Move the GPU, NIC, or adapter to another slot and see whether the problem remains with the original riser position. Representative tickets: #16225, #17061, #19781, #20564, #8386. -
Check whether only part of the chassis is affected
Loss of devices on one side, one slot group, or slots above a threshold often points to a daughterboard or bridge-board problem. Representative tickets: #17752, #19637, #20792, #23250, #26650. -
Inspect for physical damage and seating issues
Look for cracked slots, loose capacitors, broken connectors, damaged boards, poor seating, and signs of shipping damage or melted plastic. Representative tickets: #15826, #20777, #30435, #30736, #40252. -
Validate the signal path, not just the riser board
Review lane width, P2P bandwidth,lspci,nvidia-smi, AER errors, and any transfer or SlimSAS-style cabling tied to the riser path. Representative tickets: #15459, #19205, #19778, #33918, #37240. -
Escalate to RMA when a subcomponent path is credible but not safely repairable in field
Many cases needed depot validation or manufacturer handling because the riser is treated as a barebone subcomponent. Representative tickets: #11120, #17875, #19638, #27264, #32262.
Solutions
-
Replace the riser, daughterboard, or bridge board
The most reliable fix was direct replacement once swapping proved the failure followed the subcomponent. Examples: #11120, #16225, #17947, #19781, #8386, and 17 more. -
RMA the full system or barebone when the fault reaches beyond one removable board
Full return was often necessary when the riser issue implicated the motherboard tray, PCIe switch path, chassis, or multiple slot groups. Examples: #11062, #17875, #19329, #24643, #34056, and 14 more. -
Reseat or replace the associated signal cable
Several tickets were fixed by reseating or replacing an I/O or transfer cable rather than replacing the riser PCB itself. Examples: #17061, #19205, #33918, #34868. -
Apply BIOS or firmware remediation when link speed or enumeration is wrong
A smaller subset recovered after BIOS or firmware work restored proper PCIe generation or device visibility. Examples: #19778, #28928, #34868. -
Use onsite or advance replacement when downtime is the main constraint
For production systems, cross-ship or onsite replacement reduced downtime even before full depot confirmation. Examples: #17947, #20777, #29009, #32262, #37240.
Edge Cases
- Storage or NIC failures were sometimes riser failures in disguise: several tickets began as missing NVMe drives or dead NICs before isolation pointed to the riser path. See #16618, #23250, #25825, #35263.
- Heat or electrical damage destroyed the riser secondarily: in one case a burned RAID controller melted the riser, making the riser damage real but not primary. See #40252.
- Shipping damage created first-boot riser faults: broken slots and bent assemblies were found immediately on arrival or after transport. See #20777, #30435, #30736, #34744.
- Intermittent PCIe generation downgrade can mimic hard failure: some systems looked like dead slots until link speed, BIOS state, or cable seating was corrected. See #17061, #19778, #38142.
Related Issues
- GPU Hardware Failure
- Network Port Failure
- Defective Storage Drives
- Motherboard Hardware Failure
- BIOS BMC Issues
- System Boot Failure
- NIC Hardware Failure
Referenced by
- TS4-194492555 — product affected by this issue (×3)
- Shipping Damage — co-occurs with this issue (×5)
- Allen Huynh — handled tickets on this issue (×2)
- RTX A5000 — product affected by this issue (×2)
- Jason Chen — handled tickets on this issue (×13)
- Ian Dicarlo — handled tickets on this issue (×10)
- RMA Workflow — co-occurs with this issue (×61)
- Jared Royster — handled tickets on this issue (×9)
- Sheng Ye — handled tickets on this issue (×1)
- H200 — product affected by this issue (×1)
Comments
0 comments
Please sign in to leave a comment.