PCIE Riser Failure

Dev Account
Dev Account
  • Updated

PCIE Riser Failure

Summary

These tickets cover failures in PCIe risers, daughterboards, bridge boards, slot assemblies, and their attached signal paths. The symptom is usually that one slot, one side of the chassis, or one whole riser path stops enumerating GPUs, NICs, drives, or other PCIe devices, often while the device itself still works in another slot.

Frequency

  • 72 tickets

Common Causes

  1. Failed riser or daughterboard hardware
    The most common pattern is a fault that stays with one riser, slot bank, or daughterboard after card swaps, proving the add-in card is not the root cause. Examples: #16225, #17752, #19637, #26650, #8386, and 16 more.
  2. Physical damage to slots, connectors, or PCB components
    Several cases involve broken slots, snapped connectors, damaged sockets, shipping damage, or capacitors falling off the riser itself. Examples: #15826, #20777, #29305, #30755, #5610, and 11 more.
  3. Signal-cable or link-path faults masquerading as bad risers
    Some tickets initially looked like riser failure but were ultimately traced to degraded transfer cables, loose signal cables, or lane-width problems in the same path. Examples: #17061, #19205, #33918, #38142, #41067, and 7 more.
  4. Barebone or motherboard-side faults behind the riser path
    A smaller but important group started as riser suspicion and ended with barebone, tray, switch, or motherboard replacement instead. Examples: #17875, #19339, #24643, #33322, #33769, and 8 more.
  5. No-trouble-found or intermittent platform behavior
    Some returned systems passed burn-in after reseat, firmware, or OS changes, leaving the original riser-path failure plausible but not permanently reproduced. Examples: #18775, #19778, #34868, #39887, #6075.

Diagnostic Steps

  1. Prove the fault stays with the slot or riser path
    Move the GPU, NIC, or adapter to another slot and see whether the problem remains with the original riser position. Representative tickets: #16225, #17061, #19781, #20564, #8386.
  2. Check whether only part of the chassis is affected
    Loss of devices on one side, one slot group, or slots above a threshold often points to a daughterboard or bridge-board problem. Representative tickets: #17752, #19637, #20792, #23250, #26650.
  3. Inspect for physical damage and seating issues
    Look for cracked slots, loose capacitors, broken connectors, damaged boards, poor seating, and signs of shipping damage or melted plastic. Representative tickets: #15826, #20777, #30435, #30736, #40252.
  4. Validate the signal path, not just the riser board
    Review lane width, P2P bandwidth, lspci, nvidia-smi, AER errors, and any transfer or SlimSAS-style cabling tied to the riser path. Representative tickets: #15459, #19205, #19778, #33918, #37240.
  5. Escalate to RMA when a subcomponent path is credible but not safely repairable in field
    Many cases needed depot validation or manufacturer handling because the riser is treated as a barebone subcomponent. Representative tickets: #11120, #17875, #19638, #27264, #32262.

Solutions

  1. Replace the riser, daughterboard, or bridge board
    The most reliable fix was direct replacement once swapping proved the failure followed the subcomponent. Examples: #11120, #16225, #17947, #19781, #8386, and 17 more.
  2. RMA the full system or barebone when the fault reaches beyond one removable board
    Full return was often necessary when the riser issue implicated the motherboard tray, PCIe switch path, chassis, or multiple slot groups. Examples: #11062, #17875, #19329, #24643, #34056, and 14 more.
  3. Reseat or replace the associated signal cable
    Several tickets were fixed by reseating or replacing an I/O or transfer cable rather than replacing the riser PCB itself. Examples: #17061, #19205, #33918, #34868.
  4. Apply BIOS or firmware remediation when link speed or enumeration is wrong
    A smaller subset recovered after BIOS or firmware work restored proper PCIe generation or device visibility. Examples: #19778, #28928, #34868.
  5. Use onsite or advance replacement when downtime is the main constraint
    For production systems, cross-ship or onsite replacement reduced downtime even before full depot confirmation. Examples: #17947, #20777, #29009, #32262, #37240.

Edge Cases

  • Storage or NIC failures were sometimes riser failures in disguise: several tickets began as missing NVMe drives or dead NICs before isolation pointed to the riser path. See #16618, #23250, #25825, #35263.
  • Heat or electrical damage destroyed the riser secondarily: in one case a burned RAID controller melted the riser, making the riser damage real but not primary. See #40252.
  • Shipping damage created first-boot riser faults: broken slots and bent assemblies were found immediately on arrival or after transport. See #20777, #30435, #30736, #34744.
  • Intermittent PCIe generation downgrade can mimic hard failure: some systems looked like dead slots until link speed, BIOS state, or cable seating was corrected. See #17061, #19778, #38142.

Related Issues

Referenced by

  • TS4-194492555 — product affected by this issue (×3)
  • Shipping Damage — co-occurs with this issue (×5)
  • Allen Huynh — handled tickets on this issue (×2)
  • RTX A5000 — product affected by this issue (×2)
  • Jason Chen — handled tickets on this issue (×13)
  • Ian Dicarlo — handled tickets on this issue (×10)
  • RMA Workflow — co-occurs with this issue (×61)
  • Jared Royster — handled tickets on this issue (×9)
  • Sheng Ye — handled tickets on this issue (×1)
  • H200 — product affected by this issue (×1)

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.