Power Supply Failure

Dev Account
Dev Account
  • Updated

Power Supply Failure

Summary

These tickets cover systems or redundant PSUs that lose standby power, show amber or fault LEDs, pop, spark, or shut the system off under load. Some are clean single-PSU failures; others begin as suspected PSU faults and later prove to be cable, firmware, motherboard, cooling, or power-distribution problems.

Frequency

  • 240 tickets

Common Causes

  1. A PSU itself failed and the fault followed the unit
    The clearest pattern is a bad module that still faults after slot, cable, or chassis swaps. Examples: #10000, #10784, #17087, #27403, #9685, and 40 more.
  2. Intermittent or load-triggered power loss
    Some systems stayed up at idle but shut down during training, burn-in, or heavy GPU use, with PSU behavior only becoming obvious under sustained draw. Examples: #10301, #18948, #18951, #25196, #26472, and 17 more.
  3. Cabling, AC source, or redundant-PSU path issues
    A meaningful subset involved bad power cords, slot-specific backplane issues, one PSU in a redundant pair, or confusion around wall power and feed redundancy. Examples: #10664, #11064, #12820, #13885, #15219, and 23 more.
  4. Burn, pop, or visible electrical damage
    Several tickets report sparks, popping sounds, burnt smell, smoke, or charred hardware, making the PSU failure explicit and urgent. Examples: #12927, #17374, #18086, #20065, #23811, and 10 more.
  5. Suspected PSU failure that turned out to be broader platform trouble
    Some cases later traced to motherboard, cooler, PDB, or firmware behavior rather than the PSU module alone. Examples: #12820, #18951, #19329, #24079, #42135, and 18 more.

Diagnostic Steps

  1. Prove whether the fault follows the PSU
    Swap the suspect unit into another slot or chassis, or insert a known-good PSU into the system, before assuming the server itself is bad. Representative tickets: #10000, #10784, #12820, #17087, #27403.
  2. Observe the PSU indicators and failure mode carefully
    Note amber LEDs, no standby LED, fan behavior, alarm state, pop sounds, smell, or whether the system still runs on the remaining PSU. Representative tickets: #14762, #15219, #20374, #23811, #9685.
  3. Check system logs for power events, not just customer description
    SEL, BMC, and PSU over-current or thermal messages helped separate PSU failure from thermal or motherboard events. Representative tickets: #13564, #18951, #24079, #39535, #39651.
  4. Rule out simple external power-path issues
    Confirm outlet, PDU, cable, and feed behavior, especially in redundant or datacenter deployments. Representative tickets: #11064, #13592, #13812, #20720, #29782.
  5. Escalate quickly when there is electrical damage or repeated no-power behavior
    Smoke, popping, burnt odor, or repeat no-power after swap-testing usually justified immediate RMA. Representative tickets: #12927, #17374, #18086, #20065, #37472.

Solutions

  1. Replace the failed PSU module
    The most reliable fix was component RMA or advance replacement once swap tests proved the fault followed the PSU. Examples: #10784, #17087, #27403, #36217, #9685, and 60 more.
  2. RMA the full system when the power fault was not isolated to one module
    Full-system return was common when shutdowns persisted, multiple subsystems were implicated, or the backplane or board might be involved. Examples: #14762, #18951, #19329, #42135, #24079, and 35 more.
  3. Correct cabling, feed, or redundant-power configuration
    Some tickets resolved after fixing external power-path problems rather than replacing hardware. Examples: #10664, #13592, #13812, #20720, #29782.
  4. Apply firmware or platform remediation when PSU hardware passed
    A minority of cases stabilized after BIOS, BMC, or PSU firmware updates changed power behavior under load. Examples: #18951, #19685, #37170, #41046.
  5. Set realistic out-of-warranty replacement or quote paths
    Older systems often ended with replacement-part sales guidance instead of warranty RMA. Examples: #15058, #19110, #38890, #42137.

Edge Cases

  • Thermal trips can look like PSU failure: some systems shut off under load and only later proved to have CPU or chassis cooling faults rather than bad PSUs. See #18951, #24079, #34618.
  • Motherboard or PDB faults can mimic dead PSUs: several cases started as bad-PSU reports but later implicated the board-level power path. See #12820, #19329, #35091, #39631.
  • Known manufacturing or fleet defects: a few tickets were batch-style replacement requests tied to known PSU defects rather than new diagnosis. See #8210, #36217, #39510.
  • System still runs on one PSU, masking urgency: redundant designs sometimes hid the failure until alarms or slot swaps made the bad module obvious. See #10000, #10784, #27403, #9685.

Related Issues

Referenced by

  • Pws 2k20a 1r — product affected by this issue (×13)
  • Sheng Ye — handled tickets on this issue (×3)
  • Vws 135223847 — product affected by this issue (×3)
  • Ian Dicarlo — handled tickets on this issue (×41)
  • David Nguyen — handled tickets on this issue (×4)
  • RMA Workflow — co-occurs with this issue (×184)
  • David — handled tickets on this issue (×9)
  • Jared Royster — handled tickets on this issue (×24)
  • Jason Chen — handled tickets on this issue (×16)
  • Shipping Damage — co-occurs with this issue (×3)

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.