Memory Failure
Summary
DIMM or memory-slot faults that surface as memory-training errors, missing installed capacity, or ECC-related shutdown concern after POST or during runtime. This page starts with one clear new-system DIMM/slot RMA path and one looser ECC-poweroff thread that shows how memory evidence can stay ambiguous when the symptoms later quiet down.
Frequency
7 tickets
Common Causes
- DIMM training failure or a bad slot can present immediately on a new system as one DIMM missing from total capacity and a BIOS error tied to a specific location (Ticket #23828).
- Correctable ECC reports do not always prove an active DIMM failure if later logs stay quiet and the broader shutdown symptom may have another software or platform contributor (Ticket #23730).
Diagnostic Steps
- Reseat the DIMMs first and confirm whether the full expected memory capacity returns (Ticket #23828).
- Swap the suspect DIMM with a known-good DIMM to determine whether the failure follows the module or stays with the slot (Ticket #23828).
- Review SEL / system logs for recurring ECC events and exact slot labels before assuming the same memory fault is still active (Ticket #23730).
Solutions
- Move to component or system RMA once swap testing shows the slot path is the durable fault instead of the DIMM itself (Ticket #23828).
- Continue monitoring with a clearer decision tree when the historical ECC alert does not reproduce and the live symptom may not still be memory-led (Ticket #23730).
Edge Cases
- A power-off complaint can initially mention ECC memory but later drift away from a confirmed memory root cause, so support should not overstate certainty just because one older SEL event exists (Ticket #23730).
Related Issues
Batch 35 Evidence
- Batch 35 establishes a new memory-failure page with both correctable-ECC and memory-training evidence, including one DIMM/P2 ECC shutdown thread that stayed partly inconclusive and one new-system H1 training fault that progressed through swap testing to probable slot-level RMA (Tickets #23730, #23828).
Batch 36 Evidence
- Batch 36 adds more DIMM and memory-led boot failures, including a Q-code 0d bad-DIMM case (Ticket #26635).
Batch 45 Evidence
- Batch 43 adds another strong DIMM/CPU isolation case, where widespread memory-training errors clustered around CPU2 and support used swap logic before moving the case into RMA (Ticket #29865).
Batch 46 Evidence
- Batch 46 adds another DIMM-led startup failure, where an amber DRAM light and failed boot after a routine reboot led support to treat the issue as memory-path or board-level hardware trouble and convert the case into system RMA flow (Ticket #34918).
Batch 78 Evidence
- Batch 78 adds another clean DIMM-failure case, where two new memory modules were reported DOA and support pushed the thread toward component RMA once slot-versus-DIMM isolation was requested (Ticket #41808).
Batch 80 Evidence
- Batch 80 adds a constrained correctable-MCE case, where L3/IO link errors appeared on multiple servers but full DoD logs could not be transferred, leaving memory as one plausible path among DIMM, CPU, and board causes rather than a confirmed DIMM failure (Ticket #40951).
Comments
0 comments
Please sign in to leave a comment.