How to Identify Defective DIMM from EDAC error

Andrew Rodriguez
Andrew Rodriguez
  • Updated

Document Scope

This article provides a guide on using EDAC to detect DIMMs experiencing errors and how to find DIMMs on the motherboard. Detailed steps are provided on installing, running, and addressing errors found by EDAC. In this article, an Asus 2U ESC-4000 G4 is used as an example for locating DIMMS on a motherboard.

Sample Error message

CPU_SrcID#0_MC#1_Chan#0_DIMM#0

[root@c109 ~]# dmesg |grep -i error
[    1.281914] ERST: Error Record Serialization Table (ERST) support is initialized.
[  755.619919] mce: [Hardware Error]: Machine check events logged
[  755.620269] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[  755.620362] EDAC MC1: 1 CE memory read error on CPU_SrcID#0_MC#1_Chan#0_DIMM#0 (channel:0 slot:0 page:0x5c068 offset:0x900 grain:32 syndrome:0x0 -  err_code:0101:0090 socket:0 imc:1 rank:0 bg:2 ba:0 row:1e05 col:40)

Installing EDAC

CentOS/Rocky

sudo yum install edac-utils

Ubuntu

sudo apt-get edac utils

Note: This will not work on in a liveboot environment (i.e. USB, Live CD, etc)

Usage Example

In this example, we used an Asus 2U ESC-4000 G4 server as the example. Please note, the EDAC output and the location of the DIMMs will differ between system. 

[root@c109 ~]# edac-util
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 13 Corrected Errors
[root@c109 ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 13 Corrected Errors

1557f40e-5eb1-482f-93b9-de17704fda62.png

Interpreting the EDAC Output

CPU_SrcID#0_MC#1_Chan#0_DIMM#0 The defective DIMM is D1

CPU #0

  • MC #0 - Memory Controller #0 [Band A to C]
  • MC #1 - Memory Controller #1 [Band D to F]

CPU #1

  • MC #2 - Memory Controller #2 [Band G to J]
  • MC #3 - Memory Controller #3 [Band K to M]
CPU_SrcID#0 / MC #0 Channel Number DIMM Number
A1 Chan #0 DIMM #0
A2 Chan #0 DIMM #1
B1 Chan #1 DIMM #0
C1 Chan #2 DIMM #0
CPU SrcID#0 / MC #1 Channel Number DIMM Number
D1 Chan #0 DIMM #0
D2 Chan #0 DIMM #1
E1 Chan #1 DIMM #0
F1 Chan #2 DIMM #0
CPU SrcID#1 / MC #2 Channel Number DIMM Number
G1 Chan #0 DIMM #0
G2 Chan #0 DIMM #1
H1 Chan #1 DIMM #0
J1 Chan #2 DIMM #0
CPU SrcID#1 / MC #3 Channel Number DIMM Number
K1 Chan #0 DIMM #0
K2 Chan #0 DIMM #1
L1 Chan #1 DIMM #0
M1 Chan #2 DIMM #0

 

Once the defective DIMM is replaced, no errors appear on the EDAC output:

[root@c109 ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_MC#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#2_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#0_MC#1_Chan#0_DIMM#1: 0 Corrected Errors
mc2: 0 Uncorrected Errors with no DIMM info
mc2: 0 Corrected Errors with no DIMM info
mc2: csrow0: 0 Uncorrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#1_DIMM#0: 0 Corrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#2_DIMM#0: 0 Corrected Errors
mc2: csrow1: 0 Uncorrected Errors
mc2: csrow1: CPU_SrcID#1_MC#0_Chan#0_DIMM#1: 0 Corrected Errors
mc3: 0 Uncorrected Errors with no DIMM info
mc3: 0 Corrected Errors with no DIMM info
mc3: csrow0: 0 Uncorrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#0_DIMM#0: 0 Corrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#1_DIMM#0: 0 Corrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#2_DIMM#0: 0 Corrected Errors
mc3: csrow1: 0 Uncorrected Errors
mc3: csrow1: CPU_SrcID#1_MC#1_Chan#0_DIMM#1: 0 Corrected Errors
edac-util: No errors to report.

More Examples

Error Code Defective DIMM Location
CPU_SrcID#1_MC#2_Chan#0_DIMM#1 G2
CPU_SrcID#0_MC#1_Chan#2_DIMM#0 F1

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.