Document Scope
This article provides a guide on using EDAC to detect DIMMs experiencing errors and how to find DIMMs on the motherboard. Detailed steps are provided on installing, running, and addressing errors found by EDAC. In this article, an Asus 2U ESC-4000 G4 is used as an example for locating DIMMS on a motherboard.
Sample Error message
CPU_SrcID#0_MC#1_Chan#0_DIMM#0
[root@c109 ~]# dmesg |grep -i error
[ 1.281914] ERST: Error Record Serialization Table (ERST) support is initialized.
[ 755.619919] mce: [Hardware Error]: Machine check events logged
[ 755.620269] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[ 755.620362] EDAC MC1: 1 CE memory read error on CPU_SrcID#0_MC#1_Chan#0_DIMM#0 (channel:0 slot:0 page:0x5c068 offset:0x900 grain:32 syndrome:0x0 - err_code:0101:0090 socket:0 imc:1 rank:0 bg:2 ba:0 row:1e05 col:40)
Installing EDAC
CentOS/Rocky
sudo yum install edac-utils
Ubuntu
sudo apt-get edac utils
Note: This will not work on in a liveboot environment (i.e. USB, Live CD, etc)
Usage Example
In this example, we used an Asus 2U ESC-4000 G4 server as the example. Please note, the EDAC output and the location of the DIMMs will differ between system.
[root@c109 ~]# edac-util
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 13 Corrected Errors
[root@c109 ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 13 Corrected Errors
Interpreting the EDAC Output
CPU_SrcID#0_MC#1_Chan#0_DIMM#0 The defective DIMM is D1
CPU #0
- MC #0 - Memory Controller #0 [Band A to C]
- MC #1 - Memory Controller #1 [Band D to F]
CPU #1
- MC #2 - Memory Controller #2 [Band G to J]
- MC #3 - Memory Controller #3 [Band K to M]
CPU_SrcID#0 / MC #0 | Channel Number | DIMM Number |
A1 | Chan #0 | DIMM #0 |
A2 | Chan #0 | DIMM #1 |
B1 | Chan #1 | DIMM #0 |
C1 | Chan #2 | DIMM #0 |
CPU SrcID#0 / MC #1 | Channel Number | DIMM Number |
D1 | Chan #0 | DIMM #0 |
D2 | Chan #0 | DIMM #1 |
E1 | Chan #1 | DIMM #0 |
F1 | Chan #2 | DIMM #0 |
CPU SrcID#1 / MC #2 | Channel Number | DIMM Number |
G1 | Chan #0 | DIMM #0 |
G2 | Chan #0 | DIMM #1 |
H1 | Chan #1 | DIMM #0 |
J1 | Chan #2 | DIMM #0 |
CPU SrcID#1 / MC #3 | Channel Number | DIMM Number |
K1 | Chan #0 | DIMM #0 |
K2 | Chan #0 | DIMM #1 |
L1 | Chan #1 | DIMM #0 |
M1 | Chan #2 | DIMM #0 |
Once the defective DIMM is replaced, no errors appear on the EDAC output:
[root@c109 ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_MC#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#2_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#0_MC#1_Chan#0_DIMM#1: 0 Corrected Errors
mc2: 0 Uncorrected Errors with no DIMM info
mc2: 0 Corrected Errors with no DIMM info
mc2: csrow0: 0 Uncorrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#1_DIMM#0: 0 Corrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#2_DIMM#0: 0 Corrected Errors
mc2: csrow1: 0 Uncorrected Errors
mc2: csrow1: CPU_SrcID#1_MC#0_Chan#0_DIMM#1: 0 Corrected Errors
mc3: 0 Uncorrected Errors with no DIMM info
mc3: 0 Corrected Errors with no DIMM info
mc3: csrow0: 0 Uncorrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#0_DIMM#0: 0 Corrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#1_DIMM#0: 0 Corrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#2_DIMM#0: 0 Corrected Errors
mc3: csrow1: 0 Uncorrected Errors
mc3: csrow1: CPU_SrcID#1_MC#1_Chan#0_DIMM#1: 0 Corrected Errors
edac-util: No errors to report.
More Examples
Error Code | Defective DIMM Location |
CPU_SrcID#1_MC#2_Chan#0_DIMM#1 | G2 |
CPU_SrcID#0_MC#1_Chan#2_DIMM#0 | F1 |
Comments
0 comments
Please sign in to leave a comment.