13

In /var/log/kern.log:

kernel: [13291329.657499] EDAC MC0: 48 CE error on CPU#0Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

This is edac log, one of the memory have ce error.

I have read edac doc

Dual channels allows for 128 bit data transfers to the CPU from memory.
Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
(FB-DIMMs). The following example will assume 2 channels:


            Channel 0   Channel 1
    ===================================
    csrow0  | DIMM_A0   | DIMM_B0 |
    csrow1  | DIMM_A0   | DIMM_B0 |
    ===================================

    ===================================
    csrow2  | DIMM_A1   | DIMM_B1 |
    csrow3  | DIMM_A1   | DIMM_B1 |
    ===================================

and find the error channel:

$ grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:144648966
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0

and it should be mc0/csrow0/ch2, as the doc, the DIMM should be DIMM_C0, and can be found by dmidecode:

But I can't find this DIMM, so I don't know which memory have problem:

$ dmidecode -t memory | grep 'Locator: PROC'
        Locator: PROC 1 DIMM 2A
        Locator: PROC 1 DIMM 1D
        Locator: PROC 1 DIMM 4B
        Locator: PROC 1 DIMM 3E
        Locator: PROC 1 DIMM 6C
        Locator: PROC 1 DIMM 5F
        Locator: PROC 2 DIMM 2A
        Locator: PROC 2 DIMM 1D
        Locator: PROC 2 DIMM 4B
        Locator: PROC 2 DIMM 3E
        Locator: PROC 2 DIMM 6C
        Locator: PROC 2 DIMM 5F

There are 12 slots, and 9 slots have memory.

So how can I know which memory have problem?


Supplement:

System Information
        Manufacturer: HP
        Product Name: ProLiant DL180 G6
Tanky Woo
  • 233
  • 1
  • 2
  • 8

2 Answers2

9

Your problem DIMM is likely - Locator: PROC 1 DIMM 5F

CPU#0Channel#2_DIMM#0 means:

PROC 1, 
1D,2A = Channel 0  
3E,4B = Channel 1
5F,6C = Channel 2

5F = DIMM 0
6C = DIMM 1

Edit:

When asking questions, more information is always better... Having the server manufacturer and model would have simplified this:

Here's the memory diagram from the HP ProLiant DL180 G6 Quickspecs:

enter image description here

My suggestion that the DIMM in CPU slot #1 is correct... But this is HP hardware. You shouldn't need to guess!!

You should be using HP's management agents, since they can alert and provide platform-specific details about hardware health and status...

[root@veloce ~]# hpasmcli
HP management CLI for Linux (v2.0)
Copyright 2008 Hewlett-Packard Development Group, L.P.

--------------------------------------------------------------------------
This server ProLiant DL180 G6  , is a Proliant 100 Series Server.
NOTE: Some hpasmcli commands may not be supported on 100 series servers.
      Type 'help' to get a list of all top level commands.
--------------------------------------------------------------------------
hpasmcli> show dimm
Cartridge #:    0
Processor #:    1
Module #:       2
Present:        Yes
Form Factor:    fh
Memory Type:    5h
Size:           4096 MB
Speed:          1333 MHz
Status:         N/A

Cartridge #:    0
Processor #:    1
Module #:       1
Present:        Yes
Form Factor:    fh
Memory Type:    5h
Size:           4096 MB
Speed:          1333 MHz
Status:         N/A

Cartridge #:    0
Processor #:    1
Module #:       4
Present:        Yes
Form Factor:    fh
Memory Type:    5h
Size:           4096 MB
Speed:          1333 MHz
Status:         N/A

Cartridge #:    0
Processor #:    1
Module #:       6
Present:        Yes
Form Factor:    fh
Memory Type:    5h
Size:           4096 MB
Speed:          1333 MHz
Status:         N/A
ewwhite
  • 194,921
  • 91
  • 434
  • 799
0

I want just to add, that motherboard layout (from different vendors) may differ and it is not possible to map slots and csrow + channel + rank values reliably without checking every combination.

For example ASUS B450 motherboard with AMD ryzen and dual rank memory sticks:

dmidecode: DIMM A1 (first slot from cpu side)
edac: csrow 0 channel 1 rank 1
edac: csrow 1 channel 1 rank 3
dmidecode: DIMM A2 (second slot from cpu side)
edac: csrow 2 channel 1 rank 5
edac: csrow 3 channel 1 rank 7
dmidecode: DIMM A1 + DIMM B1 (first and third slot from cpu side)
edac: csrow 0 channel 0 rank 0
edac: csrow 0 channel 1 rank 1
edac: csrow 1 channel 0 rank 2
edac: csrow 1 channel 1 rank 3
dmidecode: DIMM A2 + DIMM B2 (second and fourth slot from cpu side)
edac: csrow 2 channel 0 rank 4
edac: csrow 2 channel 1 rank 5
edac: csrow 3 channel 0 rank 6
edac: csrow 3 channel 1 rank 7
dmidecode: DIMM A1 + DIMM A2 + DIMM B1 + DIMM B2 (all)
edac: csrow 0 channel 0 rank 0
edac: csrow 0 channel 1 rank 1
edac: csrow 1 channel 0 rank 2
edac: csrow 1 channel 1 rank 3
edac: csrow 2 channel 0 rank 4
edac: csrow 2 channel 1 rank 5
edac: csrow 3 channel 0 rank 6
edac: csrow 3 channel 1 rank 7

So we can provide the following mapping:

csrow 0 channel 0 rank 0: DIMM B1
csrow 0 channel 1 rank 1: DIMM A1
csrow 1 channel 0 rank 2: DIMM B1
csrow 1 channel 1 rank 3: DIMM A1
csrow 2 channel 0 rank 4: DIMM B2
csrow 2 channel 1 rank 5: DIMM A2
csrow 3 channel 0 rank 6: DIMM B2
csrow 3 channel 1 rank 7: DIMM A2

If you are using HP servers with special cli tools than it may help you to determine memory mapping. Otherwise you have to check every combination.

puchu
  • 126
  • 2