65

One of the common server failure scenarios is bad DRAM, sometimes even when ECC memory is used.

memtest86+ is one of the most useful tools to diagnose DRAM problems. As it loads itself at the start of the memory, I've been wondering if memtest86+ checks the part of the memory which memtest86+is loaded into.

Is the memory allocated to memtest86+so small that it doesn't matter, or is it possible that memtest86+ could miss a defect in the DRAM because it can't test the memory locations it's residing in?

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
Robin
  • 768
  • 7
  • 15
  • 8
    While this question is relevant for a server, it's also relevant for an ordinary PC, so I've voted to move this question to [Super User](http://superuser.com/) where it can reach more people. – Cristian Ciupitu Jan 21 '16 at 15:53

2 Answers2

77

Obviously, memtest86+ cannot test the memory region which currently contains the memtest86+ executable code (but if there are memory errors in that region, it is very likely that the test itself will crash). However, memtest86+ is able to relocate its own code to a different address at runtime, and by using this trick it is able to test all memory which it is allowed to use by the firmware (BIOS) — just not all at once.

This code relocation is described in README.background inside the memtest86+ source code archive (the file is slightly out of date — e.g., it states that the addresses used for memtest86+ code are 0x2000 and 0x200000, but the low address as defined in the source is actually 0x10000, and the high address is either 0x2000000 or 0x300000 depending on the amount of memory in the machine).

But even with this relocation trick memtest86+ is not able to test all memory for the following reasons:

  • Usually the firmware (BIOS) reserves some RAM regions for its own use (e.g., ACPI tables). While these RAM regions can be accessed by CPU, writing anything into them can result in unpredictable behavior.

  • Some part of RAM is used for the System Management Mode and is not even accessible from the CPU outside of the privileged SMM code.

  • The RAM address range between 640K and 1M is inaccessible due to quirks of the legacy PC memory layout (some of this RAM may be used as a shadow for BIOS ROM and for SMM, other parts may be completely inaccessible).

Sergey Vlasov
  • 6,088
  • 1
  • 19
  • 30
  • 1
    Interesting, I missed its relocation capability. Obviously SMM and the likes are out of range (apart specific BIOS support). – shodanshok Jan 21 '16 at 13:29
  • Those mapped regions generally exclude the DRAM as something else "off module" is being addressed ? ROM and peripheral devices say. – mckenzm Jan 22 '16 at 01:13
  • 3
    if you have several ram modules, perform a second test after having swapped them... – JFL Jan 22 '16 at 12:32
  • Is it possible to have memory fail in just the right way to have memtest incorrectly report success due to having its instructions rewritten? Or rather, how many faults does it take? – John Dvorak Jan 22 '16 at 14:44
  • 3
    @JanDvorak: In theory, it's possible, of course. In practice, I'd say it's only slightly more likely than banging your head on the keyboard and randomly typing out a Shakespearean sonnet. – Ilmari Karonen Jan 22 '16 at 20:17
5

No, memtest can't test its own memory. However, it is so small (only some KB) that it hardly matters. EDIT: this statement is wrong since, as stated in the selected answer, memtest can dynamically relocate itself to test all user addressable memory.

--

In theory, modern processor can, at boot time, configure part of their cache as programmable memory, from within very small programs (as memtest) can be run without touching DRAM at all.

However, it is a model-specific feature (which require BIOS support) and I don't think memtest is using it.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • Thank you for your answer. `memtest` is testing the CPU cache as well. So if `memtest` would be loaded into this cache, then it this part of the cache couldn't be tested, which is more problematic, because it's much smaller than the memory? – Robin Jan 21 '16 at 10:11
  • 2
    Besed on [memtest86 documentation](http://www.memtest86.com/technical.htm) it does **not** test the processor cache, at least in a direct manner. Moreover, modern processors have separate instructions and data cache (I$ and D$). Executable code is loaded into the instruction cache and it can't be directly modified / overwritten – shodanshok Jan 21 '16 at 10:13
  • 1
    memtest86+ definitely tests the CPUs data cache, but that doesn't matter for this question. Thank you again for your answer. – Robin Jan 21 '16 at 10:23
  • 3
    Are you sure about this? I thought it copied itself somewhere else while testing the memory it normally lives in. That's why every test has a slow part (most of memory) and a really fast part (the tiny bit where its code/data is stored). – Peter Cordes Jan 21 '16 at 11:58
  • @Robin any reference on "memtest86+ definitely tests the CPUs data cache"? [Wikipedia](https://en.wikipedia.org/wiki/Memtest86) doesn't seem to agree: "access patterns are designed to keep most cache organizations flushed so that memory accesses are actually seen to the RAM" – Dmitry Grigoryev Jan 21 '16 at 15:04
  • As mentioned in Sergey's answer, memtest86+ relocates itself, so it effectively does test its own memory. I think this answer is misleading. – Nate Eldredge Jan 21 '16 at 18:10
  • @DmitryGrigoryev: memtest86+ tests L1, L2 and L3 caches of a CPU, which is the data cache, right? – Robin Jan 21 '16 at 19:13
  • @NateEldredge true. I edited it to point to the correct answer. – shodanshok Jan 21 '16 at 21:37
  • @Robin by reference I mean a link to some kind of source which confirms what you say. So far, I've only seen Wikipedia which says it doesn't – Dmitry Grigoryev Jan 21 '16 at 21:51
  • @DmitryGrigoryev: http://storage5.static.itmages.com/i/16/0122/h_1453462950_5192344_b15635706a.png – Robin Jan 22 '16 at 11:42
  • @Robin I bet info on that screenshot comes from [`cpuid`](http://stackoverflow.com/questions/14283171/how-to-receive-l1-l2-l3-cache-size-using-cpuid-instruction-in-x86). Testing the cache would mean you write a pattern in each cache line and read it back. – Dmitry Grigoryev Jan 22 '16 at 11:49
  • 1
    @DmitryGrigoryev: Ah okay.. so I've learnt something more :-) Cool thanks! – Robin Jan 22 '16 at 11:50
  • @shodanshok, for what it's worth, [__memtest86__](http://www.memtest86.com/) and [__memtest86+__](http://www.memtest.org/) are different programs. – Cristian Ciupitu Jan 22 '16 at 15:52