Can Linux scrub memory?

10

6

Does Linux have a mechanism to "scrub" memory? e.g. testing the memory and marking areas as dirty if they fail so that the system can continue to operate "safely" even with bad ram chips installed?!

Waxhead

Posted 2011-12-28T20:40:11.423

Reputation: 1 092

Answers

2

The answer is yes, and it is done transparently (provided you have ECC memory to detect errors, and your kernel version is at least 2.6.30 to continue to operate safely).

Basically, your memory is checked at every read from the processor, and scrubbed periodically*, to check for consistency with the Error Correcting Codes (ECC). If an error happened, you get a Machine Check Exception, which is then logged and grabbed by mcelog (http://www.mcelog.org/).

If your error was correctable, it increments a "leaky bucket" counter, which causes a physical DIMM that fails too often to be transparently replaced by another one. Thus your memory page is copied to a new location, your virtual memory address is updated to point to the new page, and the old page is marked by the OS as not to be used anymore.

This is called "soft-offlining" on Linux (and memory page retirement on Solaris, I don't know about other OSs).

If your error was not correctable however, what is called "hard-offlining" happens, that is your memory page gets removed from the normal operating system memory management, and your application gets killed (NB : by some catchable SIGBUS signal that tells you where the error happened, but it's rare enough not to care and try to catch it). If your memory page is mapped from a file and clean, the OS can also reload it transparently at another physical location instead of killing the process.

You can read more on mcelog, there are plenty configuration options, you can get other behaviours to be triggered, options, and other leads on what to read and how to make sure mcelog is running on your system.


* Scrubbing, or "Patrol Scrubbing" consists in reading memory, checking it against ECC for errors, and overwriting with the corrected memory words when an error is discovered. The term patrol scrubbing is used by opposition to overwriting incorrect data on errors in memory reads, which is sometimes called "Demand Scrubbing". Scrubbing is a hardware procedure that can be enabled, usually through the BIOS.

Cimbali

Posted 2011-12-28T20:40:11.423

Reputation: 153

1This only applies if you have the more expensive ECC memory. – psusi – 2014-12-25T15:44:13.770

This applies to all memories with ECC. Be it parity (but then you can't correct), SECDED, the more expensive Chipkill or any newer ones. DDR1 could already implement ECC, but it would all depend on which actual model you use. The "home" market has traditionally less need for resilience, but supercomputers have been equipped with them for over 20 years -- servers are in between. – Cimbali – 2014-12-25T16:01:23.313

1I meant ECC memory is more expensive ( than non ECC ) and so most people don't have it. – psusi – 2014-12-25T16:16:42.213

1Well "most people" is pretty vague. Whether it is common to pay the price in investment and power depends on the market, as I said. My average Dell laptop, that's 2 years old now, is equipped with it (standard, no special options asked). It's getting more and more common, because miniaturization of features makes DIMMs more sensible to various radiations. – Cimbali – 2014-12-25T16:30:34.817

1

Cimbali, who do "Patrol Scrubbing" (on systems with ECC memory) - bios firmware (probably in smm mode, transparently for OS kernel) or linux kernel in some software mode (which module do the patrol scrubbing)? ECC memory did not check ecc sums; to check ecc, data must be read (and ecc scheme in memory controller will check sum). Some memory is read often (by normal programs on cpu), another may be not read for weeks. Patrol scrubbing will read all memory every day (intel) or every 1-48 hours to do ecc checking - https://electronics.stackexchange.com/q/73546#comment911379_73573

– osgx – 2018-05-25T03:01:37.343

mcelog module is not for software-based patrol scrubbing, it just logs mce events - http://www.mcelog.org/ "*mcelog logs and accounts machine checks (in particular memory, IO, and CPU hardware errors) on modern x86 Linux systems*"

– osgx – 2018-05-25T03:02:53.367

@osgx I believe it is the memory controller, which is already where ECC checks are done, that also performs the scan of memory. – Cimbali – 2019-11-08T15:30:36.360

7

This is actually a bad idea. Memory cannot be reliably tested in a quick sweep. This is why software like memtest86 uses multiple passes with different bit patters to test memory. Solution:

  1. Test memory with memtest86, preferably long test, leave it running overnight, it will take a long time.

  2. If bad memory is detected, use memmap kernel parameter to force kernel not to use that memory:

   memmap=nn[KMG]$ss[KMG]
            [KNL,ACPI] Mark specific memory as reserved.
            Region of memory to be used, from ss to ss+nn.
            Example: Exclude memory from 0x18690000-0x1869ffff
                     memmap=64K$0x18690000
                     or
                     memmap=0x10000$0x18690000

In addition, you can use ECC memory which will correct 1-bit errors and detect 2-bit errors in your memory automatically (and you'll get log messages from kernel about uncorrectable memory problems if they happen)

haimg

Posted 2011-12-28T20:40:11.423

Reputation: 19 503

1

I think it could be interesting to include a reference to the badram kernel module here. It uses memtest86 as you propose, but instead of refraining the kernel from using bad memory, it allocates it to the kernel to not using, effectively guaranteeing that neither the kernel nor your applications run into that memory.

– Cimbali – 2015-08-09T11:06:56.660

Thanks for the tip on those kernel parameters. Do you think you please can clarify why this is a such a bad idea and why you can't check a chunk of memory using the same methods as memtest86(+)? I am aware that more reliable testing requires more CPU time (and probably bigger chunks of ram in one go as well) but why would this have to be a show stopper? CPU time may not a problem if spread over a long enough period and besides multi cpu is becomming more and more mainstream. – Waxhead – 2011-12-28T21:32:30.290

Well technically, if done over long enough period of time, this may be possible. But the bottleneck here is not CPU(s), but memory bus, and of course you "poison" your CPU's memory cache. I'd not aware of such kernel module, and the idea looks very fragile to me (orchestrating repeated pattern writing to an arbitrary region of memory on a live system, etc.) – haimg – 2011-12-28T21:50:09.523

haimg : question : will the VFS manage paging for this reserved memory ? i think it can not as it won't be visible to it. – Jay D – 2012-09-19T22:30:06.767

@JayD: That reserved physical memory off-limits to kernel and not touched by the kernel or anything else. – haimg – 2012-09-20T03:35:54.377

1@Waxhead Memory scrubbing is usually done at the BIOS level using hardware. If enabled you should find options for patrol scrubbing and demand scrubbing. If memory integrity is important to you, which it surely is if you're using ECC memory, then the smallish performance hit gained by enabling these options is worthwhile. – Ian – 2013-09-23T16:49:17.573

If you don't have this enabled in the BIOS it seems that you might have to roll your own solution, see http://www.gossamer-threads.com/lists/linux/kernel/1368453 and follow the links for more information. If you're /not/ using ECC memory then you can't perform memory scrubbing at all.

– Ian – 2013-09-23T16:49:17.573

2

The post and answer misunderstand the issue. Memory scrubbing is intended to keep correctible single bit errors from turning into uncorrectible double errors. The scrubber merely all physical memory (forcing cache misses to do so) occasionally. If there are any single bit errors, they will be corrected (and the correction must rewrite the correct value using a compare-and-swap), thus clearing the error.

Otherwise, if a second error occurs in a word which already has one error, the entire word will be uncorrectible and the OS will have to do something drastic.

Scrubbing is important because without it, memory which is read but not written (like code pages) may accumulate errors over time.

Larry Stewart

Posted 2011-12-28T20:40:11.423

Reputation: 21

Why do you think the answer miss understood the issue when it has been marked as the answer? – Dave – 2013-06-13T14:35:30.587

1Notwithstanding Dave's reply, Larry is quite correct, the answer /does/ misunderstand the question. The question asks if linux can do a memory scrub, used, as Larry carefully explains, to prevent single bit errors detected and corrected by ECC h/w from turning into uncorrectable 2 bit errors. The answer talks about how to detect those errors in the first place using a software application. – Ian – 2013-09-23T16:49:17.573

I think you misunderstand the purpose here. You are of course correct in your description about scrubbing however if you for example run a (non-critical) file server on non-ecc ram and have CPU cycles to spare it sounds like a good idea to sooner or later be able to detect corrupt memory and flag it as bad and know about it than to be unaware of a bad memory chip. Perhaps a better wording would be memory validation / verification. Not technically scrubbing perhaps, but still a viable way of reducing the damage done by potentially bad memory. – Waxhead – 2014-02-19T22:31:19.420

1

If you have ECC memory you may want to have a closer look at https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-edac. (I found "sdram_scrub_rate" especially interesting.)

(If this link breaks at some point (it really shouldn't) I'd suggest downloading the appropriate Linux documentation and search for "scrub".)

Kai

Posted 2011-12-28T20:40:11.423

Reputation: 33