10

I'm looking for a way to force an ECC error in a DRAM DIMM to test some code associated with recovering from these errors. I believe Intel makes a test jig for several thousand dollars, but I'm looking for something a bit cheaper.

I've tried buying a Beta emitter (Strontium 90, 0.01uCi) attached to the DIMM to force a "bit flip" in the hardware. After two weeks of running, I'm not getting any ECC errors reported.

My next step is to either buy a stronger emitter... or see if anyone else has solved this some other way.

Question: Has anyone found a way to force ECC failures in a DIMM for test purposes (other than finding a failed DIMM and using that.. which was our old technique until the DIMM gave up the ghost completely)

albiglan
  • 1,033
  • 8
  • 11
  • 1
    You're going to need something a bit stronger than that. Pop up to Fermilab and see what they've got. :) – Michael Hampton Mar 08 '16 at 00:51
  • Indeed.... I figure I need something that can pump out something in the "many MeV" range of beta particles, or... exceed my fine motor skills with a Dremel to remove the packaging then switch to an alpha emitter. In any case, I was hoping someone might suggest a "non-nuclear option" :-) – albiglan Mar 08 '16 at 13:47
  • 4
    I'm voting to close this question as off-topic because this is not a system administration question. It is suitable for migration to another SE site, though I am not entirely sure which one would be appropriate. – kasperd Mar 08 '16 at 13:55
  • I'm okay either way on closing it but indeed suffered from "which site works best" and this was a "best fit" so I suspect if this is off topic here, then it will simply be closed and I'll look elsewhere for answers. My argument for keeping it here is that SysAdmins may be interested in the testing/validation aspect of this, even if this is a corner case. That is- it relates to the SysAdmin function, even if 99% of SysAdmins will never perform this themselves. – albiglan Mar 08 '16 at 15:39
  • 2
    Do these [error injection](https://www.kernel.org/doc/Documentation/edac.txt "Linux: Documentation/edac.txt") examples help? Or do you need real hardware faults? – ckujau Mar 10 '16 at 07:48
  • Thanks for the link. We looked at that and it won't work. Unfortunately, we need to inject into HW :-( – albiglan Mar 10 '16 at 22:56
  • 1
    Would it be possible to simulate ECC errors with a virtualisation layer of some kind? Just a thought – Molomby Mar 11 '16 at 00:21
  • 2
    @Molomby it would. Fault injection for virtual machines is a research discipline in CS. Some decent work has been published over the course of the years. – the-wabbit Mar 11 '16 at 07:48

1 Answers1

1

The issue was resolved by adding wires to a single DIMM (destroying it for normal use) and generating random ECC errors with the DIMM which allowed us to test the system.

albiglan
  • 1,033
  • 8
  • 11