33

Considering the fact that many server-class systems are equipped with ECC RAM, is it necessary or useful to burn-in the memory DIMMs prior to their deployment?

I've encountered an environment where all server RAM is placed through a lengthy burn-in/stress-tesing process. This has delayed system deployments on occasion and impacts hardware lead-time.

The server hardware is primarily Supermicro, so the RAM is sourced from a variety of vendors; not directly from the manufacturer like a Dell Poweredge or HP ProLiant.

Is this a useful exercise? In my past experience, I simply used vendor RAM out of the box. Shouldn't the POST memory tests catch DOA memory? I've responded to ECC errors long before a DIMM actually failed, as the ECC thresholds were usually the trigger for warranty placement.

  • Do you burn-in your RAM?
  • If so, what method(s) do you use to perform the tests?
  • Has it identified any problems ahead of deployment?
  • Has the burn-in process resulted in any additional platform stability versus not performing that step?
  • What do you do when adding RAM to an existing running server?
ewwhite
  • 194,921
  • 91
  • 434
  • 799

8 Answers8

31

No.

The goal of burning in hardware is to stress it to the point of catalyzing a failure in a component.

Doing this with mechanical hard drives will get some results, but it's just not going to do a lot for RAM. The nature of the component is such that environmental factors and age are far more likely to be the cause of failures than reading and writing to the RAM (even at its maximum bandwidth for a few hours or days) would ever be.

Assuming your RAM is high enough quality that the solder won't melt the first time you really start to use it, a burn-in process won't help you find defects.

Shane Madden
  • 112,982
  • 12
  • 174
  • 248
26

I found a document by Kingston detailing how they work with Server Memory, I believe that this process would, normally, be the same for most known manufacturers. Memory chips, as well as all semiconductor devices, follow a particular reliability/failure pattern that is known as the Bathtub Curve:

enter image description here

Time is represented on the horizontal axis, beginning with the factory shipment and continuing through three distinct time periods:

  • Early Life Failures: Most failures occur during the early usage period. However, as time goes on, the number of failures diminishes quickly. The Early Life Failure period, shown in yellow, is approximately 3 months.

  • Useful Life: During this period, failures are extremely rare. The useful life period is shown in blue and is estimated to be 20+ years.

  • End-of-Life Failures: Eventually, semiconductor products wear out and fail. The End-of-Life period is shown in green

Now because Kingston noted that high fail-rates would occur the first three months (after these three months the unit is considered good until it's EOL about 15 - 20 years later). They designed a test using a unit called the KT2400 which brutally tests the server memory modules for 24 hours at 100 degrees celsius at high voltage, by which all cells of every DRAM chip is continuously exercised; this high level of stress testing has the effect of aging the modules by at least three months (as noted before the critical period where most modules show failures).

The results were:

In March 2004, Kingston began a six-month trial in which 100 percent of its server memory was tested in the KT2400. Results were closely monitored to measure the change in failures. In September 2004, after all the test data was compiled and analyzed, results showed that failures were reduced by 90 percent. These results exceeded expectations and represent a significant improvement for a product line that was already at the top of its class.

So why is burning in memory not useful for server memory? Simply, because it's already done by your manufacturer!

jlliagre
  • 8,691
  • 16
  • 36
Lucas Kauffman
  • 16,818
  • 9
  • 57
  • 92
  • 10
    The chip manufacturer, and maybe even the server vendor might test *some* chips. But mst components are only sample-tested these days to reduce cost. Even if your chips or whole DIMMs were once tested, that doesn't tell you if the contacts or PCB were somehow tweaked or messed up during assembly or shipping. We've had a MemTEst86 burn-in find problems with memory from two different servers, out-of-the-box from two different "tier 1" server vendors. If they had made it to production, ECC might have saved us, but silent database corruption could also have been the result. – rmalayter Jun 25 '13 at 21:50
  • 7
    This bathtub curve is not just for semiconductors. Most components built with any degree of quality control follow it: hard drives, SSDs, power supplies (mainly because of capacitors), fans, etc. – voretaq7 Jun 25 '13 at 21:51
  • 6
    This is one of the reasons I never buy extended warranties on electronics. The device (or component) is either going to fail in the first few months or will last the rest of its lifetime. This also demonstrates why it is so important to weed out the bad apples early so that you can get to the smooth sailing as soon as possible. – Atari911 Jun 27 '13 at 16:11
  • @rmalayter So you would advocate burning the RAM any anyway? – ewwhite Jun 29 '13 at 15:11
  • 3
    @ewwhite Yes, I would test. It only takes few hours or so to boot memtest86 and let it check 384 GB of RAM. We burn in all storage subsystems as well using IOmeter for the same reason. Had several RAID controllers or drives die on us during burn-in over the last several years, even though they initially worked fine during the OS install. Sometimes it was a bad firmware thing, sometimes faulty cache RAM on the RAID controller, sometimes it was "who knows - RMA it!" – rmalayter Jul 11 '13 at 14:15
15

We buy blades and we generally buy in reasonably large block of them at a time, as such we get them in and install them over DAYS before our network ports are ready/secure. So we use that time to use memtest for around 24hrs, sometimes longer if it goes over a weekend - once that's done we spray down the basic ESXi and IP is ready for its host profile to be applied once the network's up. So yeah we test it, more out of opportunity than necessity but it's caught a few DOA DIMMs before now, and it's not me physically doing it so it takes me no effort. I'm for it.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
  • 3
    A "Test of Opportunity" makes sense -- given the chance I'd do it. If it's going to delay deployments I can risk a bad DIMM and an ECC light :-) – voretaq7 Jun 25 '13 at 17:26
  • 2
    If you build the test into the deployment plan then you've bought yourself the time, if you just do everything as fast as you can you're setting yourself up for criticism at a later date. Strong-arm management whenever you can :) – Chopper3 Jun 25 '13 at 17:30
  • @Chopper3 So if you were establishing a policy, *do it always?*, *do it never?* or *do it when you can?*. – ewwhite Jun 29 '13 at 15:12
  • @ewwhite - I'd say the latter, though we tend to engineer that into the standard deployment plan, so it's highly likely each time. – Chopper3 Jun 29 '13 at 15:24
11

Well I guess it depends on exactly what your processes is. I ALWAYS run MemTest86 on memory before I put it in a system (server or otherwise). After you have a system up and running, problems caused by faulty memory can be hard to troubleshoot.

As for actually "stress-testing" the memory; I have yet to even see why this would be useful unless you are testing for overclocking purposes.

Atari911
  • 375
  • 1
  • 7
  • What does MemTest86 tell you? Have you found RAM issues prior to installing it in a server using this method? – ewwhite Jun 25 '13 at 11:50
  • 4
    I've found a lot of errors with MemTest86+ that the BIOS and Windows memory diagnostics won't find. I highly recommend it. Yes, ECC will find the same errors, but a memtest will help you find them all ahead of time. – Owen Johnson Jun 25 '13 at 15:58
  • 6
    MemTest will let you know if there are any flaws in the internals of the memory. It does this by storing patterns of bytes as well as random sets of bytes in the memory in an attempt to trigger an error. The program can run a "pass" to let you know if the memory is good but I generally run multiple passes overnight just to make sure. The nice thing about MemTest is that it tells me if the memory is bad before I deploy the system. It has triggered an RMA many times and saved me a lot of headaches. Once the machine is deployed its a pain in the @ss to RMA the memory. – Atari911 Jun 25 '13 at 16:03
  • 2
    @OwenJohnson Generally when you run MemTest86(+) you're hoping to trigger those ECC errors before you put the machine into production :-) – voretaq7 Jun 25 '13 at 17:00
6

I don't, but I've seen people who do. I never saw them gain anything from it though, I think it might be a hangover or superstition perhaps.

Personally, i'm like you in that the ECC error rates are more useful to me - assuming the RAM isn't DOA but then you'd know that anyhow.

Sirex
  • 5,447
  • 2
  • 32
  • 54
6

For non-ECC ram running a 30 minutes on memtest86+ is useful as there is usually no reliable method of detecting bit-errors when the system is running.
Blue-screening is not considered to be reliable method...
And slightly flaky RAM often doesn't show immediately, only after the system has seen some full-memory load and then only if the data in that RAM was code that got used and then crashed. Data-corruption can go unnoticed for long periods of time.

For ECC ram it won't do anything the memory controller itself won't be doing so it really doesn't make sense. It's just a waste of time.

In my experience people who insist on burning in are usually old guys who have always done it like this and who keep doing it out of habit without really thinking things true.
Or they are young guys following the prescribed procedure written by those old guys.

Tonny
  • 6,252
  • 1
  • 17
  • 31
  • Bad knowledge, handed down across generations? – ewwhite Sep 26 '13 at 23:02
  • @ewwhite Yes, as far as I know. And I have a Bsc. in computer hardware technology, so I'm supposed to know what I'm talking about :-) – Tonny Sep 27 '13 at 09:41
  • except for all the incidents of people who actually found errors, as shown in the thread. Also, if it's not obvious, there is a difference in getting the parts swapped before taking a server into production or replacing ram on a DB server that runs in 24x7. Unless pretend it's a "Grown error" and everyone else is just old and doing cargo cult stuff, but it's still gonna cause losses to have a prod server offline. – Florian Heigl Mar 16 '14 at 17:00
  • 1
    @FlorianHeigl I don't advocate burning in RAM for the sake of it, but I will never endorse putting a server into production, without it being stress-tested over at least 24 hours. RAM is usually not the problem. Flaky HDD's, RAID controllers, IPMI cards, power supplies, CPU's, VRM's... I have seen it all. (And often the server survives the initial install just fine. It's the load and/or heath that does it when it has to really work.) – Tonny Mar 16 '14 at 19:55
3

It depends.

If you are deploying 50 000 new RAMs, and you know that this particular hardware have a failure rate of 0.01% after operating less than a day, statistically speaking there got to be several of them that will fail on their first day. Burning in are meant to catch that. With deployments on that scale, failure is expected, not an exceptional situation.

If you're deploying only a couple hundreds items though, statistics are most likely on your side as you must be quite unlucky to get a failed parts.

Lie Ryan
  • 418
  • 2
  • 6
  • You've got a point. Btu let's face it, most of us will never do deployments that big. (Unless you are building a new Google data-center.) Most of us typically deploy at most 5 to 10 servers at the same time. Biggest one I personally ever did was 16 ESX nodes (4x 4-node clusters) which each took 8 DIMMs. That was 3 years ago and since then 1 DIMM failed (2 months ago). Had to replace 5 power-supplies on those same machines. First 1 after a week already. But as these are HP Proliants we sort of expected that. (HP and power-supplies.. Don't get me started...) – Tonny Jun 26 '13 at 18:55
1

For one server, it's potentially a waste of time, depend on the context.

But if you install 2000 server at a time and you don't do a valid "stress test", you are pretty sure to find one server which behave badly. And it's not only for RAM, it's for network, CPU, hard drive, etc. When you replace one DIMM, it's a good thing too, just to be sure, the right DIMM was replaced ( sometime it's not you which replace the DIMM), so launching a stress test will tell you if it's fixed or not fixed.

From my experience on large scale cluster, HPL is a good tool to have an idea if you have DIMM error. And mono node HPL are enough, but larger HPL could help too. If the system behave as expected and don't throw MCE error which are catchable by Linux in the logs then you're good !