15

I have a Dell PowerEdge R710 with dual Intel Xeon E5503 CPUs. It has 96GB(12x8GB) of ECC DIMMs. In its BIOS, memory is configured for "Advanced ECC".

My question is if my DIMMs are already ECC, does it make sense to enable this "Advanced ECC" mode in BIOS or should I switch to "Optimized"?

Dell describes these modes as such:

Advanced ECC Mode This mode uses two MCHs and “ties” them together to emulate a 128-bit data bus DIMM. This is primarily used to achieve a Single Device Data Correction (SDDC) for DIMMs based on x8 DRAM technology. SDDC is supported with x4 based DIMMs in every memory mode. One MCH is completely un-utilized, and any memory installed in this channel will generate a warning message during POST.

Memory Optimized Mode In this mode, the MCHs run independently of each other; for example, one can be idle, one can be performing a write operation, and the other can be preparing for a read operation. Memory may be installed in one, two, or three channels. To fully realize the performance benefit of the memory optimized mode, all three channels per CPU should be populated. This implies that some ‘atypical’ memory configurations, such as 3GB, 6GB, or 12GB, will yield the best performance. This is the recommended mode unless specific RAS features are needed.

Dell PowerEdge R710 Systems Hardware Owner's Manual (PDF)

Jon
  • 119
  • 7
Mxx
  • 2,312
  • 2
  • 26
  • 40

1 Answers1

24

It does make a difference, it will only make sense if you require the RAS (Reliability, Availability, and Service) features on x4 or x8 devices and understand the trade-offs for your needs. More details can be explained in the Dell white paper Dell™ PowerEdge™ Servers 2009 - Memory.

Also, configuration and layout with details specific to the R710 are available on the Technical Guidebook for the PowerEdge R710 - (Google this because I don't have reputation for link).

The important issue to note is the difference between ECC on the chip and the "Advanced ECC" provided by Dell's BIOS for Single Device Data Correction (SDDC). You will have a performance impact on both. The ECC will recover from errors during writes to the chip. However, SDDC goes a step further and will organize the bits so that an entire chip can fail and still be recoverable. See an example and details SDDC E7500 Chipset

The issues is whether your performance and/or reliability are of the utmost concern in your specific usage of the machine. If a chip failure will cause a loss of critical data or usage on this machine and it's non-redundant in the implementation, Advanced ECC may be a great way to go. However, you do so at a performance impact which may be more important to you.

I've implemented both in the field on Dell PowerEdge servers for single Microsoft SQL Server implementations. If I can be of more help, just comment to let me know.

Hope that helps.

EDIT: Coverage gap / ECC implementations

Yes, there is a coverage gap even if you implement both. Since, you are specifically using a cluster of high availability servers, IMHO you should use the Advanced ECC. Your performance impact is minimal compared to the benefits for the clustered devices. According to Crucial you have only a 2% decrease in performance on ECC memory in general.

The gap would be more specific to the types of errors that occur and how each handles the errors. In your specific situation it shouldn't translate to data loss. Since this is an Enterprise DBMS and errors, concurrency issues, etc. are managed at the software level in order to prevent data loss. A detailed history is kept of changes in a properly configured DBMS and the software that uses it can typically setup to have the transaction "rollback" any if a severe error occurs.

ECC Implementations

ECC will attempt to correct any bit errors in memory read/write. However, if the error is more significant, then not even ECC will be able to recover, causing potential loss of data. There is more discussion on ECC as well at ServerFault/What is ECC ram and why is it better?

According to Wikipedia on ECC_Memory

ECC memory maintains a memory system effectively free from single-bit errors...

SDDC

If you refer to the E7500 chipset document above (note the 55xx/56xx from Intel require login/partnership but the idea is similar which is why I didn't link originally), which describes SDDC and how it's made possible. Basically, it uses a technique for organizing the words written to memory that ensures all are written in such a way that every word will only contain a single bit error i.e. the word should be recoverable from the single bit error (as above). Now that's per word, so it could potentially recover from up to 4-bit errors on x4 devices (1 per word) and up to 8-bit errors on x8 devices (still 1 per word) by error correcting each word.

Additional errors, more bit errors, total memory failure, channel failure, bus failure, etc. can still all cause horrible problems but that's why you have a cluster and an Enterprise DBMS.

In short, if you have everything enabled and there's too many bit errors for error correction algorithms to correct you will still have an error i.e. error coverage gap. These can be exceptionally rare though.

inevio
  • 406
  • 3
  • 5
  • To be more specific this is a set of 3 identical R710s running Oracle DB cluster. So availability of a single machine is not of the highest importance. However, data corruption is troubling. I've seen R710 technical guidebook. It did not have much additional information about memory. So with on-dimm ECC it will detect/correct errors within the dimm's chips? However, Advanced ECC will detect/correct errors for the whole dimm? If that's the case, is there a coverage gap between these 2 methods? – Mxx May 14 '12 at 13:36
  • @Mxx I've updated my answer to try to explain. IMHO since you are running an Oracle DB cluster, I'm doubtful you'll have data loss. In the rare event of a failure, the DBMS is built to prevent data loss and other issues. In your case for the cluster, I would enable Advanced ECC as performance should be negligible, but you can test it under load if you have concerns. – inevio May 14 '12 at 14:35
  • Thank you very much for the answer. I'm sorry, but I'm still not clear about one thing. What could "Advanced ECC" protect me from that on-dimm ECC couldn't? If we are using dbms, then wouldn't it makes sense to switch bios to "optimized mode" to get a performance benefit of triple-channel memory config and will be protected by on-dimm ECC and Oracle's own validation? – Mxx May 14 '12 at 15:49
  • @Mxx I suppose it's not exactly guaranteed either way. However, with the Advanced ECC option On you'll be able to recover from more errors without intervention (lower overall probability of a bit error) and performance hit should be low. It's certainly lower attempting to correct at the DBMS. While the DBMS may be able to save your data the end-user experience may still be presented in a software crash and/or rollback of a potentially large operation. I suppose with monitoring if the chip is failing and errors frequency grows, Advanced ECC could give you time to replace the DIMM cleanly. – inevio May 14 '12 at 17:19