5

I have an AMD quad core, 8 gb RAM, 1 SSD EXT2 (2 months old), 2 HDD EXT4, approximately 1 year old. I'm using Ubuntu 10.04 x86-64 and when I compute the md5sum of large files (9 GB) sometimes I get different values than the one stored on a reference file.

Upon restarting and switching off the PC then I get the expected results no matter how many times I repeat it. But this is random.

I've turn on ECC (the fastest possible settings) and the issue seems to be rarer, but I've run memtest86+ for 6+ hours without a glitch (and with ECC off!).

Any idea? Should I update the BIOS of my motherboard (Asus EVO-something...don't remember it now)? I've tried all the rest apart this, but genuinely don't know what to do anymore...

Any suggestion is appreciated!

Emanuele
  • 203
  • 2
  • 6
  • 2
    Drop the VM caches and recompute the sum. See what you get then. "echo 3 >/proc/sys/vm/drop_caches". Also try to scan your filesystem for bad blocks, just to see if anything comes up. – Matthew Ife Nov 14 '11 at 23:42
  • Thanks mate, I would exclude a disk issue because it happens on both the SSD and the other two HDD. What is dropping the VM caches? It is dropping the cached files from the kernel so that md5sum will force re-read the file from disk? – Emanuele Nov 14 '11 at 23:46
  • Yes, thats correct. Clutching at straws and i've seen only one example of it personally, but electromagnetic interference can cause bit flipping in ram. – Matthew Ife Nov 14 '11 at 23:56
  • Also ubuntu should do control groups. You could try using the cpuset cgroup to allocate a specific RAM bank in your server to do the compute on, perhaps you'll find one that continually fails you. – Matthew Ife Nov 14 '11 at 23:59
  • Even dropping the file cache did do **nothing**. Instead, **unplugging** electricity and plugging it back, magically restored the correct MD5, on both physical disks and in parallel too... I'm really _lost_... – Emanuele Nov 15 '11 at 19:29
  • I've updated the bios (I have an M4A88TD-V EVO/USB3) to the latest revision (from 1404 to 2001, came out 07/08/2011)... Let's see. Btw I recommend _CPU-G_ the equivalent of CPU-Z but for Ubuntu. – Emanuele Nov 15 '11 at 19:52

5 Answers5

1

Is your RAM all the same? I had this happen after I bought more ram and got some that was faster than what was already in the box. According to the specs for the mobo it should have worked with mixed speeds, basically clocking to the lowest common denominator. Each set would work fine by themselves if I took out the other, but together something would happen and while the box would work for the most part, there were clearly problems. I did the checksums just like you described and had the same mismatches. Even ran memtest overnight and had the same result. I eventually wound up just taking the loss of the ram and scrapped the smaller of the two sets.

Jeff Snider
  • 3,252
  • 17
  • 17
1

If turn off and restart helps and ECC makes it rarer, I guess it's an overheating problem. See Enabling hardware sensors in Linux on how to use embedded MB sensors (typically, it's CPU and MB). HDDs usually have temperature among their SMART attributes.

DIMMs don't have sensors so you have to either touch them, make guesses or use an additional piece of hardware with sensors on wires that can be placed anywhere - like this front panel.

ivan_pozdeev
  • 353
  • 4
  • 13
1

Sometimes draining the capacitors can help. Unplug your machine and hold the power button for a few seconds. It sounds like witchcraft, but it works. (Sometimes.)

Also make sure your PSU is behaving properly; bad power supplies can cause bit errors.

Finally, start removing PCI/AGP/etc. devices and see if one of them is messing things up.

jon
  • 890
  • 5
  • 15
0

Try to use some diffing tool to verify byte-by-byte that the files does not in fact differ. This could also be a harddrive error of some kind.

Ztyx
  • 1,365
  • 3
  • 13
  • 27
  • 1
    Read the question carefully-- OP says he gets different results on literally the same file (not two copies presumed to have the same data.) – jon Dec 14 '11 at 01:11
-1

Have you tried to calculate an md5sum of a large file but one that fits completely in RAM? It seems like there maybe an issue with swapping.

dpendolino
  • 101
  • 2
  • These are files of different size, from 4k to 3GB. All together are 9GB. – Emanuele Nov 16 '11 at 21:16
  • 1
    This seems unlikely as md5 doesn't read the whole file in one go, it just reads 512bits at a time so it's memory requirements are quite small. – user9517 Nov 17 '11 at 09:50
  • This is begging the question. Swapping is merely a process; he still needs to know what, hardware-wise, might be causing that process to go wrong. – jon Dec 14 '11 at 01:12