Incorrect md5sums on downloaded files

2

I have an Ubuntu 10.04.1 LTS linux server which is experiencing some weird issues... I just tried to download a 440 MB tgz archive over HTTP using wget, and when expanding it with tar -xzf filename.tgz I received:

gzip: stdin: invalid compressed data--crc error

Finding this odd I renamed the file filename-bad.tgz and downloaded it again. I received the same error on the second download... The site listed an md5 checksum for the file so I checksummed both the two download attempts to see if maybe this file was just corrupt...

The two files had different checksums!

So I downloaded this file to my local workstation and ran md5sum on it there. This time, the MD5 checksum was correct, and the file extracted properly. So I copied the file from my workstation to the server and ran md5sum on that copy. It was a new md5sum, different from the correct md5sum and different from the two other attempts!

Here is the detail of the server:

  • Intel(R) Core(TM) i5 CPU (Dual Core)
  • 8GB RAM
  • Software RAID5 array using linux md devices and 3 1TB SATA drives
  • 2 ethernet cards, connected to two different networks in our office (the wired and the wireless network)

I suspected maybe the RAID array was degraded/malfunctioning, so I ran mdadm --detail and it reported the state was clean and all drives were in active sync. To further test, I copied a 1GB file from an SD card to the RAID array, and the md5sum of that file verified.

What could be going on?

EDIT: Output of cmp -l as requested:

324268145 115 105
324268657 274 264
324269297 332 322
324270577 345 344
324270833 155 154

EDIT2: I just realized one of the copies I have actually does have the correct MD5 checksum, so I copied the file from my local machine two more times and both times the checksum was correct! So a few more tests are in order here...

EDIT3: I am now unable to reproduce this issue. Which sounds like bad RAM to me. Will run memtest tonight, any other ideas welcomed!

EDIT4: Ok. Now this is weird. The issue is 100% reproducible when copying the file to specific VMWare virtual machine is running on the server. If I copy the file to that virtual machine, sometimes if I immediately copy the file to the host, the problem is reproducible. scp also sometimes says this when copying to the virtual machine:

Received disconnect from 10.1.0.73: 2: Packet corrupt

These all seem to me to be clues of bad RAM. Does everyone concur? Any other possible explanations?

EDIT5: Solved. Gee, what on earth could have been causing this problem? I just don't understand.... :-)

2436 Errors! All Right!

(I did test the RAM on this system right after I bought it, which was two-three months ago... oh well. Looks like it's time to call Dell...)

Josh

Posted 2010-08-02T21:23:43.443

Reputation: 7 540

I posted this on SuperUser as opposed to ServerFault because it's consumer-grade hardware, and it's a small office server as opposed to a serious production server. And it uses software RAID. But maybe SF is a better place for it, not sure! – Josh – 2010-08-02T21:25:01.200

1Whatever's going on, it seems to have something to do with the server, since the file seems to be getting corrupted whenever you transfer it to the server from either your desktop or the download source. – David Z – 2010-08-02T21:58:50.110

2My first guess would be the ram. Did you memtest? Could you change ram for testing? – matthias krull – 2010-08-02T22:12:36.867

Answers

2

Did the two files have the same size? If not, one of the files was probably truncated.

If you used FTP to transfer files, some clients assume text files by default, and must be told to go into binary mode or they'll mangle ^M and ^J. This was once a major source of corrupted files, but is a rarity nowadays.

TCP packets have a 16-bit checksum. That means about one error in 65536 will go undetected, so a transmission error is within the realm of possibility.

None of the possibilities above satisfactorily explain the third md5sum value, though.

Try comparing the files (e.g. with cmp -l) and see if there is a pattern to the differences. If you see that the differences always seem to be at certain bit positions (something like always at the most significant bit of a byte position of the form 8*n+3), it's usually a sign that your RAM is defective. Generally, in case of data corruption not explainable by software or network transmission, RAM is the first place to look.

Gilles 'SO- stop being evil'

Posted 2010-08-02T21:23:43.443

Reputation: 58 319

I did not use FTP... the file sizes are the same, so it must be corruption somewhere. I am starting to suspect bad RAM also. – Josh – 2010-08-02T22:29:57.193

wish I could add more plusses – Dave – 2010-08-02T22:32:45.513

"I'm afraid I can't let you do that, @Dave..." – Josh – 2010-08-02T22:36:34.840

I am now unable to reproduce this issue, meaning bad RAM is a very likely culprit. Will run memtest tonight and if the RAM is bad, your answer will be accepted! – Josh – 2010-08-02T22:54:26.197

0

If you are transferring using FTP use binary mode transfer. Otherwise, any line terminations in the file get mangled. Windows doesn't need to mangle line terminators in text mode.

BillThor

Posted 2010-08-02T21:23:43.443

Reputation: 9 384

I was transferring using HTTP (via wget) and SFTP (ssh). No FTP was involved. – Josh – 2010-08-02T21:44:50.440

0

As a quick check, if you transfer the file using sneakernet (i.e., put it on a flash drive and walk it over), does it work fine?

Tristan

Posted 2010-08-02T21:23:43.443

Reputation: 101

I'll try that with this specific file. A different file, copied from an SD card, was fine. This must be either a network issue or as some have said, a RAM issue. – Josh – 2010-08-02T22:32:11.847