2

A few days I copied a large (56GB) file from a workstation to a file server. After checking the copy I found out it had a few bytes different from original.

Details:

  • source system:
  • destination system:
    • HP ProLiant MicroServer N36L , ECC RAM
    • Windows Server 2012 R2 Standard
    • ReFS on Storage Spaces 2-way mirror

The file was copied by drag-and-drop on the workstation from the local disk to the network shared folder (on the server). The file size is 56886041991 bytes.

A second copy done the same way one day later was OK (checked by md5sum). Comparing reveals there are 97 bytes that differ. (see below) The only pattern I see is that the broken bytes are clustered in three groups where each 128th byte is changed.

What can I do? Where to start looking for the cause? It can not be the disks on the server, as they would report a read error in case of corruption, and even if not, ReFS would not notice the bad checksum and read the sector from the other disk and if that is corrupted too, it would (should) report a read error. SATA has CRC. RAM has ECC. Network has 2 layers of checksums. The Workstation has no ECC memory. Maybe network driver bugs?

Output of cmp -l: see here


Additional tests done in the meantime:

  • 24 hours of memtest86+ v5.01 on both PCs, no errors
  • 24 hours of memtest86 v4.3.7 on both PCs, no errors
  • SMART long test on all HDDs: no errors (except on one that I know has a few bad sectors, they are outside of active partitions)
  • md5sum /dev/sdX in a loop: 5 times executed on the 5TB disk , over 20 times on others - no errors detected
  • repeated the copy operation the same way as originally 10 times, checked result: no errors

I guess it was a lone cosmic ray...

David Balažic
  • 419
  • 6
  • 19

0 Answers0