5

So I have a backup server we've been using for some time, it's a FreeBSD server running zfs and serving it over NFS. The export is simple: /backup/vm -maproot=root -alldirs. If relevant, that was configured through zfs:

zfs get sharenfs
backup/vm sharenfs  -maproot=root -alldirs  local

It's been running fine and we've even restored these backups. Today I discovered purely by accident, that files read from the nfs share, don't match what was written (and what's on the server).

To demonstrate: on the server we have

pg11.txt (downloaded on the server)
pg11.txt.1 (uploaded by a client over nfs)

Both of which are Alice in Wonderland, downloaded from here: http://www.gutenberg.org/cache/epub/11/pg11.txt

On the nfs server:

md5 pg11.txt*
MD5 (pg11.txt) = eff1e5d84df1d3a543d1c578192a2367
MD5 (pg11.txt.1) = eff1e5d84df1d3a543d1c578192a2367

So far so good. Now on a client:

md5sum pg11.txt*
4d79d99b8eebe364cddf5ce42949bc3e  pg11.txt
eff1e5d84df1d3a543d1c578192a2367  pg11.txt.1

What? Reading pg11.txt from the client I can easily find lines like:

Alice started to her feet, for it flashed across her <80>^A^@<80>^V<A0>R+^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^A^@^@^A<A4>^@^@^@^A^^@^@^@^@^@^@^@^@^@^@^@^@^B<8E>^^@^@^@^@^@^B^B^@^@^@^@f7<D9>^@^@^@^@^@^@^V^V<EE>3^@^@^@^@^@^@^BFT^B<8C<FF>^E<D9>m(T^B<8C><E7>^]<CE>[<95>T^B<8C><E7>^]<CE>[<95>^@^A^@^@^@^@^@^@^@^A^@^@<U+FEFF>Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

Now on a different client:

md5sum pg11.txt*
eff1e5d84df1d3a543d1c578192a2367  pg11.txt
b9c4076a85a151e730b9a9077fd6023b  pg11.txt.1

2nd client but over tcp:

md5sum  pg11.txt*
d80ce8c17092b1b759295e27a3c0af60  pg11.txt
14cde84fd05bd39845c9bb8fc0042eda  pg11.txt.1

The previous clients where both XenServer 6.2, if I try an Ubuntu system:

md5sum pg11.txt*
eff1e5d84df1d3a543d1c578192a2367  pg11.txt
81ca4f5b9b334d00a07fcb16f444a60a  pg11.txt.1

So every client seems to have a different picture, and usually not the right one. I'm hoping someone can give me some clue as to what's happening here and how to fix it, because I'm well stumped.

Edit:

The various files, including diff can be found here: https://gist.github.com/Whoops/0fbe1751675d5e344d43. It appears that the start of the file is repeated several (7) times, preceded by the same binary string each time. Also it's interesting to note that the corruption appears to be consistent for each client, i.e. each client always sees the same corrupted version, rather than different corruption on each read.

Edit2:

The problem occurs with both NFSv3 and 4. It appears to only occur on Linux clients, not other FreeBSDs. Tested clients are XenServer 6.2 and Ubuntu 10.04 which means if it's a client bug, it spans kernel versions 2.6 - 3.11. I don't currently have another FreeBSD server to test with.

Walton Hoops
  • 181
  • 7
  • Compare files with `diff` and you'll get the cue. – Kondybas Aug 31 '14 at 07:27
  • I believe that's a bug. – Michael Hampton Aug 31 '14 at 09:23
  • @Kondybas diff is why I was using Alice in Wonderland, since it's not great for VM backups :-p. The relevant files including diff can be found here: https://gist.github.com/Whoops/ec2bbffb73461283d5b4. Unfortunately, the corrupted version contains invalid UTF-8 so github, pastebin etc. refuse to display them. It appears the start of the file is repeated several times, followed by a binary string, but I still have no idea why. – Walton Hoops Aug 31 '14 at 16:07
  • You have to know what the difference between files, otherwise there is no possibility to find the answer. – Kondybas Aug 31 '14 at 16:09
  • @Kondybas is there something you are trying to lead me towards I'm not seeing? The difference (that I see) is that the corrupted version has the leading part of the file re-injected several times, always followed by the same binary string. Why this would happen is still beyond me. – Walton Hoops Aug 31 '14 at 16:14
  • @Kondybas err always led by the same binary string. – Walton Hoops Aug 31 '14 at 16:19
  • ugh, gist dissapeared on me, new one at: https://gist.github.com/Whoops/0fbe1751675d5e344d43 – Walton Hoops Aug 31 '14 at 16:26
  • I have a hard time imagining how this isn't a bug in the NFS server/client implementation, but you might share the same content out over some other protocol to confirm the problem scope (perhaps httpd, or even just scp'ing it off). – Joshua Miller Sep 02 '14 at 19:12
  • Agreed, it has to be a bug, and what's crazy the BSD/BSD server/client is fine, Linux/Linux is fine (our main storage is a Linux server), but BSD/Linux is not. I'm not sure what to do from here though, since I really have no clue which is at fault (client or server), and when I see a bug that looks like "straight up doesn't work as advertised" in well established widely used software I wonder if there's an edge case I'm not recognizing. I've been kind of hoping someone would go "oh that sounds like this bug" and then I'd know what I needed to up/downgrade. – Walton Hoops Sep 02 '14 at 19:30
  • Also, the same directory is shared read-only over CIFS which does not have this issue. – Walton Hoops Sep 02 '14 at 19:31
  • Looks like some kind of character encoding conversion is going on. In particular the "" is a UTF Byte Order Marker. No idea what would be doing that conversion, why, or how to fix it. – Chris S Sep 02 '14 at 21:19
  • Maybe, but I kind of suspect that's coincidence. For one, it affects binary files too, I just used Alice in Wonderland, because it's a lot easier to spot the differences than in a binary file (I discovered this with a XenServer patch, which is compressed). NFS shouldn't care about text or binary, it should transfer it all the same. Also it does not occur over CIFS, so it's hard to imagine something sitting between NFS and the rest of the system that doesn't affect CIFS as well. I may be off in the weeds, but I bet that binary has to do with the NFS protocol itself. – Walton Hoops Sep 02 '14 at 21:46
  • have you tried to mount the nfs share in your clients using the noac nfs options? – c4f4t0r Sep 06 '14 at 18:36
  • Just tested: no effect, but thanks for the suggestion – Walton Hoops Sep 06 '14 at 18:44
  • a believe this a sync problem in nfs. try mount nfs with sync – Rainbow- Sep 08 '14 at 14:32
  • Occurs with both sync and async, nfs3 and 4 – Walton Hoops Sep 08 '14 at 17:46

1 Answers1

3

Ok so it turns out this is a bug with the bxe driver in FreeBSD 10.0-Release specifically

The bxe(4) driver can cause packet corruption when TSO (TCP Segmentation Offload) feature is enabled. This feature is enabled by default and can be disabled by using a -tso parameter of ifconfig(8). It can be specified in rc.conf(5) like the following:

ifconfig_bxe0="DHCP -tso"

This bug has been fixed on FreeBSD 10.0-STABLE.

Many thanks to junovitch on the FreeBSD forums for figuring this out.

Walton Hoops
  • 181
  • 7