Why do different du versions report different file sizes?

2

1

at our organization we have a windows file server, which I use to store a large number of files. This file server is mounted using smbmount on 2 clusters.

Cluster A runs CentOS 4.8, and du version 5.2.1. Cluster B runs Ubuntu 8.04.4, and du version 6.10.

When I run the du cmd on Cluster A, for a particular folder I get

user@ClusterA:~/particular_dir$ du -h
....
637G  .

However, when I run the du cmd on Cluster B, for the same folder I get

user@ClusterB:~/particular_dir$ du -h
....
1.1T  .

Why is there such a large difference? Although different OS's and du versions, surely a file size is a file size.

not_so_random

Posted 2010-08-19T09:21:45.057

Reputation: 23

Could you check if they report different sizes on all kinds of directories?

do a du -hs in a smaller directory with both du's, and see what happens. – polemon – 2010-08-19T09:26:16.177

Answers

1

What if you try ls -1s? It prints the file size in blocks. Or what if you try ls -1ak (which reports the results with blocksize=1k) or just plain ls -lah, do the results look identical between the servers?

I suspect that Samba decides to use different block size across different versions, so du reporting might get false reports over the network share. du stands for disk usage, not file usage :-) and in general things like filesystem and filesystem block size do matter if you have lots and lots of files around.

Janne Pikkarainen

Posted 2010-08-19T09:21:45.057

Reputation: 6 717

Thanks a lot for your answer! I've followed your advice and have found that, yes, the block size is different. E.g. for a subset of xml files on cluster A, the block size of each file is 8, while for the same subset the blocksize of each file is 1024. So in future, I guess I take what du is reporting with an awareness that blocksize will effect the results. Accept the smaller number because of smaller block size? Or that the report is just a rough estimate? Thanks again - I was quite confused for a while. – not_so_random – 2010-08-19T12:18:57.110

With networked filesystem mounts you always should read the du output with care. With local filesystems it still can give you a hint about a need to optimize things. That's complicated, though: a very small block size (like 512 bytes) can lead to performance problems if you suddenly need to handle huge files on that partition; on the other hand, if you have a 4096 bytes block size and the files are average of 1 kb, then you'll quickly waste lots of space, unless using e.g. ReiserFS which can pack multiple small files to a single block (with a slight performance hit). – Janne Pikkarainen – 2010-08-19T13:13:13.787

0

On one hand, such reporting indeed seems confusing. It happens due to difference in block size (512, 1k, 4k, etc.) defined for concrete file system, but also due to the number of files described in meta data (file system usually keeps it on the same device thus increasing the disk usage).

On other hand, it is quite handy to find out what is the useful (actual) data size and how it is different from disk space usage = useful data size + meta data size + fragmentation (file space usage in terms of du).

To report disk usage instead of real size:

# du -sh Data/
2.0T    Data/

Now to report useful file size:

# du -sb Data/
1650071895576   Data/

Which is 1,5Tb, meaning that 0,5 Tb is used for meta data (meta blocks, e.g. inodes) and tail fragments in blocks, which are allocated but not used at the very end of file (true for all files with size not dividable over block size). Having 2M of files times block size of 4096 with average fragmentation of say ~512b one can explain 0,5 - 1Tb disk space "loss". Hence contiguous data saves space.

Also please see man for

-b, --bytes equivalent to `--apparent-size --block-size=1'

Yauhen Yakimovich

Posted 2010-08-19T09:21:45.057

Reputation: 437