stat and ls show wrong file size (terabytes wrong)

Ok, I have a bunch of vCard files, all about 200 to 300 Bytes in size.

While trying to get them archived, I wondered why that takes so long and discovered that there is one file with a wrong size. Both ls and stat are showing a size of about 8.1 Terabytes. That's amazing because my SSD is only about 250 Gigabytes in size.

There are some other files with wrong sizes, too, but this is clearly the biggest one. I already gave it a fsck, but there seem to be no errors in the (ext4) filesystem. How can I get rid of this wrong size?

Thanks, Wolle

WolleTD

Posted 2013-06-30T20:49:22.330

Reputation: 131

2Just a guess, but maybe those are (invalid) sparse files. That would explain enormous size. – gronostaj – 2013-06-30T20:58:58.807

how do I get rid of those? And how can a sparse file be bigger than my hard drive? – WolleTD – 2013-06-30T20:59:34.197

1Imagine a binder capable of holding a 100 pages. If you use that binder as a regular file you could insert a 100 pages. You could read all 100. You could write to all 100.

Now imagine a sparse binder. You insert the first page you write "page 1: Content A". You then insert a the second page you write "page 9999: content b:". Whenever you try to read a page you look if it exists. If it does not, your answer will be this is an empty page. If it does exist you return the contents of the page. Whenever you write to a page which does not yet exist in the binder you add a new sheet of paper. – Hennes – 2013-06-30T21:06:26.550

Thus is is possible to have a binder with page number (read: A file with a size) bigger than which will fit in the binder if all pages are present. – Hennes – 2013-06-30T21:06:47.223

1So, and how do I fix that? I can't even open the file... – WolleTD – 2013-06-30T21:13:07.687

@WolleTD It is likely corrupt. Have you tried deleting it? – Paul – 2013-06-30T23:29:41.973

1Checking if the file is sparse: ls -lsh file will print occupied size in new first column. If the occupied size is smaller than the apparent size then the file is sparse. – pabouk – 2013-07-01T04:15:21.443

@Paul then it is gone. I don't want it to be gone, I need it. An I can neither read it's content nor copy it without having the same, wrong size... – WolleTD – 2013-07-01T09:44:44.657

Answers

vCard appears to be a text file format. This is a good thing as text files should not contain nulls - this will help if the OS mistakenly thinks the file is a sparse file containing very long sequences of nulls.

You can use ls -lks bigfile to see if the occupied space differs from the apparent space.

You can use dd to extract chunks of data (e.g. the first 500 bytes only) into a new file. You can then used hexdump to see if there is recoverable text in that chunk.

If you find the file is filled with long sequences of nulls, you can try using a script to read the file and only write the non-null data to a new file. In this way you may be able, at some effort, to construct a valid vCard file of the usual size.

alternatively use strings bigfile to extract text from the huge file

Many of these operations will take a long time on a ig file. You may want to practise on something smaller ...

Here's a vCard file

$ cat gump.vcard
BEGIN:VCARD
VERSION:2.1
N:Gump;Forrest
FN:Forrest Gump
...
EMAIL;PREF;INTERNET:forrestgump@example.com
REV:20080424T195243Z
END:VCARD

$ file gump.vcard
gump.vcard: vCard visiting card

let's make a corrupt sparse version

$ dd of=sparse-file bs=1k seek=5120 count=0
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0 s, Infinity B/s

$ cat gump.vcard sparse-file > sparse-gump.vcard

$ cp --sparse=always sparse-gump.vcard really-sparse-gump.vcard

$ ls -lks *sparse*
   0 -rw-r--r-- 1 rgb rgb 5120 Jul 11 18:09 sparse-file
5136 -rw-r--r-- 1 rgb rgb 5121 Jul 11 18:10 sparse-gump.vcard
   4 -rw-r--r-- 1 rgb rgb 5121 Jul 11 18:18 really-sparse-gump.vcard

Note that the size on disk of the last file is 4 blocks but it contains 5121 blocks of data.

Lets see what is in there

$ hexdump really-sparse-gump.vcard | head -n 3
0000000 4542 4947 3a4e 4356 5241 0a44 4556 5352
0000010 4f49 3a4e 2e32 0a31 3a4e 7547 706d 463b
0000020 726f 6572 7473 460a 3a4e 6f46 7272 7365

$ hexdump really-sparse-gump.vcard | tail
0000230 4120 656d 6972 6163 450a 414d 4c49 503b
0000240 4552 3b46 4e49 4554 4e52 5445 663a 726f
0000250 6572 7473 7567 706d 6540 6178 706d 656c
0000260 632e 6d6f 520a 5645 323a 3030 3038 3234
0000270 5434 3931 3235 3334 0a5a 4e45 3a44 4356
0000280 5241 0a44 0000 0000 0000 0000 0000 0000
0000290 0000 0000 0000 0000 0000 0000 0000 0000
*
0500280 0000 0000
0500284

Note the * line between offsets 290 and 0500280 - that's where all the imaginary nulls live.

$ strings really-sparse-gump.vcard > new-gump.vcard

$ ls -lks new-gump.vcard
4 -rw-r--r-- 1 rgb rgb 1 Jul 11 18:30 new-gump.vcard

$ cat new-gump.vcard
BEGIN:VCARD
VERSION:2.1
N:Gump;Forrest
FN:Forrest Gump
...
EMAIL;PREF;INTERNET:forrestgump@example.com
REV:20080424T195243Z
END:VCARD

We have recovered our normal sized vCard from the huge file. Your Mileage May Vary.

RedGrittyBrick

Posted 2013-06-30T20:49:22.330

Reputation: 70 632

On linux (since 3.1), you can use lseek() with the SEEK_DATA and/or SEEK_HOLE to identify positions of data and holes in a sparse file. By repeating the call with an increasing offset, you could read the bytes identified as data and write them out to another file as you go. Perhaps something like this (error checking and other tedium omitted for simplicity):

int fd0 = open(file, O_RDONLY, S_IRWXU);
int fd1 = open(newfile, O_WRONLY | O_CREAT | O_TRUNC, S_IRWXU);
off_t eof = lseek(fd0, 0, SEEK_END);
off_t cur = 0;
char buf[8192];
while (cur < eof) {
  off_t d = lseek(fd0, cur, SEEK_DATA);
  off_t h = lseek(fd0, d, SEEK_HOLE);
  lseek(fd0, d, SEEK_SET);
  size_t dlen = min(h - d, 8192);
  ssize_t rlen = read(fd0, buf, dlen);
  ssize_t r = write(fd1, buf, rlen);
  cur = d + rlen;
}
close(fd0);
close(fd1);

Eric Westbrook

Posted 2013-06-30T20:49:22.330

Reputation: 1