What are the end bytes of *.docx file format

23

3

I have a hard disk that was formatted and reinstalled its OS.

The problem is, it wasn't booting before formatting and the data backup that I've made before formatting, for some reason, don't have all the files.
There are Microsoft Word *.docx files missing.

Now I'm trying to recover the files with Puran File Recovery but it doesn't have a *.docx extension scan entry pre-built in it.
Puran File Recovery has an option to we create custom entries and I found in filesignatures.net the start bytes signature, so now I was able to find many *.docx headers in the hard disk.

My problem now is that I can't find anywhere what are the end bytes of *.docx files so that I might be able to recover some files.

J Rui Pinto

Posted 2019-09-16T16:46:14.870

Reputation: 249

18So many people test their backups but never test their recovery. I always found this strange: which of the two is the one that is actually interesting? – Jörg W Mittag – 2019-09-17T06:43:51.643

10@Jörg That's probably because doing the backup itself is relatively easy, while testing full restoration requires a third storage medium to put the restored copy of your disk on, which would generally not be at hand. Arguably you can also test whether individual files can be restored, but that's not a full test – tomsmeding – 2019-09-17T19:37:27.697

7@tomsmeding: I know a company who would rigorously test all their backups. At the end of every backup, they would do a byte-level comparison of the backup tape with the HDD. They also did redundant backups to two tape drives and would actually do the readback twice, once in the original drives, then swapping the two tapes between the drives. They did this every day, for the daily, weekly, monthly, and quarterly backups. The backups were in two separate climate-controlled, bomb-proof, fire-proof vaults, one on the opposite end of the building from the servers, one on the opposite end of the … – Jörg W Mittag – 2019-09-18T12:25:12.097

9… country. Then, one day, they actually had to restore a machine, and they found that the SCSI controller's firmware had an incompatibility with their tape roboter and didn't allow booting from tape … They lost millions that day. Since I heard that story, I make sure to always test restoring, not backuping. – Jörg W Mittag – 2019-09-18T12:26:20.443

Answers

70

A .docx file is just a .zip file. This is how a Zip file is structured:

Structure of a Zip file

The end of a Zip file is indicated by the end of central directory record (EOCD). The length of the EOCD is variable because it can contain a comment up to 65535 bytes long. See the bold part of the EOCD layout below:

+---------+--------+--------------------------------------------------------------------+
| Offset  | Bytes  |                            Description                             |
+---------+--------+--------------------------------------------------------------------+
|      0  | 4      | End of central directory signature = 0x06054b50                    |
|      4  | 2      | Number of this disk                                                |
|      6  | 2      | Disk where central directory starts                                |
|      8  | 2      | Number of central directory records on this disk                   |
|     10  | 2      | Total number of central directory records                          |
|     12  | 4      | Size of central directory (bytes)                                  |
|     16  | 4      | Offset of start of central directory, relative to start of archive |
|     20  | 2      | Comment length (n)                                                 |
|     22  | n      | Comment                                                            |
+---------+--------+--------------------------------------------------------------------+

Table from Wikipedia » Zip (file format) » End of central directory record (EOCD)

You can get the end of a Zip file by looking for 0x06054b50 (the beginning of the EOCD), then counting 16 bytes after that. Set the next two bytes to 0x0000 to ignore the comment, and you should now have the end of a valid Zip file.


Note: This does not take file system fragmentation into account. Your recovery approach will not work if the .docx/.zip file was fragmented on the disk because the signatures you're finding would be broken up. You would need some information from the file system in order to piece together fragmented files; beginning and end signatures don't have this information.

PhotoRec is a software I've used before that has some tricks to figure out how to piece together fragmented files. Crucially for you, PhotoRec has built-in support for Zip files, so you might want to try TestDisk/PhotoRec if your current signature search strategy isn't working for you.

Deltik

Posted 2019-09-16T16:46:14.870

Reputation: 16 807

4PhotoRec is pretty good. Sadly, sometimes, it just spits out multiple GB files randomly. – Ismael Miguel – 2019-09-17T11:11:57.570

@IsmaelMiguel I don't know the program, but it sounds like that is caused by the file system fragmentation Deltik talks about. That can only be avoided if a program is used that does not ignore the filesystem. – Paul – 2019-09-18T09:40:50.643

@Paul Won't it be required to ignore the file system if parts of it are overwritten/destroyed? This means that, yes, the fragmentation will cause issues. Which may be why the program creates those multiple GB files. – Ismael Miguel – 2019-09-18T09:45:40.623

@IsmaelMiguel If only a part of the metadata is destroyed, it can still use the other parts to figure out file locations. It's hard to write such a program though, but they might exist. – Paul – 2019-09-18T10:00:13.730

What happens if 0x06054b50 appears in the Comment? – Fax – 2019-09-19T12:43:07.100

@Fax: Highly unlikely for a .docx file. Even if this happened, one should use context clues, like checking for the presence of another EOCD signature before the one found and the comment length to confirm that there is not a whole .zip file in between. If the file is fragmented, it's still going to be tough to reconstruct regardless.

– Deltik – 2019-09-19T13:17:35.337

11

Deltik's answer is correct. Some potentially helpful information:

The sequence of bytes for the End-Of-Central-Directory Header will actually appear as 504b0506 (reverse order), as viewed by a hex editor such as xxd, or in a byte-addressed sequence.

In a valid OpenOfficeXML file, such as a .docx file, there is never an end-of-central-directory comment (See ECMA-376, Part 2, page 76: "ZIP file comment" should not be produced. However, consumers are supposed to support reading a file containing such a comment anyway.)

Also, multi-disk archives are not supported (see page 75), so the "Number of this disk" field and the "Disk where central directory starts" field are always 0. Moreover, the "Number of central directory records on this disk" and the "Total number of central directory records" fields should be equal.

All told, the final 22 bytes of any .docx file should always have the form

 50 4b 05 06 00 00 00 00 ## ## ## ## ## ## ## ## ## ## ## ## 00 00
| signature |disk |CD-  |num. |num. |size of CD | CD offset |comment
|           |num. |disk |recs |recs |           |           |length

Nick Matteo

Posted 2019-09-16T16:46:14.870

Reputation: 586