11

I am analysing big, cloud-based application. During analysis I found that one of the biggest files (>3 MB) used in this app is a very small (16x16) icon.png file.

Further analysis revealed that the file contains over 60000 lines of metadata, consisting mostly of <rdf:li> tags inside <photoshop:DocumentAncestors> tag. Here is an example:

     <photoshop:DocumentAncestors>
        <rdf:Bag>
           <rdf:li>0</rdf:li>
           <rdf:li>00094172844843523D09FDF552DF119E</rdf:li>
           <rdf:li>000B84DD32F5ABCC8D7B5E8681465EE9</rdf:li>
           <rdf:li>0013FA92942B6EC5451A4D9D4972AD7E</rdf:li>
           <rdf:li>0017ED7FA617555EF7D04797B72E2946</rdf:li>
           <rdf:li>0030491E2F4C927C3D67B20A9710BC01</rdf:li>
           <rdf:li>003287E12D0B5EA81D0AED63DDC335E5</rdf:li>
           <rdf:li>004657FECAF7D9DF3A459A2C0820D29A</rdf:li>
           <rdf:li>0048B527A1E225804FA1FE3E90A74F50</rdf:li>
           <rdf:li>0061E7DAD11961FF150102241FDE8BF5</rdf:li>

How can I check if this metadata was placed here "naturally" or it contains some hiddent data?

Kao
  • 213
  • 2
  • 6
  • 1
    Create a similar file that contains image data but no meta data. Use a Proxy like [charles](http://www.charlesproxy.com) to redirect accesses to that file to the local copy. Check if the application still works in the same way as it did before. – Guntram Blohm Mar 18 '15 at 13:29
  • @GuntramBlohm Thanks for advice. Unfortunately, I am not able to deploy and run the application at this stage. – Kao Mar 18 '15 at 13:33
  • That's why i suggested a proxy. You don't need to deploy the application, you just need it deployed and running somewhere. The intercepting proxy is supposed to route everything to the application, except requests to that one image. Of course, this assumes the application runs in your browser (i thought this because you said cloud based), but of course i might be wrong. – Guntram Blohm Mar 18 '15 at 13:37
  • Are there additional 0 blocks, where the number changes? If so, does the data that follows that number begin with that hex value? – ǝɲǝɲbρɯͽ Mar 18 '15 at 16:35
  • Those are clearly 128-bit hexadecimal numbers. Concatenate them, convert them to binary, and run the result through a standard compression algorithm (e.g. gzip). How much smaller is the result, relative to the uncompressed version? (It's important that you convert to binary else you will only measure gzip's ability to compensate for hexadecimal coding overhead.) – zwol Mar 18 '15 at 16:49
  • 2
    Also, they appear to be in an ascending sequence. Does that continue throughout? If you take the numeric difference between pairs of numbers, does that reveal a pattern? – zwol Mar 18 '15 at 16:50
  • @zwol As for ascending sequence: [this article](http://www.hackerfactor.com/blog/index.php?/archives/2013/05/23.html) states that `The Document Ancestors is supposed to be an unsorted array` however, it can be sorted in some cases – Kao Mar 18 '15 at 17:56

2 Answers2

11

Looks like this metadata lists document IDs that were used during creation of the file. You can check this article: http://www.hackerfactor.com/blog/index.php?/archives/2013/05/23.html, search for the "Ancestors" section.

So, it contains technical metadata which could be placed there 'naturally' by the Adobe applications.

Dmytro
  • 226
  • 2
  • 4
  • 5
    Though this doesn't necessarily mean that the data was placed there "naturally". It could be this is just the method someone decided to use to hide data. According to the link, DocumentAncestors refers to copy-paste/place operations. It seems quite unnatural that someone performed 60000 copy/paste operations to produce a 16x16 icon. Also according to the link, the right part of the hex-string refers to instance of the Adobe application. It should remain the same as long as the application is not restarted. So OP's sample implies that the Adobe application was restarted 60000 times as well. – Supr Mar 18 '15 at 12:56
  • 3
    File seems fishy to me. At best it's bad practice (SEO / Webmaster wise) to have a >3mb file size for a 16x16 icon. – k1DBLITZ Mar 18 '15 at 14:47
  • 3
    @Supr could be a buggy image editing batch script which got caught in an infinite loop. – Philipp Mar 18 '15 at 21:53
5

I believe everything I've linked is safe. Sorry for the format, I'll try to fix this when I can.

Here are some candidates with similarly large metadata counts. Report links come from Googling "Excessive number of items for * DocumentAncestors" (which comes from exiftool, apparently used by VirusTotal).

Here's a jpg or mp3 (report), a png with spam text (report), a png alone (report), and two with the same md5 (31a02712515ace35f1a593c14a7b5150), but this one starts with "0," like your example does. png (report) and a live sample png samsung tablet (SAMPLE). The sample comes from the hash; the others did not produce samples.

A histogram from the "samsung" sample (I quickly split out each byte of 107,000 entries, sorted and sent them through 'uniq') may be of limited utility, except to show that the bytes aren't completely random. This may be expected given how some operations are probably encoded, but I was assuming a programming error that generated purely random UUIDs. This isn't the prettiest picture so I can work on that. Decimal 17 (0x11) is the large spike at bottom.

00-FF along left side, count on bottom

I tried some experiments to see if there might be some encoded data (also the point of the histogram) but have mostly approached it as just metadata generated while a file is processed.

Here are some additional pursuits:

Another forum post at Adobe Photoshop CC is creating problematic JPEGs that make OSX Preview.app lose its mind with a linked file (Note4Cover1.jpg) that's just as large but not as nicely formatted inside.

Someone else with an excessive number of items, I think this link suggests how to remove the extra data (warning that it may remove stuff you want):

exiftool -xmp:all= -tagsfromfile @ "-all:all<xmp:all" FILE

A caveat: I found that opening and saving with a new name using GIMP removed the data regardless of the checkboxes being set to save it. It seems like that's not supposed to happen according to the standards linked by other answers here.

And finally, differ (differ.readthedocs.org) is an image reporting library. I haven't evaluated it because while it looks useful and dumps stats from may tools (like exiftool and imagemagick) it might be a little tricky to set up (github). It still might be useful for forensic data.

ǝɲǝɲbρɯͽ
  • 429
  • 2
  • 8