14

It was mentioned that JPEG should not be used between image creation and redaction of sensitive contents, because compression artifacts around the redacted area may leak information. Given how this lossy format works, this makes sense. Is there any public research on this subject?

The core of the issue is that, for a lossy format like JPEG, the pixels are not entirely independent of each other, and each pixel has a certain relation to the value of the neighboring pixels. This relation is called the DC coefficient, and applies to the 8x8 pixel blocks that are mostly independent.


From lcamtuf (Nitter link) describing this phenomenon:

PSA: If you're redacting text in an existing JPEG image file (e.g., a scan or a photograph of a document), you should probably maintain a margin of 8 or more pixels between the black bar and the underlying text.

The reason is that JPEG compression is a lossy algorithm that operates on 8x8 pixel blocks, and that barely perceptible content-dependent compression artifacts are present as a halo extending up to 8 pixels past the boundary of the text in the image you are trying to redact.

forest
  • 64,616
  • 20
  • 206
  • 257
  • I don't believe there is. In most cases, such information leak would be purely theoretical. The circumstances would have to be very specific (font size, area of redaction, position of the redaction) to allow any meaningful exploitation. – Peter Harmann Apr 20 '18 at 07:41
  • 7
    @PeterHarmann Purely theoretical issues are what academia loves best. – forest Apr 20 '18 at 07:46
  • True, but I have never seen any research in this area and neither did I manage to google any. Also, research teams working on issues like these like their demos, so if it is infeasible, they may not have published their results. It is just boring to write a paper saying: Nope, there is no problem here. And even if they would, how do they prove there really is no method better than what they tried? – Peter Harmann Apr 20 '18 at 08:12
  • @PeterHarmann One example would be an analysis of lossy media formats in general. An analysis of multiple formats (everything from H.264 to MP3 to JPEG) could easily say "nope, no problem here" for several different formats. I recall a paper comparing cipher key schedules, and even though it never once failed, the Serpent key schedule exhibited no problems (whereas IDEA's had quite a few). – forest Apr 20 '18 at 08:16
  • Well, that depends whether the "redaction" in all these formats makes similar sense as in images. But yes, it would be nice, if there was a paper, but if there is, I did not find it. – Peter Harmann Apr 20 '18 at 08:19

1 Answers1

3

JPEG compression is a form of transform compression using DCT. A quick Google gives this overview of the algorithm:

http://www.dspguide.com/ch27/6.htm

Now the important bits here are:

  1. 8x8 (or whatever size you set during compression) fragments are independent of each other
  2. Most of the compression happens in the high frequency components

The #1 point is the most practically useful one. If your redaction allows you to add ample border space, extending it by the fragment size ensures that all the related information is likewise lost.

The #2 point is something that someone else can potentially elaborate on. Essentially, to be recovered by forensics, the correlation between the pixels to be reconstructed would have to impact the final encoding in a statistically significant way. That does not seem to me to be a trivial task. However, as JPEG is deterministic, educated guesses can be verified, or even brute-forced.

Tom
  • 10,124
  • 18
  • 51