4

We have a report in our web application that being shown in tabular format in HTML. This report has the provision to be downloaded as PDF by clicking the download as PDF button. The question is about the way this PDF download provision being implemented. I'm told to write a back-end service that would convert raw HTML string in to downloadable PDF. To convert the HTML into PDF we are using The Flying Saucer library for Java. Now the way this is suppose to work is that I will get raw HTML content as string such as:

<table id="new-table">
    <thead>
        <tr>
            <th class="model">Column 1</th>
            <th class="description">Column 2</th>
            <th class="quantity">Column-3</th>
            <th class="listDollars">Column-4</th>
            <th class="payout">Column-5</th>
    </thead>
    <tbody>
        <tr id="row-H2285" style="background: #FFFFFF;" class="modelRow">
            <td class="model">H2285</td>
            <td class="description">F125</td>
            <td>16</td>
            <td class="list"></td>
            <td class="Percent">... and so on


From the front end in the request parameters and I have to convert this HTML string using the flying saucer and return a PDF file. My question is that Is there a way that an attacker can inject malicious code inside this HTML content and send it to the back-end service? Which might be harmful to any one who opens the PDF file?

I have googled for any security issues in the flying saucer library but could'nt find anything. But I did find this question from this site itself on How to inject malicious code in pdf or jpeg and there's another one Detecting malicious javascript in PDF

  • I am missing what sort of input validations you do. There should b some that also strips out any interactive elements (JavaScript) – LvB Jun 22 '17 at 09:47
  • @LvB: I currently just have a filter that removes whole sections of '' tags. But As Johnny has mentioned in his answer the 'escaping of strings' part is remaining.I'm working on that – Aditya Cherla Jun 22 '17 at 10:47

3 Answers3

4

I have worked with many PDF libraries (not this one exactly though) mainly in .NET and if done correctly are fine. If you are worried about having the client side HTML sent back to the server and it has been manipulated, then a possible suggestion is that you render a seperate HTML version in your backend specifically for the PDF library to use. This is therefore not rendered to the client and is not subjected to client manipulation.

ISMSDEV
  • 3,272
  • 12
  • 22
4
  1. Make sure your web services is not susceptible to XSS injections.

  2. You must find ways to prevent abuse(since this become web services), e.g. use API key control, to prevent anyone upload any HTML file with malicious javascript to your pdf generator.

mootmoot
  • 2,387
  • 10
  • 16
3

Without reviewing the entire source of the library in question and every other place the PDF could conceivably be used and have its content evaluated (uploaded to some other service that parses content or performs OCR, etc.), there's really no way of knowing.

The key here is that you need to be sanitizing your user inputs as soon as you receive them regardless of whether you trust this library. Your goal as a developer needs to be, "if someone I don't trust gives me a potentially explosive string, even if I can safely handle it, I need to defuse it before someone who doesn't know better accidentally detonates it."

That someone may be another tool in your data processing pipeline that will happily evaluate a malicious string or the customer's PDF reader. Either way, it's a daywrecker.

Escape your strings! There are a lot of ways to sanitize user input. I'm not a Java expert so I don't know them all, but if it's your day job, you certainly should.

Ivan
  • 6,288
  • 3
  • 18
  • 22