16

Is there a way to generate PDF documents from HTML files automatically in Linux where the PDF offers some kind of reasonable level of resemblance to the input file?

A command-line tool - as opposed to an interactive GUI of some kind - is key.

I have tried htmldoc and some related cousins, of course. But these tools are hopelessly stone-age; htmldoc doesn't support CSS at all. You won't find a lot of HTML documents these days that don't have at least some CSS styling. I don't really care about stupid effects or minor embellishments, but the issue is that CSS is at the core of most layouts these days; not many folks are using 6 layers of nested tables anymore. So, if the conversion tool has no grasp of CSS whatsoever, it's not just a matter of "the document doesn't look quite right"; it is likely to not meet the minimum standard of usability at all.

It has been suggested to me by some folks to try to use the Gecko rendering engine to generate images that can be converted to PDFs, but I have no idea how one would go about doing this, let alone easily.

I have no trouble believing that there are good commercial tools that do this, but I'm really looking for an open-source package if possible, as the endeavour itself is an open-source one and doesn't pay.

Thanks in advance!

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
Alex Balashov
  • 907
  • 2
  • 9
  • 16

6 Answers6

7

Have you seen wkhtmltopdf? Can't say how well it works personally, but it seems like exactly what you need. Only problem may be, with this and any 'browser automation' solutions, that it will pick up the print stylesheet rather than the screen display one so the PDF may not be exactly what you see on screen.

robertc
  • 408
  • 1
  • 7
  • 14
  • What's a print stylesheet? I must be out of touch with the latest and greatest new stuff from the CSS universe. – Alex Balashov Aug 01 '09 at 21:23
  • It's nothing new, it just only became practical for most websites once they switched to CSS for layout instead of tables. Try http://www.alistapart.com/articles/goingtoprint/ or http://www.webcredible.co.uk/user-friendly-resources/css/print-stylesheet.shtml for an introduction. – robertc Aug 01 '09 at 22:07
2

XHTML2PDF is a Python toolset that includes both command-line scripts and a Python library (should you want to embed this in something larger without shelling out to the script.) It supports HTML/XHTML and CSS, with additional vendor-specific CSS styles to tweak the formatted output (e.g., page numbers, paragraph flow, etc.)

I've only used it a tiny bit to batch process a few HTML docs, but it worked fine, and its feature set seems comprehensive to me. The manual is hidden on the demo page, but is, itself, a good example of the conversion from an HTML doc to a PDF.

I had a nice set of links to "before" and "after" examples, but I just created my account, and, apparently, only spammers put more than one link in their first post :-p

Tripp Lilley
  • 166
  • 6
2

Try chm2pdf with python-beautifulsoup.

riza
  • 185
  • 2
  • 6
1

I wanted to generate some PNG out of HTML pages from the command line. Somewhere I found this ruby script that uses mozembed to generate a screenshot. You can remove the scale line if you don't want it scaled.

The only problem I see is that the page actually appears on the screen for a moment...

chmeee
  • 7,270
  • 3
  • 29
  • 43
  • Hm, yeah. The last part seems to be a bit of a killer. This needs to be baked into a purely server-side backend; no display head or anything. Any way to accomplish that? – Alex Balashov Aug 01 '09 at 21:24
0

try dompdf it works fine from de command-line and by its examples it works with any kind of html

0

PrinceXML. Can handle CSS just fine. Linux, Windows, Mac OS X versions available. AFAICS, this also is the technology behind Google Docs' PDF output. But note: this is payware.

Kurt Pfeifle
  • 1,746
  • 2
  • 12
  • 19