How to convert a webpage to PDF with preserving its look (exactly as on web browser) and text/links?

27

16

I'm looking for a way to convert a webpage to PDF, but preserving the webpage's look. Also preserving webpage's text (being selectable), searchable [Generating image screenshot for the webpage would make text neither selectable nor searchable].

I'm looking for printing the webpage to PDF as is (as on web browser) without any manipulation on style or alignment, or loss of any webpage's static components.

This would help preserving offline copies of webpages that are easily readable, annotateable and searchable.


You don't need to read any of below (Question is just the above section) in order to get my question. The following section is just listing of what I've got through research or others' answers in a nested way in order to reach an answer for the question.

Research Outcomes (Suggestions that didn't solve my problem)

Outcomes till now on trying to find a solution (All still not working as a solution for this question)

I've tried these PDF web printing engines but all manipulate pages' look, more even damaging and making some hardly readable: (Example page screenshots are included in square brackets)

  • Chrome [Original, Print Styles (Disabled | not Disabled)]
  • Firefox [Original, Print Styles (Disabled p1,p2 | not Disabled p1,p2)]
  • Readability
    • It simplifies the webpage (which is a good thing for focused reading–However, this isn't what I'm looking for). I'm looking for keeping all the webpage's positions/styles properties as seen on Web Browser in a PDF format without any manipulation.
  • Foxit Reader
  • NovaPDF
  • CutyCapt [Original, Zoom Factor: 0.4: Screenshots, Outputted PDF]
    • I'll add links after I solve program's running issues on Windows"
  • wkhtmltopdf [Original, Zoom Factor: 0.4: Screenshots, Outputted PDF]
    • It doesn't support CSS3.

All webpage screenshot image capturing plugins (e.g. Abduction, Awesome Screenshot, Fireshot, Firefox Screenshot Developer Tool, Full Page Screen Capture, Page2Images, web-capture, ...) don't answer my question, because they don't preserve text and links.

Scrible is great at preserving webpages as is for further annotation and research, but unfortunately still online and without conversion to PDF format.

There are two other questions on the community similar somehow to mine, however, this one is different a little bit but with those important distinctions:

More Similar questions where preserving text and links isn't a requirement (pages are captured as image screenshots mostly):


Notes

OS: Windows 10

Omar

Posted 2016-04-12T15:17:35.083

Reputation: 909

If you want to print from a browser you first have to disable any print stylesheets to maintain the web page's screen appearance. – DavidPostill – 2016-04-12T15:25:42.890

See How to get WYSIWYP (print what you see) in a web browser?. See my answer to that question.

– DavidPostill – 2016-04-12T15:26:50.397

Then you can print using CutePDF writer.

– DavidPostill – 2016-04-12T15:27:53.183

@DavidPostill It seems that disabling print styles either doesn't work or it doesn't effect the browser to display PDF correctly. An example screenshots have been added to the edited version of the question. – Omar – 2016-04-12T19:11:51.770

I had the same question today and this page helped me (although the output was a mobile version of the page): https://stackoverflow.com/questions/9540990/using-chromes-element-inspector-in-print-preview-mode/

– MicroMachine – 2018-05-17T19:58:04.983

Answers

7

We faced the same problem in a University project and were able to solve it using

wkhtmltopdf

We quite enjoyed the capabilities of this tool on the command line. We also called it using python code to render the current state of webpages. It has the option to deliver the webpage as pdf, usually not perfect to preserve the website view due to the Page formatting (A4 for example), or as png (preserves the view of the page but not links)

There is also the readability(for Python:pypi.python.org/pypi/readability-lxml) project we used that does the ads removal and content detection quite well (e.g. for newspaper articles and the like). If you just want an addon or extension for your browser the following readability implementation might satisfy your need:

https://www.readability.com/addons/

sebisnow

Posted 2016-04-12T15:17:35.083

Reputation: 178

Unfortunately, wkhtmltopdf didn't preserve page's elements positions. Example Page: Zoom Factor: 0.4: Screenshots, Outputted PDF

– Omar – 2016-05-06T18:36:47.963

Readability simplifies the page (which is a good thing–However this isn't what I'm looking for). I need to keep all the page's positions/styles properties as seen on Web Browser in a PDF format without any manipulation. – Omar – 2016-05-06T19:08:47.607

Did you use the wkhtmltopng option of the tool, as png the positions should be okay (at least much better than in the pdf version where the page is fitted to A4 format) – sebisnow – 2016-05-09T06:36:17.953

3

Contributing another answer for possible users. In Firefox, there used to be an addon "Print pages to PDF". You can search for its last version 0.1.9.3 (work on pre-Quantum versions only).

Currently there's this addon for both Chrome and Firefox that works quite well: PDFMage

  • Save all images in page
  • Generate text as text, not as image, you can search text in generated PDF.
  • Preserver hyperlinks
  • Has the option to save a long webpage as a one-page PDF (so the images are not split between pages)

nmhung1985

Posted 2016-04-12T15:17:35.083

Reputation: 41

2

I had the same problem, and figured it out via Chrome and with a free printer driver called PDF995. This is part of a suite of PDF utilities; the publisher's web site is http://www.pdf995.com/.

However, I think any web browser and any pdf converter will suffice. Anyway, here's what I did:

  1. select all or highlight everything.
  2. Right-click the highlighted selection or press Ctrl+P (both options give you slightly different results, but you end up with the same outcome after completion).

  3. If you right-clicked in 2., the selection (the short-cut), click "print" and only all that you've selected will be on the print preview. Make sure you change your printer destination to whatever pdf converter you decide to use (PDF995 or other).

  4. Click "print" and it saves as a pdf document.

  5. If you pressed Ctrl+P in 2. (the slightly longer way) instead, click on "More settings" and scroll down to "Options".

  6. Click the box that says "Selection only" and everything in the short-cut I described will follow.

  7. Don't forget to change your printer destination to whatever pdf converter you choose (PDF995 or other).

  8. Click "print".

user726167

Posted 2016-04-12T15:17:35.083

Reputation: 21

2

I really struggled with this and tried most of the tools that are mentioned so far. The best results I got was using Chrome's headless mode. The command on MacOS would look like this:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --print-to-pdf=test.pdf http://127.0.0.1:8080

The best list of command line options I found was here.

However there were problems with that. Specifically my pages are very javascript heavy and I couldn't make the print function wait for them to finish execution. So my output didn't have the images in it.

The solution I found was a nodeJS package: chrome-headless-render-pdf. It's scant documentation is here. It works and it is easily scriptable.

AlanObject

Posted 2016-04-12T15:17:35.083

Reputation: 207

1

If you're on Linux, try this small command line tool CutyCapt, which depends only on Qt and QtWebkit, and exports to PDF.

Ziggy Crueltyfree Zeitgeister

Posted 2016-04-12T15:17:35.083

Reputation: 293

0

Although not exactly your request as not in PDF, if the objective is purely to keep an offline copy of webpages for later review, saving it as webpage would do just that.

The big caveat is that it will create a .html file and a folder with all the media content on the page rather than a single document.

In Chrome and Firefox, you can save a page doing a right click on it and choosing Save as... In Internet Explorer, you can save it under File -> Save as (pressing the Alt key for the menus to appear).

Pyheme

Posted 2016-04-12T15:17:35.083

Reputation: 51

Saving the webpage in .html format would make it not-annotateable. So, I need it in PDF format. – Omar – 2016-04-12T15:34:35.833

That's a good point! Just remembered of an extension that allows you to easily disable print-related stylesheets. A quick google search led me to the discussion when I had first heard of it, on Superuser: How to get WYSIWYP (print what you see) in a web browser?

– Pyheme – 2016-04-12T15:42:24.973

I tried doing "Save As" using Chrome. It creates a .HTML file and a folder. The .HTLM file was missing a whole lot of stuff from the page. – SherlockSpreadsheets – 2018-12-10T22:33:16.723

0

Try this service. Creates a PDF from a website as you see it in the browser. https://lomotoh.com/ (I am affiliated with this site)

David Herse

Posted 2016-04-12T15:17:35.083

Reputation: 101

This preserves links, but not selectable text, which is a requirement in the question. – fixer1234 – 2016-10-15T23:07:49.697

Seems to be selectable for some sites. I think it depends what sort of custom font the site uses. – David Herse – 2016-10-16T03:18:35.277

0

At least all of the text on some pages is searchable, selectable, cut and pastable. I tried on a page pasted up up robotically by a computer out of text and pix and it it tuned it all into an image.

I have used these things for years. I get the best results in Linux by rebuilding the page in a XX word of your choice and exporting the result as a PDF. I can get what I want at considerable cost. From the my limited use arch ivin The site David Herse put up https://lomotoh.com/ (I am NOT affiliated with this site) works as well as any I have ever used. I will be my go to resource to cover webpages to PDFs until I find better or it cost too much for me to pay out of my own thin purse.

Gordon Couger

Posted 2016-04-12T15:17:35.083

Reputation: 1