Why are PDFs generated from MS Word so large?

70

9

I created a simple MS Word document containing just this sentence:

This is a small document.

Nothing else. Then I've saved this document as DOCX and a PDF. Here are the file sizes:

DOCX: 12 kB
PDF: 89 kB

This difference is huge, technically, and it really starts bothering me when mostly textual documents that are tens of kB in DOCX start generating PDFs that are hundreds of kB large. What's so inefficient about the PDF format? Or is just Word using some terrible output algorithm?

BTW, the PDF output settings were set to create the smallest file possible:

PDF output options

Borek Bernard

Posted 2015-09-30T08:08:37.507

Reputation: 11 400

28My guess is that the PDF embeds the font, which is necessary if a document is to be truly portable. – AFH – 2015-09-30T08:29:14.710

you can open properties to see if the font is embedded or not – phuclv – 2015-09-30T08:31:41.780

Can you add a link to the pdf and maybe the docx too? – Hastur – 2015-09-30T08:37:04.323

2Yes, the font subset is embedded. That might be it. I've tried to repeat the same sentence a few hundred times and the PDF file size only grew by 4 kB which is just about right. (DOCX stayed on 12kB which is no surprise as that is a zipped format and repeated text will take hardly any new bytes.) – Borek Bernard – 2015-09-30T08:37:05.217

The setting "Minimum size (publishing online)" probably only affects the quality of embedded images, not of fonts. – Arjan – 2015-09-30T09:07:23.080

@AFH Spot on! I wonder if it will also embed common fonts such as Arial – MonkeyZeus – 2015-09-30T12:38:48.197

1

@AFH It does not embed Arial. http://i.stack.imgur.com/aUZgt.png

– MonkeyZeus – 2015-09-30T12:43:03.637

1Thinking about it from a Kolmogorov complexity standpoint, Microsoft Word is larger than your average PDF viewer, by much more than a few hundred kB. – hobbs – 2015-09-30T15:06:39.667

8I think the real question is why your wordprocessing format is so much bigger than the equivalent LaTeX ... :-p – Toby Speight – 2015-09-30T16:40:34.603

1Also, remember that DOCX is really just a zip file so you have built-in compression at the document level. PDF has some internal compression techniques (streams) but there's lots a preamble (tokens/names) surrounding those that don't get any compression applied. – Chris Haas – 2015-09-30T20:51:04.663

Answers

104

If you open the PDF in notepad++ you'll find:

9 0 obj
<</Filter/FlateDecode/Length 79100/Length1 171804>>
stream
xœì}    XTGºvÕ9½/t7Ðl
..... many more bytes  ...   ëH|  
endstream
endobj
10 0 obj

and that object is referenced here at the end in the /FontFile2 instruction:

6 0 obj
<</Type/FontDescriptor/FontName/ABCDEE+Calibri/Flags 32/ItalicAngle 0/Ascent 750/Descent -250/CapHeight 750/AvgWidth 521/MaxWidth 1743/FontWeight 400/XHeight 250/StemV 52/FontBBox[ -503 -250 1240 750] /FontFile2 9 0 R>>
endobj

The Fonts used by the Word document gets embedded into the PDF so the pdf is self-contained.

I used this slide-deck to decypher the PDF instructions.

If you want to prevent the fonts being embedded in the PDF file make sure your Word document makes use of one of the 14 standard typefaces available in PDF viewers, (source Wikipedia)

  • Times New Roman > Times (v3) (in regular, italic, bold, and bold italic)
  • Courier New > Courier (in regular, oblique, bold and bold oblique)
  • Arial > Helvetica (v3) (in regular, oblique, bold and bold oblique)
  • Symbol > Symbol
  • Wingdings > Zapf Dingbats

rene

Posted 2015-09-30T08:08:37.507

Reputation: 1 115

2

Sidenote: The linked slide deck (a Powershell presentation) is worth reading it. Very detailed. Don't miss the comments where he explains the structure of a PDF

– nixda – 2015-10-04T09:23:57.340

3

This has happened to me many times in Microsoft Word when trying to export a simple manuscript to PDF. A 5–8 page Word document, ~50 KB in size, will end up as a 10+ MB PDF file, which is far too large to reasonably email to someone.

Rene's answer is on the right track—the problem is that fonts get embedded into the document—but just using one of the standard typefaces won't necessarily solve the problem.

All of my documents were in Times New Roman, using nothing fancier than bold and italics. Or so I thought. It turns out that I have automatic kerning enabled in my default template (for obvious reasons). When exporting to PDF, Word was actually embedding each of those ligatures as a separate font object into the document, bloating it beyond all belief.

The fix is simple, you just have to remember to do it each time:

  1. Select all of the text in the document.
  2. Format → Font → Advanced
  3. Uncheck "Kerning for fonts"

Interestingly, you can leave ligatures, contextual alternatives, and other advanced typography features enabled; they have no perceptible effect on the size of the resulting PDF.

Re-export the document as a PDF, and it's down to a hundred or so KB. Unfortunately, the kerning is sub-par, so I wouldn't recommend printing this way, but it works fine for emailing a document.

Cody Gray

Posted 2015-09-30T08:08:37.507

Reputation: 1 856

-3

To give a less technical answer that may help is that PDFs use vectors (i.e.: mathematical equations) to describe everything you see. All the curves and lines are defined by mathematical equations, and so there will necessarily be a lot of information to hold, particularly when you have images in your documents.

The benefit of this is that you can theoretically zoom in infinitely close without losing any resolution or detail, because the lines and curves have no width, so they can scale with your zoom.

Just like how Google's recent font change reduced the size of the logo from ~14KB to ~300B, simpler fonts will likely help reduce your file size.

Ben Sandeen

Posted 2015-09-30T08:08:37.507

Reputation: 191

4That analogy doesn't work. At all. Google's logo change was not just the font, but also from gradients to flat which makes the size difference. Furthermore, exporting a document to a large bitmap will be much larger than a font + text. The mathematical equations, as you misleadingly put it, are just integer coordinate pairs, of which there are maybe a few dozen per glyph. And since it's a font it doesn't need to be repeated for every letter. – Joey – 2015-10-02T05:56:13.617