Convert .doc or .rtf to clean HTML on OS X

2

0

When I export a file from Word or TextEdit, I get very bloated HTML, full of crazy style tags on every paragraph, so I can't even clean it by hand.

The only information I want preserved is:

  • <h1>, <h2>, <h3>, <p> tags.

  • Alignment (center, left, right)

  • links, external and internal (for the table of contents)

  • <img> tags

iDontKnowBetter

Posted 2012-02-14T03:41:54.937

Reputation: 273

1Word is notorious for building messy markup. Can you use a different program? Try importing the documents into Google Docs and downloading as HTML (Zipped). – Synetech – 2012-02-14T03:56:55.150

1Google Docs html does everything with spans and css classes and has no newlines. – Nathan – 2012-02-14T04:33:50.283

Cannot reproduce issues with TextEdit. Can you provide a sample document that uses inline styles? – Daniel Beck – 2012-02-14T12:53:24.677

I'd also try openoffice/libreoffice. – Rich Homolka – 2012-02-14T15:31:40.270

@DanielBeck This is a simple document, written in pages, exported as .rtf, and saved as html; which is what I need to be able to do. http://snipt.org/uMr6

– iDontKnowBetter – 2012-02-14T19:56:48.030

OpenOffice seems to export the cleanest HTML of the three, but still, for a very long document (200 pages), it would be a pain to clean up. The has to be a program that lets you choose which tags, exclusively, you wish to allow in an HTML document, and leave nothing but those tags. – iDontKnowBetter – 2012-02-14T20:40:20.230

Answers

0

I once heard that the blog feature of Microsoft Word exports much better HTML than even filtered HTML under the Save As menu.

To try go to the Word Ribbion -> Publish -> Blog. You will need to setup a dummy account but if the results are good enough it might be worth it.

Otherwise, since your expected output sounds so simple you may even want to consider creating your own VBA script which walks each element in the document in order and creates an HTML string from each that is then saved to disk.

Adam

Posted 2012-02-14T03:41:54.937

Reputation: 6 454