HTML to UNFORMATTED plain text?

6

1

I'm looking for a way to convert a folder full of HTML files to plain text. What I want is for the text files to be as much as possible like what I'd get if I selected all the text in a web browser, copied it, and pasted the text into a plain text file.

NO, REALLY, I WANT UNFORMATTED PLAIN TEXT. All of the solutions that I'm finding produce Markdown or something that looks like it, or tries to preserve layout, or uses asterisks and underscores to indicate text formatting, or preserves the content of scripts in the output file, or some clever goddam thing.

All I want is the words written by the author in the order that the author wrote them. I don't even care if the processing converts all of the list items in a list into a single paragraph, or even collapses the entire document into a single paragraph. Any of this is much better than giving me anything at all other than the actual language contained in the document.

I'd love a terminal application or Python script, but I'll take anything I can get.

patrick-mooney

Posted 2016-02-19T23:12:27.010

Reputation: 163

1Tip: remove everything between < and >. I don't know sed, but I'm pretty sure it could do it. – gronostaj – 2016-02-19T23:18:14.517

1yup, sed can do it, and a host of other utilities. This is a basic scrape for content I think, but you're not saying whether you want the header information - there's tags that don't show in the body, including javascripts and such not in tags. Can you clarify that what you want it just the text content of a page? – Ele Munjeli – 2016-02-19T23:36:31.557

@ Ele Munjeli Yep, just the text content. (= – patrick-mooney – 2016-02-20T01:37:15.120

@gronostaj That gets me closer, but isn't perfect: some tags (<p>, <br>) are whitespace and really should be converted into space characters, because they separate actual words (as in "Here are some lines<br>in a quote"). OTOH, some tags (like <script> for inline scripts) are or can be containers for things that don't count as "plain text." – patrick-mooney – 2016-02-20T01:39:42.420

Answers

3

html2text is a Python script that converts a page of HTML into equivalent Markdown-structured text. html2text can be downloaded and run in any operating system that has Python installed. The html2text program is in the repositories of many Linux distributions and it can be run from the command-line like this:

html2text -style pretty input.html  

This command not only converts the original html file to text, but it also does a pretty good job of making the plain text output easy to read. The headings look like headings, the lists look like lists, etc.

karel

Posted 2016-02-19T23:12:27.010

Reputation: 11 374

Thought I was pretty clear about really not wanting any formatting characters at all in the output, including those generated by Markdown. =( – patrick-mooney – 2016-02-21T06:27:46.127

All the formatting of the plain text output is done automatically by html2text by very clever use of the space character (which does not count as formatting because the space character is not a special character). There are no markdown asterisks or underline characters or any garbage like that. Also if you don't like the pretty style, you can use the -style compact option instead and get rid of the indentations made with the space character too. – karel – 2016-02-21T06:41:04.180

4

Use w3m -dump <page.html>.

It will give you the text representation of the html file.

From the man page:

-dump  dump formatted page into stdout

Although is says formatted, the output is just plain text.

NZD

Posted 2016-02-19T23:12:27.010

Reputation: 2 142

2lynx also supports -dump. – TOOGAM – 2016-02-20T05:36:02.130

1Yes, and the very same is achievable with the good old lynx like this: lynx -dump -nolist -nomargins – Gombai Sándor – 2016-02-20T05:36:41.353

0

Unix.com: How to remove only HTML tags in a file provides:
sed -n '/^$/!{s/<[^>]*>//g;p;}' filename
or html2text

CommandLineFu: Remove all HTML tags shows another sed line, or awk.

I believe this is a somewhat common operation provided by multiple programs, and that the most common name for this task is to "strip" the HTML. A quick Google Search for: Linux strip html tags shows multiple solutions, including PHP: strip tags.

TOOGAM

Posted 2016-02-19T23:12:27.010

Reputation: 12 651