6
1
I'm looking for a way to convert a folder full of HTML files to plain text. What I want is for the text files to be as much as possible like what I'd get if I selected all the text in a web browser, copied it, and pasted the text into a plain text file.
NO, REALLY, I WANT UNFORMATTED PLAIN TEXT. All of the solutions that I'm finding produce Markdown or something that looks like it, or tries to preserve layout, or uses asterisks and underscores to indicate text formatting, or preserves the content of scripts in the output file, or some clever goddam thing.
All I want is the words written by the author in the order that the author wrote them. I don't even care if the processing converts all of the list items in a list into a single paragraph, or even collapses the entire document into a single paragraph. Any of this is much better than giving me anything at all other than the actual language contained in the document.
I'd love a terminal application or Python script, but I'll take anything I can get.
1Tip: remove everything between
<
and>
. I don't knowsed
, but I'm pretty sure it could do it. – gronostaj – 2016-02-19T23:18:14.5171yup, sed can do it, and a host of other utilities. This is a basic scrape for content I think, but you're not saying whether you want the header information - there's tags that don't show in the body, including javascripts and such not in tags. Can you clarify that what you want it just the text content of a page? – Ele Munjeli – 2016-02-19T23:36:31.557
=D https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
– Abraxas – 2016-02-20T01:17:24.827@ Ele Munjeli Yep, just the text content. (= – patrick-mooney – 2016-02-20T01:37:15.120
@gronostaj That gets me closer, but isn't perfect: some tags (<p>, <br>) are whitespace and really should be converted into space characters, because they separate actual words (as in "Here are some lines<br>in a quote"). OTOH, some tags (like <script> for inline scripts) are or can be containers for things that don't count as "plain text." – patrick-mooney – 2016-02-20T01:39:42.420