Utility to non-destructively fold or re-indent HTML

I have a script that takes input from wget or similar and searches through it for key words using grep. (I promise i am not trying to parse HTML with regular expressions, it is just a convenient way to emulate the content-detection behaviour we have in another much more complex product.) This works great, as long as the HTML content isn't too severely minified. When it is, the lines can become very long (over 50 kB in some cases i've seen), and grep chokes on them.

To remedy this, i would like to be able to fold or re-indent the HTML so that it is spread out over more lines. However, in order for the script to give accurate results, i need to be able to do this without otherwise altering the content. This means it can't correct invalid or unclosed tags, and it must fold only between elements, not inside them.

These two requirements seem to rule out all of the HTML-tidying or prettifying utilities i've found.

Are there any UNIX-based shell utilities, perl/python/ruby modules, or similar that can do this for me?

Alternatively, since all i need is to add some new lines in between tags, is there a way that i can semi-reliably do this myself?

kine

Posted 2013-09-05T19:21:09.783

Reputation: 1 669

How do you not fold inside the html element and still have this work? – Ignacio Vazquez-Abrams – 2013-09-05T19:26:54.483

The problem is that the only way to reliably detect what is "between elements" or even identify an element, requires you to parse it. If you had one particular document that you were working with, a solution could possibly be found using regex, but there is no general-use case for what you want. – Darth Android – 2013-09-05T19:46:57.423

Ignacio: I mean that it can't fold 'text nodes'. – kine – 2013-09-06T00:49:47.757

Answers

Ok, for anyone else in need of this, I'm recording the suggestions made in this awesome thread (in case that link goes down, as per StackExchange guidelines):

HTB 2.0 - DOS based - http://www.digital-mines.com/htb/
Tabifier - supports CSS, HTML and C style syntax (including Javascript) - http://tools.arantius.com/tabifier
HTML-Kit - a full-featured free HTML editor running on Windows, you need to config TIDY options [Tools /Check code using Tidy /Add new config], uncheck all swithes except "Output only the body content" and "Convert non-breaking space to entities", then go to Actions /Tools /HTML Tidy /Indent Tags or beautify - http://www.chami.com/html-kit/
SCREEM - only for Linux -
NetBeans - " After openining an html file with NetBeans, click Source then select Format. That's it. " -
WebmasterGate's HTML / XHTML Beautifier - Online tool - http://www.webmastergate.com/html-beautifier/
Aptana Studio (Version 2.0.4) - "Select Edit > Format or press Ctrl-Shift F to format the html code. The format function can be configured from Windows > Preferrences, then select Aptana > Editors > HTML > Formatting, click Edit to add tags which should not take a new line then save it as a new preferrence." -
UniversalIndentGUI - Uses HTB Beautifier internally - While running Notepad++, go to Plugins > Plugin Manager > Show Plugin Manager, select UniversalIndentGUI from the available list to install it.
tidy with these options:

(filler text since the markdown engine seems to have problem when code directly follows bullets)

[HTML, XHTML, XML Options]
anchor-as-name:no
doctype:omit
drop-empty-paras:no
fix-backslash:no
fix-bad-comments:no
fix-uri:no
input-xml:yes
join-styles:no
lower-literals:no
preserve-entities:yes
quote-ampersand:no
quote-nbsp:no

[Diagnostics Options]
show-warnings:no

[Pretty Print Options]
indent:yes
indent-spaces:3
tab-size:3

[Miscellaneous Options]
quiet:yes

I'm yet to try out these options (the input-xml: yes and force-output: yes config suggestions to HTML tidy mentioned https://stackoverflow.com/questions/7151180/use-html-tidy-to-just-indent-html-code works for my immediate purpose), will update this answer if I do.

sundar - Reinstate Monica

Posted 2013-09-05T19:21:09.783

Reputation: 1 289

Another option is to use pup without arguments:

pup

In xmllint --html uses an HTML parser and --format reformats the input. The dash for STDIN cannot be omitted.

xmllint --format --html -

XmlStarlet also supports using an HTML parser. fo is short for format. See xml fo -h for help.

xml fo --html

The main implementation of tidy does not support HTML5 but tidy-html5 does. brew install tidy-html5 installs tidy-html5 as /usr/local/bin/tidy in OS X.

nisetama

Posted 2013-09-05T19:21:09.783

Reputation: 651

Run the file through HTML Tidy.

For example:

curl http://superuser.com | tidy -i | less

-i is for indentation of the input.

Der Hochstapler

Posted 2013-09-05T19:21:09.783

Reputation: 77 228

1The first paragraph of man tidy says: For HTML varians, it detects and corrects many common coding errors. This makes it destructive to the original content. – kine – 2013-09-06T00:51:18.670

@kine: Oh, well, if the first paragraph of the man page says that, then I wouldn't even try it either. – Der Hochstapler – 2013-09-06T09:08:04.747

@kine I found the answer here: http://stackoverflow.com/questions/7151180/use-html-tidy-to-just-indent-html-code especially the second comment on the answer. Running it with that config (along with input-xml yes and force-output yes) indents it mostly non-destructively - "mostly" because it still makes changes in HTML character entities, I guess you have to hunt down and change that option too if that's a problem to you.

– sundar - Reinstate Monica – 2013-09-18T15:48:27.363

The simplest way to do this without parsing/fixing the document is to look for a closing tag, followed by an open-angle-bracket or whitespace, and insert a newline. Search for:

(</[^>]+>)(<|\s)

and replace with

$1\n$2

You will still need to manually check over each output document and verify that it didn't break anything, but this should work for most cases. It won't be pretty output, but it should kill 50KB lines.

Darth Android

Posted 2013-09-05T19:21:09.783

Reputation: 35 133

That's a possibility. It might be the one i have to go with. – kine – 2013-09-06T00:52:01.297