2
I have a script that takes input from wget
or similar and searches through it for key words using grep
. (I promise i am not trying to parse HTML with regular expressions, it is just a convenient way to emulate the content-detection behaviour we have in another much more complex product.) This works great, as long as the HTML content isn't too severely minified. When it is, the lines can become very long (over 50 kB in some cases i've seen), and grep
chokes on them.
To remedy this, i would like to be able to fold or re-indent the HTML so that it is spread out over more lines. However, in order for the script to give accurate results, i need to be able to do this without otherwise altering the content. This means it can't correct invalid or unclosed tags, and it must fold only between elements, not inside them.
These two requirements seem to rule out all of the HTML-tidying or prettifying utilities i've found.
Are there any UNIX-based shell utilities, perl/python/ruby modules, or similar that can do this for me?
Alternatively, since all i need is to add some new lines in between tags, is there a way that i can semi-reliably do this myself?
How do you not fold inside the
html
element and still have this work? – Ignacio Vazquez-Abrams – 2013-09-05T19:26:54.483The problem is that the only way to reliably detect what is "between elements" or even identify an element, requires you to parse it. If you had one particular document that you were working with, a solution could possibly be found using regex, but there is no general-use case for what you want. – Darth Android – 2013-09-05T19:46:57.423
Ignacio: I mean that it can't fold 'text nodes'. – kine – 2013-09-06T00:49:47.757