How to "normalize" an html file?

Question

Given an HTML file, the content of which was generated from various websites, it could have potential security problems, like:

<script>alert("XSS alert!")</script>

Q: What can we do to "clear" this HTML file? Run a:

sed '/<script/d'

several times? (the "<script" could be hidden in HTML, like: "<script")

you may find this useful. http://security.stackexchange.com/q/32616/21234 — Shurmajee, Sep 23 '13 at 11:09
Are there any potentially "good" scripts that need to run on the page? Also, it's not always in a script tag (think onload, onmouseover, etc) — I'm A Person, Sep 23 '13 at 13:21
I highly recommend looking into [Content Security Policy](http://security.stackexchange.com/questions/38001/xss-prevention-through-content-security-policy) as a defense in depth approach to solving XSS issues. — Polynomial, Sep 23 '13 at 13:36
there are no "good" scripts that need to be run on the page! — gasko peter, Sep 23 '13 at 15:14

score 3 · Answer 1 · answered Sep 23 '13 at 13:50

Since the HTML was assembled from heterogeneous sources, chances are that you're not interested in any scripts it may contain. So your approach looks viable to me.

I believe that the most practical and maintainable approach would be to make a list of allowed HTML tags and attributes (you are likely to not want, more than simply not need, CSS code; so id, style and class attributes may be discarded altogether). Tags like IFRAME and attributes such as onfocus would very likely not be included in the list at all.

Once you have this, you can run your HTML through a parser that will drop any unknown tags and attributes. You may want to do this in two stages, a "tagsoup" cleaner (HTML Tidy or TagSoup or HTMLClean or BeautifulSoup - I'd try HTML Tidy first) and a full-fledged HTML parser.

This way you'd end up with a well-formed and defanged HTML code which can be then manipulated much better and in more safety.

I'd avoid regexp matching unless I really had no other alternative.

I agree with throwing a html parser at it. That way its output won't be really weird html that might be parsed incorrectly by some browsers. — CodesInChaos, Sep 23 '13 at 14:14

score 1 · Answer 2 · answered Sep 23 '13 at 19:23

The problem with the sed approach is this malicious input: <scr<script>ipt>. (Apply your sed to it and look at what comes out.) If you just try to sanitize the bad stuff, the bad guys will find a way through it. Instead, it's much safer that once you recognize the bad input, you refuse to accept any of it.

Even safer is a whitelist.

How to "normalize" an html file?

2 Answers2