User-defined HTML
You're attempting to sanitize user input by blacklisting things you don't want to allow. Unfortunately, especially given the very large list of options for HTML5, it's quite easy to miss something. Missing something will result in a potentially dangerous XSS vulnerability, which you really don't want. To pick some random examples off the top of my head:
- Do you know that SVG tags can execute scripts in some (limited) circumstances?
- In your list of
on
event attributes you banned, did you make sure and get the onbounce
event attribute of the obsolete (but still available) marquee tag?
- Are you going to make sure and keep track of any changes to the HTML spec that might roll out over the next X years, just in case one might add a dangerous tag/event?
There can be so many gotchas that securing input with a blacklist is nearly impossible. Moreover, you're approaching the problem from the wrong perspective. In security the approach you want to start with is the Principle of Least Privilege. Rather than asking, "what should I stop my users from doing", it's much safer to ask, "What should I allow them to do?". Therefore you need a two step process:
1. Robust Parser. You need a very robust parser. This is surprisingly difficult because browsers are so very forgiving when it comes to parsing HTML. If an attacker gives you bad HTML and your parser stops trying and says, "There's no HTML here - you're safe!", but the browser takes the same input, takes some guesses at what the HTML was supposed to be, and ends up with something malicious, you have an XSS vulnerability. This is even more difficult that it sounds because different browsers can apply different "corrections" when processing HTML input, making the already difficult task of parsing HTML reliably even more difficult.
As an example of browsers playing fast an loose with HTML, you can save this HTML to a file and load it in your browser:
<table><img src="1" <table onerror="alert(1)"
<p>hi</p></table>
If you load it up with Chrome and inspect element on the page, you'll see that the browser actually rendered this (probably depends on your version though):
<img src="1" <table="" onerror="alert(1)" <p="">hi<p></p><table></table>
That's an image tag with an active XSS payload (which is mildly obfuscated by the fact that a table
tag and the p
turned into meaningless attributes), a literal string hi
, an empty p
tag, and an empty table
tag. The end result is quite different than the input. I haven't tried very hard to hide the payload, but would your parser have understood it the same way? Perhaps your parser would have tried to ignore the img
tag inside of the table
tag since anything not in a td
tag would technically be illegal. Maybe your parser would have been confused by the <table
inside the img
tag and ignored the onerror
since a table tag doesn't technically have events. As it turns out though, none of that stopped the browser from executing my javascript payload. Would your parser have caught it?
2. Whitelist allowed tags and attributes Once you've parsed the user's HTML, you don't want to compare with a blacklist and remove disallowed tags/attributes. Instead you want to compare to a whitelist and remove anything that you haven't specifically vetted and approved as safe. This puts your security on much stronger footing and - let's be honest - do you really want your users to use the marquee
tag anyway?
Most importantly though, building a robust HTML parser is surprisingly difficult. If you try to do it yourself you'll spend a lot of time and probably get a lot wrong. Under normal circumstances, you'll be much better off just finding a well supported third party library to use.
Alternate Suggestion
A different tactic I would normally recommend would be to not let user's use HTML at all. Rather, allow a more limited language (like the markdown used for writing questions and answers in stack overflow). The more limited language specification makes writing a parser much easier and less error-prone, and the process of converting markdown to HTML makes it easier to guarantee "safeness" (note: easier, not guaranteed - markdown to HTML converters still suffer from occasional XSS vulnerabilities). There is a slight disadvantage that it limits the sort of formatting options users have (although I don't really consider that a disadvantage under most circumstances), but you also have a nice advantage that markdown parsers and HTML-converters are available in a wide variety of languages. A more limited option like this is typically a good trade off between usability and security. You can even add in a WYSIWYG editor that builds the markdown for users.