40

I am creating a website that allows people to upload HTML content.

Currently these are the tags that are banned:

<script></script>
<iframe>
<object>
<embed>
<style></style>
All on= attributes, i'm not gonna list them all there are like 70 of them

I don't want the user using any JavaScript or placing any code that will effect other tags on the page, so no style tags either.

Are there any other tags that I need to be mindful of when enabling users to generate HTML without impacting other things on the page?

Conor Mancone
  • 29,899
  • 13
  • 91
  • 96
Jevon
  • 501
  • 4
  • 4
  • 28
    There are libraries that can sanitize html for you. Use a server side one but which one depends on what framework you are using. There is no reason to have to reinvent the wheel and even if you really want to you could look at their implementation(s) and see what they do. – Igor Oct 30 '19 at 14:59
  • 20
    Use a library. You'd be surprised how many ways there are around tag sanitizing... In general, I'd recommend against it entirely if possible, but if you go that route then a library with enough testing is your best bet. – Jarrod Christman Oct 30 '19 at 15:01
  • 77
    Don't sanitize by banning tags! Rather, sanitize by allowing only a specific set of tags. https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html – Ghedipunk Oct 30 '19 at 17:14
  • Thanks, @Ghedipunk - I was going to ask what happens in a year or two when they introduce some new `` tag – A C Oct 31 '19 at 04:38
  • 1
    As many people have said There are libraries that sanitize html and I would not try to do this alone... `<script>` so on... I personally would never use user generated code on a website. think of myspace pages back in the day. – Michael Mano Oct 31 '19 at 07:25
  • 5
    Possible duplicate of [Input validation: how to do it if I must accept HTML as part of my input?](https://security.stackexchange.com/questions/98197/input-validation-how-to-do-it-if-i-must-accept-html-as-part-of-my-input) – sleske Oct 31 '19 at 08:11
  • What about the next HTML version that adds another on= attribute that isn't in your list yet? – user253751 Oct 31 '19 at 10:15
  • 19
  • 2
    Whitelist instead of blacklist, and I would recommend having checking code on _both_ the front end and on the backend - the front-end makes it more responsive for users, as checking can be done in real time. The back end checks keep you safe. – Baldrickk Oct 31 '19 at 11:20
  • Why not just whitelist specific tags? – dalearn Oct 31 '19 at 17:01
  • 2
    Don’t implement a blacklist yourself, use a whitelist based existing solution like jsoup. Also remember the context problem. In escaping. And instead of removing critical content, better reject the whole string „pt>“ – eckes Oct 31 '19 at 20:36
  • Can you please clarify "`All on= attributes`"? – curiousguy Nov 01 '19 at 22:42
  • I appreciate all the great answers, a lot of them are useful i dont know which one to accept. – Jevon Nov 16 '19 at 23:30

7 Answers7

87

User-defined HTML

You're attempting to sanitize user input by blacklisting things you don't want to allow. Unfortunately, especially given the very large list of options for HTML5, it's quite easy to miss something. Missing something will result in a potentially dangerous XSS vulnerability, which you really don't want. To pick some random examples off the top of my head:

  1. Do you know that SVG tags can execute scripts in some (limited) circumstances?
  2. In your list of on event attributes you banned, did you make sure and get the onbounce event attribute of the obsolete (but still available) marquee tag?
  3. Are you going to make sure and keep track of any changes to the HTML spec that might roll out over the next X years, just in case one might add a dangerous tag/event?

There can be so many gotchas that securing input with a blacklist is nearly impossible. Moreover, you're approaching the problem from the wrong perspective. In security the approach you want to start with is the Principle of Least Privilege. Rather than asking, "what should I stop my users from doing", it's much safer to ask, "What should I allow them to do?". Therefore you need a two step process:

1. Robust Parser. You need a very robust parser. This is surprisingly difficult because browsers are so very forgiving when it comes to parsing HTML. If an attacker gives you bad HTML and your parser stops trying and says, "There's no HTML here - you're safe!", but the browser takes the same input, takes some guesses at what the HTML was supposed to be, and ends up with something malicious, you have an XSS vulnerability. This is even more difficult that it sounds because different browsers can apply different "corrections" when processing HTML input, making the already difficult task of parsing HTML reliably even more difficult.

As an example of browsers playing fast an loose with HTML, you can save this HTML to a file and load it in your browser:

<table><img src="1" <table onerror="alert(1)"
<p>hi</p></table>

If you load it up with Chrome and inspect element on the page, you'll see that the browser actually rendered this (probably depends on your version though):

<img src="1" <table="" onerror="alert(1)" <p="">hi<p></p><table></table>

That's an image tag with an active XSS payload (which is mildly obfuscated by the fact that a table tag and the p turned into meaningless attributes), a literal string hi, an empty p tag, and an empty table tag. The end result is quite different than the input. I haven't tried very hard to hide the payload, but would your parser have understood it the same way? Perhaps your parser would have tried to ignore the img tag inside of the table tag since anything not in a td tag would technically be illegal. Maybe your parser would have been confused by the <table inside the img tag and ignored the onerror since a table tag doesn't technically have events. As it turns out though, none of that stopped the browser from executing my javascript payload. Would your parser have caught it?

2. Whitelist allowed tags and attributes Once you've parsed the user's HTML, you don't want to compare with a blacklist and remove disallowed tags/attributes. Instead you want to compare to a whitelist and remove anything that you haven't specifically vetted and approved as safe. This puts your security on much stronger footing and - let's be honest - do you really want your users to use the marquee tag anyway?

Most importantly though, building a robust HTML parser is surprisingly difficult. If you try to do it yourself you'll spend a lot of time and probably get a lot wrong. Under normal circumstances, you'll be much better off just finding a well supported third party library to use.

Alternate Suggestion

A different tactic I would normally recommend would be to not let user's use HTML at all. Rather, allow a more limited language (like the markdown used for writing questions and answers in stack overflow). The more limited language specification makes writing a parser much easier and less error-prone, and the process of converting markdown to HTML makes it easier to guarantee "safeness" (note: easier, not guaranteed - markdown to HTML converters still suffer from occasional XSS vulnerabilities). There is a slight disadvantage that it limits the sort of formatting options users have (although I don't really consider that a disadvantage under most circumstances), but you also have a nice advantage that markdown parsers and HTML-converters are available in a wide variety of languages. A more limited option like this is typically a good trade off between usability and security. You can even add in a WYSIWYG editor that builds the markdown for users.

Conor Mancone
  • 29,899
  • 13
  • 91
  • 96
  • 10
    Although Stack allows _both_ markdown and restricted HTML (plus MathJax on _some_) and silently removes other HTML (except in 'code'). This results in maybe 1 in 20 Qs in my experience becoming nonsensical or wrong because the poster tried to use angle brackets for variables (formally, nonterminals), and they disappear from the post. PS: 'tack' (direction or path, as of a boat) not 'tact' (consideration of people's sensitivities or propriety/custom). – dave_thompson_085 Oct 31 '19 at 03:57
  • @dave_thompson_085 Interesting. I've never tried anything tricky here so haven't given too much thought to what they do behind the scenes. I would have expected that they first properly encoded special characters on output and then converted their markdown to HTML, causing HTML tags in input to come through as plain text. Obviously though they are doing something much more involved. I could see why it might be tricky to do things right, so perhaps they ended up going with a more layered approach (although one that obviously causes UI problems sometimes). – Conor Mancone Oct 31 '19 at 12:28
  • You mentioned under 1 that browsers correct incorrect html. Can you give an example there? (not sure how extensive these corrections can be from the statement alone and....it surprised me a lot there) – Thomas Oct 31 '19 at 13:22
  • 1
    @Thomas If browsers didn't have methods for dealing with incorrect HTML then they wouldn't be able to display most pages on the internet :) I'm a bit hesitant to add something to the answer as I think it's more of an aside than critical to the main point. However, consider the following HTML: `

    hi

    ` If I save that to an HTML file, open it up in chrome, and inspect element, Chrome has rendered this: `

    hi

    `. Using such changes to find an exploit is difficult, but plausible.
    – Conor Mancone Oct 31 '19 at 13:38
  • @ConorMancone Ah thanks. That explains a lot into which direction you meant the changes (honestly I thought already into the direction of "do they change the tags themselves to something more plausible" which didn't make any sense thus my question). – Thomas Oct 31 '19 at 13:48
  • Markdown is not secure itself! See [Markdown and XSS](https://michelf.ca/blog/2010/markdown-and-xss/). However, the output of a Markdown parser will usually provide you with well-formed HTML to pass into your sanitizer, so it can help in that respect. Except that Markdown allows raw HTML to be embedded... unless you disable that feature of the Markdown parser. In the end, Markdown really doesn't gain you much int he way of security. – Waylan Oct 31 '19 at 13:48
  • @Thomas I changed my mind and opted to put it in my answer - thanks for the suggestion! – Conor Mancone Oct 31 '19 at 13:50
  • @Waylan indeed! Which is why I limited my statement to "**easier** to guarantee safeness". These things are always surprisingly tricky. – Conor Mancone Oct 31 '19 at 13:51
  • @Thomas In case you're interested, I put up a new example of strange browser parsing that brings everything closer to an actual XSS payload. – Conor Mancone Oct 31 '19 at 19:45
  • 3
    Note that google has a similar problem in their site-preview, where they embed third-party content. Their solution to the parsing problem is to parse the html themselves, then filter that via whitelisting, then pretty-print it into conformant html that allows no room for browser interpretations. – WorldSEnder Nov 01 '19 at 06:16
  • I really like this answer: especially the part about using markdown (I'd extend this to a DSL too) but I don't think it answers huge part of the question: which elements should be in the whitelist/markdown? – JimmyJames Nov 01 '19 at 16:40
  • "You need a very robust parser." No. If text is invalid HTML, then don't treat/process/render it as HTML. Render it (escape it) as text. You would need a robust parser *if* you wanted to handle as HTML everything browsers accept as HTML. – Paul Draper Nov 01 '19 at 19:09
  • You don't need a robust parser if instead of storing the user input, you parse it in your parser, resynthesize valid HTML from it, and store _that_. You still want a parser that can parse all valid HTML, but the bare minimum for security is that it always outputs safe HTML for any input. – John Dvorak Nov 02 '19 at 04:41
21

Currently these are the tags that are banned:

In addition to what's already been posted, make sure that banning means "don't letting the user save the data", not "remove the banned stuff and save the rest".

An example:

Input:
Hi there, here is my <script>alert('scary script')</script>, will I be shown?  
Filtered:  
Hi there, here is my alert('scary script'), will I be shown?

Seems good by just removing banned tags? After all, we stopped the script tag doing something stupid. But what if I instead enter this?

Input:  
Hi there, here is my <scr<script>ipt>alert('scary script')</scr<script>ipt>, will I be shown?  
Filtered:  
Hi there, here is my <script>alert('scary script')</script>, will I be shown? 

Whoops! Removing the <script> tags made my string a valid attack.

I used this trick to make a cooler looking profile page on a long forgotten social network that allowed you to add some custom HTML. Filtering HTML is hard, try to find some other way to achieve what you're after.

Matsemann
  • 308
  • 1
  • 5
  • 3
    Welcome to security.stackexchange, and great first post! Trying to strip banned (or all) attributes is a common way for people to try to remove dangerous HTML, and it's not *too* crazy of an idea. Some languages even support it by default (we use [strip_tags](https://www.php.net/manual/en/function.strip-tags.php) in our own systems). The trick (which `strip_tags` follows) is to keep looping through your tag removal algorithm until nothing is left. When trying to do it yourself though, it's a **very** easy thing to miss, and very bad when you do! – Conor Mancone Nov 01 '19 at 13:06
  • aka, this is definitely one of those "gotchas" that people easily miss when they try to implement something like this without the experience needed to get it right. – Conor Mancone Nov 01 '19 at 13:08
6

Leaving aside all the ways I could still slip XSS past this blacklist (which other answers have largely covered), allowing arbitrary HTML is still very dangerous. For example, if the user has control over the style= attribute (you said you blocked style tags, but nothing about inline attributes), or even just has access to the legacy positioning attributes (if you're letting users supply the HTML, the browser will have to allow somewhat-invalid input), the attacker can basically draw over the entire page with malicious content, such as a phishing login form, a (fake) security warning/ransomware extortion message, horrible images or videos, etc. Malicious users could also find HTML that adversely impacts a browser's rendering engine (such as consuming a ton of RAM and/or taking forever to render and using all their CPU; not all attacks are aiming to take control of a system) and spam that HTML everywhere they can.

The only good option is to use a safe layout language that gets translated into HTML by a well-tested library (the various forms of markdown or bbcode are intended for this). If you must allow HTML, do it by whitelisting specific tags, and within those tags whitelisting specific attributes (and if necessary only allowing specific values of those attributes), and throwing out anything that doesn't match. Then, re-testing after each change your filter makes to ensure the filter's own change didn't introduce a malicious input.

CBHacking
  • 40,303
  • 3
  • 74
  • 98
5

There are major risks with user generated content that is to be interpreted/parsed and displayed publicly. XSS attacks and the like can occur from users being able to sneak tags through your sanitizing functions, and there are a lot of variations that a browser will interpret that you must design for.

Recommendation? Do not allow tags at all. If you must, there are libraries out there (server side) that attempt this and are likely to have a lot of the work done to avoid the sanitizing bypasses that may exist.

In terms of your question:

Are there any other tags that i need to be mindfull of when enabling users to generate html without effect other things on the page?

The best approach is to assume all tags are of concern, and more so decide on the specific tags you want (as user Ghedipunk suggested). This is because tags can be exploited in unique, and sometimes unforeseen ways. This can be from specific browser implementation quirks to less common uses of tags. It's far easier to remove all input that follows the pattern of an HTML tag, with an exception for specific tags, than it is to try and prevent specific tags.

There are various ways to achieve this, from the most naive to far more advanced methods (actual DOM processing). If you try to do a blacklist approach, you'll be constantly chasing variations of different bypass attacks as well as unexpected uses of various tags you thought safe or forgot to include.

I'm confident a library already exist out there that has such operations done in a more advanced and secure fashion. My recommendation would be to seek those out for whatever platform you're developing for.

  • 2
    While "Do not allow tags at all" is likely good advice in general, the OP is specifically trying to allow users to provide HTML (likely for formatting reasons). As a result, you're basically telling the OP not to do what they are asking how to do. Sometimes a frame challenge is necessary, but there are suggestions that can legitimately be given here. – Conor Mancone Oct 30 '19 at 17:29
  • @ConorMancone, I agree, but I am more so answering the question in the title of, "are there any security risks of using user generated html tags?," to which the answer is a large, yes. I'll edit my answer to clarify a bit more though. – Jarrod Christman Oct 30 '19 at 17:35
  • 2
    There is no such thing as a naive approach to HTML parsing. Take for instance the user input: ``. Step one results in `` which a browser will happily treat as an image tag, ignoring the `
    – Conor Mancone Oct 30 '19 at 18:45
  • That is why I called it a naive approach, it's for illustration only and isn't immune to attacks. It merely illustrates the point of whitelisting over black listing. Accept what you want, remove all else, rather than trying to remove what you don't want. Actual DOM processing, as suggested, is the proper method. Hence: "The above pseudo code just illustrates the simplicity of a whitelist approach and killing any other HTML" and "I'm confident a library already exist out there that has such operations done in a more advanced and secure fashion." – Jarrod Christman Oct 30 '19 at 19:31
  • 2
    Sure... except that your example of a naive approach likely leaves the system vulnerable to an XSS vulnerability. This is a security site - if you're going to give example code, then it really shouldn't have any security vulnerabilities. Someone somewhere is going to see your example, copy the general principles, and end up with a vulnerable site. Your answer doesn't need an example, so if you can't make one without a vulnerability (which is likely impossible due to the complexities involved) then you are better off just removing it. – Conor Mancone Oct 30 '19 at 19:38
  • 1
    One would hope that information would be read in context and understood before attempting an implementation... but fair enough. I removed the pseudo code. – Jarrod Christman Oct 30 '19 at 19:41
  • @ConorMancone requiring all sample code to be completely free of any possibly security vulnerabilities is not realistic. – barbecue Nov 01 '19 at 13:42
1

If it doesn't cause problems for your page layout, a good option would be to place the user-generated code inside an <iframe> and serve it from a separate subdomain or domain.

<iframe> is designed to isolate the content inside the frame from the page outside it. By serving the frame contents from a separate domain, you can also isolate it from any cookies the main site is using. Cookies from main domain (example.com) are also visible in subdomains (foo.example.com), but if the main site is in different subdomain (www.example.com) or completely different domain (example2.com), the sites would have different cookies.

The downside is that it's difficult to make the frame contents seamlessly merge with surrounding page, and it may cause e.g. separate scrolling bar for the frame to appear.

jpa
  • 951
  • 6
  • 11
  • Can't a subdomain x.example.com get/set cookies from the domain example.com? So JS running in the subpage would be able to interfere in the page? – curiousguy Nov 02 '19 at 23:25
  • @curiousguy Ah, true - I'll update the answer. – jpa Nov 03 '19 at 06:30
0

This could lead to an attack vector which is called Cross Site Scripting (XSS) more on this here. You should not rely/trust on blacklisting the list of tags as there are many ways to bypass it in ever evolving HTML language.

Thumb rule is- use sandbox environment, sanitise the input and encode the output. There is various solution and frameworks at both ends (client and server side) that can perform such tasks for you like DOMPurify.

avicoder
  • 313
  • 2
  • 11
-3

Also in addition to all the dangers of parsing HTML, make sure that if you are saving this input in a database you watch out for SQL injection vulnerabilities.

One thing to start with is to use SQL parameters.

What is SQL injection?

OrangeKing89
  • 125
  • 2
  • Saving to a database does not equal SQLi. – schroeder Nov 01 '19 at 18:40
  • 1
    While validation and sanitization of input before putting it into a database (and after getting it back out) is an important consideration, that isn't the topic of this question. – Ghedipunk Nov 01 '19 at 19:36