3

Drupal has filter_xss function. Is it safe to use for filtering arbitrary user HTML input?

If not, what should be used instead when using Drupal 7?

This questuion is a duplicate of Drupal's built-in xss filters vs. HTML purifier module but answer of that seems incorrect to me as filter_xss doesn't contain code for validating HTML

Andrei Botalov
  • 5,267
  • 10
  • 45
  • 73

2 Answers2

5

I'm not sure whether Drupal's filter_xss is safe. According to the link you provided, Drupal's filter_xss is based upon the kses library for HTML filtering. To put it bluntly, I don't trust kses or kses-derived filters.

If you look at the code of kses, it is based upon a monster regexp. That's a crummy architecture for a HTML sanitizer, and I don't trust anything that's built in that way. If you have to filter arbitrary HTML, the proper way to do it is to parse the HTML and then operate on the parse tree. (HTML Purifier works that way, and partly as a result, I trust HTML Purifier a lot more.)

Historically, kses has had some security problems. See, e.g., Vulnerabilities in kses-based HTML filters and HTML Sanitisation: The Devil’s In The Details (And The Vulnerabilities). I don't know whether those vulnerabilities affect Drupal's version, but they don't inspire confidence. So, if I needed HTML filtering, and if I were choosing based upon security, I think I'd probably choose HTML Purifier over filter_xss.

I should also back up a little bit. You didn't say much about the situation you are facing. Are you sure that HTML filtering is the right tool for your job? In my experience, it's a lot more common that you need HTML escaping (output encoding) rather than HTML filtering.

If you are including untrusted input into a HTML document, you need to decide what kind of functionality you need:

  • HTML escaping, aka output encoding. Is the user-provided input "just plain text", with no rich formatting? If yes, use HTML escaping (also known as output encoding), not filter_xss. When doing HTML escaping, you want to use context-sensitive escaping, to escape the data in the proper way for the parse context where the untrusted data will be inserted. To learn more, read the following resources:

  • HTML filtering. Do you want to allow the user-provided input to contain rich HTML formatting? Do you want to allow the user to enter in nearly-arbitrary HTML, which you want to include verbatim into the output document? If yes, you need a HTML filter. In this situation, HTML Purifier is an excellent choice, and probably safer than Drupal's filter_xss. That's just my personal opinion.

When inserting user-provided input into a HTML document, my experience is that probably 95+% of the time you want HTML escaping, not HTML filtering. HTML filtering is the exception. (How often do you expect users to enter in HTML markup? If you're writing an application for the general population, the answer is: almost never.) So, are you sure you need a HTML filter anyway?

D.W.
  • 98,420
  • 30
  • 267
  • 572
2

D.W. has some excellent points, but I'd just point out a few things:

This code does four things:

  1. Removes characters and constructs that can trick browsers.
  2. Makes sure all HTML entities are well-formed.
  3. Makes sure all HTML tags and attributes are well-formed.
  4. Makes sure no HTML tags contain URLs with a disallowed protocol (e.g. javascript:).

Assuming it does an excellent job of all the things on this list, there are a few notable omissions : tag balancing, encoding level attacks, link spam.

Even if it does those well, HTMLPurify has been around and been attacked. I can think of at least one security academic off the top of my head, who makes sure HTMLPurify is patched and stable before publishing new attacks. If Drupal doesn't receive similar scrutiny then I'd use the more hardened one.

Tag Balancing

If your users' security depends on them being able to distinguish content that you authored from that authored by commenters or other third parties, then tag balancing is important.

Imagine that you used <table>s for formatting. Unbalanced tags can let them sneak content out of the region that appears to be third-party content into a region of the page that appears to be controlled by the site owners. For example, if your white-list included otherwise innocuous formatting tags like <table> then

</table>
<center>If you have any questions,
<a href="support@deceptively-similar-name.com">contact us</a>
<br>Bogus copyright</center><br><br><br><br><sub><sub><sub><sub><sub>

might let the attacker forge a footer that contains phishing links.

A similarly innocuous seeming </ul> might help an attacker break out of a list of user comments that are displayed using <ul><li>...</ul> per semantic HTML.

User configurable white-lists give you a lot of room to hang yourself here since HTML experts rightly assume that balanced elements like <table> and <ul> don't do anything security sensitive (they just create boxes around flowable content), but individual tags are problematic.

Encoding level attacks

If the attacker can get content sanitized content into the first few kB of an HTML page that does not have a Content-type header that specifies an encoding, then it might be able to trick IE into treating the page as UTF-7, avoiding all the other sanitization.

This may fall under "constructs that can trick browsers", but the source code on that page does not indicate that it does.

Link spam

If you allow links but the sanitizer does not add rel="nofollow" to links, then your reputation can be hijacked.

Mike Samuel
  • 3,873
  • 17
  • 25
  • Thanks for the review and feedback! filter_xss is one piece of Drupal's text filtering system which includes other filters to do things like tag balancing and adding rel="nofollow". Not every bit of text needs those levels of handholding, so they are on by default for the default text filter, but can be disabled if that's appropriate for the site. Content-type is added as a header to all pages and is also out of scope of filter_xss. – greggles Sep 10 '12 at 20:26
  • 1
    @greggles, It's nice to know that you can get tag balancing. I disagree re headers being out of scope. Part of providing library code that preserves security guarantees though is making sure that your clients are secure even if they don't understand all of the attack vectors so that app developers can focus on apps while security folk focus on security. If `filter_xss` produced output that was robust against encoding level attacks regardless of whether the Content-type header, then it would be easier to use, and this is not difficult -- just encoding '+' as `+` will defang UTF-7. – Mike Samuel Sep 10 '12 at 20:45
  • Makes sense, thanks Mike. Any chance you'll file an issue with details and suggestion to improve filter_xss? – greggles Sep 12 '12 at 06:12
  • I'm torn right now between spending time tracking down libraries with weaknesses and my current project which uses code-generation to provide consistent high-quality anti-code-injection tools for many application languages. I'll try to find some time to look at filter_xss in detail but can't promise anything. – Mike Samuel Sep 12 '12 at 14:47