9

Drupal filters HTML strings against XSS attacks using regexes: http://api.drupal.org/api/drupal/includes%21common.inc/function/filter_xss/7

However, as a lot of people know, HTML can't be parsed with regex.

Which makes me think that the filter_xss function could let some invalid HTML pass a script tag in it, thus being a security flaw.

But I'm not efficient enough with regexes. Maybe somebody can find something that passes? If there is, I'd make a patch to use simplexml (or something better if there is) instead of regexes.

FWIW, here is the code of the function:

function filter_xss($string, $allowed_tags = array('a', 'em', 'strong', 'cite', 'blockquote', 'code', 'ul', 'ol', 'li', 'dl', 'dt', 'dd')) {
  // Only operate on valid UTF-8 strings. This is necessary to prevent cross
  // site scripting issues on Internet Explorer 6.
  if (!drupal_validate_utf8($string)) {
    return '';
  }
  // Store the text format.
  _filter_xss_split($allowed_tags, TRUE);
  // Remove NULL characters (ignored by some browsers).
  $string = str_replace(chr(0), '', $string);
  // Remove Netscape 4 JS entities.
  $string = preg_replace('%&\s*\{[^}]*(\}\s*;?|$)%', '', $string);

  // Defuse all HTML entities.
  $string = str_replace('&', '&', $string);
  // Change back only well-formed entities in our whitelist:
  // Decimal numeric entities.
  $string = preg_replace('/&#([0-9]+;)/', '&#\1', $string);
  // Hexadecimal numeric entities.
  $string = preg_replace('/&#[Xx]0*((?:[0-9A-Fa-f]{2})+;)/', '&#x\1', $string);
  // Named entities.
  $string = preg_replace('/&([A-Za-z][A-Za-z0-9]*;)/', '&\1', $string);

  return preg_replace_callback('%
    (
    <(?=[^a-zA-Z!/])  # a lone <
    |                 # or
    <!--.*?-->        # a comment
    |                 # or
    <[^>]*(>|$)       # a string that starts with a <, up until the > or the end of the string
    |                 # or
    >                 # just a >
    )%x', '_filter_xss_split', $string);
}

And this function uses _filter_xss_split:

function _filter_xss_split($m, $store = FALSE) {
  static $allowed_html;

  if ($store) {
    $allowed_html = array_flip($m);
    return;
  }

  $string = $m[1];

  if (substr($string, 0, 1) != '<') {
    // We matched a lone ">" character.
    return '&gt;';
  }
  elseif (strlen($string) == 1) {
    // We matched a lone "<" character.
    return '&lt;';
  }

  if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9]+)([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) {
    // Seriously malformed.
    return '';
  }

  $slash = trim($matches[1]);
  $elem = &$matches[2];
  $attrlist = &$matches[3];
  $comment = &$matches[4];

  if ($comment) {
    $elem = '!--';
  }

  if (!isset($allowed_html[strtolower($elem)])) {
    // Disallowed HTML element.
    return '';
  }

  if ($comment) {
    return $comment;
  }

  if ($slash != '') {
    return "</$elem>";
  }

  // Is there a closing XHTML slash at the end of the attributes?
  $attrlist = preg_replace('%(\s?)/\s*$%', '\1', $attrlist, -1, $count);
  $xhtml_slash = $count ? ' /' : '';

  // Clean up attributes.
  $attr2 = implode(' ', _filter_xss_attributes($attrlist));
  $attr2 = preg_replace('/[<>]/', '', $attr2);
  $attr2 = strlen($attr2) ? ' ' . $attr2 : '';

  return "<$elem$attr2$xhtml_slash>";
}
Florian Margaine
  • 2,465
  • 3
  • 13
  • 10
  • 4
    People who find an answer to this question may be interested to know that disclosing the answer privately is worth $250 http://www.whitefirdesign.com/about/drupal-security-bug-bounty-program.html http://drupal.org/node/101494 – greggles Nov 02 '12 at 19:12
  • 1
    See [Is Drupal's filter_xss enough for filtering HTML?](http://security.stackexchange.com/questions/19864/is-drupals-filter-xss-enough-for-filtering-html) – Andrei Botalov Nov 02 '12 at 19:50
  • Useful tip for those doing source code review of this API: The [source code for `_filter_xss_attributes()`](http://api.drupal.org/api/drupal/includes!common.inc/function/_filter_xss_attributes/7) is also highly relevant (it is called by `_filter_xss_split()` on the list of attributes). – D.W. Nov 02 '12 at 20:03

2 Answers2

3

See Is Drupal's filter_xss enough for filtering HTML?, which has some discussion of the security of Drupal's filter_xss(). Make sure to read Mike Samuel's analysis, which identifies a number of shortcomings of filter_xss(). I don't know whether you would classify them as vulnerabilities, exactly, but they are design flaws/shortcomings that could make filter_xss() less effective than developers might expect.


The developer documentation for filter_xss() is atrociously bad.

There's a sum total of two sentences: "Filters HTML to prevent cross-site-scripting (XSS) vulnerabilities." and "Use check_markup or filter_xss for markup containing text.". When the documentation doesn't explain how to use filter_xss() properly, you shouldn't be surprised if developers fail to use it correctly. This could lead to vulnerabilities, e.g., of the sort that Rook identifies.

(Drupal also has a document entitled Handle text in a secure fashion, but it doesn't even mention filter_xss().)


I would also suggest that anyone who calls filter_xss() should make sure not to include !-- in the list of allowed tags. The code for validating comments (which is enabled if you add !-- to the list of allowed tags) looks super-sketchy to me: it doesn't do anything to validate the contents of the comment, which intuitively feels like it can't possibly be safe.

D.W.
  • 98,420
  • 30
  • 267
  • 572
0

XSS is an output problem, there is no magic function that can prevent all XSS vulnerabilities. The root of all vulnerabilities is using functionality in a way that it was never intended.

These two are pretty obvious:

print "<script>".filter_xss($_GET['still_xss1'])."</script>"; 
print "<a href=".filter_xss($_GET['still_xss2']).">xss</a>";

PoC:

?still_xss1=alert(1)
?still_xss2=javascript:alert(1)

It also looks like you can inject event handlers: ' onclick=alert(1) ', although I haven't tried it...

rook
  • 46,916
  • 10
  • 92
  • 181
  • 1
    Those examples work, but that's not how the function is designed to be used and would be a bug in the calling code, not the filter_xss function. Turning your first example into proper usage of the api is: print filter_xss(""); – greggles Nov 02 '12 at 19:29
  • 1
    @greggles But all vulnerabilities are using functionality in a way that is not intended... – rook Nov 02 '12 at 19:47
  • So, to the question of "What could bypass regexes" your answer is "avoiding the regexes." Seems like a pretty poor answer. – greggles Nov 02 '12 at 20:54
  • 2
    @greggles I am not sure how you came to that conclusion. I am saying that its impossible to write some magical function that prevents all of XSS. A regular expression should be able to process any regular language... although it maybe be difficult to do so. – rook Nov 02 '12 at 22:50