Is there a way to bypass Django's XSS escaping with "unicode"?

Question

Django (the Python web framework) escapes output to prevent XSS (Cross Site Scripting) attacks. It replaces ', ", <, >, & with their HTML safe versions.

However this presentation on slide share, (specifically slide № 13), says:

Problems

Any other Unicode will bypass this check

I can't understand this complaint. Is there some unicode character that will not be replaced by Django's escape function that will allow an XSS? I know a bit about unicode, and I can't think how.

bobince · Answer 1 · 2013-04-10T19:40:18.810

It is not clear what exactly the slide is referring to. Django's auto-escaping should be fine against HTML-injection in text content and properly-quoted attribute values.

There are not other Unicode characters that can evade HTML escaping, but in principle there are byte sequences that could be misinterpreted as being in the wrong Unicode encoding:

If the browser decides to interpret a document as UTF-7, +ADw- becomes a synonym for < (and similar sequences for &"'>), allowing HTML metacharacters to avoid being escaped.
Some East Asian multibyte encodings allow trailing bytes in a multibyte sequence to be in the 0x00-0x7F range where they could be interpreted as ASCII characters, and mis-escaped if handled in that way. Usually that would just lead to broken text rather than a security issue, though.
Invalid 'overlong' UTF-8 byte sequences may be interpreted as ASCII by some very old browsers (the original IE6 pre-SP1, and Opera at around the same time). This could allow HTML metacharacters to avoid being escaped, such the byte sequence 0xC0 0xBC representing <.

To avoid these problems you would (a) make sure to serve your documents with a UTF-8 Content-Type charset, and (b) keep all your text strings as native Unicode strings internally so that they can never encode to invalid UTF-8 sequences.

Since Django apps tend to do this by default already, it is not a likely scenario that Django templates' auto-escaping would be defeated by Unicode problems.

That's not to say XSS is solved in general of course - you still have to avoid misuse of |safe, unquoted attributes, non-HTML injection problems (like JavaScript strings, CSS properties, URL parameters), HTML content-sniffing, dangerous URL schemes (javascript: et al), and so on. But as a defence against HTML-injection in templates it should be sound.

score 8 · Answer 2 · answered Apr 10 '13 at 18:48

Yes, there are at least three instances where this XSS filter fails. XSS is complex, and blindly replacing characters doesn't solve this problem. The most obvious is if you are writing within a script tag:

<script>
var x = alert(1);
</script>

If you are writng an href or iframe src you can use the javascript: URI:

<a href=javascript:alert(1)>alert</a>

It is also vulnerable to xss if you write inside of a DOM Event

<a href="doSomethingCool('userInput%27);sendHaxor(document.cookie);//');">Cool Link</a>

The browser will automatically decode the %27 (as well as other methods of encoding) prior to executing the JavaScript event.

score 4 · Answer 3 · edited Mar 17 '17 at 13:14

Django does the sensible things to reduce exposure to XSS.

Django uses unicode and UTF-8 encoding everywhere by default, and sensibly forces unicode encoding before doing substitution on all template variables (done by default) to prevent users inserting arbitrary HTML elements. Django allows developers to change the encoding with the DEFAULT_CHARSET setting, but will force that encoding throughout the application and will insert Content-Type: text/html; charset=utf-8 HTTP response headers by default (with 'text/html' and 'utf-8' changing if you are returning a different content_type or changed the charset). Furthermore, django pages will also set <meta http-equiv="content-type" content="text/html; charset=utf-8"> in their base templates and their admin pages, but again gives developers the option to not use their base templates (and the devs custom written templates may not define a charset in the meta tag or worse may use the wrong charset). So while bobince's great answer listed some shortcomings of substitute < for < in user input via encoding issues; django by default will handle these properly.

Is it 100% fool-proof? No, they still give the developer enough configurability to do unsafe things like insert user-input into a onclick action, bypass the automatic escaping (through mark_safe() function or {{ user_input|safe }} in the template), or allow user input into an unsafe location: e.g., a link or within eval'd javascript. Granted it would be near impossible to do much more without intensive compiling/semantic analysis of each template.

For people interested, the escaping code is quite readable in django/utils/html.py. (My link goes to the current dev version; but my copy paste is from django 1.2. The main difference between the dev and 1.2 version is they renamed force_unicode to force_text (in py3 all text is unicode) and made it compatible with python 3 (all the references to six).)

Basically, the escape function is run on every variable to be rendered in the template and first checks that it can be encoded properly and then replaces the characters: &<>'" with their HTML-escaped equivalents. There is also functions for escaping JS, though I believe that has to be manually called in the template like {{ variable|escapejs }}.

def escape(html):
    """
    Returns the given HTML with ampersands, quotes and angle brackets encoded.
    """
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))
escape = allow_lazy(escape, unicode)

_base_js_escapes = (
    ('\\', r'\u005C'),
    ('\'', r'\u0027'),
    ('"', r'\u0022'),
    ('>', r'\u003E'),
    ('<', r'\u003C'),
    ('&', r'\u0026'),
    ('=', r'\u003D'),
    ('-', r'\u002D'),
    (';', r'\u003B'),
    (u'\u2028', r'\u2028'),
    (u'\u2029', r'\u2029')
)

# Escape every ASCII character with a value less than 32.
_js_escapes = (_base_js_escapes +
               tuple([('%c' % z, '\\u%04X' % z) for z in range(32)]))

def escapejs(value):
    """Hex encodes characters for use in JavaScript strings."""
    for bad, good in _js_escapes:
        value = mark_safe(force_unicode(value).replace(bad, good))
    return value
escapejs = allow_lazy(escapejs, unicode)

def conditional_escape(html):
    """
    Similar to escape(), except that it doesn't operate on pre-escaped strings.
    """
    if isinstance(html, SafeData):
        return html
    else:
        return escape(html)

and from django/utils/encoding.py:

def force_unicode(s, encoding='utf-8', strings_only=False, errors='strict'):
    """
    Similar to smart_unicode, except that lazy instances are resolved to
    strings, rather than kept as lazy objects.

    If strings_only is True, don't convert (some) non-string-like objects.
    """
    if strings_only and is_protected_type(s):
        return s
    try:
        if not isinstance(s, basestring,):
            if hasattr(s, '__unicode__'):
                s = unicode(s)
            else:
                try:
                    s = unicode(str(s), encoding, errors)
                except UnicodeEncodeError:
                    if not isinstance(s, Exception):
                        raise
                    # If we get to here, the caller has passed in an Exception
                    # subclass populated with non-ASCII data without special
                    # handling to display as a string. We need to handle this
                    # without raising a further exception. We do an
                    # approximation to what the Exception's standard str()
                    # output should be.
                    s = ' '.join([force_unicode(arg, encoding, strings_only,
                            errors) for arg in s])
        elif not isinstance(s, unicode):
            # Note: We use .decode() here, instead of unicode(s, encoding,
            # errors), so that if s is a SafeString, it ends up being a
            # SafeUnicode at the end.
            s = s.decode(encoding, errors)
    except UnicodeDecodeError, e:
        if not isinstance(s, Exception):
            raise DjangoUnicodeDecodeError(s, *e.args)
        else:
            # If we get to here, the caller has passed in an Exception
            # subclass populated with non-ASCII bytestring data without a
            # working unicode method. Try to handle this without raising a
            # further exception by individually forcing the exception args
            # to unicode.
            s = ' '.join([force_unicode(arg, encoding, strings_only,
                    errors) for arg in s])
    return s

Is there a way to bypass Django's XSS escaping with "unicode"?

3 Answers3