Canonicalization & Output Encoding

Question

I'm reading OWASP's Secure Coding Practices Checklist and under their "Input Validation" section they have an item that reads:

If any potentially hazardous characters (<>"'%()&+\\'\") must be allowed as input, be sure you implement additional controls like output encoding. Utilize canonicalization to address double encoding or other forms of obfuscation attacks.

What is "output encoding", and can someone provide a concrete example of how a validation routine could make use of it?
What is "double encoding", and why is it an "obfuscation attack"?
What is "canonicalization" and why does it prevent against double encoding?

For the third one, I found a rather vague definition for canonicalization provided by OWASP: The reduction of various data encodings to a single, simple form. But that definition doesn't really help me make sense of what they're talking about.

I'm strong with Java and Python but could follow an example in any language. I'm just trying to visualize what they're talking about here and am having a tough time seeing the "forest through the trees." Thanks in advance!

+1 Good question and good answers. Output encoding (a.k.a. output filtering) is often overlooked as a security requirement. The community tends to focus on input validation (too much focus, IMO) at the expense of output filtering. One nit pick: output encoding and validation are two independent concepts. A validation routine **should not use output encoding**, because doing so would be mixing two different concerns and misunderstanding the purpose of each. Data should be validated when it is inputted (and before persistence, ideally), and data should be encoded (filtered) when it is outputted. — Mark E. Haase, Aug 09 '12 at 15:18

mhswende · Answer 1 · 2012-08-09T18:27:03.403

What is "output encoding", and can someone provide a concrete example of how a validation routine could make use of it?

Output encoding means that the data is encoded appropriately for the context into which it is being placed. Example, say you want to dynamically display a name from an untrusted source : Your name is:<b>Foo bar</b> If the name contains html characters, you want those to be encoded for, so the result is <b>Foo <i&gt Bar</b> instead of <b>Foo <i> Bar</b>.

So, converting < to < is an example of html encoding. However, if the context is an html attribute, you may have to also encode space-characters, since an attribute may be unquoted, and a space may thus break the attribute and the input can create a new attribute: <input value=data> is attacked with: <input value=data onclick=javascript:alert(1)/>

What is "double encoding", and why is it an "obfuscation attack"?

When you type certain characters into a URL, these become URL-encoded (usually, though not in IE always):

Not encoded parameter: test<script>alert(1)</script>
URL-encoded parameter: test%3Cscript%3Ealert%281%29%3C%2fscript%3E
Double-encoded parameter: test%253Cscript%253Ealert%25281%2529%253C%252fscript%253E

Depending on the handling of input parameters, double encoding may pass through some filters/validators and wind up breaking the context where they are echoed (thus leading to XSS).

What is "canonicalization" and why does it prevent against double encoding?

Canonicalization is the act of writing something in the simplest form, thus the canonical form of something is the "simplest" form to write it. To canonicalize in this context, it means un-encoding data until it does not change anymore.

A triple encoded <-sign, goes through the following transformations:

%25253C
%253C
%3C
<

Another example can be if input is written as e.g octal escapes, overlong UTF sequences and esoteric encodings, such as UTF-7. The canonicalization converts these into a common base, for the sake of disambiguation.

score 3 · Accepted Answer · answered Aug 09 '12 at 08:33

I think the best way to describe canonicalization is to remember that it stems from canon, meaning an authentic piece of writing. What they're talking about is taking untrusted data and formatting it as an unambiguous representation, such that it can never be misrepresented by any software process.

The first step is to take your input and store it somewhere. Your input might be encoded as ASCII, UTF-8, UTF-16, or any number of other encoding schemes. The software must detect this and appropriately convert and store the data in a single format. It is now in a single unambiguous format, and therefore known to be correct when interpreted as such, i.e. it is canon. This allows for absolute certainty when later outputting the data.

For example, if I insert '; DROP TABLE users; -- into a form, it might cause an SQL injection if the app is poorly written. However, with canonicalization, the data is only data, and cannot possibly be represented as part of an SQL query. In reality, SQL's form of canonicalization is parameterized queries. Furthermore, steps must be taken to convert text encoding to a single known type, so that only valid codepoints are stored. If this is not done, a codepoint may be misinterpreted as a different character.

A similar example can be given for output into HTML. If the database contains <script>alert('xss!');</script>, then a naive app might just write that to the page directly and introduce a security issue. However, with proper canonicalization in the form of output encoding, we'd get <script>alert('xss!');</script>, which a browser cannot misinterpret.

Double encoding is a trick used to fool certain parsers. The attacker identifies the encoding you're using, then pre-encodes their data in this format. The parser wrongly assumes the data to be canon, and handles it as such. The result is that the data is mishandled, such that an exploit takes place. It's an obfuscation attack, because the attacker is obfuscating exploit data, such that the encoder doesn't see bad characters.

Canonicalization & Output Encoding

2 Answers2

Linked