First, URL encoding also know as percent-encoding is simple scheme where in the URL %xx
represents a byte (a number from 0-255) where each x
is a hex digit (base 16: 0-9A-F; note 16*16=256 the number of different bytes).
Hence %C0%AF
in a URL corresponds to putting the bytes C0 AF
into the decoded URL, meaning bytes 192 (1100 0000
) and byte 175 (1010 1111
), while %C0%2F
corresponds to bytes 192 (1100 0000
) and bytes 47 (0010 1111
)
Now ASCII only defines symbols for bytes 0-127. The most common extension to ASCII to allow special symbols (e.g., for non-English writers) is unicode. Unicode translates symbols like /
to codepoints represented by numbers; e.g., /
is 47-th codepoint (in hex 0x2f
), π
is the 960th codepoint (0x3c0
), and ♥
is 9829-th codepoint (0x2665
). Now for unicode symbol to be put into a stream of bytes it has to be encoded, and the most common encoding nowadays is UTF-8, as UTF-8 continues to encode ASCII characters in a single byte (8 bits) so it doesn't screw up encoding of plain old ASCII documents. Note ASCII only defined 128 symbols (between 0-127), which all have the first bit as 0.
The way UTF-8 works is normal ASCII characters are still encoded as usual using one byte, which a decoding application recognizes by noticing the first bit is 0
. When you get to the next byte to process, if you see that the first bit begins with a 1
it indicates that the next symbol is represented across multiple byte sequence. The number of bytes is determined by the form of the first byte (specifically, the number of leading 1
before the first 0
in the first byte). For example, if the first byte of the multibyte sequence is of the form 110x xxxx
that indicates that the next symbol is represented with two bytes. Similarly 1110 xxxx
means it is the start of a three-byte sequence, 1111 0xxx
is a four-byte sequence, etc. Now, if you read the UTF-8 wikipedia page, you'll notice the two byte sequence should be in the form 110i jklm 10no pqrs
to represent the unicode codepoint with the binary number ijk lmno pqrs
, which in principle could be any binary number from 000 0000 0000
(0) to 111 1111 1111
(2047). In our first case (C0AF
), we have the bits 1100 0000 1010 1111, which represents the codepoint 00000101111 = 47 = /
. Note that 47 could also be represented more simply with just an ASCII character /
that is as the bits 0010 1111
. You may wonder why the second byte is defined to start with 10
-- UTF-8 put this in, so you can tell if a character is a continuation of a start of a multibyte UTF-8 character to help catch errors.
So this seems to allow multiple ways to represent every unicode character. But this isn't allowed in the unicode standard. Two byte sequences are supposed to have go to values only between 128 and 2047, so C0AF
shouldn't represent a /
but be an error. However, unicode libraries are often designed to be fast, where people may not consider the security-implications. Thus some library may choose not to check that the value of the two-byte unicode character is in the valid range (even though the unicode standard forbids this). Or the developer decided that if given a C0AF
mostly likely some malformed UTF-8 application meant to send 2F
, so decides to fall back on the most sensible behavior to be most convenient to the user (as displaying a /
seems more sensible than any other choice of character to display).
Similarly, the even more flawed %C0%2F
version also works in some bad unicode libraries, because many applications that decode unicode do not check that the first bit of the second byte is actually a 1
as the previous byte already indicated it was a two byte codepoint. That is the bad decoder accepts 110i jklm ??no pqrs
as a valid two byte codepoint regardless of ?? being 10
as the UTF8 standard mandates. The first two characters of the second byte are redundant, so a quick and dirty unicode decoding application could decide to not check that these characters match the proper value.
So now that we know why %C0%AF
and %C0%2F
both ultimately decode to the symbol /
with unicode decoders that skip proper checking.
As for why this succeeds in allowing directory traversal, it often happens that the filtering of improper input and decoding the unicode symbol is done at different stages of the application. The web server may be smart enough not to allow someone to go to navigate through to http://www.example.com/../../../etc/shadow
or even http://www.example.com/..%2f..%2f..%2fetc%2fshadow
by factoring out improper symbols. However, if a web server is serving files and decoding the unicode is done after the check that prevents directory traversal or done slightly differently by the operating system, this attack may get past the filter allowing the attack to work.
For a more detailed accessible introduction to unicode, I recommend "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".