18

Introduction: I am trying to learn the basics of Directory Traversal.

Question: While trying to understand the 2-Byte Unicode conversion of characters, I came across this_SANS ARTICLE that explains Directory Traversal vulnerability. It states that / is represented as %C0%2F but the representation %C0%AF also works which aids in the successful attack.

Can anyone please explain why both representations work? It would be of much help if the reason is explained at the binary level.

Gilles 'SO- stop being evil'
  • 50,912
  • 13
  • 120
  • 179
DA12C 917
  • 318
  • 1
  • 2
  • 10

1 Answers1

36

First, URL encoding also know as percent-encoding is simple scheme where in the URL %xx represents a byte (a number from 0-255) where each x is a hex digit (base 16: 0-9A-F; note 16*16=256 the number of different bytes).

Hence %C0%AF in a URL corresponds to putting the bytes C0 AF into the decoded URL, meaning bytes 192 (1100 0000) and byte 175 (1010 1111), while %C0%2F corresponds to bytes 192 (1100 0000) and bytes 47 (0010 1111)

Now ASCII only defines symbols for bytes 0-127. The most common extension to ASCII to allow special symbols (e.g., for non-English writers) is unicode. Unicode translates symbols like / to codepoints represented by numbers; e.g., / is 47-th codepoint (in hex 0x2f), π is the 960th codepoint (0x3c0), and is 9829-th codepoint (0x2665). Now for unicode symbol to be put into a stream of bytes it has to be encoded, and the most common encoding nowadays is UTF-8, as UTF-8 continues to encode ASCII characters in a single byte (8 bits) so it doesn't screw up encoding of plain old ASCII documents. Note ASCII only defined 128 symbols (between 0-127), which all have the first bit as 0.

The way UTF-8 works is normal ASCII characters are still encoded as usual using one byte, which a decoding application recognizes by noticing the first bit is 0. When you get to the next byte to process, if you see that the first bit begins with a 1 it indicates that the next symbol is represented across multiple byte sequence. The number of bytes is determined by the form of the first byte (specifically, the number of leading 1 before the first 0 in the first byte). For example, if the first byte of the multibyte sequence is of the form 110x xxxx that indicates that the next symbol is represented with two bytes. Similarly 1110 xxxx means it is the start of a three-byte sequence, 1111 0xxx is a four-byte sequence, etc. Now, if you read the UTF-8 wikipedia page, you'll notice the two byte sequence should be in the form 110i jklm 10no pqrs to represent the unicode codepoint with the binary number ijk lmno pqrs, which in principle could be any binary number from 000 0000 0000 (0) to 111 1111 1111 (2047). In our first case (C0AF), we have the bits 1100 0000 1010 1111, which represents the codepoint 00000101111 = 47 = /. Note that 47 could also be represented more simply with just an ASCII character / that is as the bits 0010 1111. You may wonder why the second byte is defined to start with 10 -- UTF-8 put this in, so you can tell if a character is a continuation of a start of a multibyte UTF-8 character to help catch errors.

So this seems to allow multiple ways to represent every unicode character. But this isn't allowed in the unicode standard. Two byte sequences are supposed to have go to values only between 128 and 2047, so C0AF shouldn't represent a / but be an error. However, unicode libraries are often designed to be fast, where people may not consider the security-implications. Thus some library may choose not to check that the value of the two-byte unicode character is in the valid range (even though the unicode standard forbids this). Or the developer decided that if given a C0AF mostly likely some malformed UTF-8 application meant to send 2F, so decides to fall back on the most sensible behavior to be most convenient to the user (as displaying a / seems more sensible than any other choice of character to display).

Similarly, the even more flawed %C0%2F version also works in some bad unicode libraries, because many applications that decode unicode do not check that the first bit of the second byte is actually a 1 as the previous byte already indicated it was a two byte codepoint. That is the bad decoder accepts 110i jklm ??no pqrs as a valid two byte codepoint regardless of ?? being 10 as the UTF8 standard mandates. The first two characters of the second byte are redundant, so a quick and dirty unicode decoding application could decide to not check that these characters match the proper value.

So now that we know why %C0%AF and %C0%2F both ultimately decode to the symbol / with unicode decoders that skip proper checking.

As for why this succeeds in allowing directory traversal, it often happens that the filtering of improper input and decoding the unicode symbol is done at different stages of the application. The web server may be smart enough not to allow someone to go to navigate through to http://www.example.com/../../../etc/shadow or even http://www.example.com/..%2f..%2f..%2fetc%2fshadow by factoring out improper symbols. However, if a web server is serving files and decoding the unicode is done after the check that prevents directory traversal or done slightly differently by the operating system, this attack may get past the filter allowing the attack to work.

For a more detailed accessible introduction to unicode, I recommend "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

dr jimbob
  • 38,768
  • 8
  • 92
  • 161
  • 1
    Excellent Answer ! To put as I understand: 1. The ultimate criminal for such vulnerabilities is the improper verification of Unicodes! 2. The SANS_Article which I referred to provides the wrong representation as it says and I quote **- "Therefore, to represent the character ‘/’, you would use the representation “%c0%2f”, since the character ‘/’ is ASCII character 0x2f."** **- "Unfortunately, there seems to be a workaround to make it work on US systems. %c0%af = '/'"** It suggests opposite of what you've explained here ! – DA12C 917 Jan 18 '14 at 12:44
  • 2
    @DA12C917 - The author of the SANS article is mistaken (it happens). In UTF-8 (the only scheme where a leading byte of `%c0` makes senes), says that continuation bytes have to start with `10`, meaning you have to add `0x80` to `0x2f` to make it `0xaf`. But anyhow, this is all behavior that is explicitly forbidden in the unicode standard. In UTF-8 the only way to encode `/` is with the single byte `2f`. `c0 af` should be an error (overlong representation) and similarly `c0 2f` should be an error (invalid byte representation as the continuation byte doesn't start with 10). – dr jimbob Jan 18 '14 at 16:58
  • See: http://en.wikipedia.org/wiki/UTF-8#Overlong_encodings and the next section in wikipedia invalid byte sequences – dr jimbob Jan 18 '14 at 16:58
  • fabulous answer! – Eugene Oct 18 '21 at 17:48