How did Anonymous use UTF-16 ASCII to fool PHP escaping?

Question

A few months ago, Anonymous took down a child pornography site using SQL-injection. I read in this article that Anonymous claimed that "the server was using hardened PHP with escaping," but they were able to "bypass it with with UTF-16 ASCII encoding." What does that mean they did, exactly? How do I protect my site from a similar attack?

First make sure admins and devs are all security-aware people. The weakest link will get your server owned in a blink of an eye, assuming a targeted attack. — Aki, Feb 05 '12 at 20:50

score 27 · Accepted Answer · edited Feb 07 '12 at 20:32

First of all "UTF-16 ASCII encoding" is a contradiction, since UTF-16 and ASCII are mutually-exclusive encoding schemes. But presumably he's just referring to using Unicode to bypass filtering mechanisms.

The general principle is this: we often think of characters encoded in ASCII -- "A" is the number 65, "z" is the number 122. But that's not the only character encoding scheme; because the world uses more than just the English alphabet, we need to represent far more characters than that. Hence, Unicode, which has representations for pretty much every character in every language ever written, from Sinhala to Klingon.

Representing all those characters (approx. 1.1 Million possible, not all in use) in a numeric form is a real challenge. You could use 32 bits, but that's a waste of space since 3 of the 4 bytes are usually zero. You could use a variable length, but then you can't do constant-time substring operations. So a number of standards exist, one of which is UTF-16 (which you probably guessed uses 16-bit characters).

Not all programmers are used to the idea of dealing with multiple character sets, even though the underlying framework will often support them. So sometimes filtering rules or precautions will be established using the assumption that characters will be represented in UTF-8 or ASCII, which they usually are.

So the filter looks for a given string, like \" for example, which in ASCII and UTF-8 correspond to the pattern {92,34}. But in UTF-16 it looks different; it's actually {0,92,0,34}, which is just different enough to slip by a filter that wasn't expecting it.

And while the filter doesn't understand UTF-16, the underlying framework does, so the content gets normalized and interpreted just the same as anything else, allowing the query to continue unfiltered.

EDIT TO ADD:
Note that PHP is exceptionally poor at handling character encodings; and if anything, that's understating the issue. PHP by default treats all strings as ASCII, meaning internal functions such as strstr and preg_replace simply assume that all strings are ASCII-encoded. If that sounds dangerously inadequate, that's because it is. But in their defense, remember that PHP predates UTF-16 by about a year, and all this is supposedly fixed in PHP version 6.

In the meantime, the mbstring library was created to address this deficiency, but it's neither widely deployed nor undersood. If you're lucky enough to have this extension available to you, you can use mbstring.overload in your php.ini file to force internal string-processing functions to be replaced with multibyte-aware alternatives. This can also be activated using the php_admin_value directive in your .htaccess files.

Another useful function is mb_internal_encoding, which sets the encoding used internally by PHP to represent strings. By using a unicode-compatible internal encoding, you may alleviate some nastiness. At least one reference I read (but unfortunately can't find now) suggests that by setting the internal encoding to UTF-8, you enable additional processing on inbound strings that normalizes them to a single encoding. On the other hand, at least one other reference suggests that PHP behaves as stupidly as possible in this regard, and simply slurps data down unmodified irrespective of its encoding, and lets you deal with the aftermath. While the former makes more sense, with what I know about PHP, I think the latter is just as likely.

As a final alternative; and I mention this only partly in jest, is to just not use PHP and instead adopt a better-designed architecture. It's hard to come up with a framework this popular that has so many fundamental problems as PHP does. The language, the implementation, the development team, the plugin architecture, the security model -- it really is a shame that PHP is as widely deployed as it is. But this is, of course, just an opinion.

-1 if I could give it for "To protect yourself, use the latest updated software.", that is a completely inadequate advice. To protect yourself use one single encoding, make sure to filter anything illegal in that encoding, and make sure that every tool you use expect only that encoding. Encoding confusion holes are almost always caused by the custom code or setup, there is nothing updated tools can do about that. — aaaaaaaaaaaa, Feb 04 '12 at 10:21
Informative, comprehensive and thorough answer. There's no advice to give actually, it depends on your situation. As a web developer you should be careful and aware of this issue, as an admin you should stay up to date and regularly check for security issues concerning the software you're using. There's more, but I don't think either of the above advice is wrong, I just want to let eBusiness know that tylerl didn't give an inadequate advice. — Aki, Feb 05 '12 at 20:48
Klingon isn't in Unicode; see http://unicode.org/alloc/nonapprovals.html ... it was proposed and rejected. There are unofficial encodings of it in a private use area, but its not part of Unicode. — derobert, Feb 07 '12 at 23:25
@derobert: What!! May I be the first to say, *Dochvam lo'laHbe'ghach!* — tylerl, Feb 07 '12 at 23:38
Technically, PHP doesn't treat encodings at all. PHP doesn't expect strings to be *ASCII*, it expects *byte arrays*. Strings have no encoding in PHP. Some functions that half-intelligently do something with strings, like case-insensitive functions, only do their case insensitivity thing for ASCII characters. But other than that, PHP is only working on and comparing bytes, period. It's up to the programmer to know what that means. — deceze, Feb 09 '12 at 01:47

score 4 · Answer 2 · answered Feb 05 '12 at 09:46

I have no idea if this is the method that Anonymous used, but have a look at http://bugs.mysql.com/bug.php?id=22243

It appears there was a bug in Connector.Net (MySQL's managed .Net driver). From the linked bug report:

.net strings are a encoded in UTF-16. Strings are converted to Windows-1252 (SBCS Encoding) to be sent over the network and during this conversion unicode characters that may not have been checked for will "become" single quotes.

The bug report goes on to list a string containing the problem Unicode character and say:

Specifically, In the second string the problem quote is unicode character 8242 ("\u8242"). When this string is received by the server the quote will be a single quote (ASCII 96) and break the query and could be used as a sql injection attack.

The linked bug was marked as a duplicate of a fixed bug back in 2009, but it's entirely possible that the exploited server was running an older version of MySql that had this issue.

That's for a Windows server. The question didn't mention the server OS. Windows is not likely, but not impossible either. No matter the OS, this is an interesting example. — SPRBRN, May 01 '14 at 09:47

score 3 · Answer 3 · answered Feb 05 '12 at 21:05

I came to the conclusion from the article that the site was depending more on 'hardening' techniques rather than good sql input filtering/escaping. There is no proof that the kiddie porn sites sql code was not flawed.

Bypassing so called hardened PHP filters is often quite trivial. ModSecurity for example can be quite easily bypassed and there are a number of methods being used constantly by attackers to walk around such input filters.

There are filters too that are included in the website code as plugins which do not urldecode the input correctly before checking for malicious inputs.

For example: %5e

as seen in:

id=0%5E(select%20position(0x61%20in%20(select%20id%20from%20users%20where%20num=1))=1)

By playing with these characters such as "%bf%5c%27", "%bf%27", "%ef%bb%bf", "%8c%5c", it is possible to bypass so called hardening to trigger the injection.

Even worse are the whitelist filters that recursively update the $_GET with whitelist allowed characters such as:

$cleansed = preg_replace( "/[^\s{}a-z0-9_\.\-]/i", "", urldecode( $get ) );

Then consider this: id=-1%20ui*o+s|e|l|e|c|t+1,^2,*3,[4,[5,]6,]7,<8,<9,>10>

While the idea of urldecoding before filtering is a good idea, it is completely pointless in that the blacklisted characters are stripped out thus delivering the injection vector in its raw form.

In fact this method can inhance the attackers ability to bypass the likes of the so called PHP hardeners and filtering mods like modsecurity.

In the end the request is being crafted in a specific manner to bypass the input filtering, once those defenses are bypassed, the actual site code itself must have faulty DB input coding in the first place in order for the injection vectors to trigger irregardless of the claims of the attackers, in this instance Anonymous.

score 0 · Answer 4 · answered Feb 05 '12 at 12:20

Just a wild guess. They can encode ASCII strings in UTF-16, this way, routines that were maybe used to check for dangerous user input were fooled/bypassed. The string was then interpreted and the malicious input didn't get filtered.

This sound like the developers used unsafe coding practices or some libraries/applications were outdated and therefore dangerous. It's not like anonymous hackers/scripties have any bypass magic, it's all about experimenting.

Mostly, if they have 0days or some new techniques to hack everything, they would not let people know about it. They are often using old school techniques that keep working because of the incompetence of some programmers/admins. Security is important.

How did Anonymous use UTF-16 ASCII to fool PHP escaping?

4 Answers4

Linked

Related