PHP - How to do safe input filtering in the light of multibyte encoding vulnerabilities?

Question

For days now I try to get my head round how to write a secure web application in php, and it turns out to be particularly difficult. The more I read, the more I sink in deep swamps full of vulnerabilities which are not mentioned by kind people like Matt Robinson or Chris Shiflett.

For a few examples, take:

http://sirdarckcat.blogspot.de/2009/10/couple-of-unicode-issues-on-php-and.html
https://www.owasp.org/index.php/Double_Encoding
Also see character consumption as in Rook's answer

In short,I see the following problems:

When filtering input, it is not very clear how that data will be decoded later, and thus character encoding and escaping systems can bypass the input filtering. (such as double url decoding)
When escaping output, one uses standard functions such as htmlspecialchars. It is all nice that htmlspecialchars has an encoding parameter, however that doesn't prevent you from sending UTF-16 input to it, and that could probably break the security value of the function.

It seems there is a mbstring module in php, but if it is vaguely as secure as it's documentation is comprehensible, than it 'll probably be useless even if I can figure out how to use it. Just a sample of the docs for illustration:

mbstring.strict_detection boolean

    Enables the strict encoding detection.

Great, that's helpful.

Unfortunately the functions also depend on what you set in the configuration options... There seems to be a function called mb_convert_encoding, but the docs say nothing about the security aspect and it seems you need to know the input encoding (a no go zone for security). There also is mb_check_encoding. Which seems to be meant for the purpuse, but reading the user comments on the docs doesn't exactly inspire confidence.

So the question is, in the light of all this, how do you do secure input filtering? Something like this?

mb_convert_encoding to utf-8
mb_check_encoding to reject invalid input
loop url_decode until the string stops changing
Do your normal input filtering with text comparison and regex etc...

edit: note that 3 is problematic because your normal input filtering might introduce entities again that can be url decoded

edit
I found a partial answer here, from Shiflett. It seems that for htmlspecialchars using it's encoding parameter and making sure you set your character encoding header for the browser to the same would avoid that the browser interprets characters differently than htmlspecialchars does. That is all assuming that the input of htmlspecialchars is valid for the given encoding or that for every possible invalid input htmlspecialchars interprets the string in exactly the same way than every browser does. We know that if we find no way of sanitizing our input we cannot assure that the input of htmlspecialchars is validly encoded, because an attacker might cook up a string with invalid encoding. This leads us to the second possibity, that htmlspecialchars will behave identical as the browser for all possible inputs. This is a problem, since we cannot use the same tokenizer for escaping and using, since one happens on the server and one happens in the browser.

This is all similar to what msql_real_escape does for a database, although I think you can properly solve this problem for msql by using prepared statements instead.

A third problematic output is the filesystem when using php to do file uploads or other file system manipulations. About this last one there seems to be very little information available. I don't even know of a specific escape function, let alone one that is robust when it gets crooked input.

Have a look at [mod_security](http://www.modsecurity.org/) as an example of Web Application Firewall. Even though it will never guarantee every flaw is blocked it's an additional layer of security because it uses lists which are updated regularly upon discovery of new vulnerabilities. Another advantages is that it filters request *before* they even have a chance to hit PHP. — Shadok, May 15 '12 at 16:11
@Shadok Thanks, we have mod_security installed (you can't count on it for an application that might be installed elsewere though). Is there an easy way to see a list of vulnerabilities mod_security will prevent? — , May 15 '12 at 18:35
Sure, just find the version of your installed rules and browse them in the svn repository (here is link to the 'current' version): http://mod-security.svn.sourceforge.net/viewvc/mod-security/crs/tags/2.2.4/ . For a very quick glance there is a global description here: https://www.owasp.org/index.php/Category:OWASP_ModSecurity_Core_Rule_Set_Project — Shadok, May 16 '12 at 09:34
Why would you be converting the string (using either `utf8_decode` or `mb_*`) to begin with? Receive it as UTF-8, treat it as UTF-8, use the appropriate SQL and HTML escaping functions on it as UTF-8, inform the browser that it's UTF-8. — deceze, May 16 '12 at 23:16
@nus you should check out this exploitation technique: http://www.exploit-db.com/exploits/15838/ — rook, Jun 01 '12 at 05:30

rook · Answer 1 · 2012-05-15T04:20:23.300

In order to mount a proper defense against a threat you need to understand it. Blindly layering encoding schemes is very dangerous as this can introduce a problem when by default this usually isn't an issue. In general security problems related to encoding arise because escape functions can act upon data differently than they are interpreted. But that is not the only concern, it can also confuse the programmer in that they think a string is escaped, when its not.

The first concern is that escape functions can build dangerous strings for the attacker.

mysql_real_escape_string() is a special function that knows what encoding type the database is using, and adjusts its escaping to follow suit. There for there can never be a disconnect between the encoding method the escape function and the interpreter. Simply using addslashes() can be very dangerous for this reason.

As an example lets use 0xbf27, which is a single GBK character. addslashes() doesn't understand GBK, it only understands ASCII. If this string where interpreted as ASCII the first character would be 0xbf which is an unprintable ASCII character lets call ¿. The 2nd ASCII character is 0x27 which is is a single quote '. After addslashes() 0xbf27 becomes 0xbf5c27 or if it where printed in ASCII it would be ¿\'. The problem is that 0xbf5c is a valid GBK character, so addslashes() has turned a multi-byte character into 2 characters, one of them being a single quote. Another way of thinking of this is that the backslash is being consumed by GBK encoding.

Character consumption can be a problem without escaping functions. Lets use SHIFT-JS and HTML as an example. This example was taken from The Tangled Web:

<img src="http://fuzzybunnies.com/0xEO">
...this is still a part of the markup...
 " onerror="alret('this will execute!')"
<div>
...the page continues...

In this case the "> at the end of the img tag are very important control characters that are being consumed. The " is being consumed by the encoding scheme, and then the > and some other text appears to be apart of the HTML src attribute, its not until it hits the " onerror that the attribute is terminated.

... But that's not all. Lets switch gears a bit. What about urldecode()?

<?php
$id=mysql_real_escape_string($_GET['id']);
$id=urldecode($id);
mysql_query("select * from user where id='".$id."'");

Do you see the vulnerability? From my experience most PHP programmers don't see it. The problem is that urldecode() can be used to build a vulnerable string, lets say the attacker provided a string like http://localhost/vuln.php?id=%2527 or sleep(30)--. Well just about every web application platform will automatically run a urldecode over all HTTP inputs. So the contents of $_GET['id'] is actually %27 or sleep(30)-- That is because %25 was decoded into a % (% is hex 25). After mysql_real_escape_string() or even addslashes() the value is still just %27 or sleep(30)-- After running urldecode() again, the contents of $id is now ' or sleep(30)--, which is dangerous.

So as a hacker tip, when I am auditing a code base that uses magic_quotes_gpc or something similar, I grep though the code looking for decode functions, like urldcode(), htmlspecialchars_decode(), base64_decode() and similar functions. Even though, the application runs addslashes() blindly on all input, a decode function will allow the attacker to build his attack string.

So how do you defend your self? Well test your code. Some web application scanners will test for encoding flaws exactly like this. In fact I researched this topic while building a vulnerable scanner. Another point, always sanitize input at the time of use, this avoids the problem of decoding after escaping. You cannot know how it will be used, so encoding/decoding/escaping everything will always be flawed. Used parametrized queries for building sql statements. UTF-8 is a great default in most cases (but not always), and *most* of these problems don't arise with this muli-byte encoding scheme.

Hmm, the examples are nice, but this doesn't really answer the question. How do I sanitize on demand if I send code to a browser? What if a browser does url_decode? How about a filesystem? I think I don't entirely agree with your vision either (even in your own code, several developers might add stuff and forget to sanitize). I think the generally accepted method is to clean everything and put it in a clean variable for safe use. Further more, even if I would do it on demand, how do I garantee that the encoding doesn't risk permitting an XSS or path traversal? — , May 15 '12 at 12:24
testing is always nice, but I think it is hard to test for everything an attacker could come up with. And I would prefer an offline system for testing on a developermachine first if it was for input/output filtering/escaping. Furthermore the checkout button on your product page leads to a 404. — , May 15 '12 at 12:26
@nus ok you just posted so many questions. Why does a browser urldecode matter? (it actually does a urlencode... but it still doesn't matter) But the most important one is that **you must always** sanitize imedataly before use, and **NEVER NEVER NEVER NEVER** cram all input though the same sanitize function, you will never know how it will be used because you don't know how the data will be used in your program so it **will always be wrong**, and this behavior causes encoding problems. — rook, May 15 '12 at 12:51
@nus Automation is very useful, especially if your application depends on a security system that only works some of the time, like magic_quotes_gpc or some similar global security measure. That 404 issue is fixed, oops, we all make mistakes sometimes. — rook, May 15 '12 at 13:33
@Rook Often, a webserver will not actually use the inputted value. It will be send out to a filesystem or to a browser (XSS vulnerability). Output escaping is one thing, but you also have to filter your input in order to make sure the escaping will work. That is what this question is about. — , May 15 '12 at 14:00
@nus Most XSS is a server-side problem (except for dom based xss, but that is very uncommon), and the input needs to be escaped when its building the HTML. If you blindly run htmlspeicalchars() on all input, you'll have problems with xss, because [xss is an output problem](https://sitewat.ch/en/Blog/5) — rook, May 15 '12 at 16:48
@Rook I never suggested running an escaping function on all input. All XSS, including Dom based xss is a server problem. That still doesn't solve the initial problem though. Input filtering is not the same as output escaping. Input filtering is needed amongst other things to make sure output escaping works. — , May 15 '12 at 18:20
@nus actually dom based xss is just a vulnerability in javascript. Yes escaping input is vital, however applying the same filter to all input is suicide, and decode functions will undermine this approach to security. — rook, May 15 '12 at 18:41
let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/3440/discussion-between-nus-and-rook) — , May 15 '12 at 18:46

score 2 · Answer 2 · edited Mar 17 '17 at 13:14

No, I don't recommend the approach you mentioned.

First off, let me back up. Conceptually, perhaps the safest approach is to apply a combination of input validation and output escaping. Input validation means you define what the form of expected/good inputs are, and check that the inputs have that form. Output escaping means that you escape outputs, according to the context where they will be consumed. Input validation is generally done to inputs as soon as you receive them; output escaping is generally done at the very end, just before inserting a value into a HTML document or other output.

For general background on the concepts of input validation and output escaping, as well as implementation tips, OWASP has some excellent resources. See also this question: Filter user input before the database or upon display?, for more on the concepts.

To do input validation, I think I would define a whitelist of allowed characters or a regexp that characters the expected input, and check that the input matches this regexp/whitelist. (An appropriate whitelist should deal with issues related to funny input encodings.) In some cases another way to do input validation is by casting the input to a specific type, like (int). The specific validation function will depend upon the type and format of the input; you can't use a one-size-fits-all validator.

To do output escaping, I recommend using a escaping library, such as OWASP ESAPI. You need to understand the different parse contexts where values might appear, and then use the right escaping function for that parse context. For instance, htmlspecialchars() is an appropriate escaping function for dynamic values that will be inserted into a HTML document between tags, but it is not appropriate for a URL (for URLs, you must also check that the protocol is valid). OWASP ESAPI provides a suite of escaping functions for the most common contexts where you might be inserting data into outputs.

Make sure to use prepared statements. Don't build up SQL queries through string concatenation.

That said, PHP's support for security is weak. Many other modern web programming frameworks provide stronger support for security, e.g., by providing a template system with context-sensitive auto-escaping, by providing automatic support for CSRF tokens, secure session management, user logins, a database ORM, and so forth.

See also Security attacks on PHP web applications, What security features should a PHP framework have?, Is strip_tags() horribly unsafe?, What are some good, free tools to run automated security audits for PHP code?, How to perform a security audit for a PHP application?, Why do people say that PHP is inherently insecure?.

Thanks for answering. I will have a look at ESAPI, but this question is about input validation. How would you do input validation if you don't know the encoding of the input? Say I have a file upload and thus need to sanitize a filename. I don't want to whitelist, because not everyone uses an ascii alphabet, so that would be very unfriendly for everyone but english users. I do need to be absolutely sure however that no corrupt or wide character encodings cause something to be interpreted as a slash by the underlying filesystem, otherwise I have path traversel issues. — , May 16 '12 at 01:19
1. ESAPI does provide some utility functions for input validation. 2. The simple approach: PHP treats all strings as ASCII encoding. If you check that your inputs are ASCII, and treat them consistently as ASCII, and you will be fine (if an attacker sends you funny characters, perhaps intended as a funny encoding, then if you've done input validation right, the funny characters won't get through your input validation and you'll be fine). (cont.) — D.W., May 16 '12 at 01:24
3. If you are asking how to support, say, UTF8 encoded strings in PHP, that might be a question for Stack Overflow. See also [this answer](http://security.stackexchange.com/a/11371/971). If you want to use, say, UTF-8, then use some library for manipulating UTF-8 strings in PHP. From a security perspective, the important thing is to do all input validation on the UTF-8 strings using the UTF-8 aware libraries; be consistent about treating it as UTF-8; and if you include it into output, either make sure the output encoding is also UTF-8, or else convert to an appropriate encoding. — D.W., May 16 '12 at 01:26
Using utf-8 is not a problem, but the problem is that a user might send a piece of text that is not valid in utf-8, or that is downright in some other encoding. As PHP functions like preg_replace treat all text as a byte arrays with each byte representing one character, this can be used to bypass the input filtering. See the sirdarkcat link in my question for a detailed explanation and examples. — , May 16 '12 at 01:31
the link talks about the utf8_decode, but htmlspecialchars, preg_replace, and other functions have similar issues. If needed I can add links to more vulnerabilities. — , May 16 '12 at 01:35
@nus, don't use preg_replace with UTF-8 strings: it isn't UTF-8 aware. If your strings are in UTF-8, you must find a library that can perform all of the operations you want to do in UTF-8 format, and make sure to consistently use only UTF-8-aware functions. Yes, as far as I can tell, it is a shortcoming of PHP that its standard string type and many standard library functions are not UTF-8 aware. (Or, put it another way: If you want to make your question well-defined, select a string handling library and an input validation operation you want to perform, then ask about that specifically.) — D.W., May 16 '12 at 01:48

score 0 · Answer 3 · answered Sep 29 '15 at 09:59

I have come to the conclusion to use:

$input = mb_convert_encoding( $input, 'UTF-8' );

In order to sanitize the character encoding before doing any other operations like input validation and output escaping. I'll probably have to look at the source for mb_convert_encoding and it's unit tests to be sure it's safe, but I have the impression that it will always return a valid UTF-8 string.

The rest of my application will use utf-8 throughout which avoids anything being interpreted differently, all the way to the browser.

PHP - How to do safe input filtering in the light of multibyte encoding vulnerabilities?

3 Answers3