0

While building my website I put together a security class which includes two functions, one takes input and returns it as a string passed through htmlentities function of PHP. The other strips out common XSS tags etc, until now I've been using both in conjunction when outputting user inputted data to the browser Just to make it clear; my context is HTML, not pdf, xls etc (I.E. string tags then htmlentities then echo). My question is, should I just use HTML entities on its own, is that enough?

below are the two functions I described for reference:

    public function SanitiseXSS($data)
    {
        // If its empty there is no point cleaning it 
        if(empty($data))
        {
            return $data;
        }

        // Recursive loop for arrays
        if(is_array($data))
        {
            foreach($data as $key => $value)
            {
                $data[$key] = $this->SanitiseXSS($data);
            }

            return $data;
        }

        // Fix &entity\n;
        $data = str_replace(array('&','<','>'), array('&','<','>'), $data);
        $data = preg_replace('/(&#*\w+)[\x00-\x20]+;/u', '$1;', $data);
        $data = preg_replace('/(&#x*[0-9A-F]+);*/iu', '$1;', $data);
        $data = html_entity_decode($data, ENT_COMPAT, 'UTF-8');

        // Remove any attribute starting with "on" or xmlns
        $data = preg_replace('#(<[^>]+?[\x00-\x20"\'])(?:on|xmlns)[^>]*+>#iu', '$1>', $data);

        // Remove javascript: and vbscript: protocols
        $data = preg_replace('#([a-z]*)[\x00-\x20]*=[\x00-\x20]*([`\'"]*)[\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iu', '$1=$2nojavascript...', $data);
        $data = preg_replace('#([a-z]*)[\x00-\x20]*=([\'"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iu', '$1=$2novbscript...', $data);
        $data = preg_replace('#([a-z]*)[\x00-\x20]*=([\'"]*)[\x00-\x20]*-moz-binding[\x00-\x20]*:#u', '$1=$2nomozbinding...', $data);

        // Only works in IE: <span style="width: expression(alert('Ping!'));"></span>
        $data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?expression[\x00-\x20]*\([^>]*+>#i', '$1>', $data);
        $data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?behaviour[\x00-\x20]*\([^>]*+>#i', '$1>', $data);
        $data = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:*[^>]*+>#iu', '$1>', $data);

        // Remove namespaced elements (we do not need them)
        $data = preg_replace('#</*\w+:\w[^>]*+>#i', '', $data);

        do
        {
            // Remove really unwanted tags
            $old_data = $data;
            $data = preg_replace('#</*(?:applet|b(?:ase|gsound|link)|embed|frame(?:set)?|i(?:frame|layer)|l(?:ayer|ink)|meta|object|s(?:cript|tyle)|title|xml)[^>]*+>#i', '', $data);
        }
        while ($old_data !== $data);

        return $data;
    }

    public function EscapeOutput($input)
    {
        return htmlentities($input, ENT_QUOTES, 'UTF-8');
    }
sousdev
  • 3
  • 3

1 Answers1

2

Without deciphering your regular expressions I tell you that you are doing this wrong, here are some examples why:

  • HTML entities and attributes are changing all the time, you can't really blacklist things.
  • You can filter the input to avoid HTML tags altogether. This way you will still be vulnerable if you put the filtered data inside script or style tags for example.
  • You may need to accept formated user-supplied data (think FCKEditor), and you would kill the formatting with your filters. At this point you would basically have to deal with XML that you can't possibly parse with regular expressions.
  • The backend database or other system components can format the already filtered input in a way that it is becoming harmful on output (see the ValidateRequest filter bypass).
  • Also, good luck with debugging your code!

I would recommend not reinventing the wheel and separating your views from your controllers. Using some template engine/framework that HTML-encodes all template parameters before output will help a lot. If you need to display formatted content you can use HTMLPurifier.

buherator
  • 1,730
  • 1
  • 9
  • 15
  • Thanks for your response, the user input that needs to parsed as HTML is filtered by a separate function which white-lists the allowed tags so this question is just for plain text output. I know this can be filtered at input, which I do, however I would also like to escape at output, to ensure maximum security. – sousdev Nov 14 '13 at 15:19
  • Maximum security is not ensured by tricky filters but by robust software design. This is especially true for XSS protection that is a much more complex problem than it looks like. I can't stress enough to take a look at software design patterns and existing, widely used frameworks (a bit far from PHP, but .NET MVC 4+ does XSS prevention in a really nice way). – buherator Nov 24 '13 at 13:53