Blacklisting vs. whitelisting characters to prevent XSS?

Question

I've been reading about XSS prevention on OWASP and other security channels. They all say that I should use ESAPI or a similar library and do input filtering through a whitelist approach.

However, I use a framework (Webobjects) which encodes by default, so using ESAPI changes my input and is therefore not an option for me.

The second option is to use a whitelist approach. I support many languages like Japanese, Russian, Korean etc, so how do I decide what characters to whitelist?

Also, why is whitelist approach better than a blacklist approach as mentioned by OWASP? Why not just block a handfull of characters used in XSS like <, >, etc?

score 8 · Answer 1 · answered Sep 13 '12 at 19:44

It is not just a block of handful characters that you need to blacklist. In security we go by this dogma:

"There are things we know that we know. There are known unknowns. That is to say there are things that we now know we don't know. But there are also unknown unknowns. There are things we do not know we don't know."

Blacklist might help you prevent the first two cases, a whitelist helps covers all three :)

While it is easy to identify and validate a set of characters that are harmless, its difficult to identify all known bad. Most anti-virus software employ blacklist approach(signatures), however they still fail to catch a 0-day because it was something they didn't know as a known bad and hence didn't have a signature for that.

score 7 · Answer 2 · edited Sep 13 '12 at 18:17

Also, why is whitelist approach better than blacklist approach as mentioned by OWASP. Why not just block a handfull of characters used in XSS like < , > , etc

Blacklists are static in the sense, they prevent 'known bad' from happening. The problem with this is, there are new attack vectors found everyday and you would need to constantly update your black list to be safe. Whitelist on the other hand is more robust because, you can create a filter on exactly what you want. That answers your question on why whitelists are suggested by OWASP.

score 4 · Answer 3 · edited Mar 17 '17 at 10:46

I think you might have rejected ESAPI too quickly. To defend against XSS, I recommend you do output escaping: any place where you insert data dynamically into an HTML document, escape the data (in a way suitable for that parse context). ESAPI provides libraries for the escaping and is very useful. This does not involve "changing your input".

For more, read OWASP's XSS (Cross Site Scripting) Prevention Cheat Sheet, and Can anybody explain XSS to an idiot?, and Filter user input before the database or upon display?, and Canonicalization & Output Encoding.

symcbean · Answer 4 · 2018-02-08T20:17:33.630

do input filtering

No, no, no.

By all means do input validation - accept or reject the input based on rules. Don't try to change the input data. If the interface between your webserver and your application language allows content through which compromises you application language then there's something very, very wrong. Certainly you can't handle this kind of scenario within your application code.

Vulnerabilities in applications typically arise at the point where data leaves your application language - and in the case of XSS, this is where they always arise. So this is the point at which you should apply any transformation to the data. An the transformation must be apropriate to where the data is going - how you escape data you are writing to a database is very different from data to be written into html.

+1. "Sanitize your inputs" does not mean "store HTML-encoded text in your database". — tdammers, Sep 14 '12 at 10:41

score 0 · Answer 5 · answered Sep 13 '12 at 19:46

Ideally, the following steps should be performed on input:

Filter (optionally). Remove whitespace around values. Or for example for a phone number you might want to remove spaces, hyphens and periods, because some people try to enter things like 046 339 1312.
Validate.

a. This prevents user errors uncaught by the filter.

b. Validation checks should function as whitelist. With proper escaping (see point 3 below) this shouldn't be needed for security, but it may block an undiscovered attack in the future. Be careful though, it's often harder than it seems.

Whether you should go for something basic (like only checking for an @ sign when validating e-mail addresses) or an exact and strict regular expression, depends on whether you know all possible values and whether you really need to get it right. For a site with credit card data in its database, you might want to spend some more time researching possible inputs.

c. You may also want to validate it like, does the domain in this e-mail address even exist? Or is this number really valid? You may even try if a mailserver accepts the user (Gmail will error when a user does not exist, even before you really sent anything), but the server could be down. SMTP would handle this for you and delay delivery, but your live e-mail address checker will tell the user his address does not exist while it totally does.
Escape. This should be done both for input and output. Input when you are saving it to a database, output when you are displaying it. Yes, also escape data that comes straight from your database. An apostrophe may be totally harmless in HTML, whereas usage in Javascript may provide a great XSS opportunity. Or a < will not be escaped by mysql_real_escape_string, but if you then output it to an HTML page you got another XSS.

I can't think of any use-case where the above steps would make a blacklist practical or necessary, so I would advice against using it. There may always be a reason though. Escaping should take care of the rest regardless, but it's good to have another defense.

When it's important enough, you might also write a unit test to test all possible inputs and common exploits: ../, ', ", \, numerics, strings, or even the full ASCII set.

"_You may even try if a mailserver accepts the user (Gmail will error when a user does not exist, even before you really sent anything)_" do you suggest verifying email validity with `RCPT TO` instead of the command intended for this (`VRFY`)? — curiousguy, Sep 14 '12 at 04:41
Escaping input *and* output is wrong; you will be escaping twice, and that is practically never correct. — tdammers, Sep 14 '12 at 10:39
@tdammers No, you're not storing it as escaped data. You're transferring it as such. This prevents SQL injections, XSS attacks, and probably more in other languages. — Luc, Sep 14 '12 at 14:53
@Luc: You probably mean the right thing, but the wording is a bit misleading. Things you store in the database are output, just like things you send to the client, and you need to supply both in the appropriate format (SQL syntax, HTML, etc.). Things you read from the database and things you receive from a client are inputs, and while you *validate* and possibly *restrict* those, you don't *escape* them; if anything, you translate from POST fields or whatever encoding you receive to your internal encoding (typically UTF-8). — tdammers, Sep 14 '12 at 16:11
@tdammers I see what you mean. What I meant was that you escape them for the database with functions like mysql_real_escape_string, so you do actually escape them. Then when outputting to a client you escape them again with htmlspecialchars or something. But yeah my wording might be misleading. The post got no upvotes though, I guess it's a bit offtopic, but feel free to edit if you think it should be changed anyway. — Luc, Sep 14 '12 at 17:50

Blacklisting vs. whitelisting characters to prevent XSS?

5 Answers5