22

In the bits that I've searched about this, I've seen a few people declare as word-of-god that you should only sanitize outputs and not inputs. Why? Would it not be safer to cover both ends?

D.W.
  • 98,420
  • 30
  • 267
  • 572
Todd Schwine
  • 349
  • 1
  • 2
  • 5
  • 6
    As worded, this is an opinion question that could start a flame war (and might get closed for that reason). You might [edit](https://security.stackexchange.com/posts/95325/edit) it to be more like "what are the pros and cons of both approaches?" – Mike Ounsworth Jul 30 '15 at 01:41
  • 8
    I'd like to see a reference to _anyone_ saying you should not sanitize inputs. – gowenfawr Jul 30 '15 at 02:12
  • @gowenfawr There are cases for that; our security assurance department encourages us to put our sanitization as close to the output line as possible (db call or wtv) since it makes their code reviews easier. The rationale being that it's easier for them to `ctrl+f` for db calls than for all possible sources of input. – Mike Ounsworth Jul 30 '15 at 02:19
  • @gowenfawr it's not that you shouldn't sanitize inputs, but rather for intra-system data transfer, the focus is on output sanitation (making an assumption on the OP's experience). – schroeder Jul 30 '15 at 02:31
  • 2
    That said, Todd, could you add some notion of scope to this question? – schroeder Jul 30 '15 at 02:32
  • 2
    Could you elaborate on what exactly you mean by ‘input sanitization’ and ‘output sanitization’? – Gumbo Jul 30 '15 at 04:15
  • 1
    **Please clarify:** If I have data (1) Input by the user in a Form field (2) sent to the Server-Application (3) Sent to the Persistence Framework (4) Saved in the Table-Columns and then later (5) read from the Table (6) packed and formatted by the Application Server (7) presented to the User on a Webpage... `Where do you see Input/Output ?` – Falco Jul 30 '15 at 08:51
  • @MikeOunsworth my understanding of your scenario is that you still sanitize inputs. The way I see your scenario is: Line 24: Sanitize(); Line 25: Db.Execute(); Line:w/e; This would make it an input sanitisation as it is happening on the input stream. – sir_k Jul 30 '15 at 09:18
  • @FlorinCoada Yes, it's still the input _stream_, but you're allowing the raw input to sit in memory and potentially be processed by a parser before being sanitized. `Line 25: str = get_input(); Line 26: // ... ; Line 150: sanitize(str); Line 151: Db.Execute()`. Arguably you'd sanitize it both at input and at output so that there's no risk of it exploiting one of the library calls between Lines 26 - 149. – Mike Ounsworth Jul 30 '15 at 12:07
  • @MikeOunsworth At what point between lines 26-149 did the input suddenly become an output? It hasn't even gone into the database yet... – Michael Jul 30 '15 at 18:48
  • @Michael I should have written 'Line 151: Db.Execute(str);` But I think I see the bigger confusion; to me, "output" is any string that leaves my own source code, for example passed to a system call, or written to a file / db. But I suppose many people would still call that _input_ since it came from a user. – Mike Ounsworth Jul 30 '15 at 18:58

5 Answers5

33

When you sanitize input, you risk altering the data in ways that might make it unusable. So input sanitization is avoided in cases where the nature of the data is unknown. For instance, perhaps some special characters hold significance in the data and stripping them means destroying that significance.

A scenario like this may be that your system stores data that may later be pulled out into a third party system, and in that system those characters hold meaning. By stripping them you've altered the data in a significant way. For instance, perhaps the string is used as a key to look up a record in the third party system and by stripping the symbol you alter the key such that the record cannot be found.

Input sanitization can be used when that nature of the data is known and sanitization would not adversely affect the data in anyway.

Your decision to sanitize input data is in part a business decision. Will third party system depend on input exactly as it is provided? If so, it's probably not a good idea. However, you may be able to shape expectations such that the third parties understand that you will be sanitizing input data based on a specified criteria that you share with them.

saghaulor
  • 504
  • 4
  • 6
  • 7
    Additionally, you (almost) always know what format you're outputting. Either you're outputting `HTML`, or `javascript`, or `file paths`, or `SQL`, etc, and you can sanitize it in that context since different things are safe in different contexts. – Mike Ounsworth Jul 30 '15 at 12:14
  • I agree with @MikeOunsworth. When you output the data, you know the use case of the data. This knowledge allows you to safely sanitize the output data accordingly. – saghaulor Jul 30 '15 at 16:07
  • 3
    I would call this output encoding instead of sanitization. – Matthew Jul 30 '15 at 18:45
14

Gee... "Sanitize output." I've never actually heard that term used before. I've been doing this for, oh, I don't know. Over a decade now at least. You don't "sanitize your output" you encode it for proper context within the application it is being presented. You encode the output for HTML, HTML Attribute, URL, JavaScript... I've never seen or heard anyone claim that you "sanitize" your output... do you folks mean in the sense of whitelisting or blacklisting what particular character strings can be sent down the wire to the browser, for example? Nobody does that. They shouldn't anyway, for the reasons listed above - you don't know what may be legitimate use of particular data for a given application... some web sites (like, for example.. this one) must allow code to be uploaded and then rendered as code w/in the request-response life-cycle. By not allowing the use of, say, a script tag, how could examples of code ever be exchanged on code-sharing sites?

By the way "You can never in retrospect go through the database and see how many of the posts were malicious." is simply not true. There are scrubbers available to go through a database and "scrub it" of malicious code. I know, I did it last year for a major financial services company.

RatboySTL
  • 181
  • 5
  • I think you misunderstood "You can never in retrospect go through the database and see how many of the posts were malicious." - it was said in the context of having already sanitized the input. In which case the malicious part is gone, so no you cannot go back and look for which ones were originally malicious. – Aaron Dufour Jul 30 '15 at 17:56
  • 2
    +1 because why should you sanitize the output - you're the one that generated it, right? – Michael Jul 30 '15 at 18:50
8

You don't know how to sanitise data until you output it, or more precisely use it.

In many cases it might seem obvious; in your blogging engine you want to filter out script-tags; always and ever so you simply delete them from the input and never think of them again.

In other cases it might not be so easy; if the same data is used in different contexts. "<" needs to be escaped to "&lt;"in html and is completely harmless if exported as text.

But even if it is simple, by removing <script> from your input you lose important data. You can never in retrospect go through the database and see how many of the posts were malicious.

Then comes the possibility of moving goal posts: someone finds a new exploit that your filter doesn't deal with. Suddenly you must reapply a fixed filter on your entire database. What if there is a false positive bug in your fix?

But even if you are absolutely certain that the posted data is completely free of xss, viruses et cetera, it is completely safe to show in a browser; you cant just shove it in your database willy nilly. That's how SQL-injections are born.

The bottom line is that until you use the data, you cannot know what "bad" data looks like, and every time you use the data you must sanitize it.

Trying to fix the data in advance is like darning socks before there is a hole in them.

Odalrick
  • 181
  • 4
  • Creating an SQL query with user input is outputting to SQL, so you should do whatever is necessary to escape the data in SQL context. Or avoid having to escape data by using bind parameters. – Lie Ryan Jul 30 '15 at 12:57
  • 1
    While this is technically true, sometimes you use the data immediately, and therefore need to sanitize it immediately. In essence, you're arguing that "sanitizing input" is a misnomer, and possibly a misleading one. I'd agree with that. But the action it refers to, however inaccurately, is indisputably sometimes necessary. – Kevin Krumwiede Jul 30 '15 at 18:46
5

It is a risk to have XSS content in your database. Databases are meant to be shared by applications, and are long-lived compared to web front-ends.

Example: The new intern starts working on a new web app for the db, shows his boss, and bam, his login cookie is in St. Petersburg.

You don't want to alter user input, you want to validate user input and reject it if it contains possible XSS. This is pretty easy and fast with a proper HTML parser like JSoup. It's built-in to Hibernate Validator.

I'm not saying you shouldn't escape user input on output. With the number of XSS issues, it's obviously easy to miss a few though.

Neil McGuigan
  • 3,379
  • 1
  • 16
  • 20
  • 2
    You are assuming the world is web. It's not. There are also legitimate reasons for having code-like data in a database. StackExchange for example :) However you are certainly correct to validate, e.g. your phone number should not have much more than numbers and a few other bits of punctuation. – LoztInSpace Jul 30 '15 at 23:59
4

I would recomend validating the imput and sanitizing the output. That way you can ensure that valid data are being stored on the database and harmless data is consumed on the users' end.

If a field expects a date, make sure you are receiving a date. You can easily validate dates, numbers, emails, ZIP codes, phone numbers, and a lot of fields. So do it.

Do it on javascript, on client side, AND do it again on server side. If you validate on client side, you can generate an error message way faster than waiting all the way to the server, being validated, and sent back. Do it on the server again, because if someone disables client-side validation, you are still covered.

Sanitize before storing the data - you don't want to be hit by a SQL injection. Use prepared statements if possible, and escape every control char if not possible.

On the output side, encode the data as to be harmless on the backend format. If outputting HTML, escape all special HTML chars. If outputting json or XML, do the encoding accordingly.

As others said, filtering and encoding the data on the input size will destroy the data, and can delete part of data that would be harmless on some contexts, or keep hazardous data. Validating the input and encoding the output would be the best approach.

ThoriumBR
  • 50,648
  • 13
  • 127
  • 142