Is the best practice in (c) not just escaping output, but also filtering input?

lucideer · on April 24, 2017

Nope.

Input validation is good practice (checking that the input is what's expected), but input filtering is problematic for two reasons:

1. Since input filtering lacks context, the usual option is to filter very broadly (e.g. to attempt to filter for SQL injection and various forms of XSS simultaneously). This leads to much more complicated filtering strategies, adding maintenance overhead, risking data corruption, and generally increasing technical debt.

2. Data loss. This is less of an immediate security issue, but full data retention can be useful for debugging, for future migrations or data transformations, &c. It also ensures you don't get data corruption (e.g. where filtering for one context breaks your data in a different context).

jerf · on April 24, 2017

There's another less appreciated one, I think, which is that input filtering often has the problem that the "sensitive" values that you might be trying to filter out are also very often perfectly valid values as well.

For instance, the apostrophe character is a very potent character in a number of injection scenarios and one you might be tempted to "filter", but it is also a perfectly legal character in things as common as "names". It is merely one example of a very large and constantly growing set.

You can't help but correctly encode things on the way out if you want things to work properly.

There is also the question of "filtering" vs. "rejecting". I personally recommend that one way or another anything with the first 32 ASCII characters that you don't expect not end up in your database, because they are full of magical behaviors in all kinds of places, but I also tend to recommend outright rejection on the grounds that these things don't innocently come in. Nobody accidentally types the Negative ACK character into their name. But at the very least, filter it out early. You can also outright "filter" on Unicode character classes you don't expect. But this really ought to be seen more as mere day-to-day business "data validation" than a security measure because of the aforementioned fact that some of the Characters of Interest are still valid, and you can't afford to just filter them all out.

(You basically end up with "English letters and numbers". If you're trying to "filter" away all the "bad" characters in advance, without really knowing where they're going, you can't even have things like "space" (very active shell character), and UTF8 can actually be dangerous if stuff isn't expecting it, etc. And when push really comes to shove, even strings of nothing but English letters and numbers can become dangerous if they are too long, in certain pathological contexts, i.e., "seriously, don't write network software in C". Because the safety of a string is not an intrinsic property of a string but has everything to do with interpretation by further bits of code, there isn't a way to generically "cleanse" a string.)

drspacemonkey · on April 24, 2017

I've also come across scenarios where allowing the user to enter valid HTML was a requirement. Especially in cases where users will be entering HTML that renders as part of a site, it was much easier to treat all user input as potentially unsafe and escape output in all cases, with the exception of the one or two places where user-created HTML was supposed to be rendered and/or sent to an external API.

nradov · on April 24, 2017

I run into web applications all the time where lazy and incompetent developers have blocked or filtered out the '<' and '>' characters in a naive attempt to prevent content injection attacks.

gregmac · on April 24, 2017

And another: Fixes/Changes.

If you rely on input filtering but you miss something, there's a bug and filtering doesn't work, or there's a new type of text that needs to be filtered (eg: a new tag added that didn't exist at the time), you have no recourse -- that text is already in the database.

A software update can fix/change the output filtering, and since that runs at display time (when the vulnerability is actually activated), it can address it.

zeta0134 · on April 24, 2017

I think context sensitive filtering is the logical choice at all stages of the operation. Input coming in from a web form should pass through two filters. One context sensitive filter for the form data (Were the required fields filled out, and ideally do they look like the expected data in those fields?) and then again when inserting to the SQL database, to protect against SQL injection.

Then, when the data itself is read, it should pass through context-sensitive output filters, one for the template engine, once again if it's going to be embedded in html, or javascript, or a stylesheet, or into a URL. Output filtering needs to happen, but it critically should not be attempted at the time of input, as the developer of the form should not be expected to predict any possible usage for that data once gathered.

The broad-strokes filtering you describe in step 1 is an anti-pattern through and through. :) I'd avoid personally any codebase or framework attempting this strategy, as I've seen it fail too many times.

Khoth · on April 24, 2017

If hackernews filtered input, I wouldn't be able to say that the <script> tag can be used in XSS attacks.