July 10th, 2024

Don't try to sanitize input. Escape output. (2020)

Limitations of input sanitization in preventing XSS attacks are discussed. Filtering unsafe characters may alter input or provide false security. Contextual escaping and validation are crucial for secure coding practices.

Read original articleLink Icon
Don't try to sanitize input. Escape output. (2020)

The article discusses the limitations of input sanitization in preventing cross-site scripting (XSS) attacks. It explains how simply filtering out unsafe characters can lead to unintended consequences like altering user input or providing a false sense of security. The recommended approach is to escape output based on the context in which it will be displayed, ensuring proper handling of characters in different scenarios like HTML, JSON, and SQL. The article emphasizes the importance of contextual escaping and using features like parameterized queries in SQL to prevent vulnerabilities. It also addresses the challenge of allowing users to input HTML or Markdown content, suggesting approaches like whitelisting allowed tags and attributes or using security-vetted libraries. Additionally, the article highlights the significance of input validation for ensuring data integrity and preventing malicious inputs. It provides further resources for understanding and implementing secure coding practices to mitigate XSS and SQL injection risks effectively.

Link Icon 22 comments
By @terraexpert - 6 months
We were contacted by a bug hunter once stating he has access to our database and asking for a bounty for his finding, he even provided a sample of first 100 users from the users table in the database.

After some investigating, I figured out how did he obtain the data.

He was one of the first 100 users, he set one of his fields to an xss hunter payload, and slept on it.

After two years, a developer had a dump of data to test some things on, and he loaded the data into an sql development software on his mac, and using his vscode muscle memory, he did a command+shift+p to show the vscode command bar, but on the sql editor it opened "Print Preview", and the software rendered the current table view into a webview to ease the printing, where the xss payload got executed and page content was sent to the researcher.

Escape input, you never know where will it be rendered.

By @simonw - 6 months
This is such an important lesson, but it's a difficult one to convince people of - telling people NOT to sanitize their input goes against so much existing thinking and teaching about web application security.

It's worth emphasizing that there's still plenty of scope for sensible input validation. If a field is a number, or one of a known list of items (US States for example) then obviously you should reject invalid data.

But... most web apps end up with some level of free-form text. A comment on Hacker News. A user's bio field. A feedback form.

Filtering those is where things go wrong. You don't want to accidentally create a web development discussion forum where people can't talk about HTML because it gets stripped out of their comments!

By @foota - 6 months
It's buried a bit in the article, but if you have to sanitize input to allow only some kinds of inputs (e.g., specific tags), you should really be parsing it fully to an AST and then acting on that (or using a library doing the same) since otherwise you're going to be subject to all sorts of pain.
By @WillAdams - 6 months
I still wish that the Unicode folks had set up a bunch of duplicate code points which could have been used exclusively for processing marked-up text and that the folks making markup systems/languages had followed through.

Say one was updating TeX to take advantage of this --- all the normal Unicode character points would then have catcodes set to make them appropriate to process as text (or a matching special character), while "processing-marked-up" characters would then be set up so that for example:

- \ (processing-marked-up variant) would work to begin TeX commands

- # (processing-marked-up variant) would work to enumerate macro command arguments

- & (processing-marked-up variant) would work to delineate table columns

&c.

and the matching "normal" characters when encountered would simply be set.

By @marticode - 6 months
Why not not both? Escaping output should be a requirement but doesn't hurt to remove obvious garbage in the input (including harmless stuff like pointless spaces)
By @buro9 - 6 months
I store the raw input in my database, but run it through bluemonday before rendering it. Simples.

https://github.com/microcosm-cc/bluemonday

By @hinkley - 6 months
This is another place where 80% of the time one way works but 20% of the time you need to go the other way.

Of course once the product is in production you can swim one direction but not fight the current going in the other. You can always move to escaping output, but retroactively sanitizing input is a giant pain in the ass.

But the problem comes in with your architecture, and whether you can discern data you generated from data the customers generated. Choose the wrong metaphors and you end up with partially formatted data existing halfway up your call stack instead of only at the view layer. And now you really are fucked.

Rails has a cheat for this. It sets a single boolean value on the strings which is meant to indicate the provenance of the string content. If it has already been escaped, it is not escaped again. If you are combining escaped and unescaped data, you have to write your own templating function that is responsible for escaping the unescaped data (or it can lie and create security vulnerabilities. "It's fine! This data will always be clean!" Oh foolish man.)

The better solution is to push the formatting down the stack. But this is a rule that Expediency is particularly fond of breaking.

By @shaftway - 6 months
I've always been a big fan of structuring data on input, escaping it on output.

I think the big problem with just escaping output is that you can accidentally change what the output will actually be in ways that your users can't predict. If I am explaining some HTML in a field and drop `<i>...</i>` in there today, your escaper may escape this properly. But next month when you decide to change your output to actually allow an `<i>` tag, then all of a sudden my comment looks like some italicized dots, which broke it.

Instead if you structure it, and store it in your datastore as a tree of nodes and tags, then next month when you want to support `<i>` you update the input reader to generate the new structure, and the output writer to handle the new tags. You preserve old values while sanitizing or escaping things properly for each platform.

By @zzo38computer - 6 months
It is a reasonable idea, but there are other things that can be done too.

However, in the stuff about SQL, you could use SQL host parameters (usually denoted by question marks) if the database system you use supports it, which can avoid SQL injection problems.

If you deliberately allow the user to enter SQL queries, there are some better ways to handle this. If you use a database system that allows restricting SQL queries (like the authorizer callback and several other functions in SQLite which can be used for this purpose), then you might use that; I think it is better than trying to write a parser for the SQL code which is independent of the database, and expecting it to work. Another alternative is to allow the database (in CSV or SQLite format) to be downloaded (and if the MIME type is set correctly, then it is possible that a browser or browser extension will allow the user to do so using their own user interface if they wish to do so; otherwise, an external program can be used).

Some of the other problems mentioned, and the complexity involved, are due to problems with the messy complexity of HTML and WWW, in general.

For validation, you should of course validate on the back end, and you may do so in the front end too (especially if the data needed for validation is small and is intended to be publicly known). However, if JavaScripts are disabled, then it should still send the form and the server will reply with an error message if the validation fails; if JavaScripts are enabled then it can check for the error before sending it to the server; therefore it will work either way.

By @chx - 6 months
This has been the way for Drupal since ... 2005 at least. My memory becomes fuzzy before that. Since 2015 it's highly automated too thanks to Twig autoescape.
By @Udo - 6 months
They're not even related. Sanitizing input is at best a formatting/style issue. Escaping output is a security issue.
By @lsb - 6 months
Of the “six famous bad ideas in computer security”, the first and second are “default permit” and “enumerating badness”.

http://www.ranum.com/security/computer_security/editorials/d...

By @KingOfCoders - 6 months
I think the challenge is, you share data with other systems. If you don't treat "sharing" as "output" you're in trouble.
By @kazinator - 6 months
Of course you should sanitize input, and escape everything properly in the context-specific way.

Defining what is valid for an input field and rejecting everything else helps the user catch mistakes. It's not just for security.

Some kinds of information are tricky to sanitize. Names, addresses and such. Especially in an application or site that has global users. Do the wrong thing and you end up aggravating users, who are not able to input something legitimate.

But maybe don't allow, say, a date field to be "la la la" or even "December 47, 2023".

By @ww520 - 6 months
Still looking for a way to safely parse HTML string into DOM while avoiding XSS attacks. Most solutions end up with sanitizing input.
By @ecjhdnc2025 - 6 months
Ehhh!? I don't get this at all. You obviously do both.

1) you get your input data into the form that is meaningful in the database by validating, sanitising and transforming it. Because you know what form that data should be in, and that's the only form that belongs in your database. Data isn't just output, sometimes it is processed, queried, joined upon.

2) you correctly format/transform it for output formats. Now you know what the normalised form is in the database, you likely have a simpler job to transform it for output.

It's not just lazy to suggest there's a choice here, it's wrong.

By @atmanactive - 6 months
Absolutely the worst advice ever!
By @dudeinjapan - 6 months
Porque no los dos?
By @TheChaplain - 6 months
Disagree.

Escaping/sanitizing on output takes extras cycles/energy that can be spared if the same process is done once upon submission.

Think more sustainable.

By @ungamedplayer - 6 months
The reason you sanitise input is because the data can attack the host and the client.

This post has a narrow view on attackers.