How to allow specific characters with OWASP HTML Sanitizer?

How to allow specific characters with OWASP HTML Sanitizer? - java

I am using the OWASP Html Sanitizer to prevent XSS attacks on my web app. For many fields that should be plain text the Sanitizer is doing more than I expect.
For example:
HtmlPolicyBuilder htmlPolicyBuilder = new HtmlPolicyBuilder();
stripAllTagsPolicy = htmlPolicyBuilder.toFactory();
stripAllTagsPolicy.sanitize('a+b'); // return a+b
stripAllTagsPolicy.sanitize('foo#example.com'); // return foo#example.com
When I have fields such as email address that have a + in it such as foo+bar#gmail.com I end up with the wrong data in the the database. So two questions:
Are characters such as + - # dangerous on their own do they really need to be encoded?
How do I configure the OWASP html sanitizer to allow specific characters such as + - #?
Question 2 is the more important one for me to get an answer to.

You may want to use ESAPI API to filter specific characters. Although if you like to allow specific HTML element or attribute you can use following allowElements and allowAttributes.
// Define the policy.
Function<HtmlStreamEventReceiver, HtmlSanitizer.Policy> policy
= new HtmlPolicyBuilder()
.allowElements("a", "p")
.allowAttributes("href").onElements("a")
.toFactory();
// Sanitize your output.
HtmlSanitizer.sanitize(myHtml, policy.apply(myHtmlStreamRenderer));

I know I am answering question after 7 years, but maybe it will be useful for someone.
So, basically I agree with you guys, we should not allow specific character for security reasons (you covered this topic, thanks).
However I was working on legacy internal project which requried escaping html characters but "#" for reason I cannot tell (but it does not matter). My workaround for this was simple:
private static final PolicyFactory PLAIN_TEXT_SANITIZER_POLICY = new HtmlPolicyBuilder().toFactory();
public static String toString(Object stringValue) {
if (stringValue != null && stringValue.getClass() == String.class) {
return HTMLSanitizerUtils.PLAIN_TEXT_SANITIZER_POLICY.sanitize((String) stringValue).replace("#", "#");
} else {
return null;
}
}
I know it is not clean, creates additional String, but we badly need this.
So, if you need to allow specific characters you can use this workaround. But if you need to do this your application is probably incorrectly designed.

The danger in XSS is that one user may insert html code in his input data that you later inserts in a web page that is sent to another user.
There are in principle two strategies you can follow if you want to protect against this. You can either remove all dangerous characters from user input when they enter your system or you can html-encode the dangerous characters when you later on write them back to the browser.
Example of the first strategy:
User enter data (with html code)
Server remove all dangerous characters
Modified data is stored in database
Some time later, server reads modified data from database
Server inserts modified data in a web page to another user
Example of second strategy:
User enter data (with html code)
Unmodified data, with dangerous characters, is stored in database
Some time later, server reads unmodified data from database
Server html-encodes dangerous data and insert them into a web page to another user
The first strategy is simpler, since you usually reads data less often that you use them. However, it is also more difficult because it potentially destroys the data. It is particulary difficult if you needs the data for something other than sending them back to the browser later on (like using an email address to actually send an email). It makes it more difficult to i.e. make a search in the database, include data in an pdf report, insert data in an email and so on.
The other strategy has the advantage of not destroying the input data, so you have a greater freedom in how you want to use the data later on. However, it may be more difficult to actually check that you html-encode all user submitted data that is sent to the browser. A solution to your particular problem would be to html-encode the email address when (or if) you ever put that email address on a web page.
The XSS problem is an example of a more general problem that arise when you mix user submitted data and control code. SQL injection is another example of the same problem. The problem is that the user submitted data is interpreted as instructions and not data. A third, less well known example is if you mix user submitted data in an email. The user submitted data may contain strings that the email server interprets as instructions. The "dangerous character" in this scenario is a line break followed by "From:".
It would be impossible to validate all input data against all possible control characters or sequences of characters that may in some way be interpreted as instructions in some potential application in the future. The only permanent solution to this is to actually sanitize all data that is potentially unsafe when you actually use that data.

To be honest you should really be doing a whitelist against all user supplied input. If it's an email address, just use the OWASP ESAPI or something to validate the input against their Validator and email regular expressions.
If the input passes the whitelist, you should go ahead and store it in the DB. When displaying the text back to a user, you should always HTML encode it.
Your blacklist approach is not recommended by OWASP and could be bypassed by someone who is committed to attacking your users.

You should decode after sanitising your input:
System.out.println(StringEscapeUtils.unescapeHtml("<br />foo'example.com"));

Related

Best Practice For XSS Attacks in Rest Api [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have read a lot about it, but couldnt really decide which way is the best.
I have a web app and a java rest application which serves to customers.
What is the best way to prevent xss attacks using parameters in rest api and frontend?
Validating each parameter in both server and client side
Filter and control request params
On client side control before putting every data in between tags
etc...
Thank you for your time.

As with anything defense in depth is important, so validation and encoding should be done on any user provided input. Encoding is very important because what might be considered malicious is contextual. For example, what might be safe HTML might be an SQL Injection attack.
Parameters in a REST API may be saved which means they are returned from subsequent requests or the results may be reflected back to the user in the request. This means that you can get both reflected and stored XSS attacks. You also need to be careful about DOM Based XSS attacks. A more modern categorization that addresses overlap between stored, reflected, and DOM XSS is Server XSS and Client XSS.
OWASP has a great Cross Site Scripting Prevention Cheat Sheet that details out how to prevent cross site scripting. I find the XSS Prevention Rules Summary and the Output Encoding Rules Summary sections to be very handy.
The big take away is that browsers parse data differently depending on the context, so it is very important that you don't just HTML Entity Encode the data everywhere. This means it is important to do two things:
Rule #0 - Only insert untrusted (user provided) data in allowed locations. Only insert data into an HTML document into a "slot" defined by Rules #1-5.
When you insert data into one of the trusted slots follow the encoding rules for that specific slot. Again the rules are detailed in the previously linked Cross Site Scripting Prevention Cheat Sheet.
There is also a DOM based XSS Prevention cheat sheet. Like the server side XSS cheat sheet, it provies a set of rules to prevent DOM based XSS.

When it comes to XSS only possible choice is to validate user input, any kind of user input, whether it is passed from the browser or in any other way (like from terminal client).
It depends on the scenario you are following.
If it is just data without HTML content then you don't need to worry about XSS.
Otherwise, just removing <,> symbols or casting them into character encoded string would be enough.
Also you can avoid using innerHTML to append new content to the document, use innerText instead and even if there are XSS content it won't execute.
But it gets little bit complicated when api response returns HTML content as well which you need to display somewhere. In such cases avoid directly displaying user input inside HTML snippet - try to character encode or remove <, > symbols and it will be just fine

How to defend against xss when saving data and when displaying it

Let's say I have a simple CRUD application with a form to add new object and edit an existing one. From a security point of view I want to defend against cross-site scripting. Fist I would validate the input of submitted data on the server. But after that, I would escape the values being displayed in the view because maybe I have more than one application writing in my database (some developer by mistake inserts unvalidated data in the DB in the future). So I will have this jsp:
<%# taglib prefix="esapi" uri="http://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API" %>
<form ...>
<input name="myField" value="<esapi:encodeForHTMLAttribute>${myField}</esapi:encodeForHTMLAttribute>" />
</form>
<esapi:encodeForHTMLAttribute> does almost the same thing as <c:out>, it HTML escapes sensitive characters like < > " etc
Now, if I load an object that somehow was saved in the database with myfield=abc<def the input will display correctly the value abc<def while the value in the html behind will be abc<def.
The problem is when the user submits this form without changing the values, the server receives the value abc<def instead of what is visible in the page abc<def. So this is not correct. How should I implement the protection in this case?

The problem is when the user submits this form without changing the values, the server receives the value abc<def instead of what is visible in the page abc
Easy. In this case HTML decode the value, and then validate.
Though as noted in a few comments, you should see how we operate with the OWASP ESAPI-Java project. By default we always canonicalize the data which means we run a series of decoders to detect multiple/mixed encoding as well as to create a string safe to validate against with regex.
For the part that really guarantees you protection however, you normally want to have raw text stored on the server--not anything that contains HTML characters, so you may wish to store the unescaped string, if only that you can safely encode it when you send it back to the user.
Encoding is the best protection for XSS, and I would in fact recommend it BEFORE input validation if for some reason you had to choose.
I say may because in general I think its a bad practice to store altered data. It can make troubleshooting a chore. This can be even more complicated if you're using a technology like TinyMCE, a rich-text editor in the browser. It also renders html so its like dealing with a browser within a browser.

Whitelist validation for http request

I am trying to create a servlet request filter which filters any incoming request based on the whitelist characters.
I want to accept only those characters which matches the whitelist pattern to avoid any malicious code to be executed by the attacker in the form of script or modified URL.
Does anyone know which whitelist characters should be used for filtering any HTTP request string?
Any help would be appreciated
Thanks in Advance

Implement pattern matching mechanism to find whitelist characters from your URL pattern by using RegEx..
Follow this link1
Or you can try:
if (inputUrl.contains(whiteList)) {
// your code goes here
}
Or If you need to know where it occurs, you can use indexOf:
int index = inputUrl.indexOf(whiteList);
if (index != -1) // -1 means "not found"
{
...
}
Thanks,
~Chandan

The problem is that "malicious" is very broad term. You should have clear idea what types of attacks are you trying to protect from and then take measures to prevent it.
You cannot specify set of characters in general which need to be filtered out, you need to know domain in which your input in url will be used. Generally dangerous is not url itself but url parameters which are provided by your users and then interpreted by your application. Depending on how your application will use this input, you need to take specific precautions. So for example:
Url param is used to determine target of redirect. User can use this to navigate victim to malicious site, site which masks as your site but will steal users credentials providing false credentials and so on. In that case you should construct whitelist of allowed destinations expected by your aplication and forbid others. See OWASP top TEN - Unvalidated redirects and forwards.
You save data from url param to DB. You should prevent SQL injection by using Parametrized queries. See OWASP SQL injection Cherat Sheet,
Url param data will be displayed as html. You should sanitize your html by some already proven sanitizer such as OWASP html sanitizer or AntiSamy to prevent Cross Site Scripting.
And so on...
The point is, there is no silver bullet to protect you from all the malicious attack vectors especially not by whitelisting certain characters in servlet filter. You should know where is potentially malicious data used and process it with its specific usage in mind because different targets will have different vulnerabilities and will require different measures for protection.
Good start for high level overview of security issues and measures form protection against them is OWASP TOP TEN. Then I recommend some more detailed guides and resources provided by owasp.

Do I need to enable canonicalization when using OWASP ESAPI?

We are adding ESAPI 2.x (owasp java security library) to an application.
The change is easy though quite repetitive. We are adding validations to all input parameters so we make sure all the characters they are composed by are within a whitelist.
This is it:
Validator instance = ESAPI.validator();
Assert.assertTrue(instance.isValidInput("test", "xxx#gmail.com", "Email", 100, false));
Then Email patterns is set in the validation.properties file like:
Validator.Email=^[A-Za-z0-9._%'-]+#[A-Za-z0-9.-]+\\.[a-zA-Z]{2,4}$
Easy!
We are not encoding output given that after the input validation, data becomes trusted.
I can see in ESAPI that it has a flag to canonicalize the input String. I understand that canonicalization is "de-encoding" so any encoded String is transformed in plain text.
The question is. Why do we need to canonicalize?
Can anybody show a sample of an attack that will be prevented by using canonicalization?? (in java)
thank you!

Here's one (of several thousand possible examples):
Take this simple XSS input:
<script>alert('XSS');</script>
//Now we URI encode it:
%3Cscript%3Ealert(%27XSS%27)%3B%3C%2Fscript%3E
//Now we URI encode it again:
%253Cscript%253Ealert(%2527XSS%2527)%253B%253C%252Fscript%253E
Canonicalization on the input that's been encoded once will result in the original input, but in ESAPI's case, the third input will throw an IntrusionException because there is NEVER a valid use case where user input will be URI-encoded more than once. In this particular example, canonicalization means "all URI data will be reduced into its actual character representation." ESAPI actually does more than just URI decoding, btw. This is important if you wish to perform both security and/or business validation using regular expressions--the primary use of regular expressions in most applications.
At a bare minimum, canonicalization gives you good assurance that sneaking malicious input into the application isn't easy: The goal is to restrict to known-good values (whitelist) and reject everything else.
In regards to your ill-advised comment here:
We are not encoding output given that after the input validation, data becomes trusted.
Here's the dirty truth: Javascript, XML, JSON, and HTML are not "regular languages." They're nondeterministic. What this means in practical terms is that it is mathematically impossible to write a regular expression to reject all attempts to insert HTML or Javascript into your application. Look at that XSS Filter Evasion Cheat sheet I posted above.
Does your application use jquery? The following input is malcious:
$=''|'',_=$+!"",__=_+_,___=__+_,($)[_$=($$=(_$=""+{})[__+__+_])+_$[_]+(""+_$[-__])[_]+(""+!_)[___]+($_=(_$=""+!$)[$])+_$[_]+_$[__]+$$+$_+(""+{})[_]+_$[_]][_$]((_$=""+!_)[_]+_$[__]+_$[__+__]+(_$=""+!$)[_]+_$[$]+"("+_+")")()
So you must encode all data when output to the user, for the proper context, this means that if the piece of data is going to be first input into a javascript function, and then displayed as HTML, you encode for Javascript, and then HTML. If its output into an HTML data field (such as a default input box) you encode it for an HTML Attribute.
Its actually MORE IMPORTANT to do output encoding than to do input filtering in protecting against XSS. (If I HAD to just choose one...)
The pattern you want to follow in web development is one where any input that is coming from the outside world is treated as malicious at all times. You encode any time you're handing off to a dynamic interpreter.

Canonicalization of data is also about deriving the data to its basic form. So if we take a different scenario where a file path(relative/symlink) and its allied directory permission is involved we need to first canonicalize the path and then validate else it will allow somebody to explore those files without permission by just passing the target acceptable data.

How do I send a query to a website and parse the results?

I want to do some development in Java. I'd like to be able to access a website, say for example
www.chipotle.com
On the top right, they have a place where you can enter in your zip code and it will give you all of the nearest locations. The program will just have an empty box for user input for their zip code, and it will query the actual chipotle server to retrieve the nearest locations. How do I do that, and also how is the data I receive stored?
This will probably be a followup question as to what methods I should use to parse the data.
Thanks!

First you need to know the parameters needed to execute the query and the URL which these parameters should be submitted to (the action attribute of the form). With that, your application will have to do an HTTP request to the URL, with your own parameters (possibly only the zip code). Finally parse the answer.
This can be done with standard Java API classes, but it won't be very robust. A better solution would be HttpClient. Here are some examples.

This will probably be a followup question as to what methods I should use to parse the data.
It very much depends on what the website actually returns.
If it returns static HTML, use an regular (strict) or permissive HTML parser should be used.
If it returns dynamic HTML (i.e. HTML with embedded Javascript) you may need to use something that evaluates the Javascript as part of the content extraction process.
There may also be a web API designed for programs (like yours) to use. Such an API would typically return the results as XML or JSON so that you don't have to scrape the results out of an HTML document.
Before you go any further you should check the Terms of Service for the site. Do they say anything about what you are proposing to do?
A lot of sites DO NOT WANT people to scrape their content or provide wrappers for their services. For instance, if they get income from ads shown on their site, what you are proposing to do could result in a diversion of visitors to their site and a resulting loss of potential or actual income.
If you don't respect a website's ToS, you could be on the receiving end of lawyers letters ... or worse. In addition, they could already be using technical means to make life difficult for people to scrape their service.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.