I am working on a Java web application that is many years old.
Most of the <bean:write>s in the JSPs have filter="false" even when it isn't needed, probably because of developers blindly copying existing code. <bean:write> is the Struts tag to output a JSP variable, and when filter="false" is specified it does not do HTML escaping (so filter="false" is similar to the <c:out> attribute escapeXml="false"). This means that the application is vulnerable to XSS attacks, because some of these <bean:write filter="false">s are outputting user input.
A blanket removal of filter="false" isn't an option because in some cases the application allows the user to enter HTML using a TinyMCE text area, so we do need to output raw HTML in some cases to retain the user-entered formatting (although we should still be sanitising user-entered HTML to remove scripts).
There are thousands of filter="false"s in the code so an audit of each one to work out whether it is required would take too long.
What we are thinking of doing is making our own version of the bean:write tag, say secure:write, and doing a global find/replace of bean:write with secure:write in our JSPs. secure:write will strip scripts from the output when filter="false" is specified. After this change users would still be able to cause formatting HTML to be output where they shouldn't really be able to, but we aren't worried about that for the time being as long as the XSS vulnerabilities are fixed.
We would like to use a library to implement the script-stripping in the secure:write tag and we have been looking at https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project and https://code.google.com/p/owasp-java-html-sanitizer/. Both look like they are capable of sanitising HTML, although AntiSamy looks like it is intended to be used to sanitise HTML on the way in to the application instead of on the way out, and since data is output more often than it is input we are concerned that running all of our secure:write output through it could be slow.
I have 2 main questions:
1) Will our proposed approach work to fix the XSS vulnerabilities caused by filter="false"?
2) Can anyone recommend a library to use for HTML sanitisation when displaying content, i.e. which is fast enough to not significantly affect the page-rendering performance? Has anyone used AntiSamy or owasp-java-html-sanitizer for something similar?
1) Will our proposed approach work to fix the XSS vulnerabilities caused by filter="false"?
This definitely sounds like an improvement that will reduce your attack surface, but it's not sufficient.
Once an attacker can no longer inject <script>doEvil()</script> they will then focus on injecting javascript:doEvil() where URLs are expected, so you will need to harden places where URLs are injected as well.
If you're using an XSS scanner, I would do what you describe, then rerun your scanners, making sure that it tests for injected javascript URLs.
Once URLs are locked down, you should audit any writes into style attributes or elements and event handler attributes.
2) Can anyone recommend a library to use for HTML sanitisation when displaying content, i.e. which is fast enough to not significantly affect the page-rendering performance? Has anyone used AntiSamy or owasp-java-html-sanitizer for something similar?
Shameless plug: https://code.google.com/p/owasp-java-html-sanitizer/
A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS.
Related
In my Java webapp, I create summary text of long HTML text. In the process of truncation, the HTML fragments in the string often break, producing HTML string with invalid & broken fragments. Like this example HTML string:
Visit this link <img src="htt
Is there any Java library to deal with this better so that such broken fragments as above are avoided ?
Or could I let this be included in the HTML pages & somehow deal with this using client side code ?
Since browsers will usually be able to deal with almost any garbage you feed into it (if it ain't XHTML...), if the only thing that actually happens with the input (assuming it's valid HTML of any kind) is being sliced, then the only thing you have to worry about is to actually get rid of invalid opening tags; you won't be able to distinguish broken 'endings' of tags, since they, in themselves, ain't special in any way. I'd just take a slice I've generated and parse it from the end; if I encounter a stray '<', I'd get rid of everything after it. Likewise, I'd keep track of the last opened tag - if the next close after it wasn't closing that exact tag, it's likely the closing tag got out, so I'd insert it.
This would still generate a lot of garbage, but would at least fix some rudimentary problems.
A better way would be to manage a stack of opened/closed tags and generate/remove the needed/broken/unnecessary ones as they emerge. A stack is a proper solution since HTML tags musn't 'cross' [by the spec, AFAIR it's this way from HTML 4], i.e. <span><div></span></div> isn't valid.
A much better way would be to splice the document after first parsing it as SGML/HTML/XML (depends on the exact HTML doctype) - then you could just remove the nodes, without damaging the structure.
Note that you can't actually know if a tag is correct without providing an exact algorithm you use to generate this 'garbled' content.
I used owasp-java-html-sanitizer to fix those broken fragments to generate safe HTML markup from Java.
PolicyFactory html_sanitize_policy = Sanitizers.LINKS.and(Sanitizers.IMAGES);
String safeHTML = html_sanitize_policy.sanitize(htmlString);
This seemed to be easiest of all solutions I came across.
We allow users to create rich content using TinyMCE and this includes Javascript and CSS.
However, when the content reaches server (Java), we want to filter out all XSS code or potentially malicious code, things like document.cookie, eval, etc, whether they are in CSS, inline JS, XSS Javascript crafted using string text (eg. document.write), etc. Everything else, eg. changing color on mouse over, set gradient on CSS, etc are fine.
We want to allow flexibility to our users but at the same time we want to ensure users are secured. We researched on libs like HTML Purifier, jSoup, but they do not seem smart enough to distinguish potentially malicious JS from safe one.
We are wondering if there is any way to do it?
Thank you.
Have you looked at google caja? It is a compiler for third party javascript so it can be safely embedded in another site:
https://developers.google.com/caja/
It sounds like what you are looking for.
You can use JSoup for this job. JSoup has a XSS Cleaner Parser which can work on a whilelist Object (list of permitted tags).The jsoup whitelist sanitizer works by parsing the input HTML, and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output. It does not use regular expressions, which are inappropriate for this task.
jsoup provides a range of Whitelist configurations to suit most requirements; they can be modified if necessary. Read this link for more details [http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer].
Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:
<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>
Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>
Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.
This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:
<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>
and the corresponding template will look like this:
<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>
Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.
Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?
Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.
The approach you gave is pretty much similar to the Gilbert's except
the regex part. I don't want to step into the ugly regex world, I am
planning to use template approach for many other areas apart from
movie info e.g. prices, product specs extraction etc.
The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?
3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.
3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.
3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.
3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.
My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.
Not really a template-based approach here, but jsoup can still be a workable solution if you just externalize your Selector queries to a configuration file.
Your non-programmer doesn't even have to see HTML, just update the selectors in the configuration file. Something like SelectorGadget will make it easier to pick out what selector to actually use.
How can I achieve this in Java and is there any existing and better approach to this problem?
The template approach is a good approach. You gave all of the reasons why in your question.
Your templates would consist of just the HTML you want to process, and nothing else. Here's my example based on your example.
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
Basically, you would use Jsoup to process your templates. Then, as you use Jsoup to process the web pages, you check all of your processed templates to see if there's a match.
On a template match, you find the keywords in the processed template, then you find the corresponding values in the processed web page.
Yes, this would be a lot of coding, and more difficult than my description indicates. Your Java programmer will have to break this description down into simpler and simpler tasks until she or he can code the tasks.
If the web page changes frequently, then you'll probably want to confine your search for the fields like MOVIE_RATING to the smallest possible part of the page, and ignore everything else. There are two possibilities: you could either use a regular expression for each field, or you could use some kind of CSS selector. I think either would work and either "template" can consist of a simple list of search expressions, regex or css, that you would apply. Just roll through the list and extract what you can, and fail if some particular field isn't found because the page changed.
For example, the regex could look like this:
"Score:"(.)*[0-9]\.[0-9]\/[0-9]
(I haven't tested this.)
Or you can try different approach, using what i would call 'rules' instead of templates: for each piece of information that you need from the page, you can define jQuery expression(s) that extracts the text. Often when page change is small, the same well written jQuery expressions would still give the same results.
Then you can use Jerry (jQuery in Java), with the almost the same expressions to fetch the text you are looking for. So its not only about selectors, but you also have other jQuery methods for walking/filtering the DOM tree.
For example, rule for some Director text would be (in sort of sudo-java-jerry-code):
$.find("div#movie").find("div:nth-child(2)")....text();
There could be more (and more complex) expressions in the rule, spread across several lines, that for example iterate some nodes etc.
If you are OO person, each rule may be defined in its own implementation. If you are groovy person, you can even rewrite rules when needed, without recompiling your project, and still being in java. Etc.
As you see, the core idea here is to define rules how to find your text; and not to match to patterns as that may be fragile to minor changes - imagine if just a space has been added between two divs:). In this example of mine, I've used jQuery-alike syntax (actually, it's Jerry-alike syntax, since we are in Java) to define rules. This is only because jQuery is popular and simple, and known by your web developer too; at the end you can define your own syntax (depending on parsing tool you are using): for example, you may parse HTML into DOM tree and then write rules using your helper methods how to traverse it to the place of interest. Jerry also gives you access to underlaying DOM tree, too.
Hope this helps.
I used the following approach to do something similar in a personal project of mine that generates a RSS feed out of here the leading real estate website in spain.
Using this tool I found the rented place I'm currently living in ;-)
Get the HTML code from the page
Transform the HTML into XHTML. I used this this library I guess there might be today better options available
Use XPath to navigate the XHTML to the information you're interesting in
Of course every time they change the original page you will have to change the XPath expression. The other approach I can think of -semantic analysis of the original HTML source- is far, far beyond my humble skills ;-)
I need to tidy user input in a web application so that I remove certain HTML-tags and encode < to > etc.
I've made a couple of simple util methods that strips the HTML, but I find myself adding these EVERYWHERE in my application.
Is there a smarter way to tidy the user input? E.g. in the binding process, or as a filter somehow?
I've seen JTidy that can act as a servlet filter, but I'm not sure that this is what I want because I need to clean user input, not output of my JSP's.
From JTidy's homepage:
It can be used as a tool for cleaning up malformed and faulty HTML generated by your dynamic web application.
It can Validate HTML without changing the output and generate warnings for each page so you could identify JSP or Servlet that need to be fixed.
It can save you hours of time. The more HTML you write in JSP or Servlets, the more time you will save. Don't waste time manually looking for problems, figuring out why your HTML doesn't display like it should.
In addition to JTidy validation you could submit dynamically generated pages to online HTML validators for example W3C Markup Validation Service, WAVE Accessibility Tool or WDG HTML Validator even if you are behind the firewall.
I find myself adding these EVERYWHERE in my application.
Really? It's unusual to have many user inputs that accept HTML. Most inputs should be plain text, so that when the user types < they literally get a less-than sign, not a (potentially-tidied/filtered-out) tag. This requires HTML-encoding at the output stage. Typically you'd get that from the <c:out> tag.
(Old-school JSP before JSTL, lamentably, provided no HTML-encoder, so if for some reason that's what you're working with you would have to provide your own HTML-encoding method built out of string replacments, or use one of the many third-party tools that contain one.)
For the usually-few-if-any ‘rich text’ fields that are deliberately meant to accept user-supplied HTML, you should be filtering them strongly to prevent JavaScript injection from the markup. This is a difficult job! A “couple of simple util methods that strips the HTML” are highly unlikely to do it correctly and securely.
The proper way to do this is to parse the input HTML into a DOM; walk over it checking that only known-safe element and attribute names are used; then serialise it back to well-formed [X]HTML. There are a number of tools that can do this and yes, jTidy is one. You would use the method Tidy.parseDOM on the input field value, remove unwanted items from the resulting DOM with removeChild and removeAttribute, then reserialise using pprint.
A good alternative to HTML-based rich text is to give the user a simpler form of textual markup that you can then convert to known-safe HTML tags. Like this SO text box I'm typing into now.
There's Interceptor interface in Spring MVC which may be used to do some common stuff on every request. Regardless of tool you are using for tidying, you may use it for getting what you need at one point. See this manual to manage using ut. Just put the tidying routine into preHandle method and walk through data in HttpServletRequest to update it.
I need to parse HTML 4 in Java.
Ideally I'd like an implementation that is SAX compatible.
I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.
My requirements are:
No tidying.
If the input document is invalid HTML parsing should fail.
The document should be validatable against the HTML DTDs.
The parser can produce SAX2 events.
Is there a library that meets these requirements?
You can find a collection of HTML parsers here HTML Parsers. I don't remeber exactly but I think TagSoup parses the file without applying corrections...
I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.
Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:
http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp
So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:
Source source=new Source("<a>I forgot to close my link!");
source.setLogger(myListeningLogger);
source.getSourceFormatter().writeTo(new NullWriter());
// myListeningLogger has now had all the HTML flaws written to it
In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).
Jericho is available on Maven Central, which is always a good sign:
http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html
Good luck!
You may wish to check http://lobobrowser.org/cobra.jsp. They have a pure Java web browser (Lobo) implemented. They have the parser component (Cobra) pulled out separately for use. I honestly am not sure if it will do what you require with the "no tidying" requirement, but it may be worth a look. I ran across it when exploring the wild for a pure Java web browser.
You can try to subclass javax.swing.text.html.parser.Parser and implement the handleXXX() methods. It seems it doesn't try to fix the XML. See more at the API