We allow users to create rich content using TinyMCE and this includes Javascript and CSS.
However, when the content reaches server (Java), we want to filter out all XSS code or potentially malicious code, things like document.cookie, eval, etc, whether they are in CSS, inline JS, XSS Javascript crafted using string text (eg. document.write), etc. Everything else, eg. changing color on mouse over, set gradient on CSS, etc are fine.
We want to allow flexibility to our users but at the same time we want to ensure users are secured. We researched on libs like HTML Purifier, jSoup, but they do not seem smart enough to distinguish potentially malicious JS from safe one.
We are wondering if there is any way to do it?
Thank you.
Have you looked at google caja? It is a compiler for third party javascript so it can be safely embedded in another site:
https://developers.google.com/caja/
It sounds like what you are looking for.
You can use JSoup for this job. JSoup has a XSS Cleaner Parser which can work on a whilelist Object (list of permitted tags).The jsoup whitelist sanitizer works by parsing the input HTML, and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output. It does not use regular expressions, which are inappropriate for this task.
jsoup provides a range of Whitelist configurations to suit most requirements; they can be modified if necessary. Read this link for more details [http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer].
Related
I am working on a Java web application that is many years old.
Most of the <bean:write>s in the JSPs have filter="false" even when it isn't needed, probably because of developers blindly copying existing code. <bean:write> is the Struts tag to output a JSP variable, and when filter="false" is specified it does not do HTML escaping (so filter="false" is similar to the <c:out> attribute escapeXml="false"). This means that the application is vulnerable to XSS attacks, because some of these <bean:write filter="false">s are outputting user input.
A blanket removal of filter="false" isn't an option because in some cases the application allows the user to enter HTML using a TinyMCE text area, so we do need to output raw HTML in some cases to retain the user-entered formatting (although we should still be sanitising user-entered HTML to remove scripts).
There are thousands of filter="false"s in the code so an audit of each one to work out whether it is required would take too long.
What we are thinking of doing is making our own version of the bean:write tag, say secure:write, and doing a global find/replace of bean:write with secure:write in our JSPs. secure:write will strip scripts from the output when filter="false" is specified. After this change users would still be able to cause formatting HTML to be output where they shouldn't really be able to, but we aren't worried about that for the time being as long as the XSS vulnerabilities are fixed.
We would like to use a library to implement the script-stripping in the secure:write tag and we have been looking at https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project and https://code.google.com/p/owasp-java-html-sanitizer/. Both look like they are capable of sanitising HTML, although AntiSamy looks like it is intended to be used to sanitise HTML on the way in to the application instead of on the way out, and since data is output more often than it is input we are concerned that running all of our secure:write output through it could be slow.
I have 2 main questions:
1) Will our proposed approach work to fix the XSS vulnerabilities caused by filter="false"?
2) Can anyone recommend a library to use for HTML sanitisation when displaying content, i.e. which is fast enough to not significantly affect the page-rendering performance? Has anyone used AntiSamy or owasp-java-html-sanitizer for something similar?
1) Will our proposed approach work to fix the XSS vulnerabilities caused by filter="false"?
This definitely sounds like an improvement that will reduce your attack surface, but it's not sufficient.
Once an attacker can no longer inject <script>doEvil()</script> they will then focus on injecting javascript:doEvil() where URLs are expected, so you will need to harden places where URLs are injected as well.
If you're using an XSS scanner, I would do what you describe, then rerun your scanners, making sure that it tests for injected javascript URLs.
Once URLs are locked down, you should audit any writes into style attributes or elements and event handler attributes.
2) Can anyone recommend a library to use for HTML sanitisation when displaying content, i.e. which is fast enough to not significantly affect the page-rendering performance? Has anyone used AntiSamy or owasp-java-html-sanitizer for something similar?
Shameless plug: https://code.google.com/p/owasp-java-html-sanitizer/
A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS.
Is this possible? I am using javascript, AJAX and JSON to pull data from a Java servlet and i'd like to display elements of the javascript array I create within my HTML (as opposed to creating large chunks of HTML within javascript). I know I can muck up my html using:
<script>document.write(arrayVar[0].firstName);</script>
But i'd really like to avoid that. In the past I would use JSTL and EL tags when pulling data from the server, but is there a similar way to do this purely with javascript? I wouldn't be opposed to using an external library - I just don't know of any because I don't have a ton of experience with JS.
You could use any javascript templating library like http://mustache.github.com/ or ones that come with underscore.js ( or jquery templates (although I understand that these are deprecated now). Javascript templates are probably as magical as you will get for this.
I do try to avoid document.write wherever possible as it can mess up pages when this call is returned in HTML through AJAX.
Unfortunately, to answer your question, probably not - at least not in a way better than you're currently doing. document.write can be avoided by instead using the DOM methods, such as:
document.getElementById('test').innerHTML = arrayVar[0].firstName;
With pure JavaScript, I don't see any other way of doing this that would be similar to EL syntax.
Saxon-CE (currently an Alpha release) is an XSLT 2 processor with extensions for browser interaction - it's implemented in JavaScript and deployed in the same way as a JS library.
With this, you could write a template to iterate the JavaScript array (by calling a JS function) and insert each value into an HTML literal-result element within the template. The resulting HTML can then be appended to, or replace, any part of the HTML DOM.
The template can either run on page load, or be linked to any DOM event.
Does the document tree returned by JSoup when it parses an HTML document support getComputedStyle on the individual document elements?
What I would like to do is inline the CSS in an HTML fragment so that I can insert the fragment into a larger HTML document, with all of its formatting preserved but without messing with any other formatting in the document.
The research I've done would seem to suggest that I can accomplish this by iterating through all of the elements in the document, calling getComputedStyle on each one, and assigning the result to be the style for the element.
Yes, I realize that this may very well bloat the resulting HTML by putting a bunch of redundant / unnecessary style information on the individual elements, but I'm willing to pay the price of larger HTML, and as far as I can tell, embedding the style inline like this is the only way to preserve the formatting exactly while also making the HTML fragments fully portable. (If you've got another suggestion for accomplishing that purpose, I'm all ears. :-)
Getting back on topic... If I can't use getComputedStyle (or the equivalent) with JSoup, is there another Java HTML+CSS parser that supports getComputedStyle or the equivalent?
Thanks.
That's not possible. Jsoup is just a HTML parser with CSS selector support, it is not a HTML renderer.
You may want to take a look at Lobobrowser which is a Java based HTML renderer supporting JavaScript and like. I do not know nor guarantee that getComputedStyle() is supported by Lobo.
No other tools comes to mind. HtmlUnit comes close as it can also access/invoke JavaScript, but some Google results suggests that getComputedStyle() doesn't work on HtmlUnit as well. It's after all actually also not a real HTML renderer as well.
Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!
There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet
My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site
You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit
You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/
You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.
I need to tidy user input in a web application so that I remove certain HTML-tags and encode < to > etc.
I've made a couple of simple util methods that strips the HTML, but I find myself adding these EVERYWHERE in my application.
Is there a smarter way to tidy the user input? E.g. in the binding process, or as a filter somehow?
I've seen JTidy that can act as a servlet filter, but I'm not sure that this is what I want because I need to clean user input, not output of my JSP's.
From JTidy's homepage:
It can be used as a tool for cleaning up malformed and faulty HTML generated by your dynamic web application.
It can Validate HTML without changing the output and generate warnings for each page so you could identify JSP or Servlet that need to be fixed.
It can save you hours of time. The more HTML you write in JSP or Servlets, the more time you will save. Don't waste time manually looking for problems, figuring out why your HTML doesn't display like it should.
In addition to JTidy validation you could submit dynamically generated pages to online HTML validators for example W3C Markup Validation Service, WAVE Accessibility Tool or WDG HTML Validator even if you are behind the firewall.
I find myself adding these EVERYWHERE in my application.
Really? It's unusual to have many user inputs that accept HTML. Most inputs should be plain text, so that when the user types < they literally get a less-than sign, not a (potentially-tidied/filtered-out) tag. This requires HTML-encoding at the output stage. Typically you'd get that from the <c:out> tag.
(Old-school JSP before JSTL, lamentably, provided no HTML-encoder, so if for some reason that's what you're working with you would have to provide your own HTML-encoding method built out of string replacments, or use one of the many third-party tools that contain one.)
For the usually-few-if-any ‘rich text’ fields that are deliberately meant to accept user-supplied HTML, you should be filtering them strongly to prevent JavaScript injection from the markup. This is a difficult job! A “couple of simple util methods that strips the HTML” are highly unlikely to do it correctly and securely.
The proper way to do this is to parse the input HTML into a DOM; walk over it checking that only known-safe element and attribute names are used; then serialise it back to well-formed [X]HTML. There are a number of tools that can do this and yes, jTidy is one. You would use the method Tidy.parseDOM on the input field value, remove unwanted items from the resulting DOM with removeChild and removeAttribute, then reserialise using pprint.
A good alternative to HTML-based rich text is to give the user a simpler form of textual markup that you can then convert to known-safe HTML tags. Like this SO text box I'm typing into now.
There's Interceptor interface in Spring MVC which may be used to do some common stuff on every request. Regardless of tool you are using for tidying, you may use it for getting what you need at one point. See this manual to manage using ut. Just put the tidying routine into preHandle method and walk through data in HttpServletRequest to update it.