I need to tidy user input in a web application so that I remove certain HTML-tags and encode < to > etc.
I've made a couple of simple util methods that strips the HTML, but I find myself adding these EVERYWHERE in my application.
Is there a smarter way to tidy the user input? E.g. in the binding process, or as a filter somehow?
I've seen JTidy that can act as a servlet filter, but I'm not sure that this is what I want because I need to clean user input, not output of my JSP's.
From JTidy's homepage:
It can be used as a tool for cleaning up malformed and faulty HTML generated by your dynamic web application.
It can Validate HTML without changing the output and generate warnings for each page so you could identify JSP or Servlet that need to be fixed.
It can save you hours of time. The more HTML you write in JSP or Servlets, the more time you will save. Don't waste time manually looking for problems, figuring out why your HTML doesn't display like it should.
In addition to JTidy validation you could submit dynamically generated pages to online HTML validators for example W3C Markup Validation Service, WAVE Accessibility Tool or WDG HTML Validator even if you are behind the firewall.
I find myself adding these EVERYWHERE in my application.
Really? It's unusual to have many user inputs that accept HTML. Most inputs should be plain text, so that when the user types < they literally get a less-than sign, not a (potentially-tidied/filtered-out) tag. This requires HTML-encoding at the output stage. Typically you'd get that from the <c:out> tag.
(Old-school JSP before JSTL, lamentably, provided no HTML-encoder, so if for some reason that's what you're working with you would have to provide your own HTML-encoding method built out of string replacments, or use one of the many third-party tools that contain one.)
For the usually-few-if-any ‘rich text’ fields that are deliberately meant to accept user-supplied HTML, you should be filtering them strongly to prevent JavaScript injection from the markup. This is a difficult job! A “couple of simple util methods that strips the HTML” are highly unlikely to do it correctly and securely.
The proper way to do this is to parse the input HTML into a DOM; walk over it checking that only known-safe element and attribute names are used; then serialise it back to well-formed [X]HTML. There are a number of tools that can do this and yes, jTidy is one. You would use the method Tidy.parseDOM on the input field value, remove unwanted items from the resulting DOM with removeChild and removeAttribute, then reserialise using pprint.
A good alternative to HTML-based rich text is to give the user a simpler form of textual markup that you can then convert to known-safe HTML tags. Like this SO text box I'm typing into now.
There's Interceptor interface in Spring MVC which may be used to do some common stuff on every request. Regardless of tool you are using for tidying, you may use it for getting what you need at one point. See this manual to manage using ut. Just put the tidying routine into preHandle method and walk through data in HttpServletRequest to update it.
Related
We allow users to create rich content using TinyMCE and this includes Javascript and CSS.
However, when the content reaches server (Java), we want to filter out all XSS code or potentially malicious code, things like document.cookie, eval, etc, whether they are in CSS, inline JS, XSS Javascript crafted using string text (eg. document.write), etc. Everything else, eg. changing color on mouse over, set gradient on CSS, etc are fine.
We want to allow flexibility to our users but at the same time we want to ensure users are secured. We researched on libs like HTML Purifier, jSoup, but they do not seem smart enough to distinguish potentially malicious JS from safe one.
We are wondering if there is any way to do it?
Thank you.
Have you looked at google caja? It is a compiler for third party javascript so it can be safely embedded in another site:
https://developers.google.com/caja/
It sounds like what you are looking for.
You can use JSoup for this job. JSoup has a XSS Cleaner Parser which can work on a whilelist Object (list of permitted tags).The jsoup whitelist sanitizer works by parsing the input HTML, and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output. It does not use regular expressions, which are inappropriate for this task.
jsoup provides a range of Whitelist configurations to suit most requirements; they can be modified if necessary. Read this link for more details [http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer].
I am working on a Java web application that is many years old.
Most of the <bean:write>s in the JSPs have filter="false" even when it isn't needed, probably because of developers blindly copying existing code. <bean:write> is the Struts tag to output a JSP variable, and when filter="false" is specified it does not do HTML escaping (so filter="false" is similar to the <c:out> attribute escapeXml="false"). This means that the application is vulnerable to XSS attacks, because some of these <bean:write filter="false">s are outputting user input.
A blanket removal of filter="false" isn't an option because in some cases the application allows the user to enter HTML using a TinyMCE text area, so we do need to output raw HTML in some cases to retain the user-entered formatting (although we should still be sanitising user-entered HTML to remove scripts).
There are thousands of filter="false"s in the code so an audit of each one to work out whether it is required would take too long.
What we are thinking of doing is making our own version of the bean:write tag, say secure:write, and doing a global find/replace of bean:write with secure:write in our JSPs. secure:write will strip scripts from the output when filter="false" is specified. After this change users would still be able to cause formatting HTML to be output where they shouldn't really be able to, but we aren't worried about that for the time being as long as the XSS vulnerabilities are fixed.
We would like to use a library to implement the script-stripping in the secure:write tag and we have been looking at https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project and https://code.google.com/p/owasp-java-html-sanitizer/. Both look like they are capable of sanitising HTML, although AntiSamy looks like it is intended to be used to sanitise HTML on the way in to the application instead of on the way out, and since data is output more often than it is input we are concerned that running all of our secure:write output through it could be slow.
I have 2 main questions:
1) Will our proposed approach work to fix the XSS vulnerabilities caused by filter="false"?
2) Can anyone recommend a library to use for HTML sanitisation when displaying content, i.e. which is fast enough to not significantly affect the page-rendering performance? Has anyone used AntiSamy or owasp-java-html-sanitizer for something similar?
1) Will our proposed approach work to fix the XSS vulnerabilities caused by filter="false"?
This definitely sounds like an improvement that will reduce your attack surface, but it's not sufficient.
Once an attacker can no longer inject <script>doEvil()</script> they will then focus on injecting javascript:doEvil() where URLs are expected, so you will need to harden places where URLs are injected as well.
If you're using an XSS scanner, I would do what you describe, then rerun your scanners, making sure that it tests for injected javascript URLs.
Once URLs are locked down, you should audit any writes into style attributes or elements and event handler attributes.
2) Can anyone recommend a library to use for HTML sanitisation when displaying content, i.e. which is fast enough to not significantly affect the page-rendering performance? Has anyone used AntiSamy or owasp-java-html-sanitizer for something similar?
Shameless plug: https://code.google.com/p/owasp-java-html-sanitizer/
A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS.
Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!
There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet
My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site
You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit
You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/
You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.
I need to parse HTML 4 in Java.
Ideally I'd like an implementation that is SAX compatible.
I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.
My requirements are:
No tidying.
If the input document is invalid HTML parsing should fail.
The document should be validatable against the HTML DTDs.
The parser can produce SAX2 events.
Is there a library that meets these requirements?
You can find a collection of HTML parsers here HTML Parsers. I don't remeber exactly but I think TagSoup parses the file without applying corrections...
I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.
Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:
http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp
So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:
Source source=new Source("<a>I forgot to close my link!");
source.setLogger(myListeningLogger);
source.getSourceFormatter().writeTo(new NullWriter());
// myListeningLogger has now had all the HTML flaws written to it
In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).
Jericho is available on Maven Central, which is always a good sign:
http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html
Good luck!
You may wish to check http://lobobrowser.org/cobra.jsp. They have a pure Java web browser (Lobo) implemented. They have the parser component (Cobra) pulled out separately for use. I honestly am not sure if it will do what you require with the "no tidying" requirement, but it may be worth a look. I ran across it when exploring the wild for a pure Java web browser.
You can try to subclass javax.swing.text.html.parser.Parser and implement the handleXXX() methods. It seems it doesn't try to fix the XML. See more at the API
I would like to let users customize pages, let's call them A and B. So basically I want to provide a hyperlink to a jps page with big text box where a user should be able to enter any text, html (to appear on page A), with ability to preview it and save.
I haven't really deal with this sort of issues before and would appreciate help on how implement it (examples and reference would be very helpful too)
Thanks
Are you using any kind of web framework(Spring MVC / Struts / Tapestry / etc...)? If you are, they all have tutorials on dealing with user inputs / form submission, so take a look at that. They all differ slightly in how user input is processed so it's impossible to answer this question generically.
If you're not (e.g. this is straight JSP), take a look at this tutorial.
Basically, what you want to do is to define an HTML form on your page B with textarea where user would input custom HTML. When form is submitted, you'll get the text user entered as a request parameter and you can store it somewhere (in the database / flat file / memory / what have you). On your page A you'll need to retrieve that text and bind it to request or page scope, you can then display it using <%= %> or <jsp:getProperty> tags.
To ChssPly76's answer I'd just add that if you're going to provide text entry of html on a web page (or anywhere, really) you're going to want to provide some kind of validation and a mechanism to provide feedback if the html is bad. You might dispense with this for a raw internal tool but anything for public consumption will need it. e.g. what do you do if someone enters
<b>sometext
You can deal with this with simple rules that parse away html tags, a preview that lets people know how they're doing so far ala stackoverflow, an rtf input option, or just a validate and if the tags don't balance a big honking "Try again", but you'll want some kind of check that you won't just be putting up broken pages.