prevent XSS attack on JSTL & JSP scriptlet [duplicate] - java

I'm writing a servlet-based application in which I need to provide a messaging system. I'm in a rush, so I choose CKEditor to provide editing capabilities, and I currently insert the generated html directly in the web page displaying all messages (messages are stored in a MySQL databse, fyi). CKEditor already filters HTML based on a white list, but a user can still inject malicious code with a POST request, so this is not enough.
A good library already exists to prevent XSS attacks by filtering HTML tags, but it's written in PHP: HTML Purifier
So, is there a similar mature library that can be used in Java ?
A simple string replacement based on a white list doesn't seem to be enough, since I'd like to filter malformed tags too (which could alter the design of the page on which the message is displayed).
If there isn't, then how should I proceed? An XML parser seems overkill.
Note: There are a lot of questions about this on SO, but all the answers refer to filter ALL HTML tags: I want to keep valid formatting tags.

I'd recommend using Jsoup for this. Here's an extract of relevance from its site.
Sanitize untrusted HTML
Problem
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Solution
Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p>Link</p>
Jsoup offers more advantages than that as well. See also Pros and Cons of HTML parsers in Java.

You should use AntiSamy. (That's what I did)

If none of the ready-made options seem like enough, there is an excellent series of articles on XSS and attack prevention at Google Code. It should provide plenty of information to work with, if you end up going down that path.

Related

Ignoring spam/ads from a url using jsoup

I am using jsoup parser for loading the contents of some sites. Generally some sites have advertisements and other non relevant stuff on the pages. Is it possible to ignore these
when parsing a url?
No, there isn't a advertisement link avoiding function built in in Jsoup. You have to do it manually (by inspecting ad urls of each page and matching them, with regex for example).
This is not a direct answer to your question but you could use AlchemyAPI for that. They have a free 1,000 API calls program (and 30,000 if that's for academic purposes):
http://www.alchemyapi.com/api/text/

How to sanitize HTML code to prevent XSS attacks in Java or JSP?

I'm writing a servlet-based application in which I need to provide a messaging system. I'm in a rush, so I choose CKEditor to provide editing capabilities, and I currently insert the generated html directly in the web page displaying all messages (messages are stored in a MySQL databse, fyi). CKEditor already filters HTML based on a white list, but a user can still inject malicious code with a POST request, so this is not enough.
A good library already exists to prevent XSS attacks by filtering HTML tags, but it's written in PHP: HTML Purifier
So, is there a similar mature library that can be used in Java ?
A simple string replacement based on a white list doesn't seem to be enough, since I'd like to filter malformed tags too (which could alter the design of the page on which the message is displayed).
If there isn't, then how should I proceed? An XML parser seems overkill.
Note: There are a lot of questions about this on SO, but all the answers refer to filter ALL HTML tags: I want to keep valid formatting tags.
I'd recommend using Jsoup for this. Here's an extract of relevance from its site.
Sanitize untrusted HTML
Problem
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Solution
Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p>Link</p>
Jsoup offers more advantages than that as well. See also Pros and Cons of HTML parsers in Java.
You should use AntiSamy. (That's what I did)
If none of the ready-made options seem like enough, there is an excellent series of articles on XSS and attack prevention at Google Code. It should provide plenty of information to work with, if you end up going down that path.

How to filter (remove) JSP content from user-submitted pages

Overflowed Stack,
I have a Java web application (tomcat) whereby I allow the user to upload HTML code through a form.
Now since I am running on tomcat and I actually display the user-uploaded HTML I do not want a user to malicious code JSP tags/scriptlet/EL and for these to be executed on the server. I want to filter out any JSP/non-HTML content.
Writing a parser myself seems too onerous - apart from the lots of subtleties one has to take care of (comments, byte representation for the scripts etc).
Do you know of any API/library which does this for me ? I know about Caja filtering, but am looking at something specifically for JSPs.
Many Thanks,
JP, Malta.
Using a library for content cleaning is better than trying to do it yourself with e.g. Regexes.
Try Antisamy of the Open Web Application Security Project.
http://www.owasp.org/index.php/Antisamy
I didnt used it (yet), but seems to be suitable. JSP Content should be automatically removed/escaped by the HTML Normalization.
Edit, just found these:
Best Practice: User generated HTML cleaning
RegEx match open tags except XHTML self-contained tags
Don't worry about executing JSP code. Your JSP will be turned into a servlet once, so you will have something like:
out.println(contents);
and the contents won't be evaluated as JSP code. But you must worry about malicious javascript
Just save it as *.html, not as *.jsp, then it won't be passed through the JspServlet which does all the taglib/EL processing work. All taglibs/EL will end up plain (unparsed) in response.
I'm not sure if i have understand you question completly but if you whant to remove all content in suround with a "<%# .. %>" you can replace it with regex.
String resultString = subjectString.replaceAll("(?sim)<%# .*? %>", "");
I don't have a library to remove JSP tags, but you can write a little one based on regexp that would :
delete all "<% %>" tags
delete all HTML tags that contains the ':' character (to avoid "" tags for example
I don't know whether all potential malicious java code is included with theses two filters but it is a good start...
Another solution, but a little more complicated : use a http proxy server (Apache httpd, Nginx, etc.), that will serve directly static resources (css, images, html pages) and forward to Tomcat only dynamic resources (JSP and .do actions for example).
When a file is uploaded, you force the file extension to ".html". You are sure (thanks to the http proxy) that the file will not be interpreted by Tomcat.
If the pages supplied by the users aren't mentioned in the web.xml and you don't have a rule "anything that ends with *.jsp is a JSP" in web.xml, Tomcat won't try to compile/run them.
What is much more important: You must filter the HTML or users could add arbitrary JavaScript which would then steal other users passwords. This is non-trivial. Try to clean the code with JTidy to get XML and then remove all <script> tags, <link>, <object>, maybe even <img> (unless you make sure the URLs supplied are valid; some buggy browsers might run JavaScript if the image source is actually text/JavaScript, all CSS styles and make sure any href points to a safe URL. Don't forget <iframe> and <applet> and all the other things that might break your secure shell.
[EDIT] Thats should give you an idea where this is going to. In the end, you should do the reverse: Allow only a very small subset of HTML -- if at all. Most sites (like this one) use special markup for the formatting for two reasons:
It's more simple for the user
It's more secure

Best practices for processing input HTML content at server side in java

I would like to implement content management system with RDBMS in java/j2ee, and would like to know the best practices for handling input HTML content
Below are the few doubts I have got, am sure there are lots of other things to take care..
Do we need to escape HTML tags and special characters before we save HTML content to database
How do we validate/remove invalid special symbols in large input HTML content
Best practices for displaying HTML content back to browser from database
Any security risk involved in while handling HTML content
Looking forward to see some great ideas from gurus!
Use a tool like Neko to clean up the HTML into XHTML, then use any XML parser to parse it.
I recently tried out some html clean-up libraries, and the best I came across was the Cobra Html Renderer and Parser which seems to faster than others and also manages to convert dirtier HTML do XHTML. I first went for HTML Tidy, but it ended up complaining about "Unparseable HTML" way too often.
What I'd strongly discourage you from doing is to use a REGEX ;-)
I am not a guru in this but i think you will have to figure out how to deal with some special characters and escape sequences as in quotes(both double and single)..etc
May be you can try replacing those special charas and escape sequences with some other characters.
Mayb Someone else who is currenntly delaing with cms mite help you out..nways cheers!!
I would recommend looking at the architecture and design of an open source CMS like Alfresco or Apache Jackrabbit.
These are actual content repositories and will not contain end-to-end integration most likely, but can show you an underlying data model that is a good place to start.
I would also recommend you check out OWASP for information on web application security and vulnerabilities, and in particular security issues relevant to Java developers.

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.
What technique/library would you advice?
Updates/Remarks
Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
It sould be really simple.
You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.
You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml.
As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's
http://hc.apache.org/httpclient-3.x/
If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):
<table>
{
for $d in //td[contains(a/small/text(), "New York, NY")]
for $row in $d/parent::tr/parent::table/tr
where contains($d/a/small/text()[1], "New York")
return <tr><td>{data($row/td[1])}</td>
<td>{data($row/td[2])}</td>
<td>{$row/td[3]//img}</td> </tr>
}
</table>
In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.
You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.
If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :
GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>
then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.
You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.
If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.
This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.
Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.
If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.
Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.
Check this out http://www.alchemyapi.com/api/demo.html
They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.

Categories