How to check if the content is plain text or not? - java

I have a plain text area where I accept only plain text from users. I want to make sure that users do not put any markup in the text area. I also assume that users can post in different languages. So, what is the best approach to validate the content both at the server side (using java) and at the client side (using jquery).
Any help in this regard would be appreciated.
Update: I am sorry if the question wasn't clear enough. To make it simple, this is what I want to do - I let users type text in the textarea (no rich text box here). For each double new line in the text area i want to show a paragraph in the HTML page. How do I do that correctly?

It makes little sense to validate user input on HTML content. You can just escape HTML when redisplaying this user input on the webpage. Since you mentioned that you're using Java on the server side and thus you're likely using JSP as view technology, it's good to know that you can use the JSTL <c:out> tag and fn:escapeXml() function to escape HTML before printing to output.
E.g. when redisplaying user-controlled input:
<c:out value="${somebean.sometext}" />
or when redisplaying user-submitted request parameter:
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
This way for example <script>alert('xss')</script> will be printed to HTML output as <script>alert('xss')</script> and thus be displayed in HTML literally as the enduser typed in itself.
If you really insist to validate this, you could eventually grab a HTML parser like Jsoup for this.
String text = request.getParameter("text");
if (!text.equals(Jsoup.parse(text).text())) {
// There was HTML in the text.
}
Update as per the comments you actually want to sanitize the input against a HTML whitelist to remove potential malicious tags. You can do this with Jsoup as well, see also this page.
String sanitized = Jsoup.clean(text, Whitelist.basic());
The allowed elements of Whitelist#basic() is specified in the API documentation.

If it's HTML markup you want to prevent, you could use a regular expression to throw an error if it sees a chevron (<)

Related

How to defend against xss when saving data and when displaying it

Let's say I have a simple CRUD application with a form to add new object and edit an existing one. From a security point of view I want to defend against cross-site scripting. Fist I would validate the input of submitted data on the server. But after that, I would escape the values being displayed in the view because maybe I have more than one application writing in my database (some developer by mistake inserts unvalidated data in the DB in the future). So I will have this jsp:
<%# taglib prefix="esapi" uri="http://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API" %>
<form ...>
<input name="myField" value="<esapi:encodeForHTMLAttribute>${myField}</esapi:encodeForHTMLAttribute>" />
</form>
<esapi:encodeForHTMLAttribute> does almost the same thing as <c:out>, it HTML escapes sensitive characters like < > " etc
Now, if I load an object that somehow was saved in the database with myfield=abc<def the input will display correctly the value abc<def while the value in the html behind will be abc<def.
The problem is when the user submits this form without changing the values, the server receives the value abc<def instead of what is visible in the page abc<def. So this is not correct. How should I implement the protection in this case?
The problem is when the user submits this form without changing the values, the server receives the value abc<def instead of what is visible in the page abc
Easy. In this case HTML decode the value, and then validate.
Though as noted in a few comments, you should see how we operate with the OWASP ESAPI-Java project. By default we always canonicalize the data which means we run a series of decoders to detect multiple/mixed encoding as well as to create a string safe to validate against with regex.
For the part that really guarantees you protection however, you normally want to have raw text stored on the server--not anything that contains HTML characters, so you may wish to store the unescaped string, if only that you can safely encode it when you send it back to the user.
Encoding is the best protection for XSS, and I would in fact recommend it BEFORE input validation if for some reason you had to choose.
I say may because in general I think its a bad practice to store altered data. It can make troubleshooting a chore. This can be even more complicated if you're using a technology like TinyMCE, a rich-text editor in the browser. It also renders html so its like dealing with a browser within a browser.

Escape HTML only within PRE tag

Inside my JSP, I simply print out my content as follows:
${article.body}
Of course, any HTML tags within that object are rendered, and that's expected behaviour. However within this content, I want to show everything within a <pre> tag as plain text.
I know HTML can be escaped by using ${fn:escapeXML(article.body)} or <c:out value="${article.body}" />, but that will escape all the HTML, whereas I just need everything inside the <pre> tag to be escaped.
I am using Java to generate the contents, and JSP as the view.
Any help would be greatly appreciated.
Your could try using jsoup to make the content safe before sending to the jsp.
http://jsoup.org/ allows many levels of escaping.
You code find the text in your servlet and then send it to be escaped using jsoup or similar.

correcting parsed URLs in java

I am creating a HTML parser that gets the HTML from a given URL, finds the navigation menu html, and puts it into a String. The URLs in the HTML that are being copied into the String need part of the URL added (the "www.stackoverflow.com" part). How can I go about finding the existing URLs in the String and adding the missing part to it so that they work.
The URLs in the String are of the form:
<a href="/qestions/11744851.cfm">
and I need to make them in the following form:
<a href="www.stackoverflow.com/questions/11744851.cfm">
Try using this regular expression with the ReplaceAll() method:
str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");
If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a#href , where /body/div is path to menu section in HTML.
There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much).
If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.
Edit: adding missing part may be done using simple concatenation after finding interesting parts

Problem with HTML output in JSP

I have code which generates some HTML, but when I try to output this content in a jsp all '<' are replaced with '<' and all '>' with '>'. Here is the piece which renders the result:
<c:out value="${data}"/>
Can someone please explain what causes the character replacement, and how it can be avoided?
PS: I've tried escapeXml='false' property of <c:out/> tag, nothing at all is displayed.
Thanks,
Natasha
What do you want the generated HTML to do, do you want it to be markup used by the browser or do you want it shown on the final page, for example as sample code visible to the user? Without escapeXml='false' it will be output to the browser as HTML and interpreted along with all the other markup. With escapeXml='true' it will be turned into escaped markup which is rendered visible to the end-user. So it all depends on what you're trying to do.
If you want to have < and > characters visible to the end user then they have to appear as < and > in the markup.
Nick
The out tag behaves as it should - if you do wish to output the literal value you can directly insert the EL expressen ${data} in your text. That is, in stead of
Some content <c:out value="${data}"/> some more content
you would use
Some content ${data} some more content
in your JSP.
Before you get angry with c:out, please consider that the output often is something a user has put in there - potentially with some unwanted code. Imagine using ${data} on StackOverflow in stead of the (C# version of) c:out :-)
there is an escapeXml attribute on the c:out tag that is set to true by default, set it to false and no escaping will take place, i.e. your HTML will be output as is in the browser.
<c:out value="${data}"/> escapes HTML characters while ${data} does not. Take a look at this related question.

Customizing jsp pages

I would like to let users customize pages, let's call them A and B. So basically I want to provide a hyperlink to a jps page with big text box where a user should be able to enter any text, html (to appear on page A), with ability to preview it and save.
I haven't really deal with this sort of issues before and would appreciate help on how implement it (examples and reference would be very helpful too)
Thanks
Are you using any kind of web framework(Spring MVC / Struts / Tapestry / etc...)? If you are, they all have tutorials on dealing with user inputs / form submission, so take a look at that. They all differ slightly in how user input is processed so it's impossible to answer this question generically.
If you're not (e.g. this is straight JSP), take a look at this tutorial.
Basically, what you want to do is to define an HTML form on your page B with textarea where user would input custom HTML. When form is submitted, you'll get the text user entered as a request parameter and you can store it somewhere (in the database / flat file / memory / what have you). On your page A you'll need to retrieve that text and bind it to request or page scope, you can then display it using <%= %> or <jsp:getProperty> tags.
To ChssPly76's answer I'd just add that if you're going to provide text entry of html on a web page (or anywhere, really) you're going to want to provide some kind of validation and a mechanism to provide feedback if the html is bad. You might dispense with this for a raw internal tool but anything for public consumption will need it. e.g. what do you do if someone enters
<b>sometext
You can deal with this with simple rules that parse away html tags, a preview that lets people know how they're doing so far ala stackoverflow, an rtf input option, or just a validate and if the tags don't balance a big honking "Try again", but you'll want some kind of check that you won't just be putting up broken pages.

Categories