Problem with HTML output in JSP - java

I have code which generates some HTML, but when I try to output this content in a jsp all '<' are replaced with '<' and all '>' with '>'. Here is the piece which renders the result:
<c:out value="${data}"/>
Can someone please explain what causes the character replacement, and how it can be avoided?
PS: I've tried escapeXml='false' property of <c:out/> tag, nothing at all is displayed.
Thanks,
Natasha

What do you want the generated HTML to do, do you want it to be markup used by the browser or do you want it shown on the final page, for example as sample code visible to the user? Without escapeXml='false' it will be output to the browser as HTML and interpreted along with all the other markup. With escapeXml='true' it will be turned into escaped markup which is rendered visible to the end-user. So it all depends on what you're trying to do.
If you want to have < and > characters visible to the end user then they have to appear as < and > in the markup.
Nick

The out tag behaves as it should - if you do wish to output the literal value you can directly insert the EL expressen ${data} in your text. That is, in stead of
Some content <c:out value="${data}"/> some more content
you would use
Some content ${data} some more content
in your JSP.
Before you get angry with c:out, please consider that the output often is something a user has put in there - potentially with some unwanted code. Imagine using ${data} on StackOverflow in stead of the (C# version of) c:out :-)

there is an escapeXml attribute on the c:out tag that is set to true by default, set it to false and no escaping will take place, i.e. your HTML will be output as is in the browser.

<c:out value="${data}"/> escapes HTML characters while ${data} does not. Take a look at this related question.

Related

Escape HTML only within PRE tag

Inside my JSP, I simply print out my content as follows:
${article.body}
Of course, any HTML tags within that object are rendered, and that's expected behaviour. However within this content, I want to show everything within a <pre> tag as plain text.
I know HTML can be escaped by using ${fn:escapeXML(article.body)} or <c:out value="${article.body}" />, but that will escape all the HTML, whereas I just need everything inside the <pre> tag to be escaped.
I am using Java to generate the contents, and JSP as the view.
Any help would be greatly appreciated.
Your could try using jsoup to make the content safe before sending to the jsp.
http://jsoup.org/ allows many levels of escaping.
You code find the text in your servlet and then send it to be escaped using jsoup or similar.

correcting parsed URLs in java

I am creating a HTML parser that gets the HTML from a given URL, finds the navigation menu html, and puts it into a String. The URLs in the HTML that are being copied into the String need part of the URL added (the "www.stackoverflow.com" part). How can I go about finding the existing URLs in the String and adding the missing part to it so that they work.
The URLs in the String are of the form:
<a href="/qestions/11744851.cfm">
and I need to make them in the following form:
<a href="www.stackoverflow.com/questions/11744851.cfm">
Try using this regular expression with the ReplaceAll() method:
str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");
If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a#href , where /body/div is path to menu section in HTML.
There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much).
If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.
Edit: adding missing part may be done using simple concatenation after finding interesting parts

How to check if the content is plain text or not?

I have a plain text area where I accept only plain text from users. I want to make sure that users do not put any markup in the text area. I also assume that users can post in different languages. So, what is the best approach to validate the content both at the server side (using java) and at the client side (using jquery).
Any help in this regard would be appreciated.
Update: I am sorry if the question wasn't clear enough. To make it simple, this is what I want to do - I let users type text in the textarea (no rich text box here). For each double new line in the text area i want to show a paragraph in the HTML page. How do I do that correctly?
It makes little sense to validate user input on HTML content. You can just escape HTML when redisplaying this user input on the webpage. Since you mentioned that you're using Java on the server side and thus you're likely using JSP as view technology, it's good to know that you can use the JSTL <c:out> tag and fn:escapeXml() function to escape HTML before printing to output.
E.g. when redisplaying user-controlled input:
<c:out value="${somebean.sometext}" />
or when redisplaying user-submitted request parameter:
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
This way for example <script>alert('xss')</script> will be printed to HTML output as <script>alert('xss')</script> and thus be displayed in HTML literally as the enduser typed in itself.
If you really insist to validate this, you could eventually grab a HTML parser like Jsoup for this.
String text = request.getParameter("text");
if (!text.equals(Jsoup.parse(text).text())) {
// There was HTML in the text.
}
Update as per the comments you actually want to sanitize the input against a HTML whitelist to remove potential malicious tags. You can do this with Jsoup as well, see also this page.
String sanitized = Jsoup.clean(text, Whitelist.basic());
The allowed elements of Whitelist#basic() is specified in the API documentation.
If it's HTML markup you want to prevent, you could use a regular expression to throw an error if it sees a chevron (<)

How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation.
I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution
Any good ideas?
Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular
Try using a HTML parsing library then search for <a> tags in the HTML document.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
not all url are in tags, some are text
and some are in links or other tags
You shouldn't scan the HTML source to achieve this.
You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.
Best way is still that you use a tool made for the job.
You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".
[attr^=value], [attr$=value],
[attr*=value]: elements with
attributes that start with, end with,
or contain the value, e.g.
select("[href*=/path/]")
See: jSoup.
You may want to have a look at XPath or Regular Expressions.
Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.
The best way should be to google for regexes. One example is this one:
/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:#]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})*))?$/i
found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

amp is included in url struts tag

In my web application, I use strust2 url tag to pass parameters like id etc., For example, I use a link to delete an entity and I use param to pass the id of the entity to be deleted. And I follow this throughout my web app for adding, editing, deleting an entity.
During run time, sometimes, I don't get the params to be stored in my action's bean properties. When I see the link that is generated, I get something like
<a href='/projit1/p/discuss/viewDiscussion.action?d=11&amp;amp;projid=11&amp;disid=4'>
What are these amps for ? why do they sit in between the action calls (made by link via url tag actions ) ? By the time I traverse back and forth in my web app, I get 10s and 20s of amp sitting in the request URL. What is the problem here ? Please help.
In HTML, XHTML and XML certain characters are treated specially. The special characters used the most are less then (<) and ampersand (&). The < is only valid at the beginning of a tag, while the & is used to encode character entities (special characters, characters that can't be typed, etc.). Because & is special and can not appear as part of an attribute value it is encoded as & and while it may look strange if you don't know why, the href value in your question is almost correct. In the same manor < should be encoded as < to ensure correct browser behavior. Not encoding these characters MAY work but is NOT GUARANTEED to work.
The problem with your URL is with multiple amp; what this indicates is the href has been encoded multiple times. The first time & was changed to & at that time another parameter was added with it's & separator. The whole URL was then encoded a second time changeing the first & to &amp; and the second to &. Then for some reason the URL was encoded a third time causing the first to change to &amp;amp; and the second to &amp;. To remove the excess amp;s you need to ensure the URL is only encoded for HTML once not multiple times.
Your resulting tag should look like this:
<a href='/projit1/p/discuss/viewDiscussion.actiond=11&projid=11&disid=4'>
I have found the problem. Hope it helps others.
I will have to set includeParams to none. It will avoid old request parameters
Add below attribute to the url tag
escapeAmp="false"
I know this is a bit old but I ran into the same problem and solved it thanks to luckydev. Here is my code sample:
<s:url id="remoteurlGrid" action="certificationListJSON" includeParams="get">
<s:param name="population" value="%{getPopulation()}" />
<s:param name="selectedSalesRepID" value="%{getSelectedSalesRepID()}" />
</s:url>
All I needed to do was add "includeParams="get". The odd thing is that that SHOULD be the default so why it didn't work I don't know.
Here is a reference to the API: http://struts.apache.org/2.1.6/struts2-core/apidocs/org/apache/struts2/components/URL.html

Categories