Escape HTML only within PRE tag - java

Inside my JSP, I simply print out my content as follows:
${article.body}
Of course, any HTML tags within that object are rendered, and that's expected behaviour. However within this content, I want to show everything within a <pre> tag as plain text.
I know HTML can be escaped by using ${fn:escapeXML(article.body)} or <c:out value="${article.body}" />, but that will escape all the HTML, whereas I just need everything inside the <pre> tag to be escaped.
I am using Java to generate the contents, and JSP as the view.
Any help would be greatly appreciated.

Your could try using jsoup to make the content safe before sending to the jsp.
http://jsoup.org/ allows many levels of escaping.
You code find the text in your servlet and then send it to be escaped using jsoup or similar.

Related

How to parse a file containing html using JSOUP?

I have files containing HTML and I am trying to parse that file and then tokenise the text of the body.
I achieve this through:
docs = JSOUP.parse("myFile","UTF-8","");
System.out.println(docs.boy().text());
The above codes work fine but the problem is TEXT that is present outside of html tags without any tag is also printed as part of the body tags.
I need to find a way to stop this text outside of HTML tags from being read
Help this is a time sensitive question !
You can select and remove unwanted elements in your document.
doc.select("body > :matchText").remove();
The above statement will remove all text-nodes, that are direct children of the body-element. The :matchText selector is rather new, so please make sure to use a somehow recent version of JSoup (1.11.3 definitely works, but 1.10.2 not).
Find more infos on the selector syntax on https://jsoup.org/cookbook/extracting-data/selector-syntax

Java - Extract html information from string

All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.

correcting parsed URLs in java

I am creating a HTML parser that gets the HTML from a given URL, finds the navigation menu html, and puts it into a String. The URLs in the HTML that are being copied into the String need part of the URL added (the "www.stackoverflow.com" part). How can I go about finding the existing URLs in the String and adding the missing part to it so that they work.
The URLs in the String are of the form:
<a href="/qestions/11744851.cfm">
and I need to make them in the following form:
<a href="www.stackoverflow.com/questions/11744851.cfm">
Try using this regular expression with the ReplaceAll() method:
str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");
If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a#href , where /body/div is path to menu section in HTML.
There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much).
If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.
Edit: adding missing part may be done using simple concatenation after finding interesting parts

How to check if the content is plain text or not?

I have a plain text area where I accept only plain text from users. I want to make sure that users do not put any markup in the text area. I also assume that users can post in different languages. So, what is the best approach to validate the content both at the server side (using java) and at the client side (using jquery).
Any help in this regard would be appreciated.
Update: I am sorry if the question wasn't clear enough. To make it simple, this is what I want to do - I let users type text in the textarea (no rich text box here). For each double new line in the text area i want to show a paragraph in the HTML page. How do I do that correctly?
It makes little sense to validate user input on HTML content. You can just escape HTML when redisplaying this user input on the webpage. Since you mentioned that you're using Java on the server side and thus you're likely using JSP as view technology, it's good to know that you can use the JSTL <c:out> tag and fn:escapeXml() function to escape HTML before printing to output.
E.g. when redisplaying user-controlled input:
<c:out value="${somebean.sometext}" />
or when redisplaying user-submitted request parameter:
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
This way for example <script>alert('xss')</script> will be printed to HTML output as <script>alert('xss')</script> and thus be displayed in HTML literally as the enduser typed in itself.
If you really insist to validate this, you could eventually grab a HTML parser like Jsoup for this.
String text = request.getParameter("text");
if (!text.equals(Jsoup.parse(text).text())) {
// There was HTML in the text.
}
Update as per the comments you actually want to sanitize the input against a HTML whitelist to remove potential malicious tags. You can do this with Jsoup as well, see also this page.
String sanitized = Jsoup.clean(text, Whitelist.basic());
The allowed elements of Whitelist#basic() is specified in the API documentation.
If it's HTML markup you want to prevent, you could use a regular expression to throw an error if it sees a chevron (<)

Problem with HTML output in JSP

I have code which generates some HTML, but when I try to output this content in a jsp all '<' are replaced with '<' and all '>' with '>'. Here is the piece which renders the result:
<c:out value="${data}"/>
Can someone please explain what causes the character replacement, and how it can be avoided?
PS: I've tried escapeXml='false' property of <c:out/> tag, nothing at all is displayed.
Thanks,
Natasha
What do you want the generated HTML to do, do you want it to be markup used by the browser or do you want it shown on the final page, for example as sample code visible to the user? Without escapeXml='false' it will be output to the browser as HTML and interpreted along with all the other markup. With escapeXml='true' it will be turned into escaped markup which is rendered visible to the end-user. So it all depends on what you're trying to do.
If you want to have < and > characters visible to the end user then they have to appear as < and > in the markup.
Nick
The out tag behaves as it should - if you do wish to output the literal value you can directly insert the EL expressen ${data} in your text. That is, in stead of
Some content <c:out value="${data}"/> some more content
you would use
Some content ${data} some more content
in your JSP.
Before you get angry with c:out, please consider that the output often is something a user has put in there - potentially with some unwanted code. Imagine using ${data} on StackOverflow in stead of the (C# version of) c:out :-)
there is an escapeXml attribute on the c:out tag that is set to true by default, set it to false and no escaping will take place, i.e. your HTML will be output as is in the browser.
<c:out value="${data}"/> escapes HTML characters while ${data} does not. Take a look at this related question.

Categories