Escape special characters of html string in java - java

I have a html content as a string.
String attachment = "<div style=\"color:black;font-style:normal;font-size:10pt;font-family:verdana;\"><div><span style=\"background-color: rgb(255,255,255);\">This is special "'; </span></div></div>";
If I try to add this as a multipart form data I get an exception. The reason happens to be the special characters inside the html which is " and '. So I tried escaping the entire string using
org.apache.commons.lang.StringEscapeUtils.escapeJave(attachment);
After doing this the exception disappeared and it was working fine. But the double quotes used for the attributes, like style are also escaped using this method, which is not desired.
Instead of <div> style="color:black;
it was sent as <div> style=\"color:black;
So far I realized that I need to escape only the text inside the html content and not the entire text. i could extract the text content using jsoup or something else then form the html again.
But is there a generic easy solution to do this?

Related

Replace placeholder of HTML file in JAVA

I am using jsoup library to parse HTML file in Java.
I want to replace placeholders of that HTML file.
Presently I am putting the placeholders in <span id = "id_1"> xx </span> and replacing them.
I have tried to do many other things, but didn't achieve success.
Document doc = Jsoup.parse(new File("abc.html"), UTF_8);
doc.getElementById("id_1").text("MUKUL");
Placeholders in my html file are like <%= name %>. I want to replace all placeholders with suitable value. For now I had made change in my html file to put placeholders in <span id = "id_1"> xx </span> tag. I don't want to change my html template.
Can anyone please suggest some clean and better way to achieve this thing.
Why I am not changing it to String: I don't want to change the html file in String because html file contains some Japanese characters and whenever I transform it to string, some of the Japanese characters distort and produces some junk data.

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?
You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();
The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

How to handle double escaping in chrome/firefox?

Here is my my java code in jsp :
custUrl="customer.action?custId=211&custAddressId=2341";
Now javascript code :
function submit() {
window.location = "<c:out value='<%=custUrl%>' />";
// here is generated javascript code
// window.location = "customer.action?custId=211&custAddressId=2341"
}
FireFox and Chrome (IE does not do double escape) are escaping the already escaped value (that's why I am getting the second paramter name as amp;custAddressId instead of custAddressId).
Is there any generic solution where i can handle double escaping in firefox/chrome?
UPDATE:-
so bottom line is i want to escape the intended characters with c:out (which is happening)
but also want to avoid the double escaping while sending the data to server which is happening
in case of some browsers
By default special characters are escaped by <c:out>. Turn escaping off as
<c:out value='<%=custUrl%>' escapeXml='false' />
Ampersand & is escaped as & in XML. Here amp is short for ampersand.
This isn't a Firefox/Chrome issue because final HTML generated is the same irrespective of which browser you use to access your site. IE's HTML source viewer must have chosen to display the ampersand in its unescaped form.

Jsoup Whitelist: Parsing non-english character

I am trying to clean HTML text and to extract plain text from it using Jsoup. The HTML might contain non-english character.
For example the HTML text is:
String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";
Now if I use Jsoup#parse(String html):
String text = Jsoup.parse(html).text();
It is printing:
Á example link.
And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist):
String text = Jsoup.clean(html, Whitelist.none());
It is printing:
Á example link.
My question is, how can I get the text
Á example link.
using Whitelist and clean() method? I want to use Whitelist since I might be needed to use Whitelist#addTags(String... tags).
Any information will be very helpful to me.
Thanks.
Not possible in current version (1.6.1), jsoup print Á as Á because the entity escaping feature, there is no "don't escape" mode now (check Entities.EscapeMode).
You can 1. unescape these HTML entities, 2. extend jsoup's source code by adding a new escape mode with an empty map.

Newline not preserved when reading from Textarea

My java webapp fetches content from a textarea, and e-mails the same.
The problem I'm facing is that the newline character in the textarea message is not preserved when reading the same using
request.getParameter("message");
Any clues how it can be tackled?
TIA.
EDIT:
The content in the textarea is:
abcd
abcd
CODE:
String message = request.getParameter("message");
System.out.println("index loc for message "+message+" using \\r\\n : "+message.indexOf("\r\n"));
System.out.println("index loc for message "+message+" using \\n : "+message.indexOf("\n"));
System.out.println("index loc for message "+message+" using \\r : "+message.indexOf("\r"));
System.out.println("index loc for message "+message+" using \\n\\r : "+message.indexOf("\n\r"));
OUTPUT:
index loc for message asdfasdf using \r\n : -1
index loc for message asdfasdf using \n : -1
index loc for message asdfasdf using \r : -1
index loc for message asdfasdf using \n\r : -1
That completely depends on how you're redisplaying it.
It sounds like that you're redisplaying it in HTML. "Raw" newlines are not part of HTML markup. Do a rightclick, View Page Source in webbrowser. You'll see linebreaks over all place. Usually before and/or after HTML tags.
In order to visually present linebreaks in the HTML presentation, you should actually be using <br> tags. You can replace newlines by <br> strings as below:
message = message.replace("\n", "<br>");
This is only sensitive to XSS attack holes if the message is an user-controlled variable, because you have to present it unescaped in JSP (i.e. without <c:out>) in order to get <br> to work. You thus need to make sure that the message variable is sanitized beforehand.
Alternatively, you can also set CSS white-space property there where you're redisplaying the message to pre. If you'd like to wrap lines inside the context of a block element, then set pre-wrap. Or if you'd like to collapse spaces and tabs as well, then set pre-line.
<div id="message"><c:out value="${message}" /></div>
#message {
white-space: pre-line;
}
This will display the text preformatted (as a textarea by default does).
Two possible problems:
The text in the textarea is word wrapped and doesn't really have any newlines.
The String you get with getParameter() contains newlines (\n) but no carriage returns (\r) as expected by many email programs.
As a first step, I'd try dumping the retrieved String in a way you can check for this. You could write to a file and use od or a hex editor to look at the file, for example.
If it turns out you're simply missing CRs, you could do some simple regexp-based replacement on the string to fix that.
Searching the ASCII codes i found that the new line is not defined like the often \n, instead is defined like \r\n.
Regards.
You need to encodeURIComponent() before submitting the form and decodeURIComponent() on the server side.

Categories