Avoid removal of spaces and newline while parsing html using jsoup - java

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?

You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

Related

Escape special characters of html string in java

I have a html content as a string.
String attachment = "<div style=\"color:black;font-style:normal;font-size:10pt;font-family:verdana;\"><div><span style=\"background-color: rgb(255,255,255);\">This is special "'; </span></div></div>";
If I try to add this as a multipart form data I get an exception. The reason happens to be the special characters inside the html which is " and '. So I tried escaping the entire string using
org.apache.commons.lang.StringEscapeUtils.escapeJave(attachment);
After doing this the exception disappeared and it was working fine. But the double quotes used for the attributes, like style are also escaped using this method, which is not desired.
Instead of <div> style="color:black;
it was sent as <div> style=\"color:black;
So far I realized that I need to escape only the text inside the html content and not the entire text. i could extract the text content using jsoup or something else then form the html again.
But is there a generic easy solution to do this?

Remove html which have class x from string in java

Is there a good way to remove HTML from a Java string which have class "abc"? A simple regex like -
replaceAll("\\<.*?>","")
will remove all but i want to remove only those tag whose having class "abc".
<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>
Remove h1 with class abc only.
Note -> have to ddo it through regex not through parser because this is the only instance where i am modifying HTML in my code. Don't want additional JAR in my code.
This should Work
replaceAll("<h1[^>]*?class=\"*\'*abc\"*\'*>.*?h1>","")
Try
replaceAll("<[Hh]1 class=['\"]landingPage['\"]>.*?</[Hh]1>", "")
But note that since regex is not well-suited for this task, there might be unwanted results when it comes to complex HTML input.
For the input
<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>
the output is
<H1 class="xyz">Hello</H1>
It's never a good idea to parse HTML using regex, see RegEx match open tags except XHTML self-contained tags
See Which HTML Parser is the best? for alternatives.
For example, using JSoup you could write something like this (untested):
Document doc = Jsoup.parse(html);
Elements elements = doc.select(".abc");
elements.remove();

Parsing HTML and get all the nodes

I need to parse an HTML file in java. Unlike XML there is no repetitive tags. So I need a code that can parse the html file and reach all nodes, it includes nested tags .. etc. The HTML code is not fixed. In other words given any HTML code I need to reach all the tags in the HTML.
try this HTML Parser
http://htmlparser.sourceforge.net/samples.html
I think you need this...
var els=document.getElementsByTagName("*");
for(var i=0;i<els.length;i+)document.write(els.nodeName+"<br />");

Using jsoup to escape disallowed tags

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
has to yield the following:
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
I see the following problems/questions with jsoup:
document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
how do I replace <script>...</script> with <script>...</script>? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?
Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.
Answer 1
How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.
Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.
Example 1 - Using parse()
final String html = "<b>a</b>";
System.out.println(Jsoup.parse(html));
Output:
<html>
<head></head>
<body>
<b>a</b>
</body>
</html>
Input html is completed to ensure you have a complete document.
Example 2 - Using clean()
final String html = "<b>a</b>";
System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));
Output:
<b>a</b>
Input html is cleaned, not more.
Documentation:
Jsoup
Answer 2
The method replaceWith() does exactly what you need:
Example:
final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("script") )
{
element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<b><script>your script here</script></b>
</body>
</html>
Or body only:
System.out.println(doc.body().html());
Output:
<b><script>your script here</script></b>
Documentation:
Node.replaceWith(Node in)
TextNode
Answer 3
Yes, prettyPrint() method of Jsoup.OutputSettings does this.
Example:
final String html = "<p>your html here</p>";
Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc);
Note: if the outputSettings() method is not available, please update Jsoup.
Output:
<html><head></head><body><p>your html here</p></body></html>
Documentation:
Document.OutputSettings.prettyPrint(boolean pretty)
Answer 4 (no bullet)
No! Jsoup is one of the best and most capable Html library out there!

Extract text between html tags parsed from xml

Can anyone help me in extracting text from within the html tags to plain text?
I have parsed an xml and get some output as body which has html tags now i want to remove the tags and use the text.
thanks in advance!!!!
You can use HTML Parser like JSoup
For example
HTML is
<div style="height:240px;"><br>test: example<br>test1:example1</div>
You can get the html using
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();
Try a HTML Parser.
If the HTML is escaped, i.e. < instead of < you might have to decode first.
Considering your requirements you might try Jericho HTML Parser
Take a look at TextExtractor class:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".
If all you want to do is remove HTML tags from a string, you can do this:
String output = input.replaceAll("(?s)\\<.*?\\>", " ");

Categories