Formatting issue with HTML while using JSoup for Java - java

I'm trying to scrape "text" off of a website with JSoup. I can get the text cleanly (with no formatting at all, just the text), or with all the formatting still attached (i.e. < br > along with < p > and < /p >).
However, I can't seem to get the formatted version to include < br/ > to any extent, and that's the only thing that was specifically requested to go along with the text.
For example, I can get this:
<p><br>Worldwide database</p>
and this:
Worldwide database
but I can't get this, which is my desired result:
Worldwide database<br/>
I don't see any < br / >'s while looking at the HTML code via the FireBug plugin on Firefox so I'm wondering if that might be the issue? Or maybe there's an issue with the method's I'm using in my code to pull the text?
Anyways, here's my code:
Elements descriptionHTML = doc.select("div[jsname]"); // <-- Get access to the text w/ JSoup
String descText = descriptionHTML.text(); // <-- Get the code w/o any formating at all
// This prints out the desired text with the <p><br> and </p>, but no <br/>
for (Element link : descriptionHTML)
{
String jsname = link.attr("jsname");
if( jsname.equals("C4s9Ed")){
System.out.println(link);
break;
}
}
I'd really apprecaite any help with this issue.
Thanks,
Jack

HTML does not define a closing tag for <br> elements. XHTML however requires that the tag is marked as empty: <br />. JSoup parses both, but will print out only normal HTML (<br>).
If you use the XML parser in Jsoup, the <br> tags are not closed and so Jsoup tries to guess where to place matching closing tags </br> which are neither HTML nor XHTML compliant.
If you want to keep the line break info and strip out all other tags, I think you need to program that part outside of Jsoup. You could for example replace all <br> and <br /> strings with a uniqe other string, say "_brSplitPos_", then parse the document with JSoup, print out the text only and replace the "_brSplitPos_" against <br />:
String html = "<div>This<br>is<br />a<br>test</div>";
html = html.replaceAll("<br(?:\\s+/)?>", "_brSplitPos_");
Document docH = Jsoup.parse(html);
String onlyText = docH.text();
onlyText = onlyText.replace("_brSplitPos_", "<br />");
System.out.println(onlyText);

Related

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?
You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();
The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

How to validate that at least one element in a html string has content?

I have a wysiwyg editor that I can't modify that sometimes returns <p></p> which obviously looks like an empty field to the person using the wysiwyg.
So I need to add some validation on my back-end which uses java.
should be rejected
<p></p>
<p> </p>
<div><p> </p></div>
should be accepted
<p>a</p>
<div><p>a</p></div>
<p> </p>
<div><p>a</p></div>
basically as long as any element contains some content we will accept it and save it.
I am looking for libraries that I should look at and ideas for how to approach it. Thanks.
You may look on jsoup library. It's pretty fast
It takes HTML and you may return text from it (see example from their website below).
Extract attributes, text, and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
I would advise you to do it on the client side. The reason is because it is natural for the browser to do this. You need to hook your wysiwyg editor in the send or "save" part, a lot of them have this ability.
Javascript would be
function stripIfEmpty(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
var contentText = tmp.textContent || tmp.innerText || "";
if(contentText.trim().length === 0){
return "";
}else{
return html;
}
}
In the case if you need backend javascript, then the only correct solution would be to use some library that parse HTML, like jsoup - #Dmytro Pastovenskyi show you that.
If you want to use backend but allow it to be fuzzy, not strict, then you can use regex like replaceAll("\\<[^>]*>","") then trim, then check if the string is empty.
You can use regular expressions (built-in to Java).
For example,
"<p>\\s*\\w+\\s*</p>"
would match a <p> tag with at least 1 character of content.

Jsoup select text WITH including html tags

I use Jsoup to select some code between <td></td> tags. It looks like this:
Document doc = Jsoup.parse(response, "UTF-8");
Element elMotD = doc.select("td.info").first();
String motdText = elMotD.text();
My problem now is that jsoup selects the text like I want but it simply sorts out tags like <br> which are important for my displaying in Android TextView later.
How can I do this that Jsoup doesn't miss the tags in between this text?
See here: http://jsoup.org/cookbook/extracting-data/attributes-text-html
Use the Element.html() method to get to the html including its inner html tags. You can also use Node.outerHtml() to the the html including the outer tags.
In your case:
Document doc = Jsoup.parse(response, "UTF-8");
Element elMotD = doc.select("td.info").first();
String motdHtml = elMotD.html();

How to parse html from javascript variables with Jsoup in Java?

I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?
Example:
<!DOCTYPE html>
<html>
<head>
<script>
var html = "<span>some text</span>";
</script>
</head>
<body>
<p>text</p>
</body>
</html>
In this example Jsoup only picks up the text from p tag which is what it's supposed to do. How do I pick up the text from var html span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.
You can use Jsoup to parse all the <script>-tags into DataNode-objects.
DataNode
A data node, for contents of style, script tags etc, where contents should not show in text().
Elements scriptTags = doc.getElementsByTag("script");
This will give you all the Elements of tag <script>.
You can then use the getWholeData()-method to extract the node.
// Get the data contents of this node.
String getWholeData()
for (Element tag : scriptTags){
for (DataNode node : tag.dataNodes()) {
System.out.println(node.getWholeData());
}
}
Jsoup API - DataNode
I am not so sure about the answer, but I saw a similar situation before here.
You probably can use Jsoup and manual parsing to get the text according to that answer.
I just modify that piece of code for your specific case:
Document doc = ...
Element script = doc.select("script").first(); // Get the script part
Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part
while( m.find() )
{
System.out.println(m.group()); // the whole html text
System.out.println(m.group(1)); // value only
}
Hope it will be helpful.

Using jsoup to escape disallowed tags

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
has to yield the following:
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
I see the following problems/questions with jsoup:
document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
how do I replace <script>...</script> with <script>...</script>? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?
Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.
Answer 1
How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.
Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.
Example 1 - Using parse()
final String html = "<b>a</b>";
System.out.println(Jsoup.parse(html));
Output:
<html>
<head></head>
<body>
<b>a</b>
</body>
</html>
Input html is completed to ensure you have a complete document.
Example 2 - Using clean()
final String html = "<b>a</b>";
System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));
Output:
<b>a</b>
Input html is cleaned, not more.
Documentation:
Jsoup
Answer 2
The method replaceWith() does exactly what you need:
Example:
final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("script") )
{
element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<b><script>your script here</script></b>
</body>
</html>
Or body only:
System.out.println(doc.body().html());
Output:
<b><script>your script here</script></b>
Documentation:
Node.replaceWith(Node in)
TextNode
Answer 3
Yes, prettyPrint() method of Jsoup.OutputSettings does this.
Example:
final String html = "<p>your html here</p>";
Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc);
Note: if the outputSettings() method is not available, please update Jsoup.
Output:
<html><head></head><body><p>your html here</p></body></html>
Documentation:
Document.OutputSettings.prettyPrint(boolean pretty)
Answer 4 (no bullet)
No! Jsoup is one of the best and most capable Html library out there!

Categories