Parsing HTML and breaking lines in HTML text - java

I have plenty of Java code adding some HTML fragments on the server side. The HTML complexity can be variuous however it will have some text inside that must be broken according to specified line length.
So the argument is whole HTML frament:
<div class="container">
<div id="header">
<br class="cbt">
<div id="hlogo">
<a href="/" >
Stack Overflow
</a>
I must for example break Stack Overflow to
Stack
Overflow
because it exceeded line length limit which would be 9 chars.
How could I do that? Meybe there is some library that would parse this HTML fragment to some document object and then I could break these lines, but what if the text is mixed with HTML ..?

Yes, you can parse your whole String with html content using JSOUP Library. This library will transform all your HTML Nodes into HTML Objects, than you can iterate this objects looking for this texts with length > 9 breaking this inserting a for example.
Example:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
The Documents Object consist of Elements and TextNodes, the TextNode it is exactly what you looking for.
You can find some excelent examples in http://jsoup.org/cookbook/introduction/parsing-a-document
Hope it helps.

Related

Parsing html body outer text only

I used JSoup to parse HTML.
How can I get ony the body text?
I mean I want only outer text without inculding others tag's text.
(Music causes us to think eloquently.)
<html>
<body>
<p class=\"mm3h\">ဂီတကဆွဲဆောင်အားကောင်းတဲ့ကျွန်တော်တို့ကိုဖြစ်စေတယ်လို့ထင်တယ်။</p>
Music causes us to think eloquently.
<a class=\"\" href=\"\" aria-label=\"--Ralph Waldo Emerson (1 item)\">--Ralph Waldo Emerson</a>
</body>
<html>
I know the question is already answered and the answer is marked as the accepted answer, but I think there is another way to get what was asked:
JSoup offers the ownText() method. with this, you can get all text nodes of an element that are direct children of the element. Child elements and their text nodes will not be returned.
Document doc = Jsoup.parse("<body> text <p> not included </p> included </body>");
Element body = doc.body();
String ownText = body.ownText();
Document doc = Jsoup.parse("<body> your content </body>");
String body = doc.body().textNodes().get(1).text();

Why does jsoup.parse() doesn't return the full HTML document?

I am tinkering around with Jsoup and I wonder why does jsoup.parse(url) returns a partial part of the HTML. Here is my code:
System.out.println(Jsoup.parse("https://www.example.com"))
Here is the output
<html>
<head></head>
<body>
https://www.example.com
</body>
</html>
It is partial at best, and if you go to www.example.com you would see that the parser missed two <p> tags.
Now the docs say this
public static Document parse​(String html)
Parse HTML into a Document. As no base URI is specified, absolute URL
detection
relies on the HTML including a tag.
Parameters: html - HTML to parse
Returns: sane HTML
It says it brings back a document, but in practice its a partial one. Also - what is "sane" HTML?

Java Scanner to find a tag, then delimiters to write what's in that tag to a file

I'm writing a program that's intended to search the HTML of a website, find a specific tag, then write the contents of that tag to a file. For example, the HTML could look like this:
<div class="something" specific-tag:"print this 1">some content</div>
<div class="something" not-the-right-tag:"don't print this">some content</div>
<div class="something" specific-tag:"print this 2">some content</div>
<div class="something" not-the-right-tag:"don't print this">some content</div>
<div class="something" specific-tag:"print this 3">some content</div>
The desired file output would look like this:
print this 1
print this 2
print this 3
I know how to use the Scanner class to find the specific tag, in this case "specific-tag" and I know how write to a file using delimiters, the delimiter in this case being ", but what I don't know how to do is search for a tag, then write to a file everything between the delimiters after that tag, then resume searching for the next tag and repeat until the end of the file.
Thoughts?
You really should use some kind of html parsing library. A quick google search revealed this http://jsoup.org/. It seems easy to use. Calling
Elements divs = doc.select("div[specific-tag]");
should yield the divs and then you can extract the specific-tag attribute.

Presence of HTML tags using Jsoup

With Jsoup it is easy to count number of times a particular tag's presence in a text. For example I am trying to see how many times anchor tag is present in the given text.
String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(content);
Elements links = doc.select("a[href]"); // a with href
System.out.println(links.size());
This gives me a count of 4. If I have a sentence and I want to know if the sentence contains any html tags or not, is it possible with Jsoup? Thank you.
You are possibly better off with a regular expression, but if you really want to use JSoup, then you can try to match for all ellements, and then subtract 4, as JSoup automatically adds four elements, that is, first the root element, and then a <html>, <head> and <body> element.
This might loosely look like:
// attempt to count html elements in string - incorrect code, see below
public static int countHtmlElements(String content) {
Document doc = Jsoup.parse(content);
Elements elements = doc.select("*");
return elements.size()-4;
}
However this gives a wrong result if the text contains a <html>, <head> or <body>; compare the results of:
// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));
So to make this work, you would have to check for the "magic" tags separately; that is why I feel a regular expression might be simpler.
More failed attempts to make this work: Using parseBodyFragment instead of parse does not help, as this gets sanitized in the same way by JSoup. Same, counting as doc.select("body *"); saves you the trouble to subtract 4, but it still yields the wrong count if a <body> is involved. Only if you have an application where you are sure that no <html>, <head> or <body> elements are present in the strings to be checked, it might work under that limitiation.

Using jsoup to escape disallowed tags

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
has to yield the following:
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
I see the following problems/questions with jsoup:
document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
how do I replace <script>...</script> with <script>...</script>? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?
Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.
Answer 1
How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.
Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.
Example 1 - Using parse()
final String html = "<b>a</b>";
System.out.println(Jsoup.parse(html));
Output:
<html>
<head></head>
<body>
<b>a</b>
</body>
</html>
Input html is completed to ensure you have a complete document.
Example 2 - Using clean()
final String html = "<b>a</b>";
System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));
Output:
<b>a</b>
Input html is cleaned, not more.
Documentation:
Jsoup
Answer 2
The method replaceWith() does exactly what you need:
Example:
final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("script") )
{
element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<b><script>your script here</script></b>
</body>
</html>
Or body only:
System.out.println(doc.body().html());
Output:
<b><script>your script here</script></b>
Documentation:
Node.replaceWith(Node in)
TextNode
Answer 3
Yes, prettyPrint() method of Jsoup.OutputSettings does this.
Example:
final String html = "<p>your html here</p>";
Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc);
Note: if the outputSettings() method is not available, please update Jsoup.
Output:
<html><head></head><body><p>your html here</p></body></html>
Documentation:
Document.OutputSettings.prettyPrint(boolean pretty)
Answer 4 (no bullet)
No! Jsoup is one of the best and most capable Html library out there!

Categories