Parsing html body outer text only

Parsing html body outer text only - java

I used JSoup to parse HTML.
How can I get ony the body text?
I mean I want only outer text without inculding others tag's text.
(Music causes us to think eloquently.)
<html>
<body>
<p class=\"mm3h\">ဂီတကဆွဲဆောင်အားကောင်းတဲ့ကျွန်တော်တို့ကိုဖြစ်စေတယ်လို့ထင်တယ်။</p>
Music causes us to think eloquently.
<a class=\"\" href=\"\" aria-label=\"--Ralph Waldo Emerson (1 item)\">--Ralph Waldo Emerson</a>
</body>
<html>

I know the question is already answered and the answer is marked as the accepted answer, but I think there is another way to get what was asked:
JSoup offers the ownText() method. with this, you can get all text nodes of an element that are direct children of the element. Child elements and their text nodes will not be returned.
Document doc = Jsoup.parse("<body> text <p> not included </p> included </body>");
Element body = doc.body();
String ownText = body.ownText();

Document doc = Jsoup.parse("<body> your content </body>");
String body = doc.body().textNodes().get(1).text();

Related

Parsing HTML and breaking lines in HTML text

I have plenty of Java code adding some HTML fragments on the server side. The HTML complexity can be variuous however it will have some text inside that must be broken according to specified line length.
So the argument is whole HTML frament:
<div class="container">
<div id="header">
<br class="cbt">
<div id="hlogo">
<a href="/" >
Stack Overflow
</a>
I must for example break Stack Overflow to
Stack
Overflow
because it exceeded line length limit which would be 9 chars.
How could I do that? Meybe there is some library that would parse this HTML fragment to some document object and then I could break these lines, but what if the text is mixed with HTML ..?

Yes, you can parse your whole String with html content using JSOUP Library. This library will transform all your HTML Nodes into HTML Objects, than you can iterate this objects looking for this texts with length > 9 breaking this inserting a for example.
Example:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
The Documents Object consist of Elements and TextNodes, the TextNode it is exactly what you looking for.
You can find some excelent examples in http://jsoup.org/cookbook/introduction/parsing-a-document
Hope it helps.

HTML Parsing and removing anchor tags while preserving inner html using Jsoup

I have to parse some html and remove the anchor tags , but I need to preserve the innerHTML of anchor tags
For example, if my html text is:
String html = "<div> <p> some text some link text </p> </div>"
Now I can parse the above html and select for a tag in jsoup like this,
Document doc = Jsoup.parse(inputHtml);
//this would give me all elements which have anchor tag
Elements elements = doc.select("a");
and I can remove all of them by,
element.remove()
But it would remove the complete achor tag from start bracket to close bracket, and the inner html would be lost, How can I preserve the inner HTML which removing only the start and close tags.
Also, Please Note : I know there are methods to get outerHTML() and
innerHTML() from the element, but those methods only give me ways to
retrieve the text, the remove() method removes the complete html of
the tag. Is there any way in which I can only remove the outer tags
and preserve the innerHTML ?
Thanks a lot in advance and appreciate your help.
--Rajesh

use unwrap, it preserves the inner html
doc.select("a").unwrap();
check the api-docs for more info:
http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29

How about extracting the inner HTML first, adding it to the DOM and then removing your tags? This code is untested, but should do the trick:
Edit:
I updated the code to use replaceWith(), making the code more intuitive and probably more efficient; thanks to A.J.'s hint in the comments.
Document doc = Jsoup.parse(inputHtml);
Elements links = doc.select("a");
String baseUri = links.get(0).baseUri();
for(Element link : links) {
Node linkText = new TextNode(link.html(), baseUri);
// optionally wrap it in a tag instead:
// Element linkText = doc.createElement("span");
// linkText.html(link.html());
link.replaceWith(linkText);
}
Instead of using a text node, you can wrap the inner html in anything you want; you might even have to, if there's not just text inside your links.

Presence of HTML tags using Jsoup

With Jsoup it is easy to count number of times a particular tag's presence in a text. For example I am trying to see how many times anchor tag is present in the given text.
String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(content);
Elements links = doc.select("a[href]"); // a with href
System.out.println(links.size());
This gives me a count of 4. If I have a sentence and I want to know if the sentence contains any html tags or not, is it possible with Jsoup? Thank you.

You are possibly better off with a regular expression, but if you really want to use JSoup, then you can try to match for all ellements, and then subtract 4, as JSoup automatically adds four elements, that is, first the root element, and then a <html>, <head> and <body> element.
This might loosely look like:
// attempt to count html elements in string - incorrect code, see below
public static int countHtmlElements(String content) {
Document doc = Jsoup.parse(content);
Elements elements = doc.select("*");
return elements.size()-4;
}
However this gives a wrong result if the text contains a <html>, <head> or <body>; compare the results of:
// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));
So to make this work, you would have to check for the "magic" tags separately; that is why I feel a regular expression might be simpler.
More failed attempts to make this work: Using parseBodyFragment instead of parse does not help, as this gets sanitized in the same way by JSoup. Same, counting as doc.select("body *"); saves you the trouble to subtract 4, but it still yields the wrong count if a <body> is involved. Only if you have an application where you are sure that no <html>, <head> or <body> elements are present in the strings to be checked, it might work under that limitiation.

Using jsoup to escape disallowed tags

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
has to yield the following:
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
I see the following problems/questions with jsoup:
document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
how do I replace <script>...</script> with <script>...</script>? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?
Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.

Answer 1
How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.
Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.
Example 1 - Using parse()
final String html = "<b>a</b>";
System.out.println(Jsoup.parse(html));
Output:
<html>
<head></head>
<body>
<b>a</b>
</body>
</html>
Input html is completed to ensure you have a complete document.
Example 2 - Using clean()
final String html = "<b>a</b>";
System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));
Output:
<b>a</b>
Input html is cleaned, not more.
Documentation:
Jsoup
Answer 2
The method replaceWith() does exactly what you need:
Example:
final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("script") )
{
element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<b><script>your script here</script></b>
</body>
</html>
Or body only:
System.out.println(doc.body().html());
Output:
<b><script>your script here</script></b>
Documentation:
Node.replaceWith(Node in)
TextNode
Answer 3
Yes, prettyPrint() method of Jsoup.OutputSettings does this.
Example:
final String html = "<p>your html here</p>";
Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc);
Note: if the outputSettings() method is not available, please update Jsoup.
Output:
<html><head></head><body><p>your html here</p></body></html>
Documentation:
Document.OutputSettings.prettyPrint(boolean pretty)
Answer 4 (no bullet)
No! Jsoup is one of the best and most capable Html library out there!

Insert HTML into the Body of an HTMLDocument

This seems like such a simple question, but I'm having such difficulty with it.
Problem:
I have some text to insert into an HTMLDocument. This text sometimes specifies some html as well. E.G.:
Some <br />Random <b>HTML</b>
I'm using HTMLEditorKit.insertHTML to insert it at a specified offset. This works fine, unless the offset is at the begining of the doc (offset = 1). When this is the case the text gets inserted into the head of the document instead of the body.
Example:
editorKitInstance.insertHTML(doc, offset, "<font>"+stringToInsert+"</font>", 0, 0, HTML.Tag.FONT);
I use the font tag so I now what I'm inserting will be in a font tag with no attributes so it won't effect the format. I need to know this because the last parameter, insertTag, is required and I can't know the contents of stringToInsert until runtime. If there is already text in the doc (such as "1234567890") then this is the output:
<html>
<head>
</head>
<body>
<p style="margin-top: 0">
1234567890 <font>something <br />Some <br />Random <b>HTML</b></font>
</p>
</body>
</html>
However if the offset is 1 and the document is empty this is the result:
<html>
<head>
<font>Some <br />Random <b>HTML</b></font>
</head>
<body>
</body>
</html>
Other Notes:
This is all being done on the
innerdocument of a JEditorPane. If
there is a better way to replace text
in a JEditorPane with potential
HTML I would be open to those ideas
as well.
Any help would be appreciated. Thanks!

There are several things you should know about the internal structure of the HTMLDocument.
First of all - the body does not start at position 0. All textual content of the document is stored in an instance of javax.swing.text.AbstractDocument$Content. This includes the title and script tags as well. The position/offset argument of ANY document and editor kit function refers to the text in this Content instance! You have to determine the start of the body element to correctly insert content into the body. BTW: Even though you didn't define a body element in your HTML, it will auto-generated by the parser.
Simply inserting at a position tends to have unexpected side effects. You need to know where you want to put the content in relation to the (HTML) elements at this position. E.g. if you have the following text in your document: "...</span><span>..." - there is only one position (referring to the Content instance) for "at the end of the first span", "between the spans" and "at the start of the second span". To solve this problem there are 4 functions in the HTMLDocument API:
insertAfterEnd
insertAfterStart
insertBeforeEnd
insertBeforeStart
As a conclusion: for a general solutions, you have to find the BODY element to tell the document to "insertAfterStart" of the body and at the start offset of the body element.
The following snipped should work in any case:
HTMLDocument htmlDoc = ...;
Element[] roots = htmlDoc.getRootElements(); // #0 is the HTML element, #1 the bidi-root
Element body = null;
for( int i = 0; i < roots[0].getElementCount(); i++ ) {
Element element = roots[0].getElement( i );
if( element.getAttributes().getAttribute( StyleConstants.NameAttribute ) == HTML.Tag.BODY ) {
body = element;
break;
}
}
htmlDoc.insertAfterStart( body, "<font>text</font>" );
If you're sure that the header is always empty, there is another way:
kit.read( new StringReader( "<font>test</font>" ), htmlDoc, 1 );
But this will throw a RuntimeException, if the header is not empty.
By the way, I prefer to use JWebEngine to handle and render HTML content since it keeps header and content separated, so inserting at position 0 always works.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing html body outer text only - java

Document doc = Jsoup.parse("<body> your content </body>"); String body = doc.body().textNodes().get(1).text();

Related

Parsing HTML and breaking lines in HTML text

HTML Parsing and removing anchor tags while preserving inner html using Jsoup

Presence of HTML tags using Jsoup

Using jsoup to escape disallowed tags

Insert HTML into the Body of an HTMLDocument

Categories

Resources