Insert HTML into the Body of an HTMLDocument - java

This seems like such a simple question, but I'm having such difficulty with it.
Problem:
I have some text to insert into an HTMLDocument. This text sometimes specifies some html as well. E.G.:
Some <br />Random <b>HTML</b>
I'm using HTMLEditorKit.insertHTML to insert it at a specified offset. This works fine, unless the offset is at the begining of the doc (offset = 1). When this is the case the text gets inserted into the head of the document instead of the body.
Example:
editorKitInstance.insertHTML(doc, offset, "<font>"+stringToInsert+"</font>", 0, 0, HTML.Tag.FONT);
I use the font tag so I now what I'm inserting will be in a font tag with no attributes so it won't effect the format. I need to know this because the last parameter, insertTag, is required and I can't know the contents of stringToInsert until runtime. If there is already text in the doc (such as "1234567890") then this is the output:
<html>
<head>
</head>
<body>
<p style="margin-top: 0">
1234567890 <font>something <br />Some <br />Random <b>HTML</b></font>
</p>
</body>
</html>
However if the offset is 1 and the document is empty this is the result:
<html>
<head>
<font>Some <br />Random <b>HTML</b></font>
</head>
<body>
</body>
</html>
Other Notes:
This is all being done on the
innerdocument of a JEditorPane. If
there is a better way to replace text
in a JEditorPane with potential
HTML I would be open to those ideas
as well.
Any help would be appreciated. Thanks!

There are several things you should know about the internal structure of the HTMLDocument.
First of all - the body does not start at position 0. All textual content of the document is stored in an instance of javax.swing.text.AbstractDocument$Content. This includes the title and script tags as well. The position/offset argument of ANY document and editor kit function refers to the text in this Content instance! You have to determine the start of the body element to correctly insert content into the body. BTW: Even though you didn't define a body element in your HTML, it will auto-generated by the parser.
Simply inserting at a position tends to have unexpected side effects. You need to know where you want to put the content in relation to the (HTML) elements at this position. E.g. if you have the following text in your document: "...</span><span>..." - there is only one position (referring to the Content instance) for "at the end of the first span", "between the spans" and "at the start of the second span". To solve this problem there are 4 functions in the HTMLDocument API:
insertAfterEnd
insertAfterStart
insertBeforeEnd
insertBeforeStart
As a conclusion: for a general solutions, you have to find the BODY element to tell the document to "insertAfterStart" of the body and at the start offset of the body element.
The following snipped should work in any case:
HTMLDocument htmlDoc = ...;
Element[] roots = htmlDoc.getRootElements(); // #0 is the HTML element, #1 the bidi-root
Element body = null;
for( int i = 0; i < roots[0].getElementCount(); i++ ) {
Element element = roots[0].getElement( i );
if( element.getAttributes().getAttribute( StyleConstants.NameAttribute ) == HTML.Tag.BODY ) {
body = element;
break;
}
}
htmlDoc.insertAfterStart( body, "<font>text</font>" );
If you're sure that the header is always empty, there is another way:
kit.read( new StringReader( "<font>test</font>" ), htmlDoc, 1 );
But this will throw a RuntimeException, if the header is not empty.
By the way, I prefer to use JWebEngine to handle and render HTML content since it keeps header and content separated, so inserting at position 0 always works.

Related

Parsing html body outer text only

I used JSoup to parse HTML.
How can I get ony the body text?
I mean I want only outer text without inculding others tag's text.
(Music causes us to think eloquently.)
<html>
<body>
<p class=\"mm3h\">ဂီတကဆွဲဆောင်အားကောင်းတဲ့ကျွန်တော်တို့ကိုဖြစ်စေတယ်လို့ထင်တယ်။</p>
Music causes us to think eloquently.
<a class=\"\" href=\"\" aria-label=\"--Ralph Waldo Emerson (1 item)\">--Ralph Waldo Emerson</a>
</body>
<html>
I know the question is already answered and the answer is marked as the accepted answer, but I think there is another way to get what was asked:
JSoup offers the ownText() method. with this, you can get all text nodes of an element that are direct children of the element. Child elements and their text nodes will not be returned.
Document doc = Jsoup.parse("<body> text <p> not included </p> included </body>");
Element body = doc.body();
String ownText = body.ownText();
Document doc = Jsoup.parse("<body> your content </body>");
String body = doc.body().textNodes().get(1).text();

jsoup unexpected behaviour and find all div for a class

I am using jsoup to parse webpage using the following command
Document document = Jsoup.connect("http://www.blablabla.de/").get();
then
System.out.println(document.toString());
I get the desired result. But saving the subject webpage and then trying to do the same operation
Document doc = Jsoup.parse("/user/test/test.html","UTF-8");
System.out.println(doc.toString());
I got
html
head head
body
/home/1.html
body
html
My second issue is that I want to get the content of every single div of a specific class. I am using
Elements elements = document.select("div.things.subthings");
the divs I want to catch are as follows
<div class="col_a col text">
<div class="text">
done
</div>
</div>
But saving the subject webpage and then trying to do the same operation
The wrong method is called. Actually, the method called is this one:
static Document Jsoup::parse(String html, String baseUri) // Parse HTML into a Document.
You want to call this one:
static Document parse(File in, String charsetName) // Parse the contents of a file as HTML.
Try this instead:
Document doc = Jsoup.parse(new File("/user/test/test.html"), "UTF-8");
System.out.println(doc.toString());
My second issue is that I want to get the content of every single div of a specific class.
Try one of the css queries below:
For finding all divs with class="col_a col text"
div.col_a.col.text
For finding all divs with class="col_a col text" OR class="text"
div.col_a.col.text, div.text
For finding all divs with class="col_a col text" having divs with class="text" among their descendants
div.col_a.col.text:has(div.text)

How to parse html from javascript variables with Jsoup in Java?

I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?
Example:
<!DOCTYPE html>
<html>
<head>
<script>
var html = "<span>some text</span>";
</script>
</head>
<body>
<p>text</p>
</body>
</html>
In this example Jsoup only picks up the text from p tag which is what it's supposed to do. How do I pick up the text from var html span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.
You can use Jsoup to parse all the <script>-tags into DataNode-objects.
DataNode
A data node, for contents of style, script tags etc, where contents should not show in text().
Elements scriptTags = doc.getElementsByTag("script");
This will give you all the Elements of tag <script>.
You can then use the getWholeData()-method to extract the node.
// Get the data contents of this node.
String getWholeData()
for (Element tag : scriptTags){
for (DataNode node : tag.dataNodes()) {
System.out.println(node.getWholeData());
}
}
Jsoup API - DataNode
I am not so sure about the answer, but I saw a similar situation before here.
You probably can use Jsoup and manual parsing to get the text according to that answer.
I just modify that piece of code for your specific case:
Document doc = ...
Element script = doc.select("script").first(); // Get the script part
Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part
while( m.find() )
{
System.out.println(m.group()); // the whole html text
System.out.println(m.group(1)); // value only
}
Hope it will be helpful.

Presence of HTML tags using Jsoup

With Jsoup it is easy to count number of times a particular tag's presence in a text. For example I am trying to see how many times anchor tag is present in the given text.
String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(content);
Elements links = doc.select("a[href]"); // a with href
System.out.println(links.size());
This gives me a count of 4. If I have a sentence and I want to know if the sentence contains any html tags or not, is it possible with Jsoup? Thank you.
You are possibly better off with a regular expression, but if you really want to use JSoup, then you can try to match for all ellements, and then subtract 4, as JSoup automatically adds four elements, that is, first the root element, and then a <html>, <head> and <body> element.
This might loosely look like:
// attempt to count html elements in string - incorrect code, see below
public static int countHtmlElements(String content) {
Document doc = Jsoup.parse(content);
Elements elements = doc.select("*");
return elements.size()-4;
}
However this gives a wrong result if the text contains a <html>, <head> or <body>; compare the results of:
// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));
So to make this work, you would have to check for the "magic" tags separately; that is why I feel a regular expression might be simpler.
More failed attempts to make this work: Using parseBodyFragment instead of parse does not help, as this gets sanitized in the same way by JSoup. Same, counting as doc.select("body *"); saves you the trouble to subtract 4, but it still yields the wrong count if a <body> is involved. Only if you have an application where you are sure that no <html>, <head> or <body> elements are present in the strings to be checked, it might work under that limitiation.

Using jsoup to escape disallowed tags

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
has to yield the following:
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
I see the following problems/questions with jsoup:
document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
how do I replace <script>...</script> with <script>...</script>? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?
Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.
Answer 1
How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.
Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.
Example 1 - Using parse()
final String html = "<b>a</b>";
System.out.println(Jsoup.parse(html));
Output:
<html>
<head></head>
<body>
<b>a</b>
</body>
</html>
Input html is completed to ensure you have a complete document.
Example 2 - Using clean()
final String html = "<b>a</b>";
System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));
Output:
<b>a</b>
Input html is cleaned, not more.
Documentation:
Jsoup
Answer 2
The method replaceWith() does exactly what you need:
Example:
final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("script") )
{
element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<b><script>your script here</script></b>
</body>
</html>
Or body only:
System.out.println(doc.body().html());
Output:
<b><script>your script here</script></b>
Documentation:
Node.replaceWith(Node in)
TextNode
Answer 3
Yes, prettyPrint() method of Jsoup.OutputSettings does this.
Example:
final String html = "<p>your html here</p>";
Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc);
Note: if the outputSettings() method is not available, please update Jsoup.
Output:
<html><head></head><body><p>your html here</p></body></html>
Documentation:
Document.OutputSettings.prettyPrint(boolean pretty)
Answer 4 (no bullet)
No! Jsoup is one of the best and most capable Html library out there!

Categories