jsoup unexpected behaviour and find all div for a class - java

I am using jsoup to parse webpage using the following command
Document document = Jsoup.connect("http://www.blablabla.de/").get();
then
System.out.println(document.toString());
I get the desired result. But saving the subject webpage and then trying to do the same operation
Document doc = Jsoup.parse("/user/test/test.html","UTF-8");
System.out.println(doc.toString());
I got
html
head head
body
/home/1.html
body
html
My second issue is that I want to get the content of every single div of a specific class. I am using
Elements elements = document.select("div.things.subthings");
the divs I want to catch are as follows
<div class="col_a col text">
<div class="text">
done
</div>
</div>

But saving the subject webpage and then trying to do the same operation
The wrong method is called. Actually, the method called is this one:
static Document Jsoup::parse(String html, String baseUri) // Parse HTML into a Document.
You want to call this one:
static Document parse(File in, String charsetName) // Parse the contents of a file as HTML.
Try this instead:
Document doc = Jsoup.parse(new File("/user/test/test.html"), "UTF-8");
System.out.println(doc.toString());
My second issue is that I want to get the content of every single div of a specific class.
Try one of the css queries below:
For finding all divs with class="col_a col text"
div.col_a.col.text
For finding all divs with class="col_a col text" OR class="text"
div.col_a.col.text, div.text
For finding all divs with class="col_a col text" having divs with class="text" among their descendants
div.col_a.col.text:has(div.text)

Related

Parsing html body outer text only

I used JSoup to parse HTML.
How can I get ony the body text?
I mean I want only outer text without inculding others tag's text.
(Music causes us to think eloquently.)
<html>
<body>
<p class=\"mm3h\">ဂီတကဆွဲဆောင်အားကောင်းတဲ့ကျွန်တော်တို့ကိုဖြစ်စေတယ်လို့ထင်တယ်။</p>
Music causes us to think eloquently.
<a class=\"\" href=\"\" aria-label=\"--Ralph Waldo Emerson (1 item)\">--Ralph Waldo Emerson</a>
</body>
<html>
I know the question is already answered and the answer is marked as the accepted answer, but I think there is another way to get what was asked:
JSoup offers the ownText() method. with this, you can get all text nodes of an element that are direct children of the element. Child elements and their text nodes will not be returned.
Document doc = Jsoup.parse("<body> text <p> not included </p> included </body>");
Element body = doc.body();
String ownText = body.ownText();
Document doc = Jsoup.parse("<body> your content </body>");
String body = doc.body().textNodes().get(1).text();

Jsoup select text WITH including html tags

I use Jsoup to select some code between <td></td> tags. It looks like this:
Document doc = Jsoup.parse(response, "UTF-8");
Element elMotD = doc.select("td.info").first();
String motdText = elMotD.text();
My problem now is that jsoup selects the text like I want but it simply sorts out tags like <br> which are important for my displaying in Android TextView later.
How can I do this that Jsoup doesn't miss the tags in between this text?
See here: http://jsoup.org/cookbook/extracting-data/attributes-text-html
Use the Element.html() method to get to the html including its inner html tags. You can also use Node.outerHtml() to the the html including the outer tags.
In your case:
Document doc = Jsoup.parse(response, "UTF-8");
Element elMotD = doc.select("td.info").first();
String motdHtml = elMotD.html();

style attribute not being displayed using jsoup

I am using Jsoup to fetch all images of a particular manga chapter from online-manga sites using only the first page link.
I have successfully retrieved the total page number and the src of the first page, for example: if supplied with this link "http://www.mangapanda.com/feng-shen-ji/1/1" the output will be:
Total page : 49
Title : Feng Shen Ji 1
ImageURL : http://i15.mangapanda.com/feng-shen-ji/1/feng-shen-ji-2974919.jpg
what I want to do now is to fetch the src of the second page and then auto-increment to get the rest. The link to the second page is in the html as:
<div id="prefetchimg" style="background-image: url("http://i34.mangapanda.com/feng-shen-ji/1/feng-shen-ji-2974921.jpg");"></div>
but when I use jsoup as
String url = "http://www.mangapanda.com/feng-shen-ji/1";
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements div = doc.select("div");
for (Element divParse : div) {
if(divParse.id().equals("prefetchimg"))
System.out.println(divParse);}
I only get
<div id="prefetchimg"></div>
Instead of
<div id="prefetchimg" style="background-image: url("http://i34.mangapanda.com/feng-shen-ji/1/feng-shen-ji-2974921.jpg");"></div>
How do I get the style attribute?
#eltabo said
Ok, in your case, your tag has been modified by a javascript function, so Jsoup can't see this attribut
And this is true, JSoup is for Html page only. For Html with JS use for example HtmlUnit

HTML Parsing and removing anchor tags while preserving inner html using Jsoup

I have to parse some html and remove the anchor tags , but I need to preserve the innerHTML of anchor tags
For example, if my html text is:
String html = "<div> <p> some text some link text </p> </div>"
Now I can parse the above html and select for a tag in jsoup like this,
Document doc = Jsoup.parse(inputHtml);
//this would give me all elements which have anchor tag
Elements elements = doc.select("a");
and I can remove all of them by,
element.remove()
But it would remove the complete achor tag from start bracket to close bracket, and the inner html would be lost, How can I preserve the inner HTML which removing only the start and close tags.
Also, Please Note : I know there are methods to get outerHTML() and
innerHTML() from the element, but those methods only give me ways to
retrieve the text, the remove() method removes the complete html of
the tag. Is there any way in which I can only remove the outer tags
and preserve the innerHTML ?
Thanks a lot in advance and appreciate your help.
--Rajesh
use unwrap, it preserves the inner html
doc.select("a").unwrap();
check the api-docs for more info:
http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29
How about extracting the inner HTML first, adding it to the DOM and then removing your tags? This code is untested, but should do the trick:
Edit:
I updated the code to use replaceWith(), making the code more intuitive and probably more efficient; thanks to A.J.'s hint in the comments.
Document doc = Jsoup.parse(inputHtml);
Elements links = doc.select("a");
String baseUri = links.get(0).baseUri();
for(Element link : links) {
Node linkText = new TextNode(link.html(), baseUri);
// optionally wrap it in a tag instead:
// Element linkText = doc.createElement("span");
// linkText.html(link.html());
link.replaceWith(linkText);
}
Instead of using a text node, you can wrap the inner html in anything you want; you might even have to, if there's not just text inside your links.

Insert HTML into the Body of an HTMLDocument

This seems like such a simple question, but I'm having such difficulty with it.
Problem:
I have some text to insert into an HTMLDocument. This text sometimes specifies some html as well. E.G.:
Some <br />Random <b>HTML</b>
I'm using HTMLEditorKit.insertHTML to insert it at a specified offset. This works fine, unless the offset is at the begining of the doc (offset = 1). When this is the case the text gets inserted into the head of the document instead of the body.
Example:
editorKitInstance.insertHTML(doc, offset, "<font>"+stringToInsert+"</font>", 0, 0, HTML.Tag.FONT);
I use the font tag so I now what I'm inserting will be in a font tag with no attributes so it won't effect the format. I need to know this because the last parameter, insertTag, is required and I can't know the contents of stringToInsert until runtime. If there is already text in the doc (such as "1234567890") then this is the output:
<html>
<head>
</head>
<body>
<p style="margin-top: 0">
1234567890 <font>something <br />Some <br />Random <b>HTML</b></font>
</p>
</body>
</html>
However if the offset is 1 and the document is empty this is the result:
<html>
<head>
<font>Some <br />Random <b>HTML</b></font>
</head>
<body>
</body>
</html>
Other Notes:
This is all being done on the
innerdocument of a JEditorPane. If
there is a better way to replace text
in a JEditorPane with potential
HTML I would be open to those ideas
as well.
Any help would be appreciated. Thanks!
There are several things you should know about the internal structure of the HTMLDocument.
First of all - the body does not start at position 0. All textual content of the document is stored in an instance of javax.swing.text.AbstractDocument$Content. This includes the title and script tags as well. The position/offset argument of ANY document and editor kit function refers to the text in this Content instance! You have to determine the start of the body element to correctly insert content into the body. BTW: Even though you didn't define a body element in your HTML, it will auto-generated by the parser.
Simply inserting at a position tends to have unexpected side effects. You need to know where you want to put the content in relation to the (HTML) elements at this position. E.g. if you have the following text in your document: "...</span><span>..." - there is only one position (referring to the Content instance) for "at the end of the first span", "between the spans" and "at the start of the second span". To solve this problem there are 4 functions in the HTMLDocument API:
insertAfterEnd
insertAfterStart
insertBeforeEnd
insertBeforeStart
As a conclusion: for a general solutions, you have to find the BODY element to tell the document to "insertAfterStart" of the body and at the start offset of the body element.
The following snipped should work in any case:
HTMLDocument htmlDoc = ...;
Element[] roots = htmlDoc.getRootElements(); // #0 is the HTML element, #1 the bidi-root
Element body = null;
for( int i = 0; i < roots[0].getElementCount(); i++ ) {
Element element = roots[0].getElement( i );
if( element.getAttributes().getAttribute( StyleConstants.NameAttribute ) == HTML.Tag.BODY ) {
body = element;
break;
}
}
htmlDoc.insertAfterStart( body, "<font>text</font>" );
If you're sure that the header is always empty, there is another way:
kit.read( new StringReader( "<font>test</font>" ), htmlDoc, 1 );
But this will throw a RuntimeException, if the header is not empty.
By the way, I prefer to use JWebEngine to handle and render HTML content since it keeps header and content separated, so inserting at position 0 always works.

Categories