Jsoup Not Parsing Particluar DIv - java

I am unable to get the div 'live ticker' from here using Jsoup Library.
Here is my code:
Document doc = Jsoup.connect(Link).get();
Element link = doc.select("div.data-of-match-live-experience").first();
Elements squad = doc.select("div.data-of-match-live-experience");
Elements li = squad.select("li"); // select all li from ul
Log.d("jsoup", "size: " + li.size());

The text in this tag in not part of initial html, but is set by JavaScript after page is loaded. You can check it by disabling JavaScript in your browser. Jsoup only gets static html, does not execute JavaScript code.
When you examine what connections are made from the page you will find out that the value is updated through request to this API:
https://shapeshifter-api.onefootball.com/v1/en/match/live-experience/5/6700/718129
Make a request to this url, parser result and you will get desired value.

Related

Picking a JavaScript button in HtmlUnit

In AliX, this page for example https://www.aliexpress.com/item/32956185908.html
how do you pick the particular country and colour?
For example, following js configure as country:
"skuPropertyValues":[
{"propertyValueDisplayName":"China","propertyValueId":201336100,"propertyValueIdLong":201336100,"propertyValueName":"China","skuPropertySendGoodsCountryCode":"CN","skuPropertyTips":"China","skuPropertyValueShowOrder":2,"skuPropertyValueTips":"China"},
{"propertyValueDisplayName":"GERMANY","propertyValueId":201336101,"propertyValueIdLong":201336101,"propertyValueName":"GERMANY","skuPropertySendGoodsCountryCode":"DE","skuPropertyTips":"GERMANY","skuPropertyValueShowOrder":2,"skuPropertyValueTips":"GERMANY"},
{"propertyValueDisplayName":"SPAIN","propertyValueId":201336104,"propertyValueIdLong":201336104,"propertyValueName":"SPAIN","skuPropertySendGoodsCountryCode":"ES","skuPropertyTips":"SPAIN","skuPropertyValueShowOrder":2,"skuPropertyValueTips":"SPAIN"},
{"propertyValueDisplayName":"Russian Federation","propertyValueId":201336103,"propertyValueIdLong":201336103,"propertyValueName":"Russian Federation","skuPropertySendGoodsCountryCode":"RU","skuPropertyTips":"Russian Federation","skuPropertyValueShowOrder":2,"skuPropertyValueTips":"Russian Federation"}]},
I'm not sure how to pick one through JavaScript or am I looing at the wrong thing, would it CS and picking up a div tag?
You have to enable js support and wait after retrieving the page until all the javascript is done. The js code will generate div's from you code above (use page.asXML() to get an idea of the page fro the viewpoint of HtmlUnit.
If the div's are there you can click() on them - like on any other html element. This click should trigger the same js code like in real browsers.
To find the div elements please have a look at https://htmlunit.sourceforge.io/gettingStarted.html; there are many different options listed.
Like #RBRi said, use the click method on the element. To choose the element (DIV in this case) you can use the getbyxpath method:
Starting from the page you can access using the XPATH of the element. The XPATH can be easily obtained using the browser inspect tool and copying the full XPATH of the div (right click over the div and use the copy option). For the purpose of your URL, "CHINA" element has the XPATH: /html/body/div[6]/div/div[2]/div/div[2]/div[7]/div/div[1]/ul/li[1]/div/ "GERMANY": /html/body/div[6]/div/div[2]/div/div[2]/div[7]/div/div[1]/ul/li[2]/div/span
page1 = webClient.getPage("https://www.aliexpress.com/item/32956185908.html");
// get china div using xpath
element = ((HtmlElement)page1.getByXPath("/html/body/div[6]/div/div[2]/div/div[2]/div[7]/div/div[1]/ul/li[1]/div/").get(0));
// Click over china
element.click();
webClient.waitForBackgroundJavaScript(10000);
If you want to iterate over the list of countries you can use the same approach, but this time copy the XPATH of the div that contains all the countries, get that element, and from this element then iterate by getting the divs. In this case, you can use the attribute "class" to get those elements:
page1 = webClient.getPage("https://www.aliexpress.com/item/32956185908.html");
element = ((HtmlElement)page1.getByXPath("/html/body/div[6]/div/div[2]/div/div[2]/div[7]/div/div[1]/ul").get(0); // Get the div that contains all countries
List<HtmlElement> elements = element.getElementsByAttribute("div", "class", "sku-property-text");
Then you can iterate over the list of elements and click the one you prefer.
:
choosenElement.click();
:

JSoup not reading content from URL with anchor

I'm using JSoup to read content from the following page:
https://www.astrology.com/horoscope/daily/aries.html#Monday
This is the code that I'm using:
String test1 = "https://www.astrology.com/horoscope/daily/aries.html#Monday";
String test2 = "https://www.astrology.com/horoscope/daily/aries.html#Tuesday";
Document document = Jsoup.connect(test1).get();
Element content = document.getElementById("content");
Element p = content.child(0);
String myTest = p.text();
In the URL I can pass the day with an anchor (see test1 and test2 variables) but in both cases it returns the same content, looks like it JSoup is simply ignoring the anchor and just using the base URL: https://www.astrology.com/horoscope/daily/aries.html. Is there a way for JSoup to read an URL with an anchor?
Jsoup ignores the anchor because the relevant information is rendered with JavaScript and Jsoup cannot process it. If you examin the page with your browser's dev tools you'll see that the daily info is found in a json file, like https://www.astrology.com/horoscope/daily/all/aries/2021-03-23/, so you can easily change the date/sign and get whatever you like.

Unable to retrieve table elements using jsoup

I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas
I have started off trying to do the following, but there are no results from the get go:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty
I also tried this, but again no results:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements divs = document.select("div");
if (!divs.isEmpty()) {
for (Element div : divs) {
// all of these are empty
Elements verbTenses = div.getElementsByClass("verbtense");
Elements verbTables = div.getElementsByClass("verbtable");
Elements tables = div.getElementsByClass("table verbtable");
}
}
What am I doing incorrectly?
The page you are trying to scrape have dynamically generated content on the client side (with javascript), therfore you won be able to extact data using that link
You might me able to scrape some content from the API call that this webpage is making eg https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
Inspect browser console to see what page is doing, and do the same
The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time.
Jsoup can't parse and execute JavaScript so all you get is the initial page :(
The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests:
One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup.
General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):
String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();
then
you'll have to use some JSON parsing library for example json-simple
to obtain html fragment and then you can parse it to HTML with Jsoup:
String json = Jsoup.connect(
"https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
.ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);
Now you can try your initial approach with using selectors to get what you want from document object.

JSoup select Div based on Id and href based on title

Im using JSoup to parse HTML response. I have multiple Div tags. I have to select Div tag based on an ID.
My pseudo code looks like this,
Document divTag = Jsoup.connect(link).get();
Elements info = divTag.select("div#navDiv");
where navDiv is the ID. But it doesnt seem to work.
Also I would want to select Href inside the Div based on some title, where hrefTitle[] would be string array. So while iterating the href I would check if the title is present in the string array, if so i would add them to list else ignore. How do i select href inside Div ? and How to select title? any inputs much appreciated.
But it doesnt seem to work.
It should work. Proof:
Document doc = Jsoup.parse("<html><body><div/>" +
"<div id=\"navDiv\">" +
"link1" +
"link2<" +
"</div></body></html>");
Element div = doc.select("div#navDiv").first();
Now, we can select the a element inside the div that has (for example) an href attribute whose value is href2:
System.out.println(div.select("a[href=href2]"));
Output:
link2
You can find the full selector syntax here:
http://jsoup.org/apidocs/org/jsoup/select/Selector.html

How to parse content with <pre>?

I am using jsoup to parse a number of things.
I am trying to parse this tag
<pre>HEllo Worl<pre>
But just cant get it to work.
How would i parse this using jsoup?\
Document jsDoc = null;
jsDoc = Jsoup.connect(url).get();
Elements titleElements = jsDoc.getElementsByTag("pre");
Here is what i have so far.
Works fine for me with latest Jsoup:
String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");
for (Element pre : pres) {
System.out.println(pre.text());
}
Result:
Hello World
If you get nothing, then the HTML which you're parsing simply doesn't contain any <pre> element. Check it yourself by
System.out.println(document.html());
Perhaps the URL is wrong. Perhaps there's some JavaScript which alters the HTML DOM with new elements (Jsoup doesn't interpret nor execute JS). Perhaps the site expects a real browser instead of a bot (change the user agent then). Perhaps the site requires a login (you'd need to maintain cookies). Who knows. You can figure this all out with a real webbrowser like Firefox or Chrome.

Categories