JSoup not reading content from URL with anchor - java

I'm using JSoup to read content from the following page:
https://www.astrology.com/horoscope/daily/aries.html#Monday
This is the code that I'm using:
String test1 = "https://www.astrology.com/horoscope/daily/aries.html#Monday";
String test2 = "https://www.astrology.com/horoscope/daily/aries.html#Tuesday";
Document document = Jsoup.connect(test1).get();
Element content = document.getElementById("content");
Element p = content.child(0);
String myTest = p.text();
In the URL I can pass the day with an anchor (see test1 and test2 variables) but in both cases it returns the same content, looks like it JSoup is simply ignoring the anchor and just using the base URL: https://www.astrology.com/horoscope/daily/aries.html. Is there a way for JSoup to read an URL with an anchor?

Jsoup ignores the anchor because the relevant information is rendered with JavaScript and Jsoup cannot process it. If you examin the page with your browser's dev tools you'll see that the daily info is found in a json file, like https://www.astrology.com/horoscope/daily/all/aries/2021-03-23/, so you can easily change the date/sign and get whatever you like.

Related

Unable to retrieve table elements using jsoup

I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas
I have started off trying to do the following, but there are no results from the get go:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty
I also tried this, but again no results:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements divs = document.select("div");
if (!divs.isEmpty()) {
for (Element div : divs) {
// all of these are empty
Elements verbTenses = div.getElementsByClass("verbtense");
Elements verbTables = div.getElementsByClass("verbtable");
Elements tables = div.getElementsByClass("table verbtable");
}
}
What am I doing incorrectly?
The page you are trying to scrape have dynamically generated content on the client side (with javascript), therfore you won be able to extact data using that link
You might me able to scrape some content from the API call that this webpage is making eg https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
Inspect browser console to see what page is doing, and do the same
The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time.
Jsoup can't parse and execute JavaScript so all you get is the initial page :(
The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests:
One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup.
General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):
String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();
then
you'll have to use some JSON parsing library for example json-simple
to obtain html fragment and then you can parse it to HTML with Jsoup:
String json = Jsoup.connect(
"https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
.ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);
Now you can try your initial approach with using selectors to get what you want from document object.

Jsoup extract Hrefs from the HTML content

My problem is that I try to get the Hrefs from this site with JSoup
https://www.amazon.de/s?k=kissen&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2
but it does not work.
I tried to select the class from the Href like this
Elements elements = documentMainSite.select(".a-link-normal");
and after that I tried to extract the Hrefs with the following piece of code.
for (Element element : elements) {
String href = element.attributes().get("href");
}
but unfortunately it gives me nothing...
Can someone tell me where is my mistake please?
I don't just connect to the website. I also save the hrefs in a string by extracting them with
String href = element.attributes().get("href");
after that I've print the href String but is empty.
On another side the code works with another css selector. so it has nothing to do with the code by it self. its just the css selector (.a-link-normal) that is probably wrong.
You won't get anything by simply connecting to the url via Jsoup.
Document document = Jsoup.connect(yourUrl).get();
String bodyText = document.getElementsByTag("body").get(0).text();
Here is the translation of the body text, which I got from the above code.
Enter the characters below We ask for your understanding and want to
be sure that you are not a bot. For best results, please use a browser
that accepts cookies. Type the characters you see in the image: Enter
characters Try another image Continue shopping Terms & Conditions
Privacy Policy © 1996-2015, Amazon.com, Inc. or its affiliates
Either you need to bypass captcha or emulate a browser by means of Selenium, for example.

Jsoup Not Parsing Particluar DIv

I am unable to get the div 'live ticker' from here using Jsoup Library.
Here is my code:
Document doc = Jsoup.connect(Link).get();
Element link = doc.select("div.data-of-match-live-experience").first();
Elements squad = doc.select("div.data-of-match-live-experience");
Elements li = squad.select("li"); // select all li from ul
Log.d("jsoup", "size: " + li.size());
The text in this tag in not part of initial html, but is set by JavaScript after page is loaded. You can check it by disabling JavaScript in your browser. Jsoup only gets static html, does not execute JavaScript code.
When you examine what connections are made from the page you will find out that the value is updated through request to this API:
https://shapeshifter-api.onefootball.com/v1/en/match/live-experience/5/6700/718129
Make a request to this url, parser result and you will get desired value.

Jsoup: How to get the returned Documents url, if a redirection was involved in between request and responce

I have a Java web crawler. It is opening this type of urls :
http://jobbank.dk/prefetch_net/s.aspx?c=j&u=/5l/GCyVEQ4dr07BQM6aDvW1I0UefK7VvjHbG5dHDz2P2tCsrbBFYiCBFyAvIdjVnWkl3nwjaUdTp8spu4B9B833lJobgVCKRfM MawPa4AoPK7JvRti4tFFFdmUbtr4LajxRjFH ERBWO7cx43GJ6ColMjDI40vayZSqQ Zl54dK4hqc/nj909Nvb 8Hm9aUmecabYb8Lecyigr3RH/msy NRXW8Le66u2OVepyXyLXHApptPYf2RK42PcqKEawanyjbWAnP8WlT9DaiO/adJ9mEEPIAadtEY/ocN3wSa4=
The final url is different that this, which i guess means that a redirect is involved. I can get and parse the returned Document, but is there any way to get the "final", "real" url too?
That URL is not doing a redirect, is returning a page which has this meta header
<meta http-equiv="refresh" content="0; url=https://krb-xjobs.brassring.com/TGWebHost/jobdetails.aspx?PartnerId=30033&SiteID=5635&type=mail&jobId=722158"-->
You can see your "final" url there.
You can parse the document for this tag with (for example) select("meta[http-equiv=refresh]")
And then parse the content attribute.
Summing up:
Document doc = Jsoup.connect("http://jobbank.dk/prefetch_net/s.aspx?c=j&u=/5l/GCyVEQ4dr07BQM6aDvW1I0UefK7VvjHbG5dHDz2P2tCsrbBFYiCBFyAvIdjVnWkl3nwjaUdTp8spu4B9B833lJobgVCKRfM MawPa4AoPK7JvRti4tFFFdmUbtr4LajxRjFH ERBWO7cx43GJ6ColMjDI40vayZSqQ Zl54dK4hqc/nj909Nvb 8Hm9aUmecabYb8Lecyigr3RH/msy NRXW8Le66u2OVepyXyLXHApptPYf2RK42PcqKEawanyjbWAnP8WlT9DaiO/adJ9mEEPIAadtEY/ocN3wSa4=").get();
Elements select = doc.select("meta[http-equiv=refresh]");
String content = select.first().attr("content");
String prefix = "url=";
String url = content.substring(content.indexOf(prefix) + prefix.length());
System.out.println(url);
Will give you your desired uri:
https://krb-xjobs.brassring.com/TGWebHost/jobdetails.aspx?PartnerId=30033&SiteID=5635&type=mail&jobId=722158
I hope it will help.

How to parse content with <pre>?

I am using jsoup to parse a number of things.
I am trying to parse this tag
<pre>HEllo Worl<pre>
But just cant get it to work.
How would i parse this using jsoup?\
Document jsDoc = null;
jsDoc = Jsoup.connect(url).get();
Elements titleElements = jsDoc.getElementsByTag("pre");
Here is what i have so far.
Works fine for me with latest Jsoup:
String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");
for (Element pre : pres) {
System.out.println(pre.text());
}
Result:
Hello World
If you get nothing, then the HTML which you're parsing simply doesn't contain any <pre> element. Check it yourself by
System.out.println(document.html());
Perhaps the URL is wrong. Perhaps there's some JavaScript which alters the HTML DOM with new elements (Jsoup doesn't interpret nor execute JS). Perhaps the site expects a real browser instead of a bot (change the user agent then). Perhaps the site requires a login (you'd need to maintain cookies). Who knows. You can figure this all out with a real webbrowser like Firefox or Chrome.

Categories