Use Jsoup to get every files contain in the web page

Use Jsoup to get every files contain in the web page - java

Right now I'm trying to implement a WebCrawler that lists out every file (images, etc.) and file extension (.jpg, .png, etc.) contained within a website using Jsoup. And I don't how to extract files elements from the URL.
Right now I know how to get all the text contained in the URL by doing something like this.
val doc = Jsoup.connect(link).get
val body: Element = doc.body()
val allText: String = body.text

Related

JSoup not reading content from URL with anchor

I'm using JSoup to read content from the following page:
https://www.astrology.com/horoscope/daily/aries.html#Monday
This is the code that I'm using:
String test1 = "https://www.astrology.com/horoscope/daily/aries.html#Monday";
String test2 = "https://www.astrology.com/horoscope/daily/aries.html#Tuesday";
Document document = Jsoup.connect(test1).get();
Element content = document.getElementById("content");
Element p = content.child(0);
String myTest = p.text();
In the URL I can pass the day with an anchor (see test1 and test2 variables) but in both cases it returns the same content, looks like it JSoup is simply ignoring the anchor and just using the base URL: https://www.astrology.com/horoscope/daily/aries.html. Is there a way for JSoup to read an URL with an anchor?

Jsoup ignores the anchor because the relevant information is rendered with JavaScript and Jsoup cannot process it. If you examin the page with your browser's dev tools you'll see that the daily info is found in a json file, like https://www.astrology.com/horoscope/daily/all/aries/2021-03-23/, so you can easily change the date/sign and get whatever you like.

Unable to retrieve table elements using jsoup

I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas
I have started off trying to do the following, but there are no results from the get go:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty
I also tried this, but again no results:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements divs = document.select("div");
if (!divs.isEmpty()) {
for (Element div : divs) {
// all of these are empty
Elements verbTenses = div.getElementsByClass("verbtense");
Elements verbTables = div.getElementsByClass("verbtable");
Elements tables = div.getElementsByClass("table verbtable");
}
}
What am I doing incorrectly?

The page you are trying to scrape have dynamically generated content on the client side (with javascript), therfore you won be able to extact data using that link
You might me able to scrape some content from the API call that this webpage is making eg https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
Inspect browser console to see what page is doing, and do the same

The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time.
Jsoup can't parse and execute JavaScript so all you get is the initial page :(
The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests:
One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup.
General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):
String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();
then
you'll have to use some JSON parsing library for example json-simple
to obtain html fragment and then you can parse it to HTML with Jsoup:
String json = Jsoup.connect(
"https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
.ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);
Now you can try your initial approach with using selectors to get what you want from document object.

Why does JSoup create empty files named after the link?

Im trying to get the all the images of a website but why does Jsoup not only get the images of the page but also a create document named like the link after the slash?
Elements imageElements = document.select("img[src$=.png], img[src$=.jpg], img[src$=.jpeg]");
for(Element imageElement : imageElements){
String strImageURL = imageElement.attr("abs:src");
Here is the full code

I found a way to fix it it is probably a prnt.sc issue:
Instead selecting the everything with the "img" tag like that:
Elements img = doc.getElementsByTag("img");
I selected everything with the file ending png, jpg and jpeg. Here the code:
Elements imageElements = document.select("img[src$=.png], img[src$=.jpg], img[src$=.jpeg]");

Image Extraction from webpage in java

I have just started working on a content extraction project. First I am trying to the Image URLs in a webpage. In some cases, the "src" attribute of "img" has relative URL. But I need to get the complete URL.
I was looking for some Java library to achieve this and thought Jsoup will be useful. Is there any other library to achieve this easily?

If you just need to get the complete URL from a relative one, the solution is simple in Java:
URL pageUrl = base_url_of_the_html_page;
String src = src_attribute_value; //relative or absolute URL
URL imgUrl = new URL(pageUrl, src);
The base URL of the HTML page is usually just the URL you have obtained the HTML code from. However, a <base> tag used in the document header, may be used for specifying a different base URL (but it's not used very frequently).
You may use Jsoup or just a DOM parser for obtaining the src attribute values and for finding the eventual base tag.

Is there a simple java program that can extract URL & title of html files

Hi I am looking for a simple URL & title extractor from html files in Java. I am trying to parse bookmarks.html (IE,Firefox) etc and add the title & url to a db. I need to do this in java (no 3rd party libraries allowed) so proably I have to use sax/dom/regex.

You can load up the file into a DOM document and then use an XPath expression to find all the instances of an tag. Extracting the HREF attribute and the tag contents should do what you want to do. The XPath would probably be something as simple as '//A'.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Use Jsoup to get every files contain in the web page - java

Related

JSoup not reading content from URL with anchor

Unable to retrieve table elements using jsoup

Why does JSoup create empty files named after the link?

Image Extraction from webpage in java

Is there a simple java program that can extract URL & title of html files

Categories

Resources