JSOUP - Crawling Images & Text from URLs Found on a Previously Crawled Page - java

I'm attempting to create a crawler using Jsoup that will...
Go to a web page (specifically, a google sheets publicly published page like this one https://docs.google.com/spreadsheets/d/1CE9HTe2rdgPsxMHj-PxoKRGX_YEOCRjBTIOVtLa_2iI/pubhtml) and collect all href url links found in each cell.
Next, I want it to go to each individual url found the page, and crawl THAT url's headline and main image.
Ideally, if the urls on the google sheets page were for example, a specific Wikipedia page and a Huffington Post article, it would print out something like:
Link: https: //en.wikipedia.org/wiki/Wolfenstein_3D
Headline: Wolfenstein 3D
Image: https: //en.wikipedia.org/wiki/Wolfenstein_3D#/media/File:Wolfenstein-3d.jpg
Link: http: //www.huffingtonpost.com/2012/01/02/ron-pippin_n_1180149.html
Headline: Ron Pippin’s Mythical Archives Contain History Of Everything (PHOTOS)
Image: http: //i.huffpost.com/gen/453302/PIPPIN.jpg
(excuse the spaces in the URLs. Obviously I don't want the crawler to add spaces and break up URLS... stack overflow just wouldn't let me post more links in this question)
So far, I've got the jsoup working for the first step (pulling the links from the initial url) using this code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class mycrawler {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("https://docs.google.com/spreadsheets/d/1CE9HTe2rdgPsxMHj-PxoKRGX_YEOCRjBTIOVtLa_2iI/pubhtml").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
I'm now having trouble figuring out how to create the second aspect of the crawler where it cycles through each link (could be a variable number of links) and finds the headline and main image from each.

public static void main(String[] args) {
Document doc;
String url = "https://docs.google.com/spreadsheets/d/1CE9HTe2rdgPsxMHj-PxoKRGX_YEOCRjBTIOVtLa_2iI/pubhtml";
try {
doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
String innerurl = link.text();
if (!innerurl.contains("://")) {
continue;
}
System.out.println("*******");
System.out.println(innerurl);
Document innerDoc = Jsoup.connect(innerurl).get();
Elements headerLinks = innerDoc.select("h1");
for (Element innerLink : headerLinks) {
System.out.println("Headline : " + innerLink.text());
}
Elements imgLinks = innerDoc.select("img[src]");
for (Element innerLink : imgLinks) {
String innerImgSrc = innerLink.attr("src");
if(innerurl.contains("huffingtonpost") && innerImgSrc.contains("i.huffpost.com/gen")){
System.out.println("Image : " + innerImgSrc);
}
if(innerurl.contains("wikipedia")) {
Pattern pattern = Pattern.compile("(jpg)$", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(innerImgSrc);
if(matcher.find()){
System.out.println("Image : " + innerImgSrc);
break;
}
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
Output
*******
https://en.wikipedia.org/wiki/Wolfenstein_3D
Headline : Wolfenstein 3D
Image : //upload.wikimedia.org/wikipedia/en/0/05/Wolfenstein-3d.jpg
*******
http://www.huffingtonpost.com/2012/01/02/ron-pippin_n_1180149.html
Headline : Ron Pippin's Mythical Archives Contain History Of Everything (PHOTOS)
Image : http://i.huffpost.com/gen/453302/PIPPIN.jpg
Image : http://i.huffpost.com/gen/453304/PIPSHIP.jpg

I think you should get the href attribute of the link with link.attr("href") instead of link.text(). (in the page the displayed text and the underlying href are different) Track all the links to a list and iterate that list in second step to get the corresponding Document from which you can extract the Headline and Image URL.
For wiki pages we can extract the heading with Jsoup as follows
Element heading = document.select("#firstHeading").first();
System.out.println("Heading : " + heading.text());

Related

Jsoup hyperlink scraping not working for some websites

I've been working on a project recently which involves scraping specific products from websites and reporting the availability status(Graphics cards if anyone is curious).
Using JSOUP, I've been doing this by going through product listing pages, scraping all the links and filtering out the appropriate links. For some websites my code works completely fine but for others, some or even no links are scraped by my code.
Working example:
https://www.bhphotovideo.com/c/buy/Graphic-Cards/ci/6567
Non-Working example:
https://www.bestbuy.com/site/computer-cards-components/video-graphics-cards/abcat0507002.c?id=abcat0507002
https://www.evga.com/products/productlist.aspx?type=0
Here is the snipped of code in charge of scraping the links:
public class LinkScrapeLite {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://www.evga.com/products/productlist.aspx?type=0").get(); //Evga gives me no output whatsoever
String title = doc.title();
System.out.println("title: " + title);
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from the href attribute
System.out.println("nlink: " + link.attr("href"));
System.out.println("text: " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
I understand that what I'm doing is by no means efficient so if anyone has any suggestions of how I could do this in a better way please let me know :)
in this case you need a library that allows to wait loading of javascript for example we can use htmlunit
here is the solution for the evga site:
String url = "https://www.evga.com/products/productlist.aspx?type=0";
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setPrintContentOnFailingStatusCode(false);
HtmlPage htmlPage = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(1000);
webClient.waitForBackgroundJavaScriptStartingBefore(1000);
final List<DomElement> hrefs = htmlPage.getByXPath("//a");
for (DomElement element : hrefs) {
System.out.println(element.getAttribute("href"));
}
}

Extracting user details from facebook page

I am extracting details from a page which I'm administering. I tried using jsoup to extract the links then from that extract names of users but it's not working. It only shows links other than user links. I tried extracting names from this link
https://www.facebook.com/plugins/fan.php?connections=100&id=pageid
which is working quite well but does not works for this link
https://www.facebook.com/browse/?type=page_fans&page_id=
Can anyone help me...Below is the code which I tried.
doc = Jsoup.connect("https://www.facebook.com/browse/?type=page_fans&page_id=mypageid").get();
Elements els = doc.getElementsByClass("fsl fwb fcb");
Elements link = doc.select("a[href]");
for(Element ele : link)
{
system.out.println(ele.attr("href"));
} }
Try This
Document doc = Jsoup.connect("https://www.facebook.com/plugins/fan.php?connections=100&id=pageid").timeout(0).get();
Elements nameLinks = doc.getElementsByClass("link");
for (Element users : nameLinks) {
String name = users.attr("title");
String url = users.attr("href");
System.out.println(name + "-" + url);
}
It will give all the users name and URl present on the first link defined in your question.

Finding a specific file on a site using jsoup

So i'm trying to create a little program that updates a World of Warcraft addon for me. Im using jsoup to get a list of links on a specific site. How do I ignore files/links that don't end in .zip?
This is my link list so far, as you can see it will print a list of all the links on the site. The goal is to only find .zip files (there are only two). And then download one of them. Direct link to download changes every time they update the addon, so I can't just download a specific link. I need to find the latest version every time.
public static void LinkList() {
Document doc;
try {
doc = Jsoup.connect("http://www.tukui.org/dl.php").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("href"));
}
} catch (IOException e) {
e.printStackTrace();
}
}
You can use [attr$=value] selector to checks if attribute ends with value
Elements links = doc.select("a[href$=zip]");
Demo:
Document doc = Jsoup.connect("http://www.tukui.org/dl.php").get();
Elements links = doc.select("a[href$=zip]");
List<String> list = new ArrayList<>();
for (Element link : links) {
System.out.println("link : " + link.attr("href"));
list.add(link.attr("href"));
}
String[] arr = list.toArray(new String[list.size()]);
System.out.println("array content:" + Arrays.toString(arr));
Output:
link : http://www.tukui.org/downloads/tukui-15.79.zip
link : http://www.tukui.org/downloads/elvui-6.82.zip
link : /client/win/tc2430.zip
array content:[http://www.tukui.org/downloads/tukui-15.79.zip, http://www.tukui.org/downloads/elvui-6.82.zip, /client/win/tc2430.zip]

Jsoup get href within a class

I have this html code that I need to parse
<a class="sushi-restaurant" href="/greatSushi">Best Sushi in town</a>
I know there's an example for jsoup that you can get all links in a page,e.g.
Elements links = doc.select("a[href]");
for (Element link : links) {
print(" * a: <%s> (%s)", link.attr("abs:href"),
trim(link.text(), 35));
}
but I need a piece of code that can return me the href for that specific class.
Thanks guys
You can select elements by class. This example finds elements with the class sushi-restaurant, then gets the absolute URL of the first result.
Make sure that when you parse the HTML, you specify the base URL (where the document was fetched from) to allow jsoup to determine what the absolute URL of a link is.
public static void main(String[] args) {
String html = "<a class=\"sushi-restaurant\" href=\"/greatSushi\">Best Sushi in town</a>";
Document doc = Jsoup.parse(html, "http://example.com/");
// find all <a class="sushi-restaurant">...
Elements links = doc.select("a.sushi-restaurant");
Element link = links.first();
// 'abs:' makes "/greatsushi" = "http://example.com/greatsushi":
String url = link.attr("abs:href");
System.out.println("url = " + url);
}
Shorter version:
String url = doc.select("a.sushi-restaurant").first().attr("abs:href");
Hope this helps!
Elements links = doc.select("a");
for (Element link : links) {
String attribute=link.attr("class");
if(attribute.equalsIgnoreCase("sushi-place")){
print link.href//You probably need this
}
}

How to extract full URLs from all paragraphs in a webpage using jsoup

How do I extract full URL's from all paragraphs on a web page using jsoup? I am able to extract only the relative URL's.
Expected:
http://fr.wikipedia.org/wiki/Husni_al-Zaim
Actual: /Husni_al-Zaim
My Code:
Elements links = doc.select("p");
Elements linkss = links.select("a");
for (Element link : linkss) {
if (link.text().matches("^[A-Z].+") == true) {
list.add(new NamedLink(link.attr("href"), link.text()));
}
}
Use .absUrl("href") instead of .attr("href"). This only works when you get the document from a webpage or parse the full file from disk (and thus do not massage portions from HTML to text and back as in your example).
Document document = Jsoup.connect("http://stackoverflow.com").get();
Elements paragraphLinks = document.select("p a");
for (Element paragraphLink : paragraphLinks) {
String absUrl = paragraphLink.absUrl("href");
// ...
}

Categories