I am trying to run a web scraper in Eclipse that, using Jsoup, that can take the names of the professors on this page: yu.edu/faculty and print them out. This is my code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class YUscraper {
public static void main(String[] args) throws IOException {
// fetches & parses HTML
String url = "http://yu.edu/faculty/";
Document document = Jsoup.connect(url).get();
// Extract data
Element content = document.getElementById("mainlist");
Elements names = content.getElementsByTag("a");
// Output data
for (Element name : names) {
System.out.println("Name: " + name.text());
}
}
}
I am getting this error:
Exception in thread "main" java.lang.NullPointerException
at YUscraper.main(YUscraper.java:18)
I am relatively new to this so pardon me if I am missing something really evident. I used many examples I have seen to get to this point, but I still don't understand what throws IOException is for and what it means that an exception was found. Please help, thanks!
Line 18 is
Elements names = content.getElementsByTag("a");
Seems like there is no tag with id "mainlist" in the html retrieved from http://yu.edu/faculty/.
Seems like you were trying to access tag main-nav instead of mainlist.
Line Element content = document.getElementById("mainlist");
content is returned as null, so null.getElementsByTag is giving the error .. Looks like html doesn't have element by 'mainlist'
Related
So I am trying to make a program that retrieves the IFrame tag from a website, opens the link and downloads the video. Currently, it retrieves the IFrame tag, but I can't figure out how to ignore the actual tags. I am pretty sure I can use the .split() feature, but I don't know how to create a regex code to only pull the data from inside of the quotes. I also tried using JSoup's .html, but it just printed a blank statement. Here is what I have (It mostly split correctly, except in the URL there is "id=..." which causes it to split again):
package com.trentmenard;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Main {
public static void main(String[] args) {
Document website;
try{
website = Jsoup.connect("https://swordartonlineepisode.com/sword-art-online-season-3-episode-1-english-dubbed-watch-online/").get();
System.out.println("Website Found! Title: " + website.title());
Element videoLink = website.select("iframe").first();
System.out.println("Found Video Link: " + videoLink);
videoLink.removeAttr("width");
videoLink.removeAttr("height");
videoLink.removeAttr("scrolling");
videoLink.removeAttr("allowfullscreen");
System.out.println("Modified: " + videoLink);
String link = videoLink.toString();
String[] stringArray = link.split("=");
for(String a : stringArray){
System.out.println(a);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Output: https://i.stack.imgur.com/ZXTiV.png
Thanks in advance!
I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}
Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.
I tried my own one already grabbing html from wikis main page like the suggested sample on JSoup.org but I got a similar error when I was trying to print it out using a simple for loop/ It was saying you cant use.size on Elements.
for(int d=1; d<= newsHeadlines.size(); d++)
Then I tried an example that was posted here and I get this error
Exception in thread "main" java.lang.Error: Unresolved compilation problems:
Type mismatch: cannot convert from org.jsoup.select.Elements to javax.lang.model.util.Elements
Can only iterate over an array or an instance of java.lang.Iterable
at grabdatafromHTML.Main.main(Main.java:23)
Not sure why I get this error for the code down below and help would be much appreciated.
Thanks :)
package grabdatafromHTML;
import java.util.List;
import javax.lang.model.util.Elements;
import org.jsoup.select.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
public class Main {
public static void main(String[] args) {
try{
String url = "http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping";
// Download the HTML and store in a Document
Document doc = Jsoup.connect(url).get();
// Select the <p> Elements from the document
Elements paragraphs = doc.select("p");
// For each selected <p> element, print out its text
for (Element e : paragraphs) {
System.out.println(e.text());
}
}
catch (Exception e){
System.out.println("some error");
}
}
}
Remove the import
import javax.lang.model.util.Elements;
to allow the class org.jsoup.select.Elements to be used (which you've already imported)
I have the following code that is supposed to extract data from HTML document. I used eclipse. It gives me two errors (though, this code is copied and pasted from JSoup site as a tutorial). The errors in 1) File, and 2) Elements. I can't see any problem in these two types.
import java.io.IOException;
import java.net.MalformedURLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestClass
{
public static void main(String args[]) throws IOException
{
try{
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
}//try
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}//catch
}
}</i>
You forgot to import them.
import java.io.File;
import org.jsoup.select.Elements;
See also:
Java tutorial - Using package members
Hint: read the "Quick Fix" options suggested by Eclipse. It's already the 1st option for File.
My goal is to always get the same string (which is the URI in my case) while reading the href property from a link. Example:
Suppose think that a html file it have somany links like
a href="index.html"> but base domain is http://www.domainname.com/index.html
a href="../index.html"> but base domain is http://www.domainname.com/dit/index.html
how can i get all the link correctly means the full link including domain name?
how can i do that in java?
the input is HTML,that is,from a bunch of HTML code it need to extract correct link
You can do this using a fullworthy HTML parser like Jsoup. There's a Node#absUrl() which does exactly what you want.
package com.stackoverflow.q3394298;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
public static void main(String... args) throws Exception {
URL url = new URL("https://stackoverflow.com/questions/3394298/");
Document document = Jsoup.connect(url).get();
Element link = document.select("a.question-hyperlink").first();
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
}
which prints (correctly) the following for the title link of your current question:
/questions/3394298/full-link-extraction-using-java
https://stackoverflow.com/questions/3394298/full-link-extraction-using-java
Jsoup may have more other (undiscovered) advantages for your purpose as well.
Related questions:
What are the pros and cons of the leading HTML parsers in Java?
Update: if you want to select all links in the document, then do as follows:
Elements links = document.select("a");
for (Element link : links) {
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
Use the URL object:
URL url = new URL(URL context, String spec)
Here's an example:
import java.net.*;
public class Test {
public static void main(String[] args) throws Exception {
URL base = new URL("http://www.java.com/dit/index.html");
URL url = new URL(base, "../hello.html");
System.out.println(base);
System.out.println(url);
}
}
It will print:
http://www.java.com/dit/index.html
http://www.java.com/hello.html