I tried my own one already grabbing html from wikis main page like the suggested sample on JSoup.org but I got a similar error when I was trying to print it out using a simple for loop/ It was saying you cant use.size on Elements.
for(int d=1; d<= newsHeadlines.size(); d++)
Then I tried an example that was posted here and I get this error
Exception in thread "main" java.lang.Error: Unresolved compilation problems:
Type mismatch: cannot convert from org.jsoup.select.Elements to javax.lang.model.util.Elements
Can only iterate over an array or an instance of java.lang.Iterable
at grabdatafromHTML.Main.main(Main.java:23)
Not sure why I get this error for the code down below and help would be much appreciated.
Thanks :)
package grabdatafromHTML;
import java.util.List;
import javax.lang.model.util.Elements;
import org.jsoup.select.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
public class Main {
public static void main(String[] args) {
try{
String url = "http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping";
// Download the HTML and store in a Document
Document doc = Jsoup.connect(url).get();
// Select the <p> Elements from the document
Elements paragraphs = doc.select("p");
// For each selected <p> element, print out its text
for (Element e : paragraphs) {
System.out.println(e.text());
}
}
catch (Exception e){
System.out.println("some error");
}
}
}
Remove the import
import javax.lang.model.util.Elements;
to allow the class org.jsoup.select.Elements to be used (which you've already imported)
Related
So I am trying to make a program that retrieves the IFrame tag from a website, opens the link and downloads the video. Currently, it retrieves the IFrame tag, but I can't figure out how to ignore the actual tags. I am pretty sure I can use the .split() feature, but I don't know how to create a regex code to only pull the data from inside of the quotes. I also tried using JSoup's .html, but it just printed a blank statement. Here is what I have (It mostly split correctly, except in the URL there is "id=..." which causes it to split again):
package com.trentmenard;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Main {
public static void main(String[] args) {
Document website;
try{
website = Jsoup.connect("https://swordartonlineepisode.com/sword-art-online-season-3-episode-1-english-dubbed-watch-online/").get();
System.out.println("Website Found! Title: " + website.title());
Element videoLink = website.select("iframe").first();
System.out.println("Found Video Link: " + videoLink);
videoLink.removeAttr("width");
videoLink.removeAttr("height");
videoLink.removeAttr("scrolling");
videoLink.removeAttr("allowfullscreen");
System.out.println("Modified: " + videoLink);
String link = videoLink.toString();
String[] stringArray = link.split("=");
for(String a : stringArray){
System.out.println(a);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Output: https://i.stack.imgur.com/ZXTiV.png
Thanks in advance!
I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}
Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.
I am trying to run a web scraper in Eclipse that, using Jsoup, that can take the names of the professors on this page: yu.edu/faculty and print them out. This is my code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class YUscraper {
public static void main(String[] args) throws IOException {
// fetches & parses HTML
String url = "http://yu.edu/faculty/";
Document document = Jsoup.connect(url).get();
// Extract data
Element content = document.getElementById("mainlist");
Elements names = content.getElementsByTag("a");
// Output data
for (Element name : names) {
System.out.println("Name: " + name.text());
}
}
}
I am getting this error:
Exception in thread "main" java.lang.NullPointerException
at YUscraper.main(YUscraper.java:18)
I am relatively new to this so pardon me if I am missing something really evident. I used many examples I have seen to get to this point, but I still don't understand what throws IOException is for and what it means that an exception was found. Please help, thanks!
Line 18 is
Elements names = content.getElementsByTag("a");
Seems like there is no tag with id "mainlist" in the html retrieved from http://yu.edu/faculty/.
Seems like you were trying to access tag main-nav instead of mainlist.
Line Element content = document.getElementById("mainlist");
content is returned as null, so null.getElementsByTag is giving the error .. Looks like html doesn't have element by 'mainlist'
I have the following code that is supposed to extract data from HTML document. I used eclipse. It gives me two errors (though, this code is copied and pasted from JSoup site as a tutorial). The errors in 1) File, and 2) Elements. I can't see any problem in these two types.
import java.io.IOException;
import java.net.MalformedURLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestClass
{
public static void main(String args[]) throws IOException
{
try{
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
}//try
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}//catch
}
}</i>
You forgot to import them.
import java.io.File;
import org.jsoup.select.Elements;
See also:
Java tutorial - Using package members
Hint: read the "Quick Fix" options suggested by Eclipse. It's already the 1st option for File.
import org.jsoup.*;
import org.w3c.dom.Document;
public class jsoup {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
String html = "<html><head><title>First parse</title></head>"
+ "<body><p id='xxx'>Parsed HTML into a doc.</p></body></html>";
Document doc = (Document)Jsoup.parse(html);
Element el = doc.getElementById("xxx");
}
}
When I run code above, I receive a
error:Element cannot be resolved to a type in line "Element el = doc.getElementById("xxx");"
Can you help me?
That's just a compilation error. You need to import Element.
import org.jsoup.nodes.Element;
Read the Jsoup javadocs for all packages and classes. They are linked in Jsoup home page. Please also note that Jsoup doesn't use Document from org.w3c.dom. Remove that line and the unnecessary cast.