I have the following code that is supposed to extract data from HTML document. I used eclipse. It gives me two errors (though, this code is copied and pasted from JSoup site as a tutorial). The errors in 1) File, and 2) Elements. I can't see any problem in these two types.
import java.io.IOException;
import java.net.MalformedURLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestClass
{
public static void main(String args[]) throws IOException
{
try{
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
}//try
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}//catch
}
}</i>
You forgot to import them.
import java.io.File;
import org.jsoup.select.Elements;
See also:
Java tutorial - Using package members
Hint: read the "Quick Fix" options suggested by Eclipse. It's already the 1st option for File.
Related
I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}
Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.
I have searched more than anything for correct solution still i couldn't fix.
Please look on this & help me.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class NewClass {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org").get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
for (Element a : article) {
System.out.println("Title : \n" + a.select("a").text());
System.out.println("Article summary: \n" + a.select("div.entry-summary").text());
}
}
}
I have the above code for getting article and its contents from an single page.
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org").get();
I want to do this for several websites.
In this line or using some iteration i want to apply my code for several webpages say 500+.
And i want to save it in separate text document under its article title and its contents.
I am new to coding so i could not find the correct code.
I was doing this code for past two months to create my code.
For starter you can do something like this,
String[] urls={"http://tamilblog.ishafoundation.org","url2","url3"};//your 500 urls wil be stored here,
for(String url: urls){
Document doc = Jsoup.connect(url).get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
for (Element a : article) {
System.out.println("Title : \n" + a.select("a").text());
System.out.println("Article summary: \n" + a.select("div.entry-summary").text());
}
}
I want to remove the script when reading url not file, please help me
Document connect = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm");
Elements selects = connect.select("div.middle-col");
System.out.println(selects.removeAttr("script").html());
This is how you need to remove script element:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TestJsoup {
public static void main(String args[]) throws IOException {
Document doc = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm").get();
Elements selects = doc.select("div.middle-col");
for (Element script : selects) {
Elements scripts = script.select("script");
scripts.remove();
}
System.out.println(selects.html());
}
}
Additionally, you can use Jsoup.Clean(html,white).
I tried my own one already grabbing html from wikis main page like the suggested sample on JSoup.org but I got a similar error when I was trying to print it out using a simple for loop/ It was saying you cant use.size on Elements.
for(int d=1; d<= newsHeadlines.size(); d++)
Then I tried an example that was posted here and I get this error
Exception in thread "main" java.lang.Error: Unresolved compilation problems:
Type mismatch: cannot convert from org.jsoup.select.Elements to javax.lang.model.util.Elements
Can only iterate over an array or an instance of java.lang.Iterable
at grabdatafromHTML.Main.main(Main.java:23)
Not sure why I get this error for the code down below and help would be much appreciated.
Thanks :)
package grabdatafromHTML;
import java.util.List;
import javax.lang.model.util.Elements;
import org.jsoup.select.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
public class Main {
public static void main(String[] args) {
try{
String url = "http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping";
// Download the HTML and store in a Document
Document doc = Jsoup.connect(url).get();
// Select the <p> Elements from the document
Elements paragraphs = doc.select("p");
// For each selected <p> element, print out its text
for (Element e : paragraphs) {
System.out.println(e.text());
}
}
catch (Exception e){
System.out.println("some error");
}
}
}
Remove the import
import javax.lang.model.util.Elements;
to allow the class org.jsoup.select.Elements to be used (which you've already imported)
Can anyone help me figure out why I can't use jsoup to read table in this link below:
http://data.fpt.vn/InfoDNS.aspx?domain=google.com
I use it to get DNS of a host.
Here is the code that I used:
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class dnsjava {
public static void main(String... args) throws Exception {
String fpt = "http://data.fpt.vn/InfoDNS.aspx?domain=google.com";
String espn = "http://espn.go.com/mens-college-basketball/conferences/standings/_/id/2/year/2012/acc-conference"
org.jsoup.nodes.Document doc = Jsoup.connect(fpt).get();
Elements table = doc.select("table.tabular");
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.text());
System.out.println(tds.text());
}
}
}
It work with the url of espn and doc.select("table.tablehead"); but with fpt url, nothing happen!
Thank you for your help!
looks like the response you are seeking is not present, when i did the "view source"(in browser) of the link.
doc.select("table.tabular"); //
"tabular" is not present in response.