JSoup error in data types

JSoup error in data types - java

I have the following code that is supposed to extract data from HTML document. I used eclipse. It gives me two errors (though, this code is copied and pasted from JSoup site as a tutorial). The errors in 1) File, and 2) Elements. I can't see any problem in these two types.
import java.io.IOException;
import java.net.MalformedURLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestClass
{
public static void main(String args[]) throws IOException
{
try{
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
}//try
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}//catch
}
}</i>

You forgot to import them.
import java.io.File;
import org.jsoup.select.Elements;
See also:
Java tutorial - Using package members
Hint: read the "Quick Fix" options suggested by Eclipse. It's already the 1st option for File.

Related

Why "http://www.stackoverflow.com" is not getting parsed but "http://www.javatpoint.com/java-tutorial" is getting parsed

I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}

Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.

HOW to get article content from many urls webpages

I have searched more than anything for correct solution still i couldn't fix.
Please look on this & help me.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class NewClass {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org").get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
for (Element a : article) {
System.out.println("Title : \n" + a.select("a").text());
System.out.println("Article summary: \n" + a.select("div.entry-summary").text());
}
}
}
I have the above code for getting article and its contents from an single page.
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org").get();
I want to do this for several websites.
In this line or using some iteration i want to apply my code for several webpages say 500+.
And i want to save it in separate text document under its article title and its contents.
I am new to coding so i could not find the correct code.
I was doing this code for past two months to create my code.

For starter you can do something like this,
String[] urls={"http://tamilblog.ishafoundation.org","url2","url3"};//your 500 urls wil be stored here,
for(String url: urls){
Document doc = Jsoup.connect(url).get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
for (Element a : article) {
System.out.println("Title : \n" + a.select("a").text());
System.out.println("Article summary: \n" + a.select("div.entry-summary").text());
}
}

Remove script in link jsoup

I want to remove the script when reading url not file, please help me
Document connect = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm");
Elements selects = connect.select("div.middle-col");
System.out.println(selects.removeAttr("script").html());

This is how you need to remove script element:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TestJsoup {
public static void main(String args[]) throws IOException {
Document doc = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm").get();
Elements selects = doc.select("div.middle-col");
for (Element script : selects) {
Elements scripts = script.select("script");
scripts.remove();
}
System.out.println(selects.html());
}
}

Additionally, you can use Jsoup.Clean(html,white).

Jsoup and same mismatch error for grabbing HTML

I tried my own one already grabbing html from wikis main page like the suggested sample on JSoup.org but I got a similar error when I was trying to print it out using a simple for loop/ It was saying you cant use.size on Elements.
for(int d=1; d<= newsHeadlines.size(); d++)
Then I tried an example that was posted here and I get this error
Exception in thread "main" java.lang.Error: Unresolved compilation problems:
Type mismatch: cannot convert from org.jsoup.select.Elements to javax.lang.model.util.Elements
Can only iterate over an array or an instance of java.lang.Iterable
at grabdatafromHTML.Main.main(Main.java:23)
Not sure why I get this error for the code down below and help would be much appreciated.
Thanks :)
package grabdatafromHTML;
import java.util.List;
import javax.lang.model.util.Elements;
import org.jsoup.select.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
public class Main {
public static void main(String[] args) {
try{
String url = "http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping";
// Download the HTML and store in a Document
Document doc = Jsoup.connect(url).get();
// Select the <p> Elements from the document
Elements paragraphs = doc.select("p");
// For each selected <p> element, print out its text
for (Element e : paragraphs) {
System.out.println(e.text());
}
}
catch (Exception e){
System.out.println("some error");
}
}
}

Remove the import
import javax.lang.model.util.Elements;
to allow the class org.jsoup.select.Elements to be used (which you've already imported)

Use jsoup to read table content

Can anyone help me figure out why I can't use jsoup to read table in this link below:
http://data.fpt.vn/InfoDNS.aspx?domain=google.com
I use it to get DNS of a host.
Here is the code that I used:
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class dnsjava {
public static void main(String... args) throws Exception {
String fpt = "http://data.fpt.vn/InfoDNS.aspx?domain=google.com";
String espn = "http://espn.go.com/mens-college-basketball/conferences/standings/_/id/2/year/2012/acc-conference"
org.jsoup.nodes.Document doc = Jsoup.connect(fpt).get();
Elements table = doc.select("table.tabular");
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.text());
System.out.println(tds.text());
}
}
}
It work with the url of espn and doc.select("table.tablehead"); but with fpt url, nothing happen!
Thank you for your help!

looks like the response you are seeking is not present, when i did the "view source"(in browser) of the link.
doc.select("table.tabular"); //
"tabular" is not present in response.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup error in data types - java

You forgot to import them. import java.io.File; import org.jsoup.select.Elements; See also: Java tutorial - Using package members Hint: read the "Quick Fix" options suggested by Eclipse. It's already the 1st option for File.

Related

Why "http://www.stackoverflow.com" is not getting parsed but "http://www.javatpoint.com/java-tutorial" is getting parsed

HOW to get article content from many urls webpages

Remove script in link jsoup

Jsoup and same mismatch error for grabbing HTML

Use jsoup to read table content

Categories

Resources