Use jsoup to read table content

Use jsoup to read table content - java

Can anyone help me figure out why I can't use jsoup to read table in this link below:
http://data.fpt.vn/InfoDNS.aspx?domain=google.com
I use it to get DNS of a host.
Here is the code that I used:
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class dnsjava {
public static void main(String... args) throws Exception {
String fpt = "http://data.fpt.vn/InfoDNS.aspx?domain=google.com";
String espn = "http://espn.go.com/mens-college-basketball/conferences/standings/_/id/2/year/2012/acc-conference"
org.jsoup.nodes.Document doc = Jsoup.connect(fpt).get();
Elements table = doc.select("table.tabular");
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.text());
System.out.println(tds.text());
}
}
}
It work with the url of espn and doc.select("table.tablehead"); but with fpt url, nothing happen!
Thank you for your help!

looks like the response you are seeking is not present, when i did the "view source"(in browser) of the link.
doc.select("table.tabular"); //
"tabular" is not present in response.

Related

Why "http://www.stackoverflow.com" is not getting parsed but "http://www.javatpoint.com/java-tutorial" is getting parsed

I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}

Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.

HOW to get article content from many urls webpages

I have searched more than anything for correct solution still i couldn't fix.
Please look on this & help me.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class NewClass {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org").get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
for (Element a : article) {
System.out.println("Title : \n" + a.select("a").text());
System.out.println("Article summary: \n" + a.select("div.entry-summary").text());
}
}
}
I have the above code for getting article and its contents from an single page.
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org").get();
I want to do this for several websites.
In this line or using some iteration i want to apply my code for several webpages say 500+.
And i want to save it in separate text document under its article title and its contents.
I am new to coding so i could not find the correct code.
I was doing this code for past two months to create my code.

For starter you can do something like this,
String[] urls={"http://tamilblog.ishafoundation.org","url2","url3"};//your 500 urls wil be stored here,
for(String url: urls){
Document doc = Jsoup.connect(url).get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
for (Element a : article) {
System.out.println("Title : \n" + a.select("a").text());
System.out.println("Article summary: \n" + a.select("div.entry-summary").text());
}
}

Remove script in link jsoup

I want to remove the script when reading url not file, please help me
Document connect = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm");
Elements selects = connect.select("div.middle-col");
System.out.println(selects.removeAttr("script").html());

This is how you need to remove script element:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TestJsoup {
public static void main(String args[]) throws IOException {
Document doc = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm").get();
Elements selects = doc.select("div.middle-col");
for (Element script : selects) {
Elements scripts = script.select("script");
scripts.remove();
}
System.out.println(selects.html());
}
}

Additionally, you can use Jsoup.Clean(html,white).

Fetch ajax/javascript content using HTMLunit

I have written a code which will fetch me a html contents of the page as response , I am using HTML Unit to do so . But I am getting error's for some specific urls like
[https://communities.netapp.com/welcome][1]
For first page i am able to retrieve the contents . But when i dont the content which we get using load more button .
Here's my code:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Sample {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException {
String url = "https://communities.netapp.com/welcome";
WebClient client = new WebClient(BrowserVersion.INTERNET_EXPLORER_9);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setRedirectEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(true);
client.getOptions().setCssEnabled(true);
client.getOptions().setUseInsecureSSL(true);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page = client.getPage(url);
Writer output = null;
String text = page.asText();
File file = new File("D://write6.txt");
output = new BufferedWriter(new FileWriter(file));
output.write(text);
output.close();
System.out.println("Your file has been written");
// System.out.println("as Text ==" +page.asText());
// System.out.println("asXML == " +page.asXml());
// System.out.println("text content ==" +page.getTextContent());
// System.out.println(page.getWebResponse().getContentAsString());
}
}
Any suggestion ?

As i understand from your question you have a button on which you have to press.
Please look at: http://htmlunit.sourceforge.net/gettingStarted.html
You have there an example of submitting a form.
This should be very similar here

JSoup error in data types

I have the following code that is supposed to extract data from HTML document. I used eclipse. It gives me two errors (though, this code is copied and pasted from JSoup site as a tutorial). The errors in 1) File, and 2) Elements. I can't see any problem in these two types.
import java.io.IOException;
import java.net.MalformedURLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestClass
{
public static void main(String args[]) throws IOException
{
try{
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
}//try
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}//catch
}
}</i>

You forgot to import them.
import java.io.File;
import org.jsoup.select.Elements;
See also:
Java tutorial - Using package members
Hint: read the "Quick Fix" options suggested by Eclipse. It's already the 1st option for File.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Use jsoup to read table content - java

looks like the response you are seeking is not present, when i did the "view source"(in browser) of the link. doc.select("table.tabular"); // "tabular" is not present in response.

Related

Why "http://www.stackoverflow.com" is not getting parsed but "http://www.javatpoint.com/java-tutorial" is getting parsed

HOW to get article content from many urls webpages

Remove script in link jsoup

Fetch ajax/javascript content using HTMLunit

JSoup error in data types

Categories

Resources