Getting multiple tables from HTML using Jsoup - java

I am trying to scrape data from multiple tables on this website: http://www.national-autograss.co.uk/march.htm
I need to keep the table data together with their respective dates located in h2 so I would like a way to do the following:
Find first date header h2
Extract table data beneath h2 (can be multiple tables)
Move on to next header and extract tables etc
I have written code to extract all parts separately but I do not know how to extract the data so that it stays with the relevant date header.
Any help or guidance would be much appreciated. The code I am starting with is below but like I said all it is doing is iterating through the data.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.national-autograss.co.uk/march.htm").get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elementsTable1 = doc.select("#table1");
Elements elementsTable2 = doc.select("#table2");
Elements dateElements = doc.select("h2");
for (int i = 0; i < dateElements.size(); i++) {
System.out.println(dateElements.get(i).text());
System.out.println(elementsTable1.get(i).text());
System.out.println(elementsTable2.get(i).text());
}
}
}

It seems that the values that you want are stored inside <tr>'s in a table where in every table the first child is a <h2>.
<table align="center"><col width="200"><col width="150"><col width="100"><col width="120"><col width="330"><col width="300">
<h2>Sunday 30 March</h2>
<tr id="table1">
<td><b>Club</b></td>
<td><b>Venue</b></td>
<td><b>Start Time</b></td>
<td><b>Meeting Type</b></td>
<td><b>Number of Days for Meeting</b></td>
<td><b>Notes</b></td>
</tr>
<tr id="table2">
<td>Evesham</td>
<td>Dodwell</td>
<td>11:00am</td>
<td>RO</td>
<td>Single Days Racing</td>
<td></td>
</tr>
</table>
My suggestion is that you search for all tables, when first child is a h2 you do something with the rest of its children:
Elements tables = doc.select("table");
for(Element table : tables) {
if(table.child(0).tagName().equals("h2")) {
Elements children = table.children()
}
}
Hope this helps!
EDIT : You want to remove all <col> before the <h2> as they will appear before it (did not notice this before):
for(Element element : doc.select("col"))
{
element.remove();
}

Related

Why is JSoup printing a question mark

I'm trying to understand the following. I have some code reading a page from gutenberg.org. Almost everything is ok but some characters are not. They are ok in the browser.
package nl.atticworks.gutenberg;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Gutenberg {
private static final String GET_URL = "http://www.gutenberg.org/browse/languages/nl";
public static void main(String[] args) {
try {
Document doc = Jsoup.connect(GET_URL).get();
Elements data = doc.select("div.pgdbbylanguage");
for (Element d : data) {
Elements children = d.select("*");
for (Element child : children) {
if (child.tagName().equals("ul")) {
Element author = children.get(children.indexOf(child) - 1);
String a1 = author.select("a:last-child").text();
if (a1.startsWith("Kara")) {
System.out.println(a1);
Elements titles = child.select("li.pgdbetext a");
for (Element title : titles) {
System.out.println("\t" + title.text());
}
}
}
}
}
} catch (IOException ex) {
// do something...
}
}
}
The string a1 prints "Karadži?, Vuk Stefanovi?, 1787-1864" but should print "Karadžić, Vuk Stefanović, 1787-1864"
I'm pretty sure that the encoding is ok (UTF-8) but the c with acute isn't encoded properly.
Still, browsers do show the correct char, Jsoup doesn't. Why?
Regards,
Hans
As you haven't said what you are running your program in it is difficult to give a definitive answer, but basically there is nothing wrong with your code. JSoup is not responsible for your display problem, whichever console you are displaying on is the problem.
If you set your console (or IDE) to the UTF-8 encoding it should display correctly.
I tried this code on my own IDEA,and the output was just as you expected.
So I insist that the encoding is the problem.

Trying to use jSoup to scrape data from a table

First time poster and fairly new coder, so please go easy on me. I'm trying to use jSoup to scrape data from a table. However, I'm having a couple problems:
1) I'm using NetBeans. I get a "stop" error on Line 30 (Elements tds...) that says cannot find symbol symbol method getElementsByTag. I'm confused because I thought I imported the correct package, and I use the same code a couple lines above and get no error.
2) When I run the code, I get an error that says:
Exception in thread "main" java.lang.NullPointerException
at mytest.JsoupTest1.main(JsoupTest1.java:26)
Which I thought means that a variable with a value of NULL is being used. Did I incorrectly enter the "row" variable in my for loop above?
Here's my code. I truly appreciate any help!
package mytest;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest1 {
private static Object row;
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect( "http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2015&month=0&season1=2015&ind=0&team=18&rost=0&age=0&filter=&players=0" ).get();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
Element table = doc.getElementById( "LeaderBoard1_dg1_ct100" );
Elements rows = table.getElementsByTag( "tr" );
for( Element row:rows ) {
}
Elements tds = row.getElementsByTag( "td" );
for( int i=0; i < tds.size(); i++ ) {
System.out.println(tds.get(i).text());
}
}
}
Welcome to StackOverflow.
This works.
Document doc = null;
try {
doc = Jsoup
.connect(
"http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2015&month=0&season1=2015&ind=0&team=18&rost=0&age=0&filter=&players=0")
.get();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
Element table = doc.getElementById("LeaderBoard1_dg1_ctl00");
Elements rows = table.getElementsByTag("tr");
for (Element row : rows) {
Elements tds = row.getElementsByTag("td");
for (int i = 0; i < tds.size(); i++) {
System.out.println(tds.get(i).text());
}
}
There are three problems with your code.
The id you are using is wrong. Instead of LeaderBoard1_dg1_ct100 use LeaderBoard1_dg1_ctl00. You mistook the l for 1.
The second problem is the Object row. No need for this one. Remove it.
You had the iteration of the rows outside of the for loop. And because you had the Object row variable, no compilation errors where present, thus hiding the problem.

Finding the content in a row

The row contains
row number -- name surname -- instructor name-- E
</tr>
<tr height=20 style='height:15.0pt'>
<td height=20 class=xl6429100 align=right width=28 style='height:15.0pt;
border-top:none;width:21pt'>row number</td>
<td class=xl8629100 width=19 style='border-top:none;border-left:none;
width:14pt'> </td>
<td class=xl6529100 width=137 style='border-top:none;border-left:none;
width:103pt'>name</td>
<td class=xl6529100 width=92 style='border-top:none;border-left:none;
width:69pt'>surname</td>
<td class=xl7929100 style='border-top:none;border-left:none'>instructor name</td>
<td class=xl8129100 style='border-top:none'>grade</td>
I want to retrieve only one row from this html file to control my own grade. I get the source of the html by using java but now how can I reach the row that I want? I will find the surname first. In this part of the table how can I reach the grade coloumn?
here is my code;
import java.net.*;
import java.io.*;
public class staj {
public static void main(String[] args) throws Exception {
URL staj = new URL("http://www.cs.bilkent.edu.tr/~sekreter/SummerTraining/2014G/CS399.htm");
BufferedReader in = new BufferedReader(new InputStreamReader(staj.openStream()));
String inputLine;
String grade;
while ((inputLine = in.readLine()) != null){
if(inputLine.contains(mysurname))
//grade = WHAT?
}
in.close();
}
And also, is using java efficient and appropriate for this aim? Which language would be better?
You should definitely use Jsoup library to extract what you need from HTML document - http://jsoup.org/
I've created a sample code that demonstrates an example of extracting data from the table you provided in a description: https://gist.github.com/wololock/15f511fd9d7da9770f1d
public static void main(String[] args) throws IOException {
String url = "http://www.cs.bilkent.edu.tr/~sekreter/SummerTraining/2014G/CS399.htm";
String username = "Samet";
Document document = Jsoup.connect(url).get();
Elements rows = document.select("tr:contains("+username+")");
for (Element row : rows) {
System.out.println("---------------");
System.out.printf("No: %s\n", row.select("td:eq(0)").text());
System.out.printf("Evaluator: %s\n", row.select("td:eq(4)").text());
System.out.printf("Status: %s\n", row.select("td:eq(5)").text());
}
}
Take a look on this:
document.select("tr:contains("+username+")");
Jsoup allows you to use jquery-like methods and selectors to extract data from html documents. In this example selector you extracts only those tr elements that contain given username in nested elements. When you have a list of those rows you can simply extract the data. Here we use:
row.select("td:eq(n)")
where :eq(n) means select n-th td element nested in tr. Here is the output:
---------------
No: 85
Evaluator: Buğra Gedik
Status: E
---------------
No: 105
Evaluator: Çiğdem Gündüz Demir
Status: E

jsoup: How to extract correct data from this website

I am trying to extract data from a Spanish dictionary using jsoup. Essentially, the user will input words he wants to define as command line arguments and the program will return a formatted list of definitions. Here is what I have done so far:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
String[] urls = new String[args.length];
for(int i=0; i<args.length; i++) {
urls[i] = "http://www.diccionarios.com/detalle.php?palabra="
+ args[i]
+ "&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on";
try{
Document doc = Jsoup.connect(urls[i]).get();
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
System.out.println(untokenized);
}catch (Exception e) {
System.out.println("EXCEPTION: Word is probably not in this dictionary.");
}
}
}
}
That url array gives the correct urls where the information for the definition is.
Now, what I'm expecting to be returned is what you would get if you went to the try.jsoup website and used (for example) this : http://www.diccionarios.com/detalle.php?palabra=libro&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on
as the link and typed in html as the CSS Query. I need that data so I can tokenize the definition from that.
So I guess my question is, what method would I use to obtain the same data that you can see on the try.jsoup website. Thanks a lot!
Edit: This is about interpreting the data from the url. The end result data I want (in this example) is "Conjunto de hojas escritas unidas o cosidas por uno de sus lados y cubiertas por tapas de cartón u otro material." That is the definition on the website. However, I noticed that on that try.jsoup website that if I put the html text in the CSS Query box then the result was a huge bunch of text. My assumption was that the following 2 lines of code would capture this huge bunch of text and save it as a string:
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
However, the output for when I print untokenized is instead this: "Usuario Clave ¿Olvidaste tu clave? Condiciones Privacidad Versión completa © 2011 Larousse Editorial, SL." So my question is, how to I obtain the string data for that huge bunch of text found on the try.jsoup website?
EDIT: I followed the advice of the question here: Jsoup - CSS Query selector issue (?) and it worked great.

Jsoup Html parsing problem finding internal links data

Usually we have many internal links in a file. I want to parse a html file such that i get the headings of a page and its corresponding data in a map.
Steps i did:
1) Got all the internal reference elements
2) Parsed the document for the id = XXX where XXX == (element <a href="#XXX").
3) it takes me to the <span id="XXX">little text here </span> <some tags here too ><p> actual text here </p> <p> here too </p>
4) How to go from <span> to <p> ???
5) I tried going to parent of span and thought that its one of the child is <p> too... its true. But it also involves <p> of other internal links too.
EDIT: added an sample html file portion:
<li class="toclevel-1 tocsection-1"><a href="#Enforcing_mutual_exclusion">
<span class="tocnumber">1</span> <span class="toctext">Enforcing mutual exclusion</span> </a><ul>
<li class="toclevel-2 tocsection-2"><a href="#Hardware_solutions">
<span class="tocnumber">1.1</span> <span class="toctext">Hardware solutions</span>
</a></li>
<li class="toclevel-2 tocsection-3"><a href="#Software_solutions">
<h2><span class="editsection">[<a href="/w/index.php?title=Mutual_exclusion&
amp;action=edit&section=1" title="Edit section: Enforcing mutual exclusion">
edit</a>]</span> <span class="mw-headline" id="Enforcing_mutual_exclusion">
<comment --------------------------------------------------------------------
**see the id above = Enforcing_mutual_exclusion** which is same as first internal
link . Jsoup takes me to this span element. i want to access every <p> element after
this <span> tag before another <span> tag with id="any of the internal links"
------------------------------------------------------------------------------!>
Enforcing mutual exclusion</span></h2>
<p>There are both software and hardware solutions for enforcing mutual exclusion.
The different solutions are shown below.</p>
<h3><span class="editsection">[<a href="/w/index.php?title=Mutual_exclusion&
amp;action=edit&section=2" title="Edit section: Hardware solutions">
edit</a>]</span> <span class="mw-headline" id="Hardware_solutions">Hardware
solutions</span></h3>
<p>On a <a href="/wiki/Uniprocessor" title="Uniprocessor" class="mw-
redirect">uniprocessor</a> system a common way to achieve mutual exclusion inside
kernels is
disable <a href="/wiki/Interrupt" title="Interrupt">
Here is my code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public final class Website {
private URL websiteURL ;
private Document httpDoc ;
LinkedHashMap<String, ArrayList<String>> internalLinks =
new LinkedHashMap<String, ArrayList<String>>();
public Website(URL __websiteURL) throws MalformedURLException, IOException, Exception{
if(__websiteURL == null)
throw new Exception();
websiteURL = __websiteURL;
httpDoc = Jsoup.parse(connect());
System.out.println("Parsed the http file to Document");
}
/* Here is my function: i first gets all the internal links in internalLinksElements.
I then get the href name of <a ..> tag so that i can search for it in documnet.
*/
public void getDataWithHeadingsTogether(){
Elements internalLinksElements;
internalLinksElements = httpDoc.select("a[href^=#]");
for(Element element : internalLinksElements){
// some inline links were bad. i only those having span as their child.
Elements spanElements = element.select("span");
if(!spanElements.isEmpty()){
System.out.println("Text(): " + element.text()); // this can not give what i want
// ok i get the href tag name that would be the id
String href = element.attr("href") ;
href = href.replace("#", "");
System.out.println(href);
// selecting the element where we have that id.
Element data = httpDoc.getElementById(href);
// got the span
if(data == null)
continue;
Elements children = new Elements();
// problem is here.
while(children.isEmpty()){
// going to its element unless gets some data.
data = data.parent();
System.out.println(data);
children = data.select("p");
}
// its giving me all the data of file. thats bad.
System.out.println(children.text());
}
}
}
/**
*
* #return String Get all the headings of the document.
* #throws MalformedURLException
* #throws IOException
*/
#SuppressWarnings("CallToThreadDumpStack")
public String connect() throws MalformedURLException, IOException{
// Is this thread safe ? url.openStream();
BufferedReader reader = null;
try{
reader = new BufferedReader( new InputStreamReader(websiteURL.openStream()));
System.out.println("Got the reader");
} catch(Exception e){
e.printStackTrace();
System.out.println("Bye");
String html = "<html><h1>Heading 1</h1><body><h2>Heading 2</h2><p>hello</p></body></html>";
return html;
}
String inputLine, result = new String();
while((inputLine = reader.readLine()) != null){
result += inputLine;
}
reader.close();
System.out.println("Made the html file");
return result;
}
/**
*
* #param argv all the command line parameters.
* #throws MalformedURLException
* #throws IOException
*/
public static void main(String[] argv) throws MalformedURLException, IOException, Exception{
System.setProperty("proxyHost", "172.16.0.3");
System.setProperty("proxyPort","8383");
System.out.println("Sending url");
// a html file or any url place here ------------------------------------
URL url = new URL("put a html file here ");
Website website = new Website(url);
System.out.println(url.toString());
System.out.println("++++++++++++++++++++++++++++++++++++++++++++++++");
website.getDataWithHeadingsTogether();
}
}
I think you need to understand that the <span>s that you are locating are children of header elements, and that the data you want to store is made up of siblings of that header.
Therefore, you need to grab the <span>'s parent, and then use nextSibling to collect nodes that are your data for that <span>. You need to stop collecting data when you run out of siblings, or you encounter another header element, because another header indicates the start of the next item's data.

Categories