Trying to use jSoup to scrape data from a table - java

First time poster and fairly new coder, so please go easy on me. I'm trying to use jSoup to scrape data from a table. However, I'm having a couple problems:
1) I'm using NetBeans. I get a "stop" error on Line 30 (Elements tds...) that says cannot find symbol symbol method getElementsByTag. I'm confused because I thought I imported the correct package, and I use the same code a couple lines above and get no error.
2) When I run the code, I get an error that says:
Exception in thread "main" java.lang.NullPointerException
at mytest.JsoupTest1.main(JsoupTest1.java:26)
Which I thought means that a variable with a value of NULL is being used. Did I incorrectly enter the "row" variable in my for loop above?
Here's my code. I truly appreciate any help!
package mytest;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest1 {
private static Object row;
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect( "http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2015&month=0&season1=2015&ind=0&team=18&rost=0&age=0&filter=&players=0" ).get();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
Element table = doc.getElementById( "LeaderBoard1_dg1_ct100" );
Elements rows = table.getElementsByTag( "tr" );
for( Element row:rows ) {
}
Elements tds = row.getElementsByTag( "td" );
for( int i=0; i < tds.size(); i++ ) {
System.out.println(tds.get(i).text());
}
}
}

Welcome to StackOverflow.
This works.
Document doc = null;
try {
doc = Jsoup
.connect(
"http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2015&month=0&season1=2015&ind=0&team=18&rost=0&age=0&filter=&players=0")
.get();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
Element table = doc.getElementById("LeaderBoard1_dg1_ctl00");
Elements rows = table.getElementsByTag("tr");
for (Element row : rows) {
Elements tds = row.getElementsByTag("td");
for (int i = 0; i < tds.size(); i++) {
System.out.println(tds.get(i).text());
}
}
There are three problems with your code.
The id you are using is wrong. Instead of LeaderBoard1_dg1_ct100 use LeaderBoard1_dg1_ctl00. You mistook the l for 1.
The second problem is the Object row. No need for this one. Remove it.
You had the iteration of the rows outside of the for loop. And because you had the Object row variable, no compilation errors where present, thus hiding the problem.

Related

How to get the values of child nodes in JDOM

I am trying to get a value by parsing an XML document using the JDOM library.
I want to get the values of the driverJar tags, which are child nodes based on the driverJars tag, but I can't get the values.
<connection>
<driverJars>
<driverJar>ojdbc11.jar</driverJar>
<driverJar>orai18n.jar</driverJar>
<driverJar>test.jar</driverJar>
</driverJars>
</connection>
I tried:
(It's done until the document is already loaded.)
if (element.getChild(DRIVER_JARS) != null) {
Element driverJarsElement = element.getChild(DRIVER_JARS);
List<Element> driverJarElementList = driverJarsElement.getChildren(DRIVER_JAR);
for (int i = 0; i < driverJarElementList.size(); i++) {
Element driverJarElement = driverJarElementList.get(i);
System.out.println(driverJarElement.getText()); // [Element: <driverJar/>]
}
}
If it is a child, you can get a value, but since it is a child, if you loop through the value, you cannot get the value of children by each index.
What I tried is the value (marked as a comment) that comes out when I print it with System.out.println.
How can I get the value?
What I want to get from the xml above is the String values of ojdbc11.jar, orai18n.jar, and test.jar.
Full code example
<connection>
<productId>oracle_10g</productId>
<productName>Oracle 9i ~ 21c</productName>
<driverJars>
<driverJar>ojdbc8.jar</driverJar>
<driverJar>orai18n.jar</driverJar>
</driverJars>
</connection>
String productId = element.getChildTextTrim(PRODUCT_ID); // oracle_10g
String productName = element.getChildTextTrim(PRODUCT_NAME); // Oracle 9i ~ 21c
Element driverJarsElement = element.getChild(DRIVER_JARS);
List<Element> driverJarElementList = driverJarsElement.getChildren(DRIVER_JAR);
if (element.getChild(DRIVER_JARS) != null) {
for (int i = 0; i < driverJarElementList.size(); i++) {
description.setDriverJars(new ArrayList<String (Arrays.asList(driverJarElementList.get(i).toString())));
}
}
(The reason I wrote that in setDriverJars is because it is a List.)
the code above is
(1) After loading the document, insert values into the fields declared in the object description.
(2) And make a copy of the object.
(3) Analyze the element and reconstruct the description using the copy.
(The method used to reconstruct the description has a different logic from the method in (1).)
In (1), I want to get values from xml, but I can't get values for multiple child nodes.
While the code in your question is not a minimal, reproducible example, the code below is essentially the same. One difference is that in the below code, I first get the root element from the DOM that is created from the XML file.
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
import org.jdom2.input.SAXBuilder;
public class JdomTest {
private static final String DRIVER_JARS = "driverJars";
private static final String DRIVER_JAR = "driverJar";
public static void main(String[] args) {
File xmlFile = new File("connects.xml");
SAXBuilder saxBuilder = new SAXBuilder();
try {
Document doc = saxBuilder.build(xmlFile);
Element root = doc.getRootElement();
Element driverJarsElement = root.getChild(DRIVER_JARS);
List<Element> driverJarElementList = driverJarsElement.getChildren(DRIVER_JAR);
for (int i = 0; i < driverJarElementList.size(); i++) {
Element driverJarElement = driverJarElementList.get(i);
System.out.println(driverJarElement.getText());
}
}
catch (JDOMException | IOException x) {
x.printStackTrace();
}
}
}
Here are the contents of file connects.xml
<connection>
<productId>oracle_10g</productId>
<productName>Oracle 9i ~ 21c</productName>
<driverJars>
<driverJar>ojdbc11.jar</driverJar>
<driverJar>orai18n.jar</driverJar>
</driverJars>
</connection>
And here is the output I get when I run the above code:
ojdbc11.jar
orai18n.jar
My environment is JDK 17.0.4 on Windows 10 (64 bit) and JDOM 2.0.6

Why is JSoup printing a question mark

I'm trying to understand the following. I have some code reading a page from gutenberg.org. Almost everything is ok but some characters are not. They are ok in the browser.
package nl.atticworks.gutenberg;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Gutenberg {
private static final String GET_URL = "http://www.gutenberg.org/browse/languages/nl";
public static void main(String[] args) {
try {
Document doc = Jsoup.connect(GET_URL).get();
Elements data = doc.select("div.pgdbbylanguage");
for (Element d : data) {
Elements children = d.select("*");
for (Element child : children) {
if (child.tagName().equals("ul")) {
Element author = children.get(children.indexOf(child) - 1);
String a1 = author.select("a:last-child").text();
if (a1.startsWith("Kara")) {
System.out.println(a1);
Elements titles = child.select("li.pgdbetext a");
for (Element title : titles) {
System.out.println("\t" + title.text());
}
}
}
}
}
} catch (IOException ex) {
// do something...
}
}
}
The string a1 prints "Karadži?, Vuk Stefanovi?, 1787-1864" but should print "Karadžić, Vuk Stefanović, 1787-1864"
I'm pretty sure that the encoding is ok (UTF-8) but the c with acute isn't encoded properly.
Still, browsers do show the correct char, Jsoup doesn't. Why?
Regards,
Hans
As you haven't said what you are running your program in it is difficult to give a definitive answer, but basically there is nothing wrong with your code. JSoup is not responsible for your display problem, whichever console you are displaying on is the problem.
If you set your console (or IDE) to the UTF-8 encoding it should display correctly.
I tried this code on my own IDEA,and the output was just as you expected.
So I insist that the encoding is the problem.

Webpage collector using google bot

I'm continuing a project that has been coming for a few years at my university. One of the activities this project does is to collect some web pages using the google bot.
Due to a problem that I cannot understand, the project is not getting through this part. Already research a lot about what may be happening, if it is some part of the code that is outdated.
The code is in Java and uses Maven for project management.
I've tried to update some information from maven's "pom".
I already tried to change the part of the code that uses the bot, but nothing works.
I'm posting the part of code that isn't working as it should:
private List<JSONObject> querySearch(int numSeeds, String query) {
List<JSONObject> result = new ArrayList<>();
start=0;
do {
String url = SEARCH_URL + query.replaceAll(" ", "+") + FILE_TYPE + "html" + START + start;);
Connection conn = Jsoup.connect(url).userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)").timeout(5000);
try {
Document doc = conn.get();
result.addAll(formatter(doc);
} catch (IOException e) {
System.err.println("Could not search for seed pages in IO.");
System.err.println(e);
} catch (ParseException e) {
System.err.println("Could not search for seed pages in Parse.");
System.err.println(e);
}
start += 10;
} while (result.size() < numSeeds);
return result;
}
what some variables do:
private static final String SEARCH_URL = "https://www.google.com/search?q=";
private static final String FILE_TYPE = "&fileType=";
private static final String START = "&start=";
private QueryBuilder queryBuilder;
public GoogleAjaxSearch() {
this.queryBuilder = new QueryBuilder();
}
Until this part is ok, it connect with the bot and can get a html from google. The problem is to separate what found and take only the link, that should be between ("h3.r> a").
That it does in this part with the result.addAll(formatter(doc)
public List<JSONObject> formatter(Document doc) throws ParseException {
List<JSONObject> entries = new ArrayList<>();
Elements results = doc.select("h3.r > a");
for (Element result : results) {
//System.out.println(result.toString());
JSONObject entry = new JSONObject();
entry.put("url", (result.attr("href").substring(6, result.attr("href").indexOf("&")).substring(1)));
entry.put("anchor", result.text());
So when it gets to this part: Elements results = doc.select ("h3.r> a"), find, probably, no h3 and can't increment the "results" list by not entering the for loop. Then goes back to the querysearch function and try again, without increment the results list. And with that, entering in a infinite loop trying to get the requested data and never finding.
If anyone here can help me, I've been trying for a while and I don't know what else to do. Thanks in advance.

Not all documents are inserted in MongoDB when using the async-driver for Java

I was experimenting with the mongodb-async driver(http://mongodb.github.io/mongo-java-driver/3.0/driver-async/) and noticed odd behaviour. I reproduced the weird behaviour in underlying code:
import com.mongodb.async.SingleResultCallback;
import com.mongodb.async.client.MongoClient;
import com.mongodb.async.client.MongoClients;
import com.mongodb.async.client.MongoCollection;
import com.mongodb.async.client.MongoDatabase;
import org.bson.Document;
public class main {
public static void main(String [] args)
{
MongoClient mongoClient = MongoClients.create();
MongoDatabase database = mongoClient.getDatabase("mongotest");
MongoCollection<Document> collection = database.getCollection("coll");
for(Integer i = 0; i < 100000; i++) {
Document doc = new Document("name"+ i.toString(), "TESTING");
collection.insertOne(doc, new SingleResultCallback<Void>() {
public void onResult(final Void result, final Throwable t) {
System.out.println("Inserted!");
}
});
}
while(true){
}
}
}
I would expect this code to insert 100.000 documents into the collection 'coll' of the mongo-database called "mongotest".
However, when I check the number of elements after running this code, thousands of documents are missing.
When running this statement in the mongodb-shell
db.getCollection("coll").count()
I get 93062 as a result. This number varies for each run but never gets up to 100.000. Can anyone explain why not all objects are properly stored as documents in the MongoDB when I use this code? We tested this on 3 different machines and every machine exposed the same behaviour.
I have the feeling it is a driver-related issue because following up on this I wrote a similar experiment using node.js:
var express = require('express');
var MongoClient = require('mongodb').MongoClient;
var app = express();
var url = 'mongodb://localhost:27017/mongotest';
MongoClient.connect(url, function (err, db) {
for (var i = 0; i < 100000; i++) {
var name = "name" + i;
db.collection("coll").insertOne({
name: name
},function(err,results) {
if(err==null) {
console.log("Sweet");
}
});
}
});
module.exports = app;
This code took longer to run in comparison to the java-code but when the code finishes, 100.000 documents are sitting in the collection as expected.
Can anyone explain why this is not the case with the java-example, and possibly provide a solution?
When did you run db.getCollection("coll").count() to check the insert result?
Maybe the insertion process has not finished when you check the result.
2016-02-19 15:00 edit
I did the same test, and had the same result.
but when I change the following line
Document doc = new Document("name"+ i.toString(), "TESTING");
to
Document doc = new Document("_id", "name"+ i.toString());
It inserted exactly 100000 docoments.

Getting multiple tables from HTML using Jsoup

I am trying to scrape data from multiple tables on this website: http://www.national-autograss.co.uk/march.htm
I need to keep the table data together with their respective dates located in h2 so I would like a way to do the following:
Find first date header h2
Extract table data beneath h2 (can be multiple tables)
Move on to next header and extract tables etc
I have written code to extract all parts separately but I do not know how to extract the data so that it stays with the relevant date header.
Any help or guidance would be much appreciated. The code I am starting with is below but like I said all it is doing is iterating through the data.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.national-autograss.co.uk/march.htm").get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elementsTable1 = doc.select("#table1");
Elements elementsTable2 = doc.select("#table2");
Elements dateElements = doc.select("h2");
for (int i = 0; i < dateElements.size(); i++) {
System.out.println(dateElements.get(i).text());
System.out.println(elementsTable1.get(i).text());
System.out.println(elementsTable2.get(i).text());
}
}
}
It seems that the values that you want are stored inside <tr>'s in a table where in every table the first child is a <h2>.
<table align="center"><col width="200"><col width="150"><col width="100"><col width="120"><col width="330"><col width="300">
<h2>Sunday 30 March</h2>
<tr id="table1">
<td><b>Club</b></td>
<td><b>Venue</b></td>
<td><b>Start Time</b></td>
<td><b>Meeting Type</b></td>
<td><b>Number of Days for Meeting</b></td>
<td><b>Notes</b></td>
</tr>
<tr id="table2">
<td>Evesham</td>
<td>Dodwell</td>
<td>11:00am</td>
<td>RO</td>
<td>Single Days Racing</td>
<td></td>
</tr>
</table>
My suggestion is that you search for all tables, when first child is a h2 you do something with the rest of its children:
Elements tables = doc.select("table");
for(Element table : tables) {
if(table.child(0).tagName().equals("h2")) {
Elements children = table.children()
}
}
Hope this helps!
EDIT : You want to remove all <col> before the <h2> as they will appear before it (did not notice this before):
for(Element element : doc.select("col"))
{
element.remove();
}

Categories