Parsing html tables using jsoup

Parsing html tables using jsoup - java

I am parsing tables using jsoup. I need to connect to division standing tables from this website: https://www.basketball-reference.com/leagues/NBA_2006.html. Don't know how to parse tables because I need to use the same method for every division standing table, but the id is different for older seasons (e.g. id="divs_standings_W", "id="divs_standings_E" and "id="divs_standings_"). Link to some older season: https://www.basketball-reference.com/leagues/NBA_1950.html.
How can I check if the table with the given id exists and if it exists put it in a variable table? Don't have much relevant code.
Document doc = Jsoup.connect("https://www.basketball-reference.com/leagues/NBA_1950.html").get();
Elements table = doc.select("table[id=\"divs_standings_\"]");

You can just use prefix matching. Use table[id^="divs_standings_"]. This will match all tables, with ids starting with divs_standings_:
Document doc = Jsoup.connect("https://www.basketball-reference.com/leagues/NBA_1950.html").get();
Element table = doc.selectFirst("table[id^=\"divs_standings_\"]");
This will work for old and new seasons.
To wrap this in a method you can just use something like this:
private static void processTable(String url) throws IOException {
Document doc = Jsoup.connect(url).get();
Element table = doc.selectFirst("table[id^=\"divs_standings_\"]");
System.out.println(table);
}
and call it with both urls:
processTable("https://www.basketball-reference.com/leagues/NBA_1950.html");
processTable("https://www.basketball-reference.com/leagues/NBA_2006.html");
You also can use pattern matching if you have more complex ids. Check out the link above for this.

Related

How to upgrade this code to latest Mongo Java driver?

I am working on this legacy application (7 years old). I have many methods that do the same thing that I am trying to upgrade to a newer MongoDB Java driver, but it won't compile.
#Override
public void saveOrUpdatePrinter(Document printer) {
printer.put(PRINTER_COLUMNS.updateDate,new Date());
MongoCollection<Document> collection = mongoTemplate.getCollection("PRINTERS");
printer.remove("_id");
Document query = new Document().append(PRINTER_COLUMNS.internal_id, printer.get(PRINTER_COLUMNS.internal_id));
WriteResult result = collection.update(query, printer, true, false);
logger.debug("saveOrUpdatePrinter updeded records: " + result.getN());
}//
The error is:
The method update(Document, Document, boolean, boolean) is undefined
for the type MongoCollection<Document>
Why was this removed?
printer.remove("_id");
Also I would like to know how to do either update or save on the document in one go?
And what will be the proper way to update a single document in the new (MongoDB Java driver 4.7.0)?
Reading this code a little more seems like it was an attempt to do UPSERT operation (update or insert).

I will try to answer your questions.
Q : How to do either Update or Save on the Document in one go?
-> MongoDB's update method updates the values in the existing document whereas the save method replaces the existing document with the document passed. Nothing happens in one go.
update method only updates which are specific fields which are modified by comparing the fields from the modified document with the original document whereas the save method updates/replaces the values of all the fields of an original document by the taking values from the modified document and setting the values into the original document.
Q : What will be the proper way to update a single document in the new (Mongo Java driver 4.7.0)
-> You should be using updateOne(query, updates, options) to update a single document on a MongoCollection object.
From updateOne docs :
The method accepts a filter that matches the document you want to
update and an update statement that instructs the driver how to change
the matching document. The updateOne() method only updates the first
document that matches the filter.
To perform an update with the updateOne() method, you must pass a
query filter and an update document. The query filter specifies the
criteria for which document to perform the update on and the update
document provides instructions on what changes to make to it.
You can optionally pass an instance of UpdateOptions to the
updateOne() method in order to specify the method's behavior. For
example, if you set the upsert field of the UpdateOptions object to
true, the operation inserts a new document from the fields in both the
query and update document if no documents match the query filter.
Q : Is it seems like it was an attempt to do UPSERT operation (Update or Insert) ?
-> Yes, it's an upsert operation.
Q : Why the code is trying to remove _id from document ?
-> The update method will update the document if the document was found by internal_id. If the document was not found and also if there is no _id field in the document, then the mongoshell will consider it as a new document and will invoke insert method internally via the update method to insert the document. For the insertion to happen, that's why it was removed from document.
Just update the code to this.
#Override
public void saveOrUpdatePrinter(Document printer) {
MongoCollection<Document> collection = mongoTemplate.getCollection("PRINTERS");
Document query = new Document().append(PRINTER_COLUMNS.internal_id, printer.get(PRINTER_COLUMNS.internal_id));
UpdateOptions options = new UpdateOptions().upsert(true);
printer.put(PRINTER_COLUMNS.updateDate,new Date());
UpdateResult result = collection.updateOne(query, printer, options);
logger.debug("saveOrUpdatePrinter updated records: " + result.getModifiedCount());
}

You can update a document using the MongoCollection#updateOne() method
An example would be:
collection.updateOne(Filters.eq("_id", new ObjectId("1234")), Updates.set("date", new Date());

MongoDB & Java DBRef Usage

So let’s say I have a patient document in MongoDB. It has things such as first name, last name, etc… I am trying to add to the current document a list of providers (which is another collection, by DBRef, as I am using POJO. How would I append multiple providers in Java to the document?

One way to do this is by just appending to a document like this:
Document doc = new Document("user", userObject)
.append("providers", providersObject);
providersObject would be your list of providers.

Get the object id after inserting the mongodb document in java

I am using mongodb 3.4 and I want to get the last inserted document id. I have searched all and I found out below code can be used if I used a BasicDBObject.
BasicDBObject docs = new BasicDBObject(doc);
collection.insertOne(docs);
ID = (ObjectId)doc.get( "_id" );
But the problem is am using Document type not BasicDBObject so I tried to get it as like this, doc.getObjectId();. But it asks a parameter which I actually I want, So does anyone know how to get it?
EDIT
This is the I am inserting it to mongo db.
Document doc = new Document("jarFileName", jarDataObj.getJarFileName())
.append("directory", jarDataObj.getPathData())
.append("version", jarDataObj.getVersion())
.append("artifactID", jarDataObj.getArtifactId())
.append("groupID", jarDataObj.getGroupId());
If I use doc.toJson() it shows me whole document. is there a way to extract only _id?
This gives me only the value i want it like the objectkey, So I can use it as reference key.
collection.insertOne(doc);
jarID = doc.get( "_id" );
System.out.println(jarID); //59a4db1a6812d7430c3ef2a5

Based on ObjectId Javadoc, you can simply instantiate an ObjectId from a 24 byte Hex string, which is what 59a4db1a6812d7430c3ef2a5 is if you use UTF-8 encoding. Why don't you just do new ObjectId("59a4db1a6812d7430c3ef2a5"), or new ObjectId("59a4db1a6812d7430c3ef2a5".getBytes(StandardCharsets.UTF_8))? Although, I'd say that exposing ObjectId outside the layer that integrates with Mongo is a design flaw.

Using MongoDB 3.4 to load and save userdata

How can I find a document and retrieve it if found, but insert and retrieve it if not found in one command?
I have an outline for the formats I wish my documents to look like for a user's data. Here is what it looks like
{
"username": "HeyAwesomePeople",
"uuid": "0f91ede5-54ed-495c-aa8c-d87bf405d2bb",
"global": {},
"servers": {}
}
When a user first logs in, I want to store the first two values of data (username and uuid) and create those empty values (global and servers. Both those global and servers will later on have more information filled into them, but for now they can be blank). But I also don't want to override any data if it already exists for the user.
I would normally use the insertOne or updateOne calls to the collection and then use the upsert (new UpdateOptions().upsert(true)) option to insert if it isn't found but in this case I also need to retrieve the user's document aswell.
So in a case in which the user isn't found in the database, I need to insert the outlined data into the database and return the document saved. In a case where the user is found in the database, I need to just return the document from the database.
How would I go about doing this? I am using the latest version of Mongo which has deprecated the old BasicDBObject types, so I can't find many places online that use the new 'Document' type. Also, I am using the Async driver for java and would like to keep the calls to the minimum.

How can I find a document and retrieve it if found, but insert and retrieve it if not found in one command?
You can use findOneAndUpdate() method to find and update/upsert.
The MongoDB Java driver exposes the same method name findOneAndUpdate(). For example:
// Example callback method for Async
SingleResultCallback<Document> printDocument = new SingleResultCallback<Document>() {
#Override
public void onResult(final Document document, final Throwable t) {
System.out.println(document.toJson());
}
};
Document userdata = new Document("username","HeyAwesomePeople")
.append("uuid", "0f91ede5")
.append("global", new Document())
.append("servers", new Document());
collection.findOneAndUpdate(userdata,
new Document("$set", userdata),
new FindOneAndUpdateOptions()
.upsert(true)
.returnDocument(ReturnDocument.AFTER),
printDocument);
The query above will try to find a document matching userdata; if found set it to the same value as userdata. If not found, the upsert boolean flag will insert it into the collection. The returnDocument option is to return the document after the action is performed.
The upsert and returnDocument flags are part of FindOneAndUpdateOptions
See also MongoDB Async Java Driver v3.4 for tutorials/examples. The above snippet was tested with current version of MongoDB v3.4.x.

Add basic value to Ontology individuals #Jena

I have an Ontology with some Classes and everything setup to run. What is a good way to fill it up with Individuals and Data?? In Short do a one-way Mapping from Database (as Input) to an Ontology.
public class Main {
static String SOURCE = "http://www.umingo.de/ontology/bento.owl";
static String NS = SOURCE+"#";
public static void main(String[] args) throws Exception {
OntModel model = ModelFactory.createOntologyModel( OntModelSpec.OWL_MEM );
// read the RDF/XML file
model.read(SOURCE);
OntologyPreLoader loader = new OntologyPreLoader();
model = loader.init(model);
model.write(System.out,"RDF/XML");
}
}
My Preloader has a Method init with the goal to copy data from a database into the ontology. Here is the Excerpt.
public OntModel init(OntModel model) throws SQLException{
Resource r = model.getResource( Main.NS + "Tag" );
Property tag_name = model.createProperty(Main.NS + "Tag_Name");
OntClass tag = r.as( OntClass.class );
// statements allow to issue SQL queries to the database
statement = connect.createStatement();
// resultSet gets the result of the SQL query
resultSet = statement
.executeQuery("select * from niuu.tags");
// resultSet is initialised before the first data set
while (resultSet.next()) {
// it is possible to get the columns via name
// also possible to get the columns via the column number
// which starts at 1
// e.g., resultSet.getSTring(2);
String id = resultSet.getString("id");
String name = resultSet.getString("name");
Individual tag_tmp = tag.createIndividual(Main.NS+"Tag_"+id);
tag_tmp.addProperty(tag_name,name);
System.out.println("id: " + id);
System.out.println("name: " + name);
}
return model;
}
Everything is working, but I feel really unsure about this way to preload ontologies. Also every Individual should get its own ID so that i can match it with the database at a later point.
Can i simply define a Property ID and add it to every Individual?
I thought about Adding ID to "Thing" as it is the most basic Type in OWL ontologies.

At first sight it seems ok. One tip is to try convert the Jena model into a RDF serialization and run it through Protégé to get a more clear picture on how your ontology mapping looks like.
You can definitely make your own property to describe the id of every individual.
Beneath is an example on how you can create a similar property in turtle format.(I did not add the prefixes for OWL and rdfs since they are some common)
You can add this in Jena aswell if needed. (or load this into your model in Jena.)
#prefix you: <your domain> .
you:dbIdentificator a owl:DatatypeProperty .
you:dbIdentificator rdfs:label "<Your database identifcator>"#en .
you:dbIdentificator rdfs:comment "<Some valuable information if needed>"#en .
you:dbIdentificator rdfs:isDefinedBy <your domain> .
you:dbIdentificator rdfs:domain owl:Thing .
You could also add owl:Thing to every resource, but that is not the best practice because it is a vague definition of a resource. I would look around for vocabularies that defines more what the resource is. Take a look at GoodRelations. It is a very good defined vocabulary that can describe information even though it is not for commercial use. Especially check out the classes there.
Hope that answered some of your question.

Programatically generating URIs is always somewhat unsettling. If you have Guava, use Preconditions to make some fail-fast assertions about what's coming out of the database (so that your code will let you know if it gets out of alignment with your schema). Use the JDK's URLEncoder to ensure that the id you get from the database is converted to a URI-friendly format (Note that if your data contains characters that cannot be printed in xml and have no percent encoding, you'll need to manually handle them).
For your property/column values, use explicitly create the literal. This makes it very clear whether you are using plain literals, language literals, or typed literals:
// If things can have multiple names in multiple languages, for example
tag_tmp.addProperty(tag_name,model.createTypedLiteral(name, "en"));
Note that you may not wish to define your schema so that it implies things about owl:Thing, because that would have implications outside of your domain. Instead, define a domain-specific notion like a :DatabaseResource. Set the domains of your properties to be that and it's subclasses rather than thing. This way the use of your property implies that the subject with within your domain, rather than simply an owl individual (which is implied by the domain of owl:DatatypeProperty anyway).
EDIT: It's absolutely acceptable to create a representation of the database's unique ID and place it into the RDF model. If you are using owl2, you can define an OWL-2 Key on that property for your :DatabaseResources and keep the same semantics that you had in the database.
EDIT: Noting a portion of your post on the Jena mailing list:
I have a huge MYSQL-Database for read only purpose and want to extract some Data into the Ontology.
I would highly recommend using the TDB Java API to construct a Dataset that backed by your disk. I've worked on very large database exports before, and it's quite possible that your data size won't be tractable otherwise. TDB's indexing requires a lot of disk space, but the memory-mapped IO makes it very difficult to kill due to OOM errors. Finally, once you have constructed the database on disk, you won't have to perform this expensive import operation again (or could at least optimize it).
If you find database creation times to be prohibitive, then you may with to utilize the bulk loader in creative ways. This answer has an example of using the bulk loader from java.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.