how to create a jena Model not in-memory? - java

I'm trying to create a Model in jena that won't load the entire data into memory but instead will read from the filesystem.
I found a whole lot of available configurations, but they all seem to be in-memory (for example on OntModelSpec).

Use Apache Jena TDB - see documentation here.
TDB stores your dataset on disk, but accesses it very efficiently: you shouldn't experience any real performance difference over an in-memory model.
Typically, if I'm dealing with a large model or dataset I work like this:
Load model on commandline:
# /tmp/DB is where TDB will store the indexed model
$ tdbloader2 --loc /tmp/DB file.nt
(use tdbloader on Windows)
(Optional) Try a query:
$ tdbquery --loc /tmp/DB #query.sparql
Access like any old model from java:
Dataset dataset = TDBFactory.createDataset("/tmp/DB") ;
Model model = dataset.getDefaultModel() ;
... continue as before ...

You can create your own implementation of org.apache.jena.graph.Graph, which won't work with memory.
An example is d2rq, where de.fuberlin.wiwiss.d2rq.jena.GraphD2RQ works with databases. but it is based on outdated jena.

Related

How to retrieve a RDF Property

I have this piece of information in RDF/XML
<rdf:RDF xmlns:cim="http://iec.ch/TC57/2012/CIM-schema-cim16#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cim:SynchronousMachineTimeConstantReactance rdf:ID="_54302da0-b02c-11e3-af35-080027008896">
<cim:IdentifiedObject.aliasName>GENCLS_DYN</cim:IdentifiedObject.aliasName>
<cim:IdentifiedObject.name>RoundRotor Dynamics</cim:IdentifiedObject.name>
<cim:SynchronousMachineTimeConstantReactance.tpdo>0.30000001192092896</cim:SynchronousMachineTimeConstantReactance.tpdo>
<cim:SynchronousMachineTimeConstantReactance.tppdo>0.15000000596046448</cim:SynchronousMachineTimeConstantReactance.tppdo>
I have learned a little bit about how to read the document but now I want to go farther. I am "playing" with API functions to try to get the values but I am lost (and I think I do not understand properly how JENA and RDF work). So, how can I get the values of each tag?
Greetings!
I would start with the Reading and Writing RDF in Apache Jena documentation, and then read The Core RDF Api. One important step in understanding the RDF Data Model is to seperate any notion of XML from your understanding of RDF. RDF is a graph data model that just so happens to have one serialization which is in XML.
You'll note that xml-specific language like "tags" actually don't show up at all in the discussion unless you are talking about how to serialize/deserialize RDF/XML.
In order to make the data you are looking at more human friendly, I'd suggest writing it out in TURTLE. TURTLE (or TTL) is another serialization of RDF that is much easier to read or write.
The following code will express your data in TURTLE and will be helpful in understanding what you see.
final InputStream yourInputFile = ...;
final Model model = ModelFactory.createDefaultModel();
model.read(yourInputFile, "RDF/XML");
model.write(System.out, null, "TURTLE");
You'll also want to provide minimal working examples whenever submitting questions on the subject area. For example, I had to add some missing end-tags to your data in order for it to be valid XML:
<rdf:RDF xmlns:cim="http://iec.ch/TC57/2012/CIM-schema-cim16#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cim:SynchronousMachineTimeConstantReactance rdf:ID="_54302da0-b02c-11e3-af35-080027008896">
<cim:IdentifiedObject.aliasName>GENCLS_DYN</cim:IdentifiedObject.aliasName>
<cim:IdentifiedObject.name>RoundRotor Dynamics</cim:IdentifiedObject.name>
<cim:SynchronousMachineTimeConstantReactance.tpdo>0.30000001192092896</cim:SynchronousMachineTimeConstantReactance.tpdo>
<cim:SynchronousMachineTimeConstantReactance.tppdo>0.15000000596046448</cim:SynchronousMachineTimeConstantReactance.tppdo>
</cim:SynchronousMachineTimeConstantReactance>
</rdf:RDF>
Which becomes the following TURTLE:
<file:///R:/workspaces/create/git-svn/create-sparql/RDF/XML#_54302da0-b02c-11e3-af35-080027008896>
a cim:SynchronousMachineTimeConstantReactance ;
cim:IdentifiedObject.aliasName "GENCLS_DYN" ;
cim:IdentifiedObject.name "RoundRotor Dynamics" ;
cim:SynchronousMachineTimeConstantReactance.tpdo "0.30000001192092896" ;
cim:SynchronousMachineTimeConstantReactance.tppdo "0.15000000596046448" .
RDF operates at the statement level, so to find out that your _54302da0-b02c-11e3-af35-080027008896 is a cim:SynchronousMachineTimeConstantReactance you would look for the corresponding triples. Jena's Model API (linked to above) will provide you with methods to identify the properties that resources have.
The following will list all statements whose subject is the aforementioned resource:
final Resource s = model.getResource("file:///R:/workspaces/create/git-svn/create-sparql/RDF/XML#_54302da0-b02c-11e3-af35-080027008896");
final ExtendedIterator<Statement> properties = s.listProperties();
while( properties.hasNext() ) {
System.out.println(properties.next());
}
which produces:
[file:///R:/workspaces/create/git-svn/create-sparql/RDF/XML#_54302da0-b02c-11e3-af35-080027008896, http://iec.ch/TC57/2012/CIM-schema-cim16#SynchronousMachineTimeConstantReactance.tppdo, "0.15000000596046448"]
[file:///R:/workspaces/create/git-svn/create-sparql/RDF/XML#_54302da0-b02c-11e3-af35-080027008896, http://iec.ch/TC57/2012/CIM-schema-cim16#SynchronousMachineTimeConstantReactance.tpdo, "0.30000001192092896"]
[file:///R:/workspaces/create/git-svn/create-sparql/RDF/XML#_54302da0-b02c-11e3-af35-080027008896, http://iec.ch/TC57/2012/CIM-schema-cim16#IdentifiedObject.name, "RoundRotor Dynamics"]
[file:///R:/workspaces/create/git-svn/create-sparql/RDF/XML#_54302da0-b02c-11e3-af35-080027008896, http://iec.ch/TC57/2012/CIM-schema-cim16#IdentifiedObject.aliasName, "GENCLS_DYN"]
[file:///R:/workspaces/create/git-svn/create-sparql/RDF/XML#_54302da0-b02c-11e3-af35-080027008896, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://iec.ch/TC57/2012/CIM-schema-cim16#SynchronousMachineTimeConstantReactance]
which produces:

Write to Jena RDF model through pagination

I intend to convert the data in an SQL database into an RDF dump. I have a model and an ontology defined.
Model model = ModelFactory.createDefaultModel();
OntModel ontModel = ModelFactory.createOntologyModel(OntModelSpec.OWL_DL_MEM, model);
The ontModel has many classes defined in it. Now let us suppose I have 10000 records in my SQL db and I want to load them into the model and write it into a file. However, I want to paginate in case of a memory overflow.
int fromIndex = 0;
int toIndex = 10;
while(true) {
//1. get resources between fromIndex to toIndex from sql db
// if no more resources 'break'
//2. push these resources in model
//3. write the model to a file
RDFWriter writer = model.getWriter();
File file = new File(file_path);
FileWriter fileWriter = new FileWriter(file, true);
writer.write(this.model, fileWriter, BASE_URL);
model.close();
from = to+1;
to = to+10;
}
Now, how does one append the new resources to the existing resources in the file. Because currently I see the ontology getting written twice and it throws an exception
org.apache.jena.riot.RiotException: The markup in the document
following the root element must be well-formed.
Is there a way to handle this already?
Now, how does one append the new resources to the existing resources
in the file. Because currently I see the ontology getting written
twice and it throws an exception
A model is a set of triples. You can add more triples to the set, but that's not quite the same thing as "appending", since a model doesn't contain duplicate triples, and sets don't have a specified order, so "appending" isn't quite the right metaphor.
Jena can handle pretty big models, so you might first see whether you can just create the model and add everything to it, and then write the model to the file. While it's good to be cautious, it's not a bad idea to see whether you can do what you want without jumping through hoops.
If you do have problems with in-memory models, you might consider using a TDB backed model which will use disk for storage. You could do incremental updates to that model using the model API, or SPARQL queries, and then extract the model afterward in some serialization.
One more option, and this is probably the easiest if you really do want to append to a file, is to use a non-XML serialization of the RDF, such as Turtle or N-Triples. These are text based (N-Triples is line-based), so appending new content to a file is not a problem. This approach is described in the answer to Adding more individuals to existing RDF ontology.

Nutch Seed URLs

Is it possible to get URLs into Nutch directly from a database or a service etc. I'm not interested in the ways which data is taken from the database or service and written to seed.txt.
No. This cannot be done directly with the default nutch codebase. You need to modify Injector.java to achieve that.
EDIT:
Try using DBInputFormat : an InputFormat that reads input data from an SQL table. You need to modify the Inject code here (line 3 in snippet below):
JobConf sortJob = new NutchJob(getConf());
sortJob.setJobName("inject " + urlDir);
FileInputFormat.addInputPath(sortJob, urlDir);
sortJob.setMapperClass(InjectMapper.class);

how to get the data from xml feeds

I have the following feeds from my vendor,
http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule
I wanted to get the data from that xml files as java objects, so that I can insert into my database regularly.
The above data is nothing but regular updates from the vendor, so that I can update in my website.
can you please suggest me what are my options available to get this working
Should I use any webservices or just Xstream
to get my final output.. please suggest me as am a new comer to this concept
Vendor has suggested me that he can give me the data in following 3 formats rss, xml or json, I am not sure what is easy and less consumable to get it working
I would suggest just write a program that parses the XML and inserts the data directly into your database.
Example
This groovy script inserts data into a H2 database.
//
// Dependencies
// ============
import groovy.sql.Sql
#Grapes([
#Grab(group='com.h2database', module='h2', version='1.3.163'),
#GrabConfig(systemClassLoader=true)
])
//
// Main program
// ============
def sql = Sql.newInstance("jdbc:h2:db/cricket", "user", "pass", "org.h2.Driver")
def dataUrl = new URL("http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule")
dataUrl.withReader { reader ->
def feeds = new XmlSlurper().parse(reader)
feeds.matches.match.each {
def data = [
it.id,
it.name,
it.type,
it.tournamentId,
it.location,
it.date,
it.GMTTime,
it.localTime,
it.description,
it.team1,
it.team2,
it.teamId1,
it.teamId2,
it.tournamentName,
it.logo
].collect {
it.text()
}
sql.execute("INSERT INTO matches (id,name,type,tournamentId,location,date,GMTTime,localTime,description,team1,team2,teamId1,teamId2,tournamentName,logo) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", data)
}
}
Well... you could use an XML Parser (stream or DOM), or a JSON parser (again stream of 'DOM'), and build the objects on the fly. But with this data - which seems to consist of records of cricket matches, why not go with a csv format?
This seems to be your basic 'datum':
<id>1263</id>
<name>Australia v India 3rd Test at Perth - Jan 13-17, 2012</name>
<type>TestMatch</type>
<tournamentId>137</tournamentId>
<location>Perth</location>
<date>2012-01-14</date>
<GMTTime>02:30:00</GMTTime>
<localTime>10:30:00</localTime>
<description>3rd Test day 2</description>
<team1>Australia</team1>
<team2>India</team2>
<teamId1>7</teamId1>
<teamId2>1</teamId2>
<tournamentName>India tour of Australia 2011-12</tournamentName>
<logo>/cricket/137/tournament.png</logo>
Of course you would still have to parse a csv, and deal with character delimiting (such as when you have a ' or a " in a string), but it will reduce your network traffic quite substantially, and likely parse much faster on the client. Of course, this depends on what your client is.
Actually you have RESTful store that can return data in several formats and you only need to read from this source and no further interaction is needed.
So, you can use any XML Parser to parse XML data and put the extracted data in whatever data structure that you want or you have.
I did not hear about XTREME, but you can find more information about selecting the best parser for your situation at this StackOverflow question.

Import CSV file to oracle DB

Is there a simple Java library or approach that will take a SQL query and load data in a CSV file to oracle database. Pls help
You don't have to use Java to load a data file into a table unless it is absolutely necessary. Instead, I'd recommend Oracle's command-line SQL*Loader utility which was designed specially for this purpose.
For similar tasks I usually use Groovy scripts as it's really easy and quick to write and runs on the JVM off course.
...an example:
import groovy.sql.Sql
def file1 = new File(/C:\Documents and Settings\USER\Desktop\Book1.csv/)
def reader = new FileReader(file1)
def sql = Sql.newInstance("jdbc:oracle:thin:#XXXXXX:XXXX:XXX", "SCHEMA",
"USER", "oracle.jdbc.driver.OracleDriver")
reader.each { line ->
fields = line.split(';')
sql.executeInsert("insert into YOUR_TABLE values(${fields[0]},${fields[1]},${fields[2]})")
}
It's a basic example, if you have double quotes and semi columns in your csv you will probably want to use something like OpenCSV to handle that.
You could transform each line in the CSV with regular expressions, to make an insert query, and then send to Oracle (with JDBC).
I think this tool will help you for any type of database import-export problem.
http://www.dmc-fr.com/home_en.php
Do you have that CSV in a file on the database server or can you store it there? Then you may try to have Oracle open it by declaring a DIRECTORY object for the path the file is in and then create an EXTERNAL TABLE which you can query in SQL afterwards. Oracle does the parsing of the file for you.
If you are open to Python you can do bulk load using SQL*Loader
loadConf=('sqlldr userid=%s DATA=%s control=%s LOG=%s.log BAD=%s.bad DISCARD=%s.dsc' % (userid,datafile, ctlfile,fn,fn,fn)).split(' ')
p = Popen(loadConf, stdin=PIPE, stdout=PIPE, stderr=PIPE, shell=False, env=os.environ)
output, err = p.communicate()
It's will be much faster that row insert.
I uploaded basic working example here.

Categories