How to get nested RDF/XML from Jena? - java

I need to create RDF that looks like this:
<rdf:Description rdf:about='uri1'>
<namespace:level1>
<rdf:Description>
<namespace:blankNode rdf:resource='uri2'/>
<namespace:text></namespace:text>
</rdf:Description>
</namespace:level1>
</rdf:Description>
<rdf:Description rdf:about="uri2">
some properties here
</rdf:Description>
As you can see, there are nested structures, as well as blank nodes. (I don't know if that's the exact terminology for the "blankNode" property in my structure.) If I use
model.write(System.out, "RDF/XML-ABBREV");
then even the blank node is nested, which I don't want. Is there any way to get this kind of structure using Jena? Or is there any other library for Java that can handle this better?

I think you're going at this the wrong way.
Nesting is a concept that only makes sense when talking about trees. But RDF is not about trees, it's about triples. Forget for a while about the structure of the XML, and think about the triples that are encoded in the XML. I find model.write(System.out, "N-TRIPLES"); most useful for that.
You first need to understand what triples you want your RDF/XML file to express. As long as it expresses the right triples, it doesn't matter whether the one node is written nested inside the other or what order things appear in. These are purely “cosmetic” questions, like indentation.

Meanwhile in RDF4J, I have a found the org.eclipse.rdf4j.rio.rdfxml.util.RDFXMLPrettyWriter to do exactly what I want, produces a nice nested clean and compact RDFXML document of a nested configuration.
While I agree with #cygri, it is often very desirable (e.g. in consulting situations) to have an easily readable RDFXML, and the Writer often spits out hard to digest RDF (probably due to streaming and optimization for memory consumption/speed).

Related

Data retrieval / search in text

I am working on a selfProjet for my own interest on data retrieval. I have one text file with the following format.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...
".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.
I did:
split the docs and put them in separate files
delete stopwords (and, or, while, is, are, ...)
stem the words to get the root of each (achievement, achieve, achievable, ...all converted to achiv and so on)
and finally create e TreeMultiMap which looks like this:
{key: word} {Values are arraylist of docID and frequency of that word in that docID}
aerodynam [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
book [[Doc_00562,6],[Doc_01111,1]]
....
....
result [[Doc_00010,5]]
....
....
zzzz [[Doc_01235,1]]
Now my questions:
Suppose that user is interested to know:
what documents does have achieving and book? (idea)
documents which has achieving and skills but not book nor video
document include Aerodynamic
and some other simple queries like this
(input) so suppose she enters
achieving AND book
(achieving AND skills) AND (NOT (book AND video))
Aerodynamic
.....and some other simple queries
(Output)
[Doc_00562,6],[Doc_01121,5],[Doc_01151,3],[Doc_00012,2],[Doc_00001,1]
....
as you can see there might be
Some precedence modifier (parenthesis which we dont know the depth)
precedence of AND, OR, NOT
and some other interesting challenges and issues
So, I would like to run the queries against the TreeMultimap and search in the words(key) and retrieve the Values(list of docs) to user.
how should I think about this problem and how to design my solution? what articles or algorithms should i read? any idea would be appreciated. (thanks for reading this long post)
The collection that you have used is the Cranfield test collection, which I believe has around 3000 documents. While for collections of this size, it is okay to store the inverted list (the data structure that you have constructed) in memory with a hash-based or trie based organization, for realistic collections of much larger sizes, often comprised of millions of documents, you would find it difficult to store the inverted list entirely within memory in such cases.
Instead of reinventing the wheel, the practical solution is thus to make use of a standard text indexing (and retrieval) framework such as Lucene. This tutorial should help you to get started.
The questions that you seek to address can be answered by Boolean queries where you can specify set of Boolean operators AND, OR and NOT between its constituent terms. Lucene supports this. Have a look at the API doc here and a related StackOverflow question here.
The Boolean query retrieval algorithm is very simple. The list elements (i.e. the document ids) corresponding to each term are stored in sorted order so that at run-time it is possible to compute the union and intersection in time linear to the size of the lists, i.e. O(n1+n2).... (this is very similar to mergesort).
You can find more information in this book chapter.

Build in library's to perform effective searching on 100GB files

Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.
As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.
If u doesn't want to use the tools built for search, then store the data in DB and use sql.

Is there a java library to bind XML (LOM) to XML+RDF?

For an educational project, I need some code (if exists) that transform XML files (specifically LOM metadata, but just xml is fine) to XML+RDF.
I need that because I'm using a RDF store (4store) to query the triples and make searches faster.
I read that with XSLT it's possible to transform any xml to another xml, so if you know there is an actual class, library or code, please tell me.
Thank you all.
My advice would be to use a software library to transform the XML to RDF/XML since the mapping may not be straightforward and RDF/XML has different XML semantics.
There a loads of different RDF API's for different technology stacks including
dotNetRDF, Jena, Sesame, ARC, Redland
http://semanticweb.org/wiki/Tools
You also need to define how the LOM metadata should be serialised into RDF. There is a good article here:
http://www.downes.ca/xml/rss_lom.htm
Answer my own question..
I'm using a binding of key/value for the LOM file. So, this part of the metadata:
<general>
<identifier xmlns="http://ltsc.ieee.org/xsd/LOM">
<catalog>oai</catalog>
<entry>oai:archiplanet.org:ap_44629</entry>
</identifier>
catalog and entry will going to be converted like this:
s = the URI of my graph, it contains my filename or identifier.
p = "lom.general.identifier.catalog"
v = "oai"
,,,,,,
s = the URI of my graph, it contains my filename or identifier.
p = "lom.general.identifier.entry"
v = "oai:archiplanet.org:ap_44629"
An so, it generates all the triples for the RDF file. I think this approach will help in order to make queries about specific values or properties.
IEEE LOM is not straightforward structure. It contains hierarchical taxonomy which should be taken into account when you are mapping. Here you can find an instruction on how you can map each IEEE LOM element as RDF, if this is your case.
Regarding the conversion, you can use the XML java library to read the XML files and create the final RDF/XML file using Jena according to the ontology I mentioned. The lom ontology is available at here

How to sort xml files using SAX

I have 6 XML files containing the following tag
the first XML file is
<root>
<firstName> Smith</firstName>
<lastname>Joe</lastname>
<Age>60</age>
</root>
the second is
<root>
<firstName> John</firstName>
<lastname>Andrew</lastname>
<Age>55</age>
</root>
and so on
the required is to print the firstname,lastname,age and I have done that in agood way.However, I need also to
print ages sorted by age
so first should be 55 then 60. I could not do that by sax it was really
IF you use sax parser you should use some intermediate structure and sort it in it (like one of the Collections). Sax parser is event based so you can't sort it really using it.
The only possible reason for using SAX is because you don't want to allocate memory to store the whole document. If you're sorting, then SAX gives you no benefits - you're using a very low-level interface to no purpose. If you want to sort the data then by far the best solution is to use a high-level XML processing language such as XSLT or XQuery.

Storing a 2 dimensional table (decision table) in XML for efficient Query(ies)

I need to implement a Routing Table where there are a number of paramters.
For eg, i am stating five attributes in the incoming message below
Customer Txn Group Txn Type Sender Priority Target
UTI CORP ONEOFF ABC LOW TRG1
UTI GOV ONEOFF ABC LOW TRG2
What is the best way to represent this data in XML so that it can be queried efficiently.
I want to store this data in XML and using Java i would load this up in memory and when a message comes in i want to identify the target based on the attributes.
Appreciate any inputs.
Thanks,
Manglu
Here is a pure XML representation that can be processed very efficiently as is, without the need to be converted into any other internal data structure:
<table>
<record Customer="UTI" Txn-Group="CORP"
Txn-Type="ONEOFF" Sender="ABC1"
Priority="LOW" Target="TRG1"/>
<record Customer="UTI" Txn-Group="Gov"
Txn-Type="ONEOFF" Sender="ABC2"
Priority="LOW" Target="TRG2"/>
</table>
There is an extremely efficient way to query data in this format using the <xsl:key> instruction and the XSLT key() function:
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:key name="kRec" match="record"
use="concat(#Customer,'+',#Sender)"/>
<xsl:template match="/">
<xsl:copy-of select="key('kRec', 'UTI+ABC2')"/>
</xsl:template>
</xsl:stylesheet>
when applied on the above XML document produces the desired result:
<record Customer="UTI"
Txn-Group="Gov" Txn-Type="ONEOFF"
Sender="ABC2" Priority="LOW"
Target="TRG2"/>
Do note the following:
There can be multiple <xsl:key>s defined that identify a record using different combinations of values to be concatenated together (whatever will be considered "keys" and/or "primary keys").
If an <xsl:key> is defined to use the concatenation of "primary keys" then a unique record (or no record) will be found when the key() function is evaluated.
If an <xsl:key> is defined to use the concatenation of "non-primary keys", then more than one record may be found when the key() function is evaluated.
The <xsl:key> instruction is the equivalent of defining an index in a database. This makes using the key() function extremely efficient.
In many cases it is not necessary to convert the above XML form to an intermediary data structure, due neither to reasons of understandability nor of efficiency.
If you're loading it into memory, it doesn't really matter what form the XML takes - make it the easiest to read or write by hand, I would suggest. When you load it into memory, then you should transform it into an appropriate data structure. (The exact nature of the data structure would depend on the exact nature of the requirements.)
EDIT: This is to counter the arguments made in comments by Dimitre:
I'm not sure whether you thought I was suggesting that people implement their own hashtable - I certainly wasn't. Just keep a straight hashtable or perhaps a MultiMap for each column which you want to use as a key. Developers know how to use hashtables.
As for the runtime efficiency, which do you think is going to be more efficient:
You build some XSLT (and bear in mind this is foreign territory, at least relatively speaking, for most developers)
XSLT engine parses it. This step may be avoidable if you're using an XSLT library which lets you just parameterise an existing query. Even so, you've got some extra work to do.
XSLT engine hits hashtables (you hope, at least) and returns a node
You convert the node into a more useful data structure
Or:
You look up appropriate entries in your hashtable based on the keys you've been given, getting straight to a useful data structure
I think I'd trust the second one, personally. Using XSLT here feels like using a screwdriver to bash in a nail...
That depends on what is repeating and what could be empty. XML is not known for its efficient queryability, as it is neither fixed-length nor compact.
I agree with the previous two posters - you should definitely not keep the internal representation of this data in XML when querying as messages come in.
The XML representation can be anything, you could do something like this:
<routes>
<route customer="UTI" txn-group="CORP" txn-type="ONEOFF" .../>
...
</routes>
My internal representation would depend on the format of the message coming in, and the language. A simple representation would be a map, mapping a structure of data (i.e. the key fields from which the routing decision is made) to the info on the target route.
Depending on your performance requirements, you could keep the key/target information as strings, though in any high performing system you'd probably want to do a straight memory comparison (in C/C++) or some form integer comparison.
Yeah, your basic problem is that you're using "XML" and "efficient" in the same sentence.
Edit: No, seriously, yer killin' me. The fact that several people in this thread are using "highly efficient" to describe anything to do with operations on a data format that require string parsing just to find out where your fields are shows that several people in this thread do not even know what the word "efficient" means. Downvote me as much as you like for saying it. I can take it, coach.

Categories