Aggregating within Apache Jena - java

I'm using the Java API of Apache Jena to store and retrieve documents and the words within them. For this I decided to set up the following datastructure:
_dataset = TDBFactory.createDataset("./database");
_dataset.begin(ReadWrite.WRITE);
Model model = _dataset.getDefaultModel();
Resource document= model.createResource("http://name.space/Source/DocumentA");
document.addProperty(RDF.value, "Document A");
Resource word = model.createResource("http://name.space/Word/aword");
word.addProperty(RDF.value, "aword");
Resource resource = model.createResource();
resource.addProperty(RDF.value, word);
resource.addProperty(RSS.items, "5");
document.addProperty(RDF.type, resource);
_dataset.commit();
_dataset.end();
The code example above represents a document ("Document A") consisting of five (5) words ("aword"). The occurences of a word in a document are counted and stored as a property. A word can also occur in other documents, therefore the occurence count relating to a specific word in a specific document is linked together by a blank node. (I'm not entirely sure if this structure makes any sense as I'm fairly new to this way of storing information, so please feel free to provide better solutions!)
My major question is: How can I get a list of all distinct words and the sum of their occurences over all documents?

Your data model is a bit unconventional, in my opinion. With your code, you'll end up with data that looks like this (in Turtle notation), and which uses rdf:type and rdf:value in unconventional ways:
:doc rdf:value "document a" ;
rdf:type :resource .
:resource rdf:value :word ;
:items 5 .
:word rdf:value "aword" .
It's unusual, because usually you wouldn't have such complex information on the type attribute of a resource. From the SPARQL standpoint though, rdf:type and rdf:value are properties just like any other, and you can still retrieve the information you're looking for with a simple query. It would look more or less like this (though you'll need to define some prefixes, etc.):
select ?word (sum(?n) as ?nn) where {
?document rdf:type ?type .
?type rdf:value/rdf:value ?word ;
:items ?n .
}
group by ?word
That query will produce a result for each word, and with each will be the sum of all the values of the :items properties associated with the word. There are lots of questions on Stack Overflow that have examples of running SPARQL queries with Jena. E.g., (the first one that I found with Google): Query Jena TDB store.

Related

JPA Select query not returning results with one letter word

I have a query that when given a word that starts with a one-letter word followed by space character and then another word (ex: "T Distribution"), does not return results. While given "Distribution" alone returns results including the results for "T Distribution". It is the same behavior with all search terms beginning with a one-letter word followed by space character and then another word.
The problem appears when the search term is of this pattern:
"[one-letter][space][letter/word]". example: "o ring".
What would be the problem that the LIKE operator not working correctly in this case?
Here is my query:
#Cacheable(value = "filteredConcept")
#Query("SELECT NEW sina.backend.data.model.ConceptSummaryVer04(s.id, s.arabicGloss, s.englishGloss, s.example, s.dataSourceId,
s.synsetFrequnecy, s.arabicWordsCache, s.englishWordsCache, s.superId, s.categoryId, s.dataSourceCacheAr, s.dataSourceCacheEn,
s.superTypeCasheAr, s.superTypeCasheEn, s.area, s.era, s.rank, s.undiacritizedArabicWordsCache, s.normalizedEnglishWordsCache,
s.isTranslation, s.isGloss, s.arabicSynonymsCount, s.englishSynonymsCount) FROM Concept s
where s.undiacritizedArabicWordsCache LIKE %:searchTerm% AND data_source_id != 200 AND data_source_id != 31")
List<ConceptSummaryVer04> findByArabicWordsCacheAndNotConcept(#Param("searchTerm") String searchTerm, Sort sort);
the result of the query on the database itself:
link to screenshot
results on the database are returned no matter the letters case:
link to screenshot
I solved this problem.
It was due to the default configuration of the Full-text index on mysql database which is by default set to 2 (ft_min_word_len = 2).
I changed that and rebuilt the index. Then, one-letter words were returned by the query.
12.9.6 Fine-Tuning MySQL Full-Text Search
Use some quotes:
LIKE '%:searchTerm%';
Set searchTerm="%your_word%" and use it on query like this :
... s.undiacritizedArabicWordsCache LIKE :searchTerm ...

How to describe classes and properties using RDFS in Java

I am quite new to Java Sesame. I am following this tutorial: http://openrdf.callimachus.net/sesame/2.7/docs/users.docbook?view . I know how to create statements and add them into the Sesame repository. At the moment, I am trying to describe classes and properties for the statements I am going to add. For example, I am having the ones below:
:Book rdf:type rdfs:Class .
:bookTitle rdf:type rdf:Property .
:bookTitle rdfs:domain :Book .
:bookTitle rdfs:range rdfs:Literal .
:MyBook rdf:type :Book .
:MyBook :bookTitle "Open RDF" .
As shown, Book is defined as a Class. bookTitle is defined as a Property. My question is: How can I do this in Java Openrdf using org.openrdf.model.vocabulary.RDFS. To clarify the point, Here is another example:
con.add(alice, RDF.TYPE, person);
alice is a type of person. How can I define person as a class using org.openrdf.model.vocabulary.RDFS. Your assistance would be very much appreciated.
You'd do this in exactly the same way as you're describing alice as a person. Like this:
con.add(person, RDF.TYPE, RDFS.CLASS);
Similarly for the other things you want to add (assuming you've created a URI for bookTitle):
con.add(bookTitle, RDF.TYPE, RDF.PROPERTY);
con.add(bookTitle, RDFS.DOMAIN, book);
etc.
I should point out that although it is of course possible to create your schema or ontology in this fashion, it might be easier to instead create a file containing your ontology (e.g. in Turtle or N-Triples syntax), and then simply upload that file to your Sesame repository.

How to access OWL documents using XPath in Java?

I am having an OWL document in the form of an XML file. I want to extract elements from this document. My code works for simple XML documents, but it does not work with OWL XML documents.
I was actually looking to get this element: /rdf:RDF/owl:Ontology/rdfs:label, for which I did this:
DocumentBuilder builder = builderfactory.newDocumentBuilder();
Document xmlDocument = builder.parse(
new File(XpathMain.class.getResource("person.xml").getFile()));
XPathFactory factory = javax.xml.xpath.XPathFactory.newInstance();
XPath xPath = factory.newXPath();
XPathExpression xPathExpression = xPath.compile("/rdf:RDF/owl:Ontology/rdfs:label/text()");
String nameOfTheBook = xPathExpression.evaluate(xmlDocument,XPathConstants.STRING).toString();
I also tried extracting only the rdfs:label element this way:
XPathExpression xPathExpression = xPath.compile("//rdfs:label");
NodeList nodes = (NodeList) xPathExpression.evaluate(xmlDocument, XPathConstants.NODESET);
But this nodelist is empty.
Please let me know where I am going wrong. I am using Java XPath API.
Don't query RDF (or OWL) with XPath
There's already an accepted answer, but I wanted to elaborate on #Michael's comment on the question. It's a very bad idea to try to work with RDF as XML (and hence, the RDF serialization of an OWL ontology), and the reason for that is very simple: the same RDF graph can be serialized as lots of different XML documents. In the question, all that's being asked for the is rdfs:label of an owl:Ontology element, so how much could go wrong? Well, here are two serializations of the ontology.
The first is fairly human readable, and was generated by the OWL API when I saved the ontology using the Protégé ontology editor. The query in the accepted answer would work on this, I think.
<rdf:RDF xmlns="http://www.example.com/labelledOnt#"
xml:base="http://www.example.com/labelledOnt"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<owl:Ontology rdf:about="http://www.example.com/labelledOnt">
<rdfs:label>Here is a label on the Ontology.</rdfs:label>
</owl:Ontology>
</rdf:RDF>
Here is the same RDF graph using fewer of the fancy features available in the RDF/XML encoding. This is the same RDF graph, and thus the same OWL ontology. However, there is no owl:Ontology XML element here, and the XPath query will fail.
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns="http://www.example.com/labelledOnt#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" >
<rdf:Description rdf:about="http://www.example.com/labelledOnt">
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Ontology"/>
<rdfs:label>Here is a label on the Ontology.</rdfs:label>
</rdf:Description>
</rdf:RDF>
You cannot reliably query an RDF graph in RDF/XML serialization by using typical XML-processing techniques.
Query RDF with SPARQL
Well, if we cannot query reliably query RDF with XPath, what are we supposed to use? The standard query language for RDF is SPARQL. RDF is a graph-based representation, and SPARQL queries include graph patterns that can match a graph.
In this case, the pattern that we want to match in a graph consists of two triples. A triple is a 3-tuple of the form [subject,predicate,object]. Both triples have the same subject.
The first triple says that the subject is of type owl:Ontology. The relationship “is of type” is rdf:type, so the first triple is [?something,rdf:type,owl:Ontology].
The second triple says that subject (now known to be an ontology) has an rdfs:label, and that's the value that we're interested in. The corresponding triple is [?something,rdfs:label,?label].
In SPARQL, after defining the necessary prefixes, we can write the following query.
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label WHERE {
?ontology a owl:Ontology ;
rdfs:label ?label .
}
(Note that because rdf:type is so common, SPARQL includes a as an abbreviation for it. The notation s p1 o1; p2 o2 . is just shorthand for the two-triple pattern s p1 o1 . s p2 o2 ..)
You can run SPARQL queries against your model in Jena either programmatically, or using the command line tools. If you do it programmatically, it is fairly easy to get the results out. To confirm that this query gets the value we're interested in, we can use Jena's command line for arq to test it out.
$ arq --data labelledOnt.owl --query getLabel.sparql
--------------------------------------
| label |
======================================
| "Here is a label on the Ontology." |
--------------------------------------
as xpath does not know the namespaces you are using.
try using:
"/*[local-name()='RDF']/*[local-name()='Ontology']/*[local-name()='label']/text()"
local name will ignore the namespaces and will work (for the first instance of this that it finds)
You would be able to use namespaces in query if you implement javax.xml.namespace.NamespaceContext for yourself. Please have a look at this answer https://stackoverflow.com/a/5466030/1443529, this explains how to get it done.

How to retrieve the Field that "hit" in Lucene

Maybe I'm really missing something.
I have indexed a bunch of key/value pairs in Lucene (v4.1 if it matters). Say I have
key1=value1 and key2=value2, e.g. as read from a properties file.
They get indexed both as specific fields and into a catchall "ALL" field, e.g.
new Field("key1", "value1", aFieldTypeMimickingKeywords);
new Field("key2", "value2", aFieldTypeMimickingKeywords);
new Field("ALL", "key1=value1", aFieldTypeMimickingKeywords);
new Field("ALL", "key2=value2", aFieldTypeMimickingKeywords);
// then get added to the Document of course...
I can then do a wildcard search, using
new WildcardQuery(new Term("ALL", "*alue1"));
and it will find the hit.
But, it would be nice to get more info, like "what was complete value (e.g. "key1=value1") that goes with that hit?".
The best I can figure out it to get the Document, then get the list of IndexableFields, then loop over all of them and see if the field.stringValue().contains("alue1"). (I can look at the data structures in the debugger and all the info is there)
This seems completely insane cause isn't that what Lucene just did? Shouldn't the Hit information return some of the Fields?
Is Lucene missing what seems like "obvious" functionality? Google and starting at the APIs hasn't revealed anything straightforward, but I feel like I must be searching on the wrong stuff.
You might want to try with IndexSearcher.explain() method. Once you get the ID of the matching document, prepare a query for each field (using the same search keywords) and invoke Explanation.isMatch() for each query: the ones that yield true will give you the matched field. Example:
for (String field: fields){
Query query = new WildcardQuery(new Term(field, "*alue1"));
Explanation ex = searcher.explain(query, docID);
if (ex.isMatch()){
//Your query matched field
}
}

How can one extract rdf:about or rdf:ID properties from triples using SPARQL?

It seemed a trivial matter at the beginning but so far I have not managed to get the unique identifier for a given resource using SPARQL. What I mean is given, e.g., rdf:Description rdf:about="http://..." and then some properties identifying this resource, what I want to do is to first find this very resource and then retrieve all the triples given some URI.
I have tried naïve approaches by writing statements in a WHERE clause such as:
?x rdf:about ?y and ?x rdfs:about ?y
I hope I am being precise.
You're making a classic mistake: confusing RDF (which is what SPARQL queries) with (one of) its serialisation, namely RDF/XML. rdf:about (and rdf:ID, rdf:Description, rdf:resource) are part of RDF/XML, a way RDF is written down. You can play around with the RDF Validator to see what RDF triples result from a piece of RDF/XML.
In your case let's start with:
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/terms/">
<rdf:Description rdf:about="http://www.example.org/">
<dc:title>Example for Donal Fellows</dc:title>
</rdf:Description>
</rdf:RDF>
Plug that into the validator and you get:
Number Subject Predicate Object
1 http://www.example.org/ http://purl.org/dc/terms/title "Example for Donal Fellows"
(you can also ask for a pictorial representation)
Notice that rdf:about is not present: its value provides the subject for the triple.
How do I do a query to find properties associated with http://www.example.org? Like this:
select * {
<http://www.example.org/> ?predicate ?object
}
You'll get:
?predicate ?object
<http://purl.org/dc/terms/title> "Example for Donal Fellows"
You'll notice that the query is a triple match with variables (?v) in places where we want to find values. We could also ask what predicate links http://www.example.org/ with "Example for..." by asking:
select * {
<http://www.example.org/> ?predicate "Example for Donal Fellows"
}
This pattern matching is the heart of SPARQL.
RDF/XML is a tricky beast, and you might find it easier to work with N-Triples, which is very verbose but clear, or turtle, which is like N-Triples with a large number of shorthands and abbreviations. Turtle is often preferred by the rdf community.
P.S. rdfs:about doesn't exist anywhere.

Categories