blank output on IndriRunQuery in lemur project - java

I'm using lemur for a nlp project, and I indexed some data succesffully
I wanna run a query on index files by IndriRunQuery command
parameter file:
<parameters>
<index>PATH-TO-INDEX-DIRECTORY</index>
<query>
<number>1</number>
<text>QUERY SAMPLE STRING</text>
</query>
<count>50</count></parameters>
there is no error, there is no answer. just a blank line in output

I found answer myself
my documents in indexing step weren't in the format that lemur document told
documents told the make training document in this format:
<DOC>
<DOCNO>DOCUMENT-ID</DOCNO>
<TEXT>DCOUMENT-PLAIN-TEXT</TEXT>
</DOC>
and indexed documents again by: buildIndex [parameterFile]
then user indriRunQuery.exe and worked well

Related

Task: Import csv, xml files into new Schema. Displaying reports. JAVA

I have a task to do. Just for practice. Would you please give ma any advises how to start it ( I am not begging for a code :) )
What should I do? Any suggestions, hints?
Import list of files csv, xml.
Every order I have to write in (in-memory) database
The order should have fields like: Client_Id, Name, Price etc....
Program has to generate reports e.g.: quantity of orders, average amount of order, total amount of orders etc.
Every raport can be displayed or written to (any) file
Database can not be divided between every start-up
Incorrect lines in order are ignored but the information about bad format is displayed.
the csv and xml files contains e.g.:
>Client_id,Other_id,Name,Amount
> 1,1,Joe,25,
> 1,2,Mike,34
> 2,2,Maria,10
> 2,1,Elizabeth,5
>
> xml <reuqests> <request>
> <client_id>1</client_Id>
> <name>Joe</name> etc.......
> </request>
> </requests>
What I might know is:
1. Import csv using csvopen. Sounds easy but, do I need to import them separately or can I use the List
2/3 Can I create schema using hibernate?
4 ? I am not sure. Can I use "switch"?
5. What library can I use to write reports? (Can I use Jesper Reports?)
6/7 What does it mean?

Using Lucene, how to index TXT files into different fields?

I am using the NSF data whose format is txt. Now I have indexed these data and can send a query and got several results. But how can I search something in a selected field (eg. title) ? Because all of these NSF data are totally plain txt file. I do not think Lucene can recognize which part of the file is a "title" or something else. Should I firstly transfer the txt files to XML files (with tags telling Lucene which part is "title")? Can Lucene do that? I have no idea how to split the txt files into several fields. Can anyone please give me some suggestions? Thanks a lot!
BTW, every txt file looks like this:
---begin---
Title: Mitochondrial DNA and Historical Demography
Type: Award
Date: August 1, 1991
Number: 9000006
Abstract: asdajsfhsjdfhsjngfdjnguwiehfrwiuefnjdnfsd
----end----
You have to split the text into the several parts. You can use the resulting strings to create a field for each part of the text, i.e. title.
Create your lucene document with the fields like this:
Document doc = new Document();
doc.add(new Field("title", titleString, Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("abstract", abstractString, Field.Store.NO, Field.Index.TOKENIZED));
and so on. After indexing the document you can search in the title like this: title:dna
More complex queries and mixing multiple fields in the query also possible: +title:dna +abstract:"some example text" -number:935353

How to access OWL documents using XPath in Java?

I am having an OWL document in the form of an XML file. I want to extract elements from this document. My code works for simple XML documents, but it does not work with OWL XML documents.
I was actually looking to get this element: /rdf:RDF/owl:Ontology/rdfs:label, for which I did this:
DocumentBuilder builder = builderfactory.newDocumentBuilder();
Document xmlDocument = builder.parse(
new File(XpathMain.class.getResource("person.xml").getFile()));
XPathFactory factory = javax.xml.xpath.XPathFactory.newInstance();
XPath xPath = factory.newXPath();
XPathExpression xPathExpression = xPath.compile("/rdf:RDF/owl:Ontology/rdfs:label/text()");
String nameOfTheBook = xPathExpression.evaluate(xmlDocument,XPathConstants.STRING).toString();
I also tried extracting only the rdfs:label element this way:
XPathExpression xPathExpression = xPath.compile("//rdfs:label");
NodeList nodes = (NodeList) xPathExpression.evaluate(xmlDocument, XPathConstants.NODESET);
But this nodelist is empty.
Please let me know where I am going wrong. I am using Java XPath API.
Don't query RDF (or OWL) with XPath
There's already an accepted answer, but I wanted to elaborate on #Michael's comment on the question. It's a very bad idea to try to work with RDF as XML (and hence, the RDF serialization of an OWL ontology), and the reason for that is very simple: the same RDF graph can be serialized as lots of different XML documents. In the question, all that's being asked for the is rdfs:label of an owl:Ontology element, so how much could go wrong? Well, here are two serializations of the ontology.
The first is fairly human readable, and was generated by the OWL API when I saved the ontology using the Protégé ontology editor. The query in the accepted answer would work on this, I think.
<rdf:RDF xmlns="http://www.example.com/labelledOnt#"
xml:base="http://www.example.com/labelledOnt"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<owl:Ontology rdf:about="http://www.example.com/labelledOnt">
<rdfs:label>Here is a label on the Ontology.</rdfs:label>
</owl:Ontology>
</rdf:RDF>
Here is the same RDF graph using fewer of the fancy features available in the RDF/XML encoding. This is the same RDF graph, and thus the same OWL ontology. However, there is no owl:Ontology XML element here, and the XPath query will fail.
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns="http://www.example.com/labelledOnt#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" >
<rdf:Description rdf:about="http://www.example.com/labelledOnt">
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Ontology"/>
<rdfs:label>Here is a label on the Ontology.</rdfs:label>
</rdf:Description>
</rdf:RDF>
You cannot reliably query an RDF graph in RDF/XML serialization by using typical XML-processing techniques.
Query RDF with SPARQL
Well, if we cannot query reliably query RDF with XPath, what are we supposed to use? The standard query language for RDF is SPARQL. RDF is a graph-based representation, and SPARQL queries include graph patterns that can match a graph.
In this case, the pattern that we want to match in a graph consists of two triples. A triple is a 3-tuple of the form [subject,predicate,object]. Both triples have the same subject.
The first triple says that the subject is of type owl:Ontology. The relationship “is of type” is rdf:type, so the first triple is [?something,rdf:type,owl:Ontology].
The second triple says that subject (now known to be an ontology) has an rdfs:label, and that's the value that we're interested in. The corresponding triple is [?something,rdfs:label,?label].
In SPARQL, after defining the necessary prefixes, we can write the following query.
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label WHERE {
?ontology a owl:Ontology ;
rdfs:label ?label .
}
(Note that because rdf:type is so common, SPARQL includes a as an abbreviation for it. The notation s p1 o1; p2 o2 . is just shorthand for the two-triple pattern s p1 o1 . s p2 o2 ..)
You can run SPARQL queries against your model in Jena either programmatically, or using the command line tools. If you do it programmatically, it is fairly easy to get the results out. To confirm that this query gets the value we're interested in, we can use Jena's command line for arq to test it out.
$ arq --data labelledOnt.owl --query getLabel.sparql
--------------------------------------
| label |
======================================
| "Here is a label on the Ontology." |
--------------------------------------
as xpath does not know the namespaces you are using.
try using:
"/*[local-name()='RDF']/*[local-name()='Ontology']/*[local-name()='label']/text()"
local name will ignore the namespaces and will work (for the first instance of this that it finds)
You would be able to use namespaces in query if you implement javax.xml.namespace.NamespaceContext for yourself. Please have a look at this answer https://stackoverflow.com/a/5466030/1443529, this explains how to get it done.

Extracting a column from a paragraph from a csv file using java

MAJOR ACC NO,MINOR ACC NO,STD CODE,TEL NO,DIST CODE
7452145,723456, 01,4213036,AAA
7254287,7863265, 01,2121920,AAA
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FOOCTYP,FOOCDES,FOOCAMT,FSTD,FTELNO,FNORECON,FXFRACCN,FLANGIND,CUR
12345,71234,7643234,AAA,001,DX,WLR Promotion - Insitu /Pre-Cabled PSTN Connection,-37.87,,,0,,E,EUR
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FORDNO,FREF,FCHGDES,FCHGAMT,CUR,FORENFRM,FORENTO
3242241,72349489,2345352,AAA,001,30234843P ,1,NEW CONNECTION - PRECABLED CHARGE,37.87,EUR,2123422,201201234
12123471,7618412389,76333232,AAA,001,3123443P ,2,BROKEN PERIOD RENTAL,5.40,EUR,201234523,20123601
I have a csv file something like the one above and I want to extract certain columns from it. For example I want to extract the first column of the first paragraph. I'm kind of new to java but I am able to read the file but I want to extract certain columns from different paragraphs. Any help will be appreciated.

Adding entities to solr using solrj and schema.xml

I would like to add entities to documents like you can do with the data-config.
At the moment I'm indexing every page of my documents as a single document.
Now :
<solrDoc>
<id>1</id>
<docname>test.pdf</docmname>
<pagenumber>1</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>
<solrDoc>
<id>2</id>
<docname>test.pdf</docmname>
<pagenumber>2</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>
As you can see the data related to the document is stored x pages times. I would like to get documents like this:
<doc>
<id>1</id>
<docname>test.pdf</docmname>
<pageEntries> //multivaluefield
<pageEntry><pagenumber>1</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
<pageEntry><pagenumber>2</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
</pageEntries>
</doc>
I don't know how to make something like pageEntry. I saw that solr can import entities from databases but I'm wondering how I can do the same? (or something similar)
I'm using solr 3.6.1. The page extraction is done by myself using pdfbox.
Java code:
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField("id", 1);
solrDoc.setField("filename", "test");
for (int p : pages) {
solrDoc.addField("page", p);
}
for (String pc : pagecont) {
solrDoc.addField("pagecont", pc);
}
If the extraction is performed by you, you can club all the pages and feed it as a single Solr document with the pagenumber & pagecontent being multivalued fields.
You can use the same id for all the pages (with the id not being a primary field in the schema definition) and use Grouping (Field Collapsing) to group the results for the documents.

Categories