I would like to add entities to documents like you can do with the data-config.
At the moment I'm indexing every page of my documents as a single document.
Now :
<solrDoc>
<id>1</id>
<docname>test.pdf</docmname>
<pagenumber>1</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>
<solrDoc>
<id>2</id>
<docname>test.pdf</docmname>
<pagenumber>2</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>
As you can see the data related to the document is stored x pages times. I would like to get documents like this:
<doc>
<id>1</id>
<docname>test.pdf</docmname>
<pageEntries> //multivaluefield
<pageEntry><pagenumber>1</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
<pageEntry><pagenumber>2</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
</pageEntries>
</doc>
I don't know how to make something like pageEntry. I saw that solr can import entities from databases but I'm wondering how I can do the same? (or something similar)
I'm using solr 3.6.1. The page extraction is done by myself using pdfbox.
Java code:
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField("id", 1);
solrDoc.setField("filename", "test");
for (int p : pages) {
solrDoc.addField("page", p);
}
for (String pc : pagecont) {
solrDoc.addField("pagecont", pc);
}
If the extraction is performed by you, you can club all the pages and feed it as a single Solr document with the pagenumber & pagecontent being multivalued fields.
You can use the same id for all the pages (with the id not being a primary field in the schema definition) and use Grouping (Field Collapsing) to group the results for the documents.
Related
I need some help with a project I am planing to do. At this stage I am trying to learn using NoSQL Databases in Java.
I've got a few nested documents looking like this:
MongoDB nesting structure
Like you can see on the image, my inner attributes are "model" and "construction".
Now I need to iterate through all the documents in my collection, whose keynames are unknown, because they are generated in runtime, when a user enters some information.
At the end I need to list them in a TreeView, keeping the structure they have already in the database.
What I've tried is getting keySets from documents, but I cannot pass the second layer of the structure. I am able to print the whole Object in Json format, but I cannot access the specific attributes like "model" or "construction".
MongoCollection collection= mongoDatabase.getCollection("test");
MongoCursor<Document> cursor = collection.find().iterator();
for(String keys: document.keySet()) {
Document vehicles = (Document) document.getString(keys);
//System.out.println(keys);
//System.out.println(document.get(keys));
}
/Document cars = (Document) vehicle.get("cars");
Document types = (Document) cars.get("coupes");
Document brands = (Document) types.get("Ford");
Document model = (Document) brands.get("Mustang GT");
Here I tried to get some properties, by hardcoding the keynames of the documents, but I can't seem to get any value either. It keeps telling me that it could not read from vehicle, because it is null.
The most tutorials and posts in forums, somehow does not work for me. I don't know if they have any other version of MongoDB Driver. Mine is: mongodb driver 3.12.7. if this helps you in any way.
I am trying to get this working for days now and it is driving me crazy.
I hope there is anyone out there who is able to help me with this problem.
Here is a way you can try using the Document class's methods. You use the Document#getEmbedded method to navigate the embedded (or sub-document) document's path.
try (MongoCursor<Document> cursor = collection.find().iterator()) {
while (cursor.hasNext()) {
// Get a document
Document doc = (Document) cursor.next();
// Get the sub-document with the known key path "vehicles.cars.coupes"
Document coupes = doc.getEmbedded(
Arrays.asList("vehicles", "cars", "coupes"),
Document.class);
// For each of the sub-documents within the "coupes" get the
// dynamic keys and their values.
for (Map.Entry<String, Object> coupe : coupes.entrySet()) {
System.out.println(coupe.getKey()); // e.g., Mercedes
// The dynamic sub-document for the dynamic key (e.g., Mercedes):
// {"S-Class": {"model": "S-Class", "construction": "2011"}}
Document coupeSubDoc = (Document) coupe.getValue();
// Get the coupeSubDoc's keys and values
coupeSubDoc.keySet().forEach(k -> {
System.out.println("\t" + k); // e.g., S-Class
System.out.println("\t\t" + "model" + " : " +
coupeSubDoc.getEmbedded(Arrays.asList(k, "model"), String.class));
System.out.println("\t\t" + "construction" + " : " +
coupeSubDoc.getEmbedded(Arrays.asList(k, "construction"), String.class));
});
}
}
}
The above code prints to the console as:
Mercedes
S-Class
model : S-Class
construction : 2011
Ford
Mustang
model : Mustang GT
construction : 2015
I think it's not the complete answer to his question.
Here he says:
Now I need to iterate through all the documents in my collection, whose keynames are unknown, because they are generated in runtime, when a user enters some information.
Your answer #prasad_ just refers to his case with vehicles, cars and so on. He needs a way to handle unknown key/value pairs i guess. For example, in this case he only knows the keys:vehicle,cars,coupe,Mercedes/Ford and their subkeys. If another user inserts some new key/value paairs in the collection he will have problems because he can't navigate trough the new document without to have a look into the database.
I'm also interested in the solution because I never nested my key/value pairs and cant see the advantage of it. Am I wrong or does it make the programming more difficult?
I'm using the Java API of Apache Jena to store and retrieve documents and the words within them. For this I decided to set up the following datastructure:
_dataset = TDBFactory.createDataset("./database");
_dataset.begin(ReadWrite.WRITE);
Model model = _dataset.getDefaultModel();
Resource document= model.createResource("http://name.space/Source/DocumentA");
document.addProperty(RDF.value, "Document A");
Resource word = model.createResource("http://name.space/Word/aword");
word.addProperty(RDF.value, "aword");
Resource resource = model.createResource();
resource.addProperty(RDF.value, word);
resource.addProperty(RSS.items, "5");
document.addProperty(RDF.type, resource);
_dataset.commit();
_dataset.end();
The code example above represents a document ("Document A") consisting of five (5) words ("aword"). The occurences of a word in a document are counted and stored as a property. A word can also occur in other documents, therefore the occurence count relating to a specific word in a specific document is linked together by a blank node. (I'm not entirely sure if this structure makes any sense as I'm fairly new to this way of storing information, so please feel free to provide better solutions!)
My major question is: How can I get a list of all distinct words and the sum of their occurences over all documents?
Your data model is a bit unconventional, in my opinion. With your code, you'll end up with data that looks like this (in Turtle notation), and which uses rdf:type and rdf:value in unconventional ways:
:doc rdf:value "document a" ;
rdf:type :resource .
:resource rdf:value :word ;
:items 5 .
:word rdf:value "aword" .
It's unusual, because usually you wouldn't have such complex information on the type attribute of a resource. From the SPARQL standpoint though, rdf:type and rdf:value are properties just like any other, and you can still retrieve the information you're looking for with a simple query. It would look more or less like this (though you'll need to define some prefixes, etc.):
select ?word (sum(?n) as ?nn) where {
?document rdf:type ?type .
?type rdf:value/rdf:value ?word ;
:items ?n .
}
group by ?word
That query will produce a result for each word, and with each will be the sum of all the values of the :items properties associated with the word. There are lots of questions on Stack Overflow that have examples of running SPARQL queries with Jena. E.g., (the first one that I found with Google): Query Jena TDB store.
I am using the NSF data whose format is txt. Now I have indexed these data and can send a query and got several results. But how can I search something in a selected field (eg. title) ? Because all of these NSF data are totally plain txt file. I do not think Lucene can recognize which part of the file is a "title" or something else. Should I firstly transfer the txt files to XML files (with tags telling Lucene which part is "title")? Can Lucene do that? I have no idea how to split the txt files into several fields. Can anyone please give me some suggestions? Thanks a lot!
BTW, every txt file looks like this:
---begin---
Title: Mitochondrial DNA and Historical Demography
Type: Award
Date: August 1, 1991
Number: 9000006
Abstract: asdajsfhsjdfhsjngfdjnguwiehfrwiuefnjdnfsd
----end----
You have to split the text into the several parts. You can use the resulting strings to create a field for each part of the text, i.e. title.
Create your lucene document with the fields like this:
Document doc = new Document();
doc.add(new Field("title", titleString, Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("abstract", abstractString, Field.Store.NO, Field.Index.TOKENIZED));
and so on. After indexing the document you can search in the title like this: title:dna
More complex queries and mixing multiple fields in the query also possible: +title:dna +abstract:"some example text" -number:935353
I am having an OWL document in the form of an XML file. I want to extract elements from this document. My code works for simple XML documents, but it does not work with OWL XML documents.
I was actually looking to get this element: /rdf:RDF/owl:Ontology/rdfs:label, for which I did this:
DocumentBuilder builder = builderfactory.newDocumentBuilder();
Document xmlDocument = builder.parse(
new File(XpathMain.class.getResource("person.xml").getFile()));
XPathFactory factory = javax.xml.xpath.XPathFactory.newInstance();
XPath xPath = factory.newXPath();
XPathExpression xPathExpression = xPath.compile("/rdf:RDF/owl:Ontology/rdfs:label/text()");
String nameOfTheBook = xPathExpression.evaluate(xmlDocument,XPathConstants.STRING).toString();
I also tried extracting only the rdfs:label element this way:
XPathExpression xPathExpression = xPath.compile("//rdfs:label");
NodeList nodes = (NodeList) xPathExpression.evaluate(xmlDocument, XPathConstants.NODESET);
But this nodelist is empty.
Please let me know where I am going wrong. I am using Java XPath API.
Don't query RDF (or OWL) with XPath
There's already an accepted answer, but I wanted to elaborate on #Michael's comment on the question. It's a very bad idea to try to work with RDF as XML (and hence, the RDF serialization of an OWL ontology), and the reason for that is very simple: the same RDF graph can be serialized as lots of different XML documents. In the question, all that's being asked for the is rdfs:label of an owl:Ontology element, so how much could go wrong? Well, here are two serializations of the ontology.
The first is fairly human readable, and was generated by the OWL API when I saved the ontology using the Protégé ontology editor. The query in the accepted answer would work on this, I think.
<rdf:RDF xmlns="http://www.example.com/labelledOnt#"
xml:base="http://www.example.com/labelledOnt"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<owl:Ontology rdf:about="http://www.example.com/labelledOnt">
<rdfs:label>Here is a label on the Ontology.</rdfs:label>
</owl:Ontology>
</rdf:RDF>
Here is the same RDF graph using fewer of the fancy features available in the RDF/XML encoding. This is the same RDF graph, and thus the same OWL ontology. However, there is no owl:Ontology XML element here, and the XPath query will fail.
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns="http://www.example.com/labelledOnt#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" >
<rdf:Description rdf:about="http://www.example.com/labelledOnt">
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Ontology"/>
<rdfs:label>Here is a label on the Ontology.</rdfs:label>
</rdf:Description>
</rdf:RDF>
You cannot reliably query an RDF graph in RDF/XML serialization by using typical XML-processing techniques.
Query RDF with SPARQL
Well, if we cannot query reliably query RDF with XPath, what are we supposed to use? The standard query language for RDF is SPARQL. RDF is a graph-based representation, and SPARQL queries include graph patterns that can match a graph.
In this case, the pattern that we want to match in a graph consists of two triples. A triple is a 3-tuple of the form [subject,predicate,object]. Both triples have the same subject.
The first triple says that the subject is of type owl:Ontology. The relationship “is of type” is rdf:type, so the first triple is [?something,rdf:type,owl:Ontology].
The second triple says that subject (now known to be an ontology) has an rdfs:label, and that's the value that we're interested in. The corresponding triple is [?something,rdfs:label,?label].
In SPARQL, after defining the necessary prefixes, we can write the following query.
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label WHERE {
?ontology a owl:Ontology ;
rdfs:label ?label .
}
(Note that because rdf:type is so common, SPARQL includes a as an abbreviation for it. The notation s p1 o1; p2 o2 . is just shorthand for the two-triple pattern s p1 o1 . s p2 o2 ..)
You can run SPARQL queries against your model in Jena either programmatically, or using the command line tools. If you do it programmatically, it is fairly easy to get the results out. To confirm that this query gets the value we're interested in, we can use Jena's command line for arq to test it out.
$ arq --data labelledOnt.owl --query getLabel.sparql
--------------------------------------
| label |
======================================
| "Here is a label on the Ontology." |
--------------------------------------
as xpath does not know the namespaces you are using.
try using:
"/*[local-name()='RDF']/*[local-name()='Ontology']/*[local-name()='label']/text()"
local name will ignore the namespaces and will work (for the first instance of this that it finds)
You would be able to use namespaces in query if you implement javax.xml.namespace.NamespaceContext for yourself. Please have a look at this answer https://stackoverflow.com/a/5466030/1443529, this explains how to get it done.
I want to create nested document in solr, I am using java/GWT/SolrJ.
Currently I am indexing following fields:
Items:
id title desc.
1 xyz xyzxyzxyz
2 pqr pqrpqrpqr
3 abc abcabcabc.
But now i want to create one more document linked with each document from above i.e. for id 1 there is one subdocument which contains follwing fields:
Item_User_Details:
for item 1 :
user details
1 qweqweqwe
2 xyzxyzxyz
3 asdasdasd
in this way I want to create for each item id from above table, there is one linked document of item_user_details.
How can I do this...?
Thanks in advance...
In our schema we've a lot of related tables.
We decided to flatten all relations into one document. To achieve this we created a custom importer (using SolrJ), which loads each document from the index, adds the related fields and write that document back.
[edit]
We do this in the following way:
export the data in a csv-file for each table (item, item_user_details)
import each csv-file into Solr, starting with the top (item in your case)
Start an Embedded-Solr server:
System.setProperty("solr.solr.home", config.getSolrIndexPath());
CoreContainer.Initializer initializer = new CoreContainer.Initializer();
this.coreContainer = initializer.initialize();
this.solr = new EmbeddedSolrServer(this.coreContainer, "");
Alternatively you can access a remote solr instance:
this.solr = new HttpSolrServer("http://[your-url]/solr");
Create a SolrDocument for each line in the file
add it to the index this.solr.add(ClientUtils.toSolrInputDocument(doc));
Commit this.solr.commit();
Load documents from the index (items)
Idetify relations in the csv-file for item_user_details via the document id (item-id)
Exted the loaded document with the fields from item_user_details
Commit the Document again