Custom index in Apache Solr - java

Suppose in addition of simple text terms i want to retrieve some complex data from text. For example, text can contain descriptions of graphs in some format. After that I want to do queries which contain some conditions on those graphs (for examle I want to find all documents with planar graphs or something like this). It seems that standard index of Solr is not sufficient for such a task because in the end it (as I understand) treats document in terms of tokens which are just strings, but I need additional index which have more suited format. So question is: can I somehow customize indexing and retrieving data from index in Solr? I've read a lot of documentation but could not find an answer.

Yes. You are able to define each field in the schema.xml file. Within that file, you can define what type of data is stored, how the document is tokenized, and how the tokenized data is manipulated. In order to meet your need, you will probably need to write a custom tokenizer and possibly custom filters as well.

Your best starting point is to look at field definition of text_general in schema. It has various tokenizers, filters that apply to the text and help you in indexing. You can define different tokens both at indexing and quering process.
You need to know that, tokens apply on the text, and filters apply on each token. You have descripton of graphs in some format. Can you elaborate more on th type of format, so that we can think of better ways? There are so many existing tokenzers and filters available. Depending upon the format, you can use existing ones or write your own.

Related

Indexing NTriples with Lucene

A part of my project is to index the s-p-o in ntriples and I need some help figuring out how exactly to do this via Java (or other language if possible).
Problem statement:
We have around 10 files with the extension “.ntriple”. Each file having at least 10k triples. The format of this file is multiple RDF TRIPLEs
<subject1_uri> <predicate1_uri> <object1_uri>
<subject2_uri> <predicate1_uri> <object2_uri>
<subject2_uri> <predicate1_uri> <object3_uri>
…..
…..
What I need to perform is, index each of these subjects, predicates and objects so that we can have a fast search and retrieve for queries like “Give me all subjects and objects for predicate1_uri” and so on.
I gave it a try using this example but I saw that this was doing a Full Text Search. This doesn't seem to be efficient as the ntriple files could be as large as 50MB per file.
Then I thought of NOT doing a full text search, instead just store s-p-o as an index Document and each (s,p,o) as a Document Field with another Field as Id (offset of the s-p-o in corresponding ntriple file).
I have two questions:
Is Lucene the Only option for what I am trying to achieve ?
Would the size of the Index files themselves be larger than Half the size of the data itself ?!
Any and all help really appreciated.
To answer your first question: No, Lucene is not the only option to do this. You can (and probably should) use any generic RDF database to store the triples. Then you can query the triples using their Java API or using SPARQL. I personally recommend Apache Jena as a Java API for working with RDF.
If you need free-text search across literals in your dataset, there is Lucene Integration with Apache Jena through Jena Text.
Regarding index sizes, this depends entirely on the entropy of your data. If you have 40,000 lines in an NTRIPLE file, but it's all replications of the same triples, then the index will be relatively small. Typically, however, RDF databases make multiple indexes of the data and you will see a size increase.
The primary benefit of this indexing is that you can ask more generic questions than “Give me all subjects and objects for predicate1_uri”. That question can be answered by linearly processing all of the NTRIPLE files without even knowing you're using RDF. The following SPARQL-like query shows an example of a more difficult search facilitated by these data stores:
SELECT DISTINCT ?owner
WHERE {
?owner :owns ?thing
?thing rdf:type/rdfs:subClassOf :Automobile
?thing :hasColor "red"#en
}
In the preceding query, we locate owners of something that is an automobile or any more specific sublcass of automobile so long as the color of that thing is "red" (as specified in english).

What is the best and optimized way to extract data from XML multiple times?

In our application we had a requirement to retrieve data from XML multiple times. We make a service call, get data in xml format and save in memory. Later as we need to retrieve data using either element name or attribute name multiple times and this makes to parse xml each and every time which is not a good way to go.
We have limitation of only saving as String in memory and also cannot use Spring or any framework so either we can save as xml or convert String into some format and parse that String. These are the options I could think off:
Parse XML every time we need to retrieve value.
Extract required data from XML using parser and save it as map in String format and parse map data using custom code.
Convert big XML to small XML and parse that small XML every time.
String split functions.
Appreciate if any one can suggest fast way to retrieve data from String
Transform your large dataset to a small dataset. Use an efficient serializer/parser. Do pull parsing and serialization, avoid object bindings (DOM / annotated objects).
Stop parsing when you have what you want, if possible. Possibly arrange your data, i.e. sort, to achieve this.
JSON or XML is secondary.
Since I am the author of vtd-xml, I must acknowledge that my point of view may be biased. But VTD-XML is ideal for your use case.
Let me explain more:
*First parsing will not be a big problem as with DOM with VTD-XML.
You can also choose to persist parsed result with vtd-xml's built-in indexing. Basically, if you can reuse the same xml without parsing it more than once ... it is very handy for this... just load the .vxl file into memory. VTD-XML has 2 parts, the XML (literal XML that is human readable). The other is the binary index part as the output of the parsing.
Since VTD-XML uses far less memory than DOM. Your point #3 may become unnecessary.
Also vtd-xml's indexing strucuture is super easy to understand. It can be written on the back of a match box.
Also vtd-xml is perfect for Big xml splitting if you understand the underlying principle of it...
Let me know if you got any question.
So here's the impression I'm getting: You need to store the serialized contents of an XML file in a variable of type String, and you need the fastest way to do this.
Assuming that arrays or linked lists of strings are not allowed, you could convert the XML into JSON, which is considerably faster to parse, easier to cache, and smaller in size than XML or any other serialization format. The resulting JSON would then be minified and stored in a string.
For example, the XML
<data>
<list>
<item>Item A</item>
<item>Item B</item>
</list>
</data>
could become
{"data":{"list":{"item":["a","b"]}}}
Note how much smaller that is than the XML, especially considering the fact that there are only opening "tags" and not closing ones, as are needed in XML. A string storing the transformed JSON data would take up less memory (and the amount of data saved would become more apparent with larger data sets) and be considerably faster to parse. Additionally, JSON is the standard for online data transfers and is outperforming XML in many areas, especially in larger datasets or where there is a considerable level of complexity in the objects being stored.
Here is some more reading on this topic:
JSON vs XML with a Web-oriented point of view
https://www.w3schools.com/js/js_json_xml.asp
"JSON: The Fat-Free Alternative to XML"
http://www.json.org/xml.html
An objective comparison between JSON and XML
https://www.sitepoint.com/json-vs-xml/
Hope I helped! Let me know if you have any questions.
EDIT:
I just saw your comment on your question that your XML stores extremely large data sets. In this case, I would not recommend using any sort of serialization but rather storing the data in a database (for what you need, I think something like MongoDB would work the best, given its unstructured approach and suitability for large datasets) and extracting only the records you currently need into a smaller string of JSON, XML, or even an array of classes in whichever language you use.

how to create a new word from template with docx4j

I have the following scenario, and need some advice:
The user will input a word document as a template, and provide some parameters in runtime so i can query my database and get data to fill the document.
So, there are two basic things i need to do:
Replace every key in the document with it´s respective result from the current query line.
"Merge" (copy? duplicate?) the existing document unchanged into itself (append) depending on how many rows i got from the query, and replacing the keys from this new copy with the next row values.
What´s is the best aprroach to do this? I´ve managed to do the replace part for now, by using the unmarshallfromtemplate providing it a hashmap.
But this way is a little bit tricky, because i need to add "${variable_name}" in the document, and sometimes word separates "${" and "}" in different tags, causing issues.
I´ve read about the custom xml binding, but didn´t understand it completely. I need to generate a custom XML, inject it in the document (all of this un runtime) and call the applybindings?? If this is true, how would i bind the fields in the document to the xml ? By name?
docx4j includes VariablePrepare, which can tidy up your input docx so that your keys are not split across separate runs.
But, you would still be better off switching to content control data binding, particularly if you have repeated data (think for example of line items in an invoice). Disclosure: I champion this approach in docx4j.
To adopt the content control data binding approach:
dream up an XML format which makes sense for your data, and write some code to convert the results of your database query into that format.
modify your template, so that the content controls are bound to elements in your XML document. ordinarily you'd use an authoring add-in for Word to help with this. (The technology Microsoft uses for binding is XPath, so how you bind depends on your XML structure, but, yes, you'd typically bind to the element name or ID).
now you have your XML file and a suitable intput docx, ContentControlsMergeXML contains the code you need to create an instance document at run time. There's also a version of this for a servlet environment at https://github.com/plutext/OpenDoPE-WAR
As an alternative to 1 & 2, there is also org.docx4j.model.datastorage.migration.FromVariableReplacement in current nightlies, which can convert your existing "${" document. Only to a standardised target XML format though.
If you have further questions, there is a forum devoted to this topic at http://www.docx4java.org/forums/data-binding-java-f16/

Java: ArrayList, String manipulation and Parsing

I am writing an application that stores references for books, journals, sites and so on. I mean I have already done most.
What I want is a suggestion regarding what is the best way to implement above specs?
What format text file should I use to store the library? Not file type but format. I am using simple text file at the moment. But planning to implement format as in below.
<book><Art of Human Hacking><New York><2011><1>
<journal><Achieving Maximum Speed In 802.11n><London><2009><260-265>
1st tag <book> and <journal> are type identifier. I have used ArrayList. Should I use multi dimensional ArrayList and store each item like below?
[[Art of Human Hacking,New York,2011,1][Achieving Maximum Speed In 802.11n,London,2009,260-265]]
I have used StringTokenizer and I cannot differentiate spaces. How do I fix this?
I have already implemented all features including listing all, listing unique, searching, editing, removing, adding. But everything is done to content without spaces.
You should use an existing serializer instead of writing your own, unless the project forbids it.
For compatability and human readability, CSV would be your best bet. Use an existing CSV parser to get your escaping correct (not that hard to write yourself, but difficult enough to warrant using an existing parser to avoid bugs). Google turned up: http://sourceforge.net/projects/javacsv/
If human editing is not a priority, then JSON is also a good format (it is human readable, but not quite as simple as CSV, and won't display in excel for instance).
Then you have binary protocols, such as native Java serialization, or even Google protocol buffers. These may be faster, but obviously lose the ability to view the data store and would probably complicate your debugging.

Indexing semi-structured data

I would like to index a set of documents that will contain semi structured data, typically key-value pairs something like #author Joe Bloggs. These keywords should then be available as searchable attributes of the document which can be queried individually.
I have been looking at Lucene and I'm able to build an index over the documents I'm interested in but I'm not sure how best to proceed with the next step of keyword extraction.
Is there a common approach for doing this in Lucene or another indexing system? I'd like to be able to search over the documents using a typical word search as I'm able to already, and so would like a something more than a custom regex extraction.
Any help would be greatly appreciated.
Niall
I wrote a source code search engine using Lucene as part of my bachelor thesis. One of the key features, was that the source code was treated as structured information, and therefore should be searchable as such, i.e. searchable according to attributes as you describe above.
Here you can find more information about this project. If that is to extensive for you, I can sum up some things:
I created separate searching fields for all the attributes which should be searchable. In my case those where for example 'method name' or 'commentary' or 'class name'.
It can be advantageous to have the content of these fields overlap, however this will blow up your database index (but only linearly with the occurrence of redundant data in searchable fields).

Categories