Indexing NTriples with Lucene

Indexing NTriples with Lucene - java

A part of my project is to index the s-p-o in ntriples and I need some help figuring out how exactly to do this via Java (or other language if possible).
Problem statement:
We have around 10 files with the extension “.ntriple”. Each file having at least 10k triples. The format of this file is multiple RDF TRIPLEs
<subject1_uri> <predicate1_uri> <object1_uri>
<subject2_uri> <predicate1_uri> <object2_uri>
<subject2_uri> <predicate1_uri> <object3_uri>
…..
…..
What I need to perform is, index each of these subjects, predicates and objects so that we can have a fast search and retrieve for queries like “Give me all subjects and objects for predicate1_uri” and so on.
I gave it a try using this example but I saw that this was doing a Full Text Search. This doesn't seem to be efficient as the ntriple files could be as large as 50MB per file.
Then I thought of NOT doing a full text search, instead just store s-p-o as an index Document and each (s,p,o) as a Document Field with another Field as Id (offset of the s-p-o in corresponding ntriple file).
I have two questions:
Is Lucene the Only option for what I am trying to achieve ?
Would the size of the Index files themselves be larger than Half the size of the data itself ?!
Any and all help really appreciated.

To answer your first question: No, Lucene is not the only option to do this. You can (and probably should) use any generic RDF database to store the triples. Then you can query the triples using their Java API or using SPARQL. I personally recommend Apache Jena as a Java API for working with RDF.
If you need free-text search across literals in your dataset, there is Lucene Integration with Apache Jena through Jena Text.
Regarding index sizes, this depends entirely on the entropy of your data. If you have 40,000 lines in an NTRIPLE file, but it's all replications of the same triples, then the index will be relatively small. Typically, however, RDF databases make multiple indexes of the data and you will see a size increase.
The primary benefit of this indexing is that you can ask more generic questions than “Give me all subjects and objects for predicate1_uri”. That question can be answered by linearly processing all of the NTRIPLE files without even knowing you're using RDF. The following SPARQL-like query shows an example of a more difficult search facilitated by these data stores:
SELECT DISTINCT ?owner
WHERE {
?owner :owns ?thing
?thing rdf:type/rdfs:subClassOf :Automobile
?thing :hasColor "red"#en
}
In the preceding query, we locate owners of something that is an automobile or any more specific sublcass of automobile so long as the color of that thing is "red" (as specified in english).

Related

Reconciliation tool [comparing two large data set of records]

I have been asked to build a reconciliation tool which could compare two large datasets (We may assume input source as two excels).
Each row in excel contains 40-50 columns and record to be compared at each column level. Each file contains close to 3 million of records or roughly 4-5 GB of data.[data may not be in sorted format]
I would appreciate if i could get some hint.
Can following technologies be a good fit
Apache Spark
Apache Spark + Ignite [assuming real time reconciliation in between time frames]
Apache Ignite + Apache Hadoop
Any suggestion to build out in-house tool.

I have also been working on the same-
You can load the csv files to temporary tables using Pyspark/Scala and query on top of the temp tables created.

First a Warning:
Writing a reconciliation tool contains lots of small annoyances and edge cases like date formats, number formats (commas in numbers, scientific notation etc), compound keys, thresholds, ignoring columns , ignoring headers/footers etc etc.
If you only have one file to rec with well defined inputs then consider doing it yourself.
However, if you are likely to try to extend it to be more generic then pay for an existing solution if you can because it will be cheaper in the long run.
Potential Solution:
The difficulty with a distributed process is how you match the keys in unsorted files.
The issue with running it all in a single process is memory.
The approach I took for a commercial rec tool was to save the CSV to tables in h2 and use SQL to do the diff.
H2 is much faster than Oracle for something like this.
If your data is well structured you can take advantage of the ability of h2 to load directly from CSV and if you save the result in a table you can also write the output to CSV too or you can use other Frameworks to write a more structured output or stream the result to a web page.
If your format is xls(x) and not CSV you should do a performance test of the various libraries to read the file as there are huge differences when dealing with that size.

I have been working on the above problem and here is the solution.
https://github.com/tharun026/SparkDataReconciler
The prerequisites as of now are
Both datasets should have the same number of columns
For now, the solution accepts only PARQUETS.
The tool gives you match percentage for each column, so you could understand which transformation went wrong.

How to find a specific string in 1 or 2 sheets of a excel file using Apache POI?

I need to find an specific string (id, name for example) in 1 sheet of excel.
this is a basic need.
Later on we need to find a user on several excel sheets and copy the whole record identified with that code and send it to a JTable in the frame.

Are you looking for a high-level search function or something? I don't think that exists.
As you load the sheets, you might consider just adding the interesting columns to a HashMap if you can use exact matches, otherwise just iterate over the sheets/columns/rows and search manually.
You could create some mid-level tooling to do this. A "Sheet Indexer" perhaps, that takes a sheet and a list of columns then lets you do lookups. Even if you have to write code to iterate over everything manually you shouldn't worry too much about speed--the number of sheets/rows are very unlikely to get large enough to effect performance or anything.
We actually have a lot of tooling built around poi including a ORM layer that lets us load from spreadsheets using annotations just like hibernate. We called it "son of poi" aka "poison".

Do I need the original files used to create an index in Lucene?

First of all as I explained in my profile, I'm not English native, so I hope you can forgive me if I make some grammar mistakes.
I'm trying to develop with the Apache Lucene API in Java. I'm able to write some index and search methods, but I'm still confused about how it works behind the scenes.
From what I know, Lucene doesn't care about the origin of the data. It just takes the data and indexes it. Let me ask with a simple example:
I want to index words from a .txt based dictionary. Once Lucene has made its indexes, do I need the source .txt dictionary anymore? How, exactly, do the indexes work?
Do the indexes contain the necessary content to perform searches without the original source? Or do the indexes only contain directions to where words are stored in the original source .txt dictionary file? I'm a little confused.

Once you have indexed everything, Lucene does not refer back to, or have further need of, any of the source documents. Everything it needs to operate is saved in it's index directory. Many people use Lucene to index files, others database records, others online resources. Whatever your source, you always have to bring in the data yourself (or with some third-party tool), and construct Documents for lucene to index, and nothing about a document says anything about where it came from. So, not only does lucene not need to refer back to original data sources, it couldn't find them if you wanted it to.
Many people's implementations do rely on having the original source present. It's not at all unusual for people to set up Lucene to index everything, but only store a file name, or database id, or some similar pointer to the original source. This allows them to perform an effective full-text search through lucene, while handling storage of the full content to some other system.

Indexing semi-structured data

I would like to index a set of documents that will contain semi structured data, typically key-value pairs something like #author Joe Bloggs. These keywords should then be available as searchable attributes of the document which can be queried individually.
I have been looking at Lucene and I'm able to build an index over the documents I'm interested in but I'm not sure how best to proceed with the next step of keyword extraction.
Is there a common approach for doing this in Lucene or another indexing system? I'd like to be able to search over the documents using a typical word search as I'm able to already, and so would like a something more than a custom regex extraction.
Any help would be greatly appreciated.
Niall

I wrote a source code search engine using Lucene as part of my bachelor thesis. One of the key features, was that the source code was treated as structured information, and therefore should be searchable as such, i.e. searchable according to attributes as you describe above.
Here you can find more information about this project. If that is to extensive for you, I can sum up some things:
I created separate searching fields for all the attributes which should be searchable. In my case those where for example 'method name' or 'commentary' or 'class name'.
It can be advantageous to have the content of these fields overlap, however this will blow up your database index (but only linearly with the occurrence of redundant data in searchable fields).

Indexing uploaded documents - searchable only by the users that uploaded them

If someone could point me in the right direction that would be most helpful.
I have written a custom CMS where I want to be able to allow each individual user to upload documents (.doc .docx .pdf .rtf .txt etc) and then be able to search the contents of those files for keywords.
The CMS is written entirely in PHP and MySQL within a Linux environment.
Once uploaded the documents would be stored in the users private folder on the server "as is". There will be hundreds if not thousands of documents stored by each user.
It is very important that the specific users files are searchable only by that user.
Could anyone point me in the right direction? I have had a look at Solr but these types of solutions seem so complicated. I have spent an entire week looking at different solutions and this is my last attempt at finding a solution.
Thank you in advance.

2 choices I see.
A search index per user. Their documents are indexed separately from everyone else's. When they do a search, they hit their own search index. There is no danger of seeing other's results, or getting scores based on contents from other's documents. The downside is having to store and update the index separately. I would look into using Lucene for something like this, as the indices will be small.
A single search index. The users all share a search index. The results from searches would have to be filtered down so that only results were returned for that user. The upside is implementing a single search index (Solr would be great for this). The down side is the risk of cross talk between users searches. Scoring would be impacted by other users documents, resulting in poorer search results.
I hate to say it, but from a quality standpoint, I'd lean towards number 1. Number 2 seems more efficient and easier, but user results are more important to me.

keep the files outside of the public directory tree, keep a reference to the file's filepath and creator's user id in a database table, then they can search for the files using database queries. you will of course have to let users create accounts and have a log in. you can they let them download the files using php.

As long as the user's files are all located in an isolated directory, or there is some way specify one user's documents, like adding the user id to the filename, you could use grep.
The disadvantages:
Each search would have to go through all the documents, so if you have a lot of documents or very large documents it would be slow.
Binary document formats, like Word or PDF, might not produce accurate results.
This is not an enterprise solution.
Revised answer: Try mnoGoSearch

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Indexing NTriples with Lucene - java

Related

Reconciliation tool [comparing two large data set of records]

How to find a specific string in 1 or 2 sheets of a excel file using Apache POI?

Do I need the original files used to create an index in Lucene?

Indexing semi-structured data

Indexing uploaded documents - searchable only by the users that uploaded them

Categories

Resources