If someone could point me in the right direction that would be most helpful.
I have written a custom CMS where I want to be able to allow each individual user to upload documents (.doc .docx .pdf .rtf .txt etc) and then be able to search the contents of those files for keywords.
The CMS is written entirely in PHP and MySQL within a Linux environment.
Once uploaded the documents would be stored in the users private folder on the server "as is". There will be hundreds if not thousands of documents stored by each user.
It is very important that the specific users files are searchable only by that user.
Could anyone point me in the right direction? I have had a look at Solr but these types of solutions seem so complicated. I have spent an entire week looking at different solutions and this is my last attempt at finding a solution.
Thank you in advance.
2 choices I see.
A search index per user. Their documents are indexed separately from everyone else's. When they do a search, they hit their own search index. There is no danger of seeing other's results, or getting scores based on contents from other's documents. The downside is having to store and update the index separately. I would look into using Lucene for something like this, as the indices will be small.
A single search index. The users all share a search index. The results from searches would have to be filtered down so that only results were returned for that user. The upside is implementing a single search index (Solr would be great for this). The down side is the risk of cross talk between users searches. Scoring would be impacted by other users documents, resulting in poorer search results.
I hate to say it, but from a quality standpoint, I'd lean towards number 1. Number 2 seems more efficient and easier, but user results are more important to me.
keep the files outside of the public directory tree, keep a reference to the file's filepath and creator's user id in a database table, then they can search for the files using database queries. you will of course have to let users create accounts and have a log in. you can they let them download the files using php.
As long as the user's files are all located in an isolated directory, or there is some way specify one user's documents, like adding the user id to the filename, you could use grep.
The disadvantages:
Each search would have to go through all the documents, so if you have a lot of documents or very large documents it would be slow.
Binary document formats, like Word or PDF, might not produce accurate results.
This is not an enterprise solution.
Revised answer: Try mnoGoSearch
Related
A part of my project is to index the s-p-o in ntriples and I need some help figuring out how exactly to do this via Java (or other language if possible).
Problem statement:
We have around 10 files with the extension “.ntriple”. Each file having at least 10k triples. The format of this file is multiple RDF TRIPLEs
<subject1_uri> <predicate1_uri> <object1_uri>
<subject2_uri> <predicate1_uri> <object2_uri>
<subject2_uri> <predicate1_uri> <object3_uri>
…..
…..
What I need to perform is, index each of these subjects, predicates and objects so that we can have a fast search and retrieve for queries like “Give me all subjects and objects for predicate1_uri” and so on.
I gave it a try using this example but I saw that this was doing a Full Text Search. This doesn't seem to be efficient as the ntriple files could be as large as 50MB per file.
Then I thought of NOT doing a full text search, instead just store s-p-o as an index Document and each (s,p,o) as a Document Field with another Field as Id (offset of the s-p-o in corresponding ntriple file).
I have two questions:
Is Lucene the Only option for what I am trying to achieve ?
Would the size of the Index files themselves be larger than Half the size of the data itself ?!
Any and all help really appreciated.
To answer your first question: No, Lucene is not the only option to do this. You can (and probably should) use any generic RDF database to store the triples. Then you can query the triples using their Java API or using SPARQL. I personally recommend Apache Jena as a Java API for working with RDF.
If you need free-text search across literals in your dataset, there is Lucene Integration with Apache Jena through Jena Text.
Regarding index sizes, this depends entirely on the entropy of your data. If you have 40,000 lines in an NTRIPLE file, but it's all replications of the same triples, then the index will be relatively small. Typically, however, RDF databases make multiple indexes of the data and you will see a size increase.
The primary benefit of this indexing is that you can ask more generic questions than “Give me all subjects and objects for predicate1_uri”. That question can be answered by linearly processing all of the NTRIPLE files without even knowing you're using RDF. The following SPARQL-like query shows an example of a more difficult search facilitated by these data stores:
SELECT DISTINCT ?owner
WHERE {
?owner :owns ?thing
?thing rdf:type/rdfs:subClassOf :Automobile
?thing :hasColor "red"#en
}
In the preceding query, we locate owners of something that is an automobile or any more specific sublcass of automobile so long as the color of that thing is "red" (as specified in english).
First of all as I explained in my profile, I'm not English native, so I hope you can forgive me if I make some grammar mistakes.
I'm trying to develop with the Apache Lucene API in Java. I'm able to write some index and search methods, but I'm still confused about how it works behind the scenes.
From what I know, Lucene doesn't care about the origin of the data. It just takes the data and indexes it. Let me ask with a simple example:
I want to index words from a .txt based dictionary. Once Lucene has made its indexes, do I need the source .txt dictionary anymore? How, exactly, do the indexes work?
Do the indexes contain the necessary content to perform searches without the original source? Or do the indexes only contain directions to where words are stored in the original source .txt dictionary file? I'm a little confused.
Once you have indexed everything, Lucene does not refer back to, or have further need of, any of the source documents. Everything it needs to operate is saved in it's index directory. Many people use Lucene to index files, others database records, others online resources. Whatever your source, you always have to bring in the data yourself (or with some third-party tool), and construct Documents for lucene to index, and nothing about a document says anything about where it came from. So, not only does lucene not need to refer back to original data sources, it couldn't find them if you wanted it to.
Many people's implementations do rely on having the original source present. It's not at all unusual for people to set up Lucene to index everything, but only store a file name, or database id, or some similar pointer to the original source. This allows them to perform an effective full-text search through lucene, while handling storage of the full content to some other system.
I want to implement a website word search (user searches about a word and gets the sentences which include the word from the specific website) in java MYSELF (without using lucandra,solr,nutch,... I mean),till now I can get content (text,not the source code) of website with the help of jsoup,but I don't know how to index the datas in the database,I searched google to understand about algorithms which well known indexers such as solr use,I have understood some things such as using inverted list or hashtable,... but they are so in general,I want to know exactly what should I do?
I want to use cassandra as my db,so I read about cassandra secondry indexing,but I have to know more about it,is this really what I should focus on?
I'm in the early stages of a note-taking application for android and I'm hoping that somebody can point me to a nice solution for storing the note data.
Ideally, I'm looking to have a solution where:
Each note document is a separate file (for dropbox syncing)
A note can be composed of multiple pages
Note pages can have binary data (such as images)
A single page can be loaded without having to parse the entire document into memory
Thread-safety: Multiple reads/writes can occur at the same time.
XML is out (at least for the entire file), since I don't have a good way to extract a single page at a time. I considered using zip files, but (especially when compressed) I think they'd be stuck loading the entire file as well.
It seems like there should be a Java library out there that does this, but my google-fu is failing me. The only other alternative I can think of is to make a separate sqlite database for every note.
Does anybody know of a good solution to this problem? Thanks!
Seems like a relational database would work here. You just need to play around with the schema a little.
Maybe make a Pages table with each page including, say, a field for the document it belongs to and a field for its order in the document. Pages could also have a field for binary data, which might be contained in another table. If the document itself has additional data, maybe you have a table for documents too.
I haven't used SQLite transactions on an Android device, but it seems like that would be a good way to address thread safety.
I would recommend using SQLite to store the documents. Ultimately, it'll be easier than trying to deal with file I/O every time you access the note. Then, when somebody wants to upload to dropbox, you generate the file on the fly and upload it. It would make sense to have a Notes table and a pages table, at least. That way you can load each page individually and a note is just a collection of pages anyway. Additionally, you can store images as BLOBS in the database for a particular page. Basically, if you only want one type of content per page, then you would have, in the pages table, something like an id column and a content column. Alternatively, if you wanted to support something that is more complex such as multiple types of content then you would need to make your pages a collection of something else, like "entities."
IMO, a relational database is going to be the easiest way to accomplish your requirement of reading from particular pages without having to load the entire file.
My Problem is, I am goin to develop a site where every one uploads the doc file, txt files etc. Now here I need a component which actually pasre the file for some Key words and mainatin the index of that. And also that Index should be updated based on the Strutured data as well, like document can actively viewable and so forth. When another user try to look that list of document based on some key word and some strutured data as mentioned earlier, user should find the list quickly. And it should support the Multi Language. We have an alogorthim in place, but we need an open source API for reading the file indexing the file with Unstrutured data based on key word. Can any one can help in this.
Lucene with Solr is the best open source solution out there.
This is not a trivial task, so why reinvent when other people have already done that: try Apache Lucene.