My Problem is, I am goin to develop a site where every one uploads the doc file, txt files etc. Now here I need a component which actually pasre the file for some Key words and mainatin the index of that. And also that Index should be updated based on the Strutured data as well, like document can actively viewable and so forth. When another user try to look that list of document based on some key word and some strutured data as mentioned earlier, user should find the list quickly. And it should support the Multi Language. We have an alogorthim in place, but we need an open source API for reading the file indexing the file with Unstrutured data based on key word. Can any one can help in this.
Lucene with Solr is the best open source solution out there.
This is not a trivial task, so why reinvent when other people have already done that: try Apache Lucene.
Related
I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.
I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!
Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.
You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.
First of all as I explained in my profile, I'm not English native, so I hope you can forgive me if I make some grammar mistakes.
I'm trying to develop with the Apache Lucene API in Java. I'm able to write some index and search methods, but I'm still confused about how it works behind the scenes.
From what I know, Lucene doesn't care about the origin of the data. It just takes the data and indexes it. Let me ask with a simple example:
I want to index words from a .txt based dictionary. Once Lucene has made its indexes, do I need the source .txt dictionary anymore? How, exactly, do the indexes work?
Do the indexes contain the necessary content to perform searches without the original source? Or do the indexes only contain directions to where words are stored in the original source .txt dictionary file? I'm a little confused.
Once you have indexed everything, Lucene does not refer back to, or have further need of, any of the source documents. Everything it needs to operate is saved in it's index directory. Many people use Lucene to index files, others database records, others online resources. Whatever your source, you always have to bring in the data yourself (or with some third-party tool), and construct Documents for lucene to index, and nothing about a document says anything about where it came from. So, not only does lucene not need to refer back to original data sources, it couldn't find them if you wanted it to.
Many people's implementations do rely on having the original source present. It's not at all unusual for people to set up Lucene to index everything, but only store a file name, or database id, or some similar pointer to the original source. This allows them to perform an effective full-text search through lucene, while handling storage of the full content to some other system.
I want to implement a website word search (user searches about a word and gets the sentences which include the word from the specific website) in java MYSELF (without using lucandra,solr,nutch,... I mean),till now I can get content (text,not the source code) of website with the help of jsoup,but I don't know how to index the datas in the database,I searched google to understand about algorithms which well known indexers such as solr use,I have understood some things such as using inverted list or hashtable,... but they are so in general,I want to know exactly what should I do?
I want to use cassandra as my db,so I read about cassandra secondry indexing,but I have to know more about it,is this really what I should focus on?
I am new to IR techniques.
I looking for a Java based API or tool that does the following.
Download the given set of URLs
Extract the tokens
Remove the stop words
Perform Stemming
Create Inverted Index
Calculate the TF-IDF
Kindly let me know how can Lucene be helpful to me.
Regards
Yuvi
You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.
Actually, TF-IDF is a score given to a term in a document, rather than the whole document.
If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene.
If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.
If someone could point me in the right direction that would be most helpful.
I have written a custom CMS where I want to be able to allow each individual user to upload documents (.doc .docx .pdf .rtf .txt etc) and then be able to search the contents of those files for keywords.
The CMS is written entirely in PHP and MySQL within a Linux environment.
Once uploaded the documents would be stored in the users private folder on the server "as is". There will be hundreds if not thousands of documents stored by each user.
It is very important that the specific users files are searchable only by that user.
Could anyone point me in the right direction? I have had a look at Solr but these types of solutions seem so complicated. I have spent an entire week looking at different solutions and this is my last attempt at finding a solution.
Thank you in advance.
2 choices I see.
A search index per user. Their documents are indexed separately from everyone else's. When they do a search, they hit their own search index. There is no danger of seeing other's results, or getting scores based on contents from other's documents. The downside is having to store and update the index separately. I would look into using Lucene for something like this, as the indices will be small.
A single search index. The users all share a search index. The results from searches would have to be filtered down so that only results were returned for that user. The upside is implementing a single search index (Solr would be great for this). The down side is the risk of cross talk between users searches. Scoring would be impacted by other users documents, resulting in poorer search results.
I hate to say it, but from a quality standpoint, I'd lean towards number 1. Number 2 seems more efficient and easier, but user results are more important to me.
keep the files outside of the public directory tree, keep a reference to the file's filepath and creator's user id in a database table, then they can search for the files using database queries. you will of course have to let users create accounts and have a log in. you can they let them download the files using php.
As long as the user's files are all located in an isolated directory, or there is some way specify one user's documents, like adding the user id to the filename, you could use grep.
The disadvantages:
Each search would have to go through all the documents, so if you have a lot of documents or very large documents it would be slow.
Binary document formats, like Word or PDF, might not produce accurate results.
This is not an enterprise solution.
Revised answer: Try mnoGoSearch