How can I index a lot of txt files? (Java/C/C++)

How can I index a lot of txt files? (Java/C/C++) - java

I need to index a lot of text. The search results must give me the name of the files containing the query and all of the positions where the query matched in each file - so, I don't have to load the whole file to find the matching portion. What libraries can you recommend for doing this?
update: Lucene has been suggested. Can you give me some info on how should I use Lucene to achieve this? (I have seen examples where the search query returned only the matching files)

For java try Lucene

I believe the lucene term for what you are looking for is highlighting. Here is a very recent report on Lucene highlighting. You will probably need to store word position information in order to get the snippets you are looking for. The Token API may help.

It all depends on how you are going to access it. And of course, how many are going to access it. Read up on MapReduce.
If you are going to roll your own, you will need to create an index file which is sort of a map between unique words and a tuple like (file, line, offset). Of course, you can think of other in-memory data structures like a trie(prefix-tree) a Judy array and the like...
Some 3rd party solutions are listed here.

Have a look at http://www.compass-project.org/ it can be looked on as a wrapper on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges.
The Overview can give you more info
http://www.compass-project.org/overview.html
I have integrated this into a spring project in no time. It is really easy to use and gives what your users will see as google like results.

Lucene - Java
It's open source as well so you are free to use and deploy in your application.
As far as I know, Eclipse IDE help file is powered by Lucene - It is tested by millions

Also take a look at Lemur Toolkit.

Why don't you try and construct a state machine by reading all files ? Transitions between states will be letters, and states will be either final (some files contain the considered word, in which case the list is available there) or intermediate.
As far as multiple-word lookups, you'll have to deal with them independently before intersecting the results.
I believe the Boost::Statechart library may be of some help for that matter.

I'm aware you asked for a library, just wanted to point you to the underlying concept of building an inverted index (from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze).

Related

Work out Analyzer, Version, etc. from Lucene index files?

Just double-checking on this: I assume this is not possible and that if you want to keep such info somehow bundled up with the index files in your index directory you have to work out a way to do it yourself.
Obviously you might be using different Analyzers for different directories, and 99% of the time it is pretty important to use the right one when constructing a QueryParser: if your QP has a different one all sorts of inaccuracies might crop up in the results.
Equally, getting the wrong Version of the index files might, for all I know, not result in a complete failure: again, you might instead get inaccurate results.
I wonder whether the Lucene people have ever considered bundling up this sort of info with the index files? Equally I wonder if anyone knows whether any of the Lucene derivative apps, like Elasticsearch, maybe do incorporate such a mechanism?
Actually, just looking inside the "_0" files (_0.cfe, _0.cfs and _0.si) of an index, all 3 do actually contain the word "Lucene" seemingly followed by version info. Hmmm...
PS other related thoughts which occur: say you are indexing a text document of some kind (or 1000 documents)... and you want to keep your index up-to-date each time it is opened. One obvious way to do this would be to compare the last-modified date of individual files with the last time the index was updated: any documents which are now out-of-date would need to have info pertaining to them removed from the index, and then have to be re-indexed.
This need must occur all the time in connection with Lucene indices. How is it generally tackled in the absence of helpful "meta info" included in with the index files proper?

Anyone interested in this issue:
It does appear from what I said that the Version is contained in the index files. I looked at the CheckIndex class and the various info you can get from that, e.g. CheckIndex.Status.SegmentInfoStatus, without finding a way to obtain the Version. I'm starting to assume this is deliberate, and that the idea is just to let Lucene handle the updating of the index as required. Not an entirely satisfactory state of affairs if so...
As for getting other things, such as the Analyzer class, it appears you have to implement this sort of "metadata" stuff yourself if you want to... this could be done by just including a text file in with the other files, or alternately it appears you can use the IndexData class. Of course your Version could also be stored this way.
For writing such info, see IndexWriter.setCommitData().
For retrieving such info, you have to use one of several (?) subclasses of IndexReader, such as DirectoryReader.

Identify an english word as a thing or product?

Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!

What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.

This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

Should I use Lucene just for Highlighting?

I have an application that searches text indexed in a MSSQL database. My current search functionality works fine. I just need to display the search results with the surrounding text of the search terms (like google does). The only tool I could find to do this is Lucene's text highlighting. I read about it from this question: Displaying sample text from the Lucene Search Results. I haven't looked into Lucene for very long, but I'm guessing I'd have to create documents for each search hit.
I was wondering if what I want to do is even possible with Lucene, and whether it'd be overkill to use a tool like this for my purpose. Are there any other tools I could/should use for this?

It depends on the size of the text you are trying to highlight, but if it is rather small, you could use Lucene highlighting functionality on top of your search backend. See Highlighter documentation for more information.
In case this would not be fast enough for you (if you want to highlight large chunks ot text for example), Lucene can make highlighting faster by using term vectors, but this would require you to move your backend from MSSQL to Lucene.

If you already can get surrounding text of the found keywords, and this is really the only thing you need, then yes, Lucene is an overkill - just surround your keywords with highlighting tags. However, in most cases
But in most cases as times goes people start thinking of other advanced options, such as stemming (if you search for "highlight" you also find "highlighting" and "highlighter"), synonym search, language detection, etc. If you ever thought you may need such things, or even you haven't ready algorithm to find text snippets with keywords (surrounding text), I highly recommend you diving into Lucene world. The best option I can think of is to index all your text fields from MSSQL and base all your text search on Lucene.
If you are afraid of hard Lucene coding, you may use Solr - Lucene-based web server with extremely wide range of capabilities, easily configured with XML-files. Solr has both - simple web and a number of programming interfaces (Solrj for Java).

It would be overkill&: Lucene is a complete search/indexing engine with stemming, scoring, and other stuff. It's likely better than what you're doing, but it depends on your goals.
If you're just doing simple keyword highlighting, consider a regex to insert highlighting tags.

Storing data in Lucene or database

I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.
Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?
The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.
Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?
Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.
Hope this makes sense.
Mr Morgan.

From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.
From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.
How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)

I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.
Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.
If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/

Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.
You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.
Regards

Text processing / comparison engine

I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?

Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.

I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.

I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.