Tagging of names using lucene/java - java

I have names of all the employees of my company (5000+). I want to write an engine which can on the fly find names in online articles(blogs/wikis/help documents) and tag them with "mailto" tag with the users email.
As of now I am planning to remove all the stop words from the article and then search for each word in a lucene index. But even in that case I see a lot of queries hitting the indexes, for example if there is an article with 2000 words and only two references to people names then most probably there will be 1000 lucene queries.
Is there a way to reduce these queries? Or a completely other way of achieving the same?
Thanks in advance

If you have only 5000 names, I would just stick them into a hash table in memory instead of bothering with Lucene. You can hash them several ways (e.g., nicknames, first-last or last-first, etc.) and still have a relatively small memory footprint and really efficient performance.

http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
This algorithm might be of use to you. The way this would work is you first compile the entire list of names into a giant finite state machine (which would probably take a while), but then once this state machine is built, you can run it through as many documents as you want and detect names pretty efficiently.
I think it would look at every character in each document only once, so it should be much more efficient than tokenizing the document and comparing each word to a list of known names.
There are a bunch of implementations available for different languages on the web. Check it out.

Related

Search stem and exact words in Lucene 4.4.0

i've store a lucene document with a single TextField contains words without stems.
I need to implement a search program that allow users to search words and exact words,
but if i've stored words without stemming, a stem search cannot be done.
There's a method to search both exact words and/or stemming words in Documents without
store Two fields ?
Thanks in advance.
Indexing two separate fields seems like the right approach to me.
Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser. Lucene doesn't really support indexing text in the same field with different analyzers. That is by design. Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly).
There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields.
You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper, by the way. Like:
Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);
I can see a couple of possibilities to accomplish it though, if you really want to.
One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. Mind your position increments, in this case. Phrase queries and the like may be problematic.
The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. Two fields added with the same name will be effectively concatenated. You'dd want to store in a separate field, in this case. Expect wonky scoring.
Again, though, both of these are bad ideas. I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields.

Assign a paper to a reviewer based on keywords

I was wondering if you know any algorithm that can do an automatic assignment for the following situation: I have some papers with a some keywords defined, and some reviewers that have some specific keywords defined. How could I do an automatic mapping, so that the reviewer could review the papers from his/her area of interest?
If you are open to using external tools Lucene is a library that will allow you to search text based on (from their website)
phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g., title, author, contents)
date-range searching
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
You will basically need to design your own parser, or specialize an existing parser according to your needs. You need to scan the papers, and according to your keywords,search and match your tokens accordingly. Then the sentences with these keywords are to be separated and displayed to the reviewer.
I would suggest the Stanford NLP POS tagger. Every keyword that you would need, will fall under some part-of-speech. You can then just tag your complete document, and search for those tags and accordingly sort out the sentences.
Apache Lucene could be one solution.
It allows you to index documents either in a RAM directory, or within a real directory of your file system, and then to perform full-text searches.
Its proposes a lot of very interesting features like filters or analyzers. You can for example:
remove the stop words depending on the language of the documents (e.g. for english: a, the, of, etc.);
stem the tokens (e.g. function, functional, functionality, etc., are considered as a single instance);
perform complex queries (e.g. review*, keyw?rds, "to be or not to be", etc.);
and so on and so forth...
You should have a look! Don't hesitate to ask me some code samples if Lucene is the way you chose :)

Storing data in Lucene or database

I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.
Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?
The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.
Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?
Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.
Hope this makes sense.
Mr Morgan.
From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.
From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.
How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)
I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.
Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.
If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/
Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.
You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.
Regards

Text processing / comparison engine

I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?
Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.
I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.
I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.

Given a list of words - what would be a good algorithm for word completion in java? Tradeoffs: Speed/efficiency/memory footprint

I'm exploring the hardware/software requirements (ultimate goal is mobile Java app) for a potential free/paid application.
The application will start with this simple goal: Given a list of relevant words in a database, to be able to do word completion on a single string input.
In other words I already know the contents of the database - but the memory footprint/speed/search efficiency of the algorithm will determine the amount of data supported.
I have been starting at the beginning with suffix-based tree searches, but am wondering if anyone has experience with the speed/memory size tradeoffs of this simple approach vs. the more complex ones being talked about in the conferences.
Honestly the initial application only has probably less than 500 words in context so it might not matter, but ultimately the application could expand to tens of thousands or hundreds of thousands of record - thus the question about speed vs. memory footprint.
I suppose I could start with something simple and switch over later, but I hope to understand the tradeoff earlier!
Word completion suggests that you want to find all the words that start with a given prefix.
Tries are good for this, and particularly good if you're adding or removing elements - other nodes do not need to be reallocated.
If the dictionary is fairly static, and retrieval is important, consider a far simpler data structure: put your words in an ordered vector! You can do binary-search to discover a candidate starting with the correct prefix, and a linear search each side of it to discover all other candidates.

Categories