Word association search in Apache Lucene

Word association search in Apache Lucene - java

I have a requirement to associate math terms that come under a common topic. For e.g. angles, cos, tan, etc., should relate to trigonometry. So when a user searches for angles, triangles, etc. the search should present results related to trigonometry as well. Can anyone provide leads on how to do this in Apache Lucene?

There is a classification api which includes K-nearest neighbors and naive Bayes models.
You would first use the train() method with your training set. Once the classifier is trained use the assignClass() method to classify a given string.
For a training set you could use Wikipedia pages for your given classes.
After you give those two a try you could make use of the Classifier interface to build a competing model.

If you already know the associations, you can just add them to the index for the specific terms -- i.e. indexing 'cos' as 'cos', 'trigonometry'.
Also if you know the associations, you could index the parent term and all of the sibling terms -- i.e. indexing 'cos' as 'trigonometry', 'cos', 'sin', etc. This sounds more like what you want.

In addition to #Josh S.'s good answer, you can also take a more direct approach, of generating your own synonyms dictionary, e.g. see Match a word with similar words using Solr?

Related

Custom Edit distance weights for operations in Lucene FuzzySearch

I came across this python library https://pypi.org/project/weighted-levenshtein/ which allows to specify different costs/weights for different operations(insertion, substitution, deletion and transposition) which is very helpful in detecting and correcting keystroke errors.
I have been searching through lucene library FuzzySearch which uses Damerau-Levenstein distance to check if something like this is supported to specify different costs/weights for different operations but not able to find any.
Please let me know if there exists a way to specify our custom costs/weights within Lucene Fuzzy-Search.
Thanks in advance!

To accomplish this you would have to extend and/or edit lucene code. To support fuzzy matching, lucene compiles an Automaton using the LevenshteinAutomata class, which implements this algorithm, and not only doesn't support edit weights, but only supports matching for up to 0 to 2 edits.
How one might edit this algorithm to produce an automaton that supports weighted edits is beyond my knowledge, but could be worth a try as it would make your customization simple (would only have to override the getAutomaton method) and would (theoretically) keep performance consistent.
The alternative would be to forgo the idea of an automaton to support fuzzy matching and simply implement a weighted levenshtein algorithm, like the one you have linked to, directly in the actual fuzzy match check. By doing this, however, you could pay a rather high performance cost depending on the nature of the fuzzy queries you handle and the content of your index.

Using another index structure in Apache Lucene

I would like to use Lucene to write my own search engine. Because I use spatial information, I would like to try some index structures which are more suitable for spatial data. As far as I know there is no alternative structure available in Lucene itself, also LGTE (Lucene extension for geo-temporal date) seems not to let you access other structures.
Did I just not see other structures or do I have to implement them?

The direct and simple answer to the title of your question, "can you use another index structure" is that you can't -- at least you can't if it would have a different API than Lucene's. In a nutshell, it is fundamentally a sorted mapping of bytes to DocIds + optionally postings (position offsets for a document plus optionally "payloads" (arbitrary bytes for a posting).
That said, I suppose you could implement a so-called Lucene Codec (new in Lucene 4.x) that has its own extended API and search against a field that assumes your specific Codec. Codecs are envisioned to have different implementations of Lucene's APIs (e.g. balancing what's in memory vs on-disk, when to cleverly compress/encode vs directly represent data) but not to introduce different API as well. But I suppose you could.
The context beyond the title of your question is that you want to do this for spatial/temporal because, it appears to me you don't believe that Lucene's index is sufficiently suitable. I strongly disagree. There have been great strides in Lucene spatial recently and it has more to go still. For example many non-relational databases will only let you index point data but Lucene spatial can handle any Spatial4j shape (a dependency) and with JTS (another dependency), that's basically all the typical shapes most people want. And there are scalable recursive algorithms for matching indexed shapes via Intersects, Within, Contains, and Disjoint predicates. I expect some big performance enhancements by end of summer or end of year at the latest. You may find this very recent post I responded to on the Solr-user list interesting:
http://lucene.472066.n3.nabble.com/Multi-dimensional-spatial-search-tt4062515.html#a4062646
So instead I propose that you help me improve the parts of Lucene spatial that need it for whatever system you are building. Perhaps it already fits the bill.

Assign a paper to a reviewer based on keywords

I was wondering if you know any algorithm that can do an automatic assignment for the following situation: I have some papers with a some keywords defined, and some reviewers that have some specific keywords defined. How could I do an automatic mapping, so that the reviewer could review the papers from his/her area of interest?

If you are open to using external tools Lucene is a library that will allow you to search text based on (from their website)
phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g., title, author, contents)
date-range searching
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching

You will basically need to design your own parser, or specialize an existing parser according to your needs. You need to scan the papers, and according to your keywords,search and match your tokens accordingly. Then the sentences with these keywords are to be separated and displayed to the reviewer.
I would suggest the Stanford NLP POS tagger. Every keyword that you would need, will fall under some part-of-speech. You can then just tag your complete document, and search for those tags and accordingly sort out the sentences.

Apache Lucene could be one solution.
It allows you to index documents either in a RAM directory, or within a real directory of your file system, and then to perform full-text searches.
Its proposes a lot of very interesting features like filters or analyzers. You can for example:
remove the stop words depending on the language of the documents (e.g. for english: a, the, of, etc.);
stem the tokens (e.g. function, functional, functionality, etc., are considered as a single instance);
perform complex queries (e.g. review*, keyw?rds, "to be or not to be", etc.);
and so on and so forth...
You should have a look! Don't hesitate to ask me some code samples if Lucene is the way you chose :)

How to classify documents indexed with lucene

I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?

Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.
There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:
Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/
Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/
These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.
Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.

Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.
To turn your lucene index into a text classifier you have two options:
For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.
The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)
If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.
You might also find this blog post interesting.
If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.
The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py

As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.
The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.

entity set expansion python

Do you know of any existing implementation in any language (preferably python) of any entity set expansion algorithms, such that the one from Google sets ? ( http://labs.google.com/sets )
I couldn't find any library implementing such algorithms and I'd like to play with some of those to see how they would perform on some specific task I would like to implement.
Any help is welcome !
Thanks a lot for your help,
Regards,
Nicolas.

I'm not aware of any ready to use open source libraries that implement the sort of clustering on demand of named entities provided by Google Sets. However, there are a few academic papers that describe in detail how to build similar systems, e.g.:
Language-Independent Set Expansion of Named Entities using the Web Wang and Cohen, in EMNLP 2009
Online Demo
Bayesian Sets Ghahramani and Heller, in NIPS, 2005
Below is a brief summary of Wang and Cohen's method. If you do end up implementing something like this yourself, it might be good to start with their method. I suspect most people will find it more intuitive than Ghahramani and Heller's formulation.
Wang and Cohen 2009
Wang and Cohen start by describing a method for automatically constructing extraction patterns that allow them to find lists of named entities in any sort of structured document. The method looks at the prefixes and suffixes bracketing known occurrences of named entities. These prefix and suffixes are then used to identify other named entities within the same document.
To complete a clusters of entities, they build a graph consisting of the interconnections between named entities, the extraction patterns associated with them, and the documents. Using this graph and starting at the nodes for the cluster's seed entities (i.e., the initial set of entities in the set to be completed), they perform numerous random walks on the graph up to 10 steps in length. They count how many times they reach the nodes corresponding to non-seed entities. Non-seed entities with high counts can then be used to complete the cluster.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.