How to classify documents indexed with lucene - java

I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?

Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.
There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:
Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/
Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/
These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.
Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.

Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.
To turn your lucene index into a text classifier you have two options:
For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.
The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)
If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.
You might also find this blog post interesting.
If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.
The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py

As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.
The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.

Related

Fitting the training dataset for text classification in Java

I'm building a system that does text classification. I'm building the system in Java. As features I'm using the bag-of-words model. However one problem with such a model is that the number of features is really high, which makes it impossible to fit the data in memory.
However, I came across this tutorial from Scikit-learn which uses specific data structures to solve the issue.
My questions:
1 - How do people solve such an issue using Java in general?
2- Is there a solution similar to the solution given in scikit-learn?
Edit: the only solution I've found so far is to personally write a Sparse Vector implementation using HashTables.
If you want to build this system in Java, I suggest you use Weka, which is a machine learning software similar to sklearn. Here is a simple tutorial about text classification with Weka:
https://weka.wikispaces.com/Text+categorization+with+WEKA
You can download Weka from:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
HashSet/HashMap are the usual way people store bag-of-words vectors in Java - they are naturally sparse representations that grow not with the size of dictionary but with the size of document, and the latter is usually much smaller.
If you deal with unusual scenarios, like very big document/representations, you can look for a few sparse bitset implementations around, they may be slightly more economical in terms of memory and are used for massive text classification implementations based on Hadoop, for example.
Most NLP frameworks make this decision for you anyway - you need to supply things in the format the framework wants them.

Identify an english word as a thing or product?

Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

Word association search in Apache Lucene

I have a requirement to associate math terms that come under a common topic. For e.g. angles, cos, tan, etc., should relate to trigonometry. So when a user searches for angles, triangles, etc. the search should present results related to trigonometry as well. Can anyone provide leads on how to do this in Apache Lucene?
There is a classification api which includes K-nearest neighbors and naive Bayes models.
You would first use the train() method with your training set. Once the classifier is trained use the assignClass() method to classify a given string.
For a training set you could use Wikipedia pages for your given classes.
After you give those two a try you could make use of the Classifier interface to build a competing model.
If you already know the associations, you can just add them to the index for the specific terms -- i.e. indexing 'cos' as 'cos', 'trigonometry'.
Also if you know the associations, you could index the parent term and all of the sibling terms -- i.e. indexing 'cos' as 'trigonometry', 'cos', 'sin', etc. This sounds more like what you want.
In addition to #Josh S.'s good answer, you can also take a more direct approach, of generating your own synonyms dictionary, e.g. see Match a word with similar words using Solr?

Using another index structure in Apache Lucene

I would like to use Lucene to write my own search engine. Because I use spatial information, I would like to try some index structures which are more suitable for spatial data. As far as I know there is no alternative structure available in Lucene itself, also LGTE (Lucene extension for geo-temporal date) seems not to let you access other structures.
Did I just not see other structures or do I have to implement them?
The direct and simple answer to the title of your question, "can you use another index structure" is that you can't -- at least you can't if it would have a different API than Lucene's. In a nutshell, it is fundamentally a sorted mapping of bytes to DocIds + optionally postings (position offsets for a document plus optionally "payloads" (arbitrary bytes for a posting).
That said, I suppose you could implement a so-called Lucene Codec (new in Lucene 4.x) that has its own extended API and search against a field that assumes your specific Codec. Codecs are envisioned to have different implementations of Lucene's APIs (e.g. balancing what's in memory vs on-disk, when to cleverly compress/encode vs directly represent data) but not to introduce different API as well. But I suppose you could.
The context beyond the title of your question is that you want to do this for spatial/temporal because, it appears to me you don't believe that Lucene's index is sufficiently suitable. I strongly disagree. There have been great strides in Lucene spatial recently and it has more to go still. For example many non-relational databases will only let you index point data but Lucene spatial can handle any Spatial4j shape (a dependency) and with JTS (another dependency), that's basically all the typical shapes most people want. And there are scalable recursive algorithms for matching indexed shapes via Intersects, Within, Contains, and Disjoint predicates. I expect some big performance enhancements by end of summer or end of year at the latest. You may find this very recent post I responded to on the Solr-user list interesting:
http://lucene.472066.n3.nabble.com/Multi-dimensional-spatial-search-tt4062515.html#a4062646
So instead I propose that you help me improve the parts of Lucene spatial that need it for whatever system you are building. Perhaps it already fits the bill.

Assign a paper to a reviewer based on keywords

I was wondering if you know any algorithm that can do an automatic assignment for the following situation: I have some papers with a some keywords defined, and some reviewers that have some specific keywords defined. How could I do an automatic mapping, so that the reviewer could review the papers from his/her area of interest?
If you are open to using external tools Lucene is a library that will allow you to search text based on (from their website)
phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g., title, author, contents)
date-range searching
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
You will basically need to design your own parser, or specialize an existing parser according to your needs. You need to scan the papers, and according to your keywords,search and match your tokens accordingly. Then the sentences with these keywords are to be separated and displayed to the reviewer.
I would suggest the Stanford NLP POS tagger. Every keyword that you would need, will fall under some part-of-speech. You can then just tag your complete document, and search for those tags and accordingly sort out the sentences.
Apache Lucene could be one solution.
It allows you to index documents either in a RAM directory, or within a real directory of your file system, and then to perform full-text searches.
Its proposes a lot of very interesting features like filters or analyzers. You can for example:
remove the stop words depending on the language of the documents (e.g. for english: a, the, of, etc.);
stem the tokens (e.g. function, functional, functionality, etc., are considered as a single instance);
perform complex queries (e.g. review*, keyw?rds, "to be or not to be", etc.);
and so on and so forth...
You should have a look! Don't hesitate to ask me some code samples if Lucene is the way you chose :)

Categories