Fitting the training dataset for text classification in Java - java

I'm building a system that does text classification. I'm building the system in Java. As features I'm using the bag-of-words model. However one problem with such a model is that the number of features is really high, which makes it impossible to fit the data in memory.
However, I came across this tutorial from Scikit-learn which uses specific data structures to solve the issue.
My questions:
1 - How do people solve such an issue using Java in general?
2- Is there a solution similar to the solution given in scikit-learn?
Edit: the only solution I've found so far is to personally write a Sparse Vector implementation using HashTables.

If you want to build this system in Java, I suggest you use Weka, which is a machine learning software similar to sklearn. Here is a simple tutorial about text classification with Weka:
https://weka.wikispaces.com/Text+categorization+with+WEKA
You can download Weka from:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html

HashSet/HashMap are the usual way people store bag-of-words vectors in Java - they are naturally sparse representations that grow not with the size of dictionary but with the size of document, and the latter is usually much smaller.
If you deal with unusual scenarios, like very big document/representations, you can look for a few sparse bitset implementations around, they may be slightly more economical in terms of memory and are used for massive text classification implementations based on Hadoop, for example.
Most NLP frameworks make this decision for you anyway - you need to supply things in the format the framework wants them.

Related

Scikit-learn vs. WEKA classification model implementation

Am I correct to assume that the classification models implementations in scikit-learn and WEKA (e.g. Naive Bayes, Random Forest etc.) produce the same results (not taking processing time and such into account)?
I am asking, because I wrote my pipeline in Python and would like to use scikit-learn for easy integration. Since most related research and previous work in my field have used WEKA and Java, I was wondering if comparing performance to my pipeline is valid and scietifically sound, given I use the same models, settings, etc.

Does any know how to extract a tensorflow DNNRegressor model and evaluate manually?

I am trying to use a DNNRegressor model in a java realtime context, unfortunately this requires a garbage free implementation. It doesn't look like tensorflow-light offers a GC free implementation. The path of least resistance would be to extract the weights and re-implement the NN manually. Has anyone tried extracting the weights from a regression model and implementing the regression manually, and if so could you describe any pitfalls?
Thanks!
I am not quite sure if your conclusion
The path of least resistance would be to extract the weights and re-implement the NN manually.
is actually true. It sounds to me like you want to use the trained model in an Android mobile application. I personally do not know much about that, but I am sure there are efficient ways to do exactly that.
However, assuming you actually need to extract the weights there are multiple ways to do this.
One straight forward way to do this is to implement the exact network you want yourself with Tensorflows low level API instead of using the canned DNNRegressor class (which is deprecated btw.). That might sound unnecessarily complex, but is actually quite easy and has the upside of you being in full control.
A general way to get all trainable variables is to use Tensorflows trainable_variables method.
Or maybe this might help you.
In terms of pitfalls I don't really believe there are any. At the end of the day you are just storing a bunch of floats. You should probably make sure to use an appropriate file format like hdf5 and sufficient float precision.

LSH Libraries in Java

I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points.
It's totally good enough to get all entries in a bucket for a query. Which ones i really need could then be processed in a different way under consideration of some filter parameters my problem include.
I already found likelike but hope that there is something a bit smaller and without need of any other tools (like Apache Hadoop in the case of likelike).
Maybe this one:
"TarsosLSH is a Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works."
Code can be found here
Apache Spark has an LSH implementation: https://spark.apache.org/docs/2.1.0/ml-features.html#locality-sensitive-hashing (API).
After having played with both the tdebatty and TarsosLSH implementations, I'll likely use Spark, as it supports sparse vectors as input. The tdebatty requires a non-sparse array of booleans or int's, and the TarsosLSH Vector implementation is a non-sparse array of doubles. This severely limits the number of dimensions one can reasonably support.
This page provides links to more projects, as well as related papers and information: https://janzhou.org/lsh/.
There is this one:
http://code.google.com/p/lsh-clustering/
I haven't had time to test it but at least it compiles.
Here another one:
https://github.com/allenlsy/knn
It uses LSH for KNN. I'm currently investigating it's usability =)
The ELKI data mining framework comes with an LSH index. It can be used with most algorithms included (anything that uses range or nn searches) and sometimes works very well.
In other cases, LSH doesn't seem to be a good approach. It can be quite tricky to get the LSH parameters right: if you choose some parameters too high, runtime grows a lot (all the way to a linear scan). If you choose them too low, the index becomes too approximative and loses to many neighbors.
It's probably the biggest challenge with LSH: finding good parameters, that yield the desired speedup and getting a good enough accuracy out of the index...

Using Java, what's the simplest method of writing a file to disk in a format that is easily readable by other applications

I've been asked to "write a file to disk in a format that is easily readable by other applications. There is no requirement for the file to be human readable". The data to be written to file is a combination of integer, string and date variables.
I can't quite figure out what the aim of this question is and what the correct answer should be.
What are the core considerations to be made in order to write a file to disk in a format that is easily readable by other applications (using the simplest possible method).
No this is not homework.
This is a pretty vague requirement. If the other applications also written in Java, then Java serialization may be the best approach.
EDIT: #leftbrain — In answer to your comment, I think I would lean toward XML; it was designed to support basic interoperability among applications. The three kinds of data mentioned (integer, string, and date) can generally be represented exactly with no special tricks and there is good support for XML processing across most programming languages. However, each data type (in the abstract) can present challenges. I would have asked the following:
What range of integer values need to be supported?
What assumptions (if any) can be made about the string data? Does the full Unicode character set need to be supported?
What range of dates, and what calendar systems, need to be supported?
Is there a well-defined structure to the data?
Is the performance (in terms of time and/or memory) of the write or read operation an issue? What does "easily readable" mean?
What the interviewer is looking for is to find out what experience you have in this area or what you might consider in implementing such a solution. Its and open ended question which has no right answer, it rather depends on the requirements.
Some suggestions use serialization of a data structure. Here a few http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
Use an SQL or NoSQL database. Here are some NoSQL databases. http://nosql-database.org/
Write the data to disk using DataOutputStream (IO), heap or direct ByteBuffers or memory mapped files. This can work well for simple cases like the one suggested in the question. As you requirements get more complicated, you might consider other options.
If you need to support multiple languages you can use xml.

How to classify documents indexed with lucene

I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?
Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.
There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:
Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/
Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/
These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.
Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.
Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.
To turn your lucene index into a text classifier you have two options:
For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.
The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)
If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.
You might also find this blog post interesting.
If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.
The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py
As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.
The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.

Categories