Am I correct to assume that the classification models implementations in scikit-learn and WEKA (e.g. Naive Bayes, Random Forest etc.) produce the same results (not taking processing time and such into account)?
I am asking, because I wrote my pipeline in Python and would like to use scikit-learn for easy integration. Since most related research and previous work in my field have used WEKA and Java, I was wondering if comparing performance to my pipeline is valid and scietifically sound, given I use the same models, settings, etc.
Related
What is doNotCheckCapabilities property in Weka used with Multiplayer Perceptron and what's its influence on classification result?
" If set, classifier capabilities are not checked before classifier is built (Use with caution to reduce runtime)."
the weki hint is not enough for me
Before a classifier is being trained, the provided dataset is tested against its capabilities, i.e., the types of data it can handle, required number of training instances. Depending on the data (eg 10s of 1000s of attributes), these capability tests can take a long time and are also computationally expensive. If you are an expert and you know that your data is in the right format already (or you are currently developing a new algorithm, using a custom datset for testing) then you could disable this check. In general, it is a good idea to leave this check in place to avoid errors or unexpected behavior further down the track.
I have a large S3 bucket full of photos of 4 different types of animals. My foray into ML will be to see if I can successfully get Deep Learning 4 Java (DL4J) to be shown a new arbitrary photo of one of those 4 species and get it to consistently, correctly guess which animal it is.
My understanding is that I must first perform a "training phase" which effectively builds up an (in-memory) neural network that consists of nodes and weights derived from both this S3 bucket (input data) and my own coding and usage of the DL4J library.
Once trained (meaning, once I have an in-memory neural net built up), then my understanding is that I can then enter zero or more "testing phases" where I give a single new image as input, let the program decide what type of animal it thinks the image is of, and then manually mark the output as being correct (the program guessed right) or incorrect w/ corrections (the program guessed wrong, and oh by the way, such and so was the correct answer). My understanding is that these test phases should help tweak you algorithms and minimize error.
Finally, it is my understanding that the library can then be used in a live "production phase" whereby the program is just responding to images as inputs and making decisions as to what it thinks they are.
All this to ask: is my understanding of ML and DL4J's basic methodology correction, or am I mislead in any way?
Training: That's any framework. You can also persist the neural network as well with either the java based SerializationUtils or in the newer release we have a ModelSerializer as well.
This is more of an integrations play than a "can it do x?"
DL4j can integrate with kafka/spark streaming and do online/mini batch learning.
The neural nets are embeddable in a production environment.
My only tip here is to ensure that you have the same data pipeline for training as well as test.
This is mainly for ensuring consistency of your data you are training vs testing on.
As well as for mini batch learning ensure you have minibatch(true) (default) if you are doing mini batch/online learning or minibatch(false) if you are training on the whole dataset at once.
I would also suggest using StandardScalar (https://github.com/deeplearning4j/nd4j/blob/master/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/linalg/dataset/api/iterator/StandardScaler.java) or something similar for persisting global statistics around your data. Much of the data pipeline will depend on the libraries you are using to build your data pipeline though.
I would assume you would want to normalize your data in some way though.
I'm building a system that does text classification. I'm building the system in Java. As features I'm using the bag-of-words model. However one problem with such a model is that the number of features is really high, which makes it impossible to fit the data in memory.
However, I came across this tutorial from Scikit-learn which uses specific data structures to solve the issue.
My questions:
1 - How do people solve such an issue using Java in general?
2- Is there a solution similar to the solution given in scikit-learn?
Edit: the only solution I've found so far is to personally write a Sparse Vector implementation using HashTables.
If you want to build this system in Java, I suggest you use Weka, which is a machine learning software similar to sklearn. Here is a simple tutorial about text classification with Weka:
https://weka.wikispaces.com/Text+categorization+with+WEKA
You can download Weka from:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
HashSet/HashMap are the usual way people store bag-of-words vectors in Java - they are naturally sparse representations that grow not with the size of dictionary but with the size of document, and the latter is usually much smaller.
If you deal with unusual scenarios, like very big document/representations, you can look for a few sparse bitset implementations around, they may be slightly more economical in terms of memory and are used for massive text classification implementations based on Hadoop, for example.
Most NLP frameworks make this decision for you anyway - you need to supply things in the format the framework wants them.
One of the projects I'm currently working on includes a Java Swing application for field users to input data about equipment parts scattered all over the companies facilities.
Most of the data collected is measured (field validation is required) or some value the operator can choose from a predefined list.The software then does some calculation and displays back instructions to the operator.
so for example for part number 3435Af-B, engineering requires the operator to measure
the diameter of some bolt and choose the part maker from a list. the application then compares the measured diameter to the stock diameter and tells the operator if it should be replaced (this is obviously not a real world example, but you get the idea)
The problem is there are over 200 known equipment parts and these are pretty old and heterogeneous so the engineering team has a limited idea of what they would like to measure on each part in the future. It doesn't seam reasonable to have the development team write and maintain over 200 different classes but doesn't seem realistic to have the engineers use a complicated system of form builders and BPM like drools (i'm getting sweaty just thinking about it).
We're currently trying to make a poor man's solution that would allow the engineers to build forms with a simple GUI having limited features and outputting the forms to XML files. The few complicated cases not covered by the GUI solution would be custom made by the developers (very few cases). For the calculation part, we would use Java Expression library (JEL), the expressions to evaluate would be generated by the same GUI.
Before we commit to this solution, I was wondering if there is something we could do differently to have a more robust system. I know that some people will consider this too much soft-coding and I agree that it is going to be difficult extracting and treating the data.
I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?
Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.
There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:
Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/
Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/
These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.
Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.
Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.
To turn your lucene index into a text classifier you have two options:
For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.
The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)
If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.
You might also find this blog post interesting.
If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.
The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py
As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.
The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.