Three arff files majority voting - java

I have three different ARFF files that contain different classification information about the same instances, so that each line of each ARFF file concerns the same instance, but contains different information on that instance. I would like to build a new classifier that would have a majority vote on the three classifiers that would applied on each ARFF data file with a cross validation
Any clue or hint is highly appreciated...

This is a very basic proposition, based that you would use Java to train and evaluate your ensemble:
Prepare each of the three datasets according to their attribute requirements, or if possible, use the same dataset for all three models using an attribute filter for each classifier (never tried this)
Train each of the three classifiers using the required training/attribute data
Code the Majority Vote rules at the end of your evaluation process
Evaluate your model on your Testing Set.
There may be other ways to do this (such as using the Vote combiner with AttributeSelectedClassifier), but doing this by code may give you more control and flexibility for what you are trying to combine.

Related

WEKA - filtering out classes in a MultiClassClassifer

I have trained a MultiClassClassifier (tested, working) and saved it somewhere on my hard drive. Now I want to make predictions for a new sample I got. I load my application and my classifier auto loads with it. I have narrowed down the search to five 5 possible classes already for the sample, outside the classification process. This means, I know k classes, that can easily be avoided in the classification.
Is it possible to filter a MultiClassClassifier (filter out all unwanted classes) before using it?
If it is? What would be the Weka method to work with for this purpose? If not, is there an alter. solution?
I want to increase the accuracy of the classifier by narrowing down the focus on 5 classes out of n classes.
I've found how to filter Instances objects but can't seem to find a proper method for the MultiClassClassifer.
My data to manipulate with is/are my testing Instances and my MultiClassClassifier.
Thank You in advance.
There isn't really a way to modify an existing MultiClassClassifier to exclude classes. However, depending on the underlying classifier you're using, you could try using .distributionForInstance which outputs a vector of confidence scores, one per class. You could then take the class with the highest score, ignoring the scores for the classes not in your target set.

JSAT: Data wrangling / manipulating

After building a prototype in R (using dplyr), I need to build a model that is deployable to our Java based server-infrastructure. Right now, I'm using the JSAT-machine-learning library.
What is the best way to wrangle data?
None of the collection-like types from the JSAT package (ClassificationDataSet, RegressionDataSet, DataSet) seem to support even basic tasks like:
Filtering out datapoints based on conditions
Splitting the dataset into two (different sized) datasets, e.g. training and testing dataset
Mutating or adding new rows based on the values of other rows
1) This isn't currently supported in JSAT, JSAT is a source of Machine Learning algorithms. Dataframe like operations are not a goal of the project in any way. I'm not sure why you would want to be filtering out data in a production system, there is no reason you couldn't do that in a better tool and then export the data to have JSAT build the model.
2) All DataSet objects inherit a randomSplit method that can do what you have asked for. An example of that is here.
3) See 1, I'm not sure what the use case is for adding "new rows based on the values of other rows". All the different DataSet classes support adding new data points, you just have to create them yourself.
source: I'm the author of JSAT

Document classification using libsvm in java

I am using libsvm library for document classification of resumes. I have multiple resumes and I need to classify them. Do I need multilabel classification OR multiclass classification in this case. Which above option should I consider and also please suggest a way to do it?
Your requirement is not straight forward, In order to develop such system you need to come up with several steps, as an Example :
You need a data set of different types of documents (various type of resumes)
Then you need to identify what kind of features that can be use to separate them(how do you going to distinguish them, based on what (ex, resume length, count of word, content of resume header, etc))
Then you need to prepare sets of feature vectors in order to train the SVM. (if you need to classify only relevant and irrelevant resumes, this will be two classes. If there are more than two classes , this will be multi-class and LibSVM supports multi-class)
When training, you need to perform scaling, cross validation in order to increse the accuracy (read here )
You need to complete above steps in order to make successful prediction.

Identify an english word as a thing or product?

Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

How to classify documents indexed with lucene

I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?
Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.
There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:
Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/
Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/
These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.
Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.
Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.
To turn your lucene index into a text classifier you have two options:
For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.
The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)
If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.
You might also find this blog post interesting.
If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.
The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py
As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.
The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.

Categories