I have used weka and made a Naive Bayes classifier, by using weka GUI. Then I have saved this model by following this tutorial. Now I want to load this model through Java code but I am unable to find any way to load a saved model using weka.
This is my requirement that I have to made model separately and then use it in a separate program.
If anyone can guide me in this regard I will be thankful to you.
You can easily load a saved model in java using this command:
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(pathToModel);
For a complete workflow in Java I wrote the following article in SO Documentation, now copied here:
Text Classification in Weka
Text Classification with LibLinear
Create training instances from .arff file
private static Instances getDataFromFile(String path) throws Exception{
DataSource source = new DataSource(path);
Instances data = source.getDataSet();
if (data.classIndex() == -1){
data.setClassIndex(data.numAttributes()-1);
//last attribute as class index
}
return data;
}
Instances trainingData = getDataFromFile(pathToArffFile);
Use StringToWordVector to transform your string attributes to number representation:
Important features of this filter:
tf-idf representation
stemming
lowercase words
stopwords
n-gram representation*
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
if(useIdf){
filter.setIDFTransform(true);
}
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setMinTermFreq(minTermFreq);
filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(maxGrams);
t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile();
stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
filter.setStopwordsHandler(stopwords);
if (useStemmer){
Stemmer s = new /*Iterated*/LovinsStemmer();
filter.setStemmer(s);
}
filter.setInputFormat(trainingData);
Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);
Create the LibLinear Classifier
SVMType 0 below corresponds to the L2-regularized logistic regression
Set setProbabilityEstimates(true) to print the output probabilities
Classifier cls = null;
LibLINEAR liblinear = new LibLINEAR();
liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
liblinear.setProbabilityEstimates(true);
// liblinear.setBias(1); // default value
cls = liblinear;
cls.buildClassifier(trainingData);
Save model
System.out.println("Saving the model...");
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
oos.writeObject(cls);
oos.flush();
oos.close();
Create testing instances from .arff file
Instances trainingData = getDataFromFile(pathToArffFile);
Load classifier
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData); This will make training and testing instances compatible.
Alternatively you could use InputMappedClassifier
Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);
Classify!
1.Get the class value for every instance in the testing set
for (int j = 0; j < testingData.numInstances(); j++) {
double res = myCls.classifyInstance(testingData.get(j));
}
res is a double value that corresponds to the nominal class that is defined in .arff file. To get the nominal class use : testintData.classAttribute().value((int)res)
2.Get the probability distribution for every instance
for (int j = 0; j < testingData.numInstances(); j++) {
double[] dist = first.distributionForInstance(testInstances.get(j));
}
dist is a double array that contains the probabilities for every class defined in .arff file
Note. Classifier should support probability distributions and enable them with: myClassifier.setProbabilityEstimates(true);
Related
I have created a model in Weka using the SMO algorithm. I am trying to evaluate a test sample using the mentioned model to classify it in my two-class problem. I am a bit confused on how to evaluate the sample using Weka Smo code. I have built an empty arff file which contains only the meta-data of the file. I calculate the sample features and I add the vector in arff file. I have created the following function Evaluate in order to evaluate a sample. File template.arff is the template which contains the meta-data of a arff file and models/smo my model.
public static void Evaluate(ArrayList<Float> temp) throws Exception {
temp.add(Float.parseFloat("1"));
System.out.println(temp.size());
double dt[] = new double[temp.size()];
for (int index = 0; index < temp.size(); index++) {
dt[index] = temp.get(index);
}
double data[][] = new double[1][];
data[0] = dt;
weka.classifiers.Classifier c = loadModel(new File("models/"), "/smo"); // loads smo model
File tmp = new File("template.arff"); //loads data template
Instances dataset = new weka.core.converters.ConverterUtils.DataSource(tmp.getAbsolutePath()).getDataSet();
int numInstances = data.length;
for (int inst = 0; inst < numInstances; inst++) {
dataset.add(new Instance(1.0, data[inst]));
}
dataset.setClassIndex(dataset.numAttributes() - 1);
Evaluation eval = new Evaluation(dataset);
//returned evaluated index
double a = eval.evaluateModelOnceAndRecordPrediction(c, dataset.instance(0));
double arr[] = c.distributionForInstance(dataset.instance(0));
System.out.println(" Confidence Scores");
for (int idx = 0; idx < arr.length; idx++) {
System.out.print(arr[idx] + " ");
}
System.out.println();
}
I am not sure if I am right here. I create the sample file. Afterwards I am loading my model. I am wandering if my code is what I need in order to evaluate the class of sample temp. If this code is ok, how can I extract the confidence score and not the binary decision about the class? The structure of template.arff file is:
#relation Dataset
#attribute Attribute0 numeric
#attribute Attribute1 numeric
#attribute Attribute2 numeric
...
#ATTRIBUTE class {1, 2}
#data
Moreover loadModel function is the following:
public static SMO loadModel(File path, String name) throws Exception {
SMO classifier;
FileInputStream fis = new FileInputStream(path + name + ".model");
ObjectInputStream ois = new ObjectInputStream(fis);
classifier = (SMO) ois.readObject();
ois.close();
return classifier;
}
I found this post here which suggest to locate the SMO.java file and change the following line smo.buildClassifier(train, cl1, cl2, true, -1, -1); // from false to true.
However it seems when I did so, I got the same binary output.
My training function:
public void weka_train(File input, String[] options) throws Exception {
long start = System.nanoTime();
File tmp = new File("data.arff");
TwitterTrendSetters obj = new TwitterTrendSetters();
Instances data = new weka.core.converters.ConverterUtils.DataSource(
tmp.getAbsolutePath()).getDataSet();
data.setClassIndex(data.numAttributes() - 1);
Classifier c = null;
String ctype = null;
boolean newmodel = false;
ctype = "SMO";
c = new SMO();
for (int i = 0; i < options.length; i++) {
System.out.print(options[i]);
}
c.setOptions(options);
c.buildClassifier(data);
newmodel = true;
if (newmodel) {
obj.saveModel(c, ctype, new File("models"));
}
}
I have some suggestions but I have no idea whether they will work. Let me know if this works for you.
First use SMO not just the parent object Classifier class. I created a new method loadModelSMO as an example of this.
SMO Class
public static SMO loadModelSMO(File path, String name) throws Exception {
SMO classifier;
FileInputStream fis = new FileInputStream(path + name + ".model");
ObjectInputStream ois = new ObjectInputStream(fis);
classifier = (SMO) ois.readObject();
ois.close();
return classifier;
}
and then
SMO c = loadModelSMO(new File("models/"), "/smo");
...
I found a article that might help you out from the mailing list subject titled
I used SMO with logistic regression but I always get a confidence of 1.0
It suggest to set use the -M to fit your logistics model which can be used through the method
setOptions(java.lang.String[] options)
Also maybe you need to set your build logistics model to true
Confidence score in SMO
c.setBuildLogisticModels(true);
Let me know if this helped at all.
Basically you should try to use the option "-M" for SMO to fit logistic models, in training process. Check the solution proposed here. It should work!
I want to make a list of all the predictions.
I have this code :
//Get File
BufferedReader reader = new BufferedReader(new FileReader(PATH + "TempArffFile.arff"));
//Get the data
Instances data = new Instances(reader);
reader.close();
//Setting class attribute
data.setClassIndex(data.numAttributes() - 1);
//Make tree
J48 tree = new J48();
String[] options = new String[1];
options[0] = "-U";
tree.setOptions(options);
tree.buildClassifier(data);
//Print tree
System.out.println(tree);
It works fine i can see the tree printed , but dont know how to work with that from here.
I want to make a list for each root how can i do that?
If you would like a list of all the testing predictions, you could use the following code (sample code provided here):
import weka.core.Instances;
import weka.classifiers.Evaluation;
import weka.classifiers.trees.J48;
...
Instances train = ... // from somewhere
Instances test = ... // from somewhere
// train classifier
Classifier cls = new J48();
cls.buildClassifier(train);
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(train);
eval.evaluateModel(cls, test);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
You could also use J48.classifyInstance() to predict a single instance, if you prefer to go that way.
I am creating a prototype for my thesis using the model generated/ trained model in weka. My thesis is about emotion analysis on text. Now I have the test data/set that I want to classify using the model/trained model.
this is my partial code that reads arff file and have a filter (stringToWordVector):
Classify ct = new Classify(TextJ48.model); // loads model
string sample = getARFFile();
StringBuilder buffer = new StringBuilder(sample);
BufferedReader reader = new BufferedReader(new java.io.StringReader(buffer.ToString()));
weka.core.converters.ArffLoader.ArffReader arff = new weka.core.converters.ArffLoader.ArffReader(reader);
Instances dataRaw = arff.getData();
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataFiltered = Filter.useFilter(dataRaw, filter);
When I show the dataFilteredit successfully filtered from words to numeric.
this is the classify class:
public Classify(string filename)
{
try
{
classifier = (Classifier)weka.core.SerializationHelper.read(filename);
}
catch (java.lang.Exception ex)
{
lblProgress.Text = ex.getMessage();
}
loadAttributes();
this.fileName = filename;
}
I don't know what to do in loadAttributes() My plan is to add all attributes in FastVector,I saw in some sources they adds attributes easily because they have a fixed sized attributes, but in my case I have different number of attributes that are based from the text.
Now how do I classify the text that I input using the model.
Looked at lots of examples for this, and so far no luck. I'd like to classify free text.
Configure a text classifier. (FilteredClassifier using StringToWordVector and LibSVM)
Train the classifier (add in lots of documents, train on filtered text)
Serialize the FilteredClassifier to disk, quit the app
Then later
Load up the serialized FilteredClassifier
Classify stuff!
It goes ok up to when I try to read from disk and classify things. All the documents and examples show the training list and testing list being built at the same time, and in my case, I'm trying to build a testing list after the fact.
A FilteredClassifier alone is not enough to create a testing Instance with the same "dictionary" as the original training set, so how do I save everything I need to classify at a later date?
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code just says "Instances loaded from somewhere" and doesn't say anything about using a similar dictionary.
ClassifierFramework cf = new WekaSVM();
if (!cf.isTrained()) {
train(cf); // Train, save to disk
cf = new WekaSVM(); // reloads from file
}
cf.test("this is a test");
Ends up throwing
java.lang.ArrayIndexOutOfBoundsException: 2
at weka.core.DenseInstance.value(DenseInstance.java:332)
at weka.filters.unsupervised.attribute.StringToWordVector.convertInstancewoDocNorm(StringToWordVector.java:1587)
at weka.filters.unsupervised.attribute.StringToWordVector.input(StringToWordVector.java:688)
at weka.classifiers.meta.FilteredClassifier.filterInstance(FilteredClassifier.java:465)
at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:495)
at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:70)
at ratchetclassify.lab.WekaSVM.test(WekaSVM.java:125)
Serialize your Instances which holds the definition of the trained data -similar dictionary?- while you are serializing your classifier:
Instances trainInstances = ... //
Instances trainHeader = new Instances(trainInstances, 0);
trainHeader.setClassIndex(trainInstances .classIndex());
OutputStream os = new FileOutputStream(fileName);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os);
objectOutputStream.writeObject(classifier);
if (trainHeader != null)
objectOutputStream.writeObject(trainHeader);
objectOutputStream.flush();
objectOutputStream.close();
To desialize:
Classifier classifier = null;
Instances trainHeader = null;
InputStream is = new BufferedInputStream(new FileInputStream(fileName));
ObjectInputStream objectInputStream = new ObjectInputStream(is);
classifier = (Classifier) objectInputStream.readObject();
try { // see if we can load the header
trainHeader = (Instances) objectInputStream.readObject();
} catch (Exception e) {
}
objectInputStream.close();
Use trainHeader to create new Instance:
int numAttributes = trainHeader.numAttributes();
double[] vals = new double[numAttributes];
for (int i = 0; i < numAttributes - 1; i++) {
Attribute attribute = trainHeader.attribute(i);
//If your attribute is nominal or string:
double value = attribute.indexOfValue(myStrVal); //get myStrVal from your source
//If your attribute is numeric
double value = myNumericVal; //get myNumericVal from your source
vals[i] = value;
}
vals[numAttributes] = Instance.missingValue();
Instance instance = new Instance(1.0, vals);
instance.setDataset(trainHeader);
return instance;
I'm analysing the k-means algorithm with Mahout. I'm going to run some tests, observe performance, and do some statistics with the results I get.
I can't figure out the way to run my own program within Mahout. However, the command-line interface might be enough.
To run the sample program I do
$ mahout seqdirectory --input uscensus --output uscensus-seq
$ mahout seq2sparse -i uscensus-seq -o uscensus-vec
$ mahout kmeans -i reuters-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
The dataset is one large CSV file. Each line is a record. Features are comma separated. The first field is an ID.
Because of the input format I can not use seqdirectory right away.
I'm trying to implement the answer to this similar question How to perform k-means clustering in mahout with vector data stored as CSV? but I still have 2 Questions:
How do I convert from CSV to SeqFile? I guess I can write my own
program using Mahout to make this conversion and then use its output
as input for seq2parse. I guess I can use CSVIterator (https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations). What class should I use to read and write?
How do I build and run my new program? I couldn't figure it out with the book Mahout in action or with other questions here.
For getting your data in SequenceFile format, you have a couple of strategies you can take. Both involve writing your own code -- i.e., not strictly command-line.
Strategy 1
Use Mahout's CSVVectorIterator class. You pass it a java.io.Reader and it will read in your CSV file, turn each row into a DenseVector. I've never used this, but saw it in the API. Looks straight-forward enough if you're ok with DenseVectors.
Strategy 2
Write your own parser. This is really easy, since you just split each line on "," and you have an array you can loop through. For each array of values in each line, you instantiate a vector using something like this:
new DenseVector(<your array here>);
and add it to a List (for example).
Then ... once you have a List of Vectors, you can write them to SequenceFiles using something like this (I'm using NamedVectors in below code):
FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();
List<NamedVector> vectors = <here's your List of vectors obtained from CSVVectorIterator>;
// Write the data to SequenceFile
try {
fs = FileSystem.get(conf);
Path path = new Path(<your path> + <your filename>);
writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector vector : dataVector) {
vec.set(vector);
writer.append(new Text(vector.getName()), vec);
}
writer.close();
} catch (Exception e) {
System.out.println("ERROR: "+e);
}
Now you have a directory of "points" in SequenceFile format that you can use for your K-means clustering. You can point the command line Mahout commands at this directory as input.
Anyway, that's the general idea. There are probably other approaches as well.
To run kmeans with csv file, first you have to create a SequenceFile to pass as an argument in KmeansDriver. The following code reads each line of the CSV file "points.csv" and converts it into vector and write it to the SequenceFile "points.seq"
try (
BufferedReader reader = new BufferedReader(new FileReader("testdata2/points.csv"));
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,new Path("testdata2/points.seq"), LongWritable.class, VectorWritable.class)
) {
String line;
long counter = 0;
while ((line = reader.readLine()) != null) {
String[] c = line.split(",");
if(c.length>1){
double[] d = new double[c.length];
for (int i = 0; i < c.length; i++)
d[i] = Double.parseDouble(c[i]);
Vector vec = new RandomAccessSparseVector(c.length);
vec.assign(d);
VectorWritable writable = new VectorWritable();
writable.set(vec);
writer.append(new LongWritable(counter++), writable);
}
}
writer.close();
}
Hope it helps!!
There were a few issues when I was running the above code, so with a few modifications in the syntax here is the working code.
String inputfiledata = Input_file_path;
String outputfile = output_path_for_sequence_file;
FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();
fs = FileSystem.get(conf);
Path path = new Path(outputfile);`enter code here`
writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
List<NamedVector> vects = new ArrayList<NamedVector>();
try {
fr = new FileReader(inputfiledata);
br = new BufferedReader(fr);
s = null;
while((s=br.readLine())!=null){
// My columns are split by tabs with each entry in a new line as rows
String spl[] = s.split("\\t");
String key = spl[0];
Integer val = 0;
for(int k=1;k<spl.length;k++){
colvalues[val] = Double.parseDouble(spl[k]);
val++;
}
}
NamedVector nmv = new NamedVector(new DenseVector(colvalues),key);
vec.set(nmv);
writer.append(new Text(nmv.getName()), vec);
}
writer.close();
} catch (Exception e) {
System.out.println("ERROR: "+e);
}
}
I would suggest you implement a program to convert the CSV to sparse vector sequence file that mahout accepts.
what you need to do is understand how InputDriver converts text files containing space-delimited floating point numbers into Mahout sequence files of VectorWritable suitable for input to the clustering jobs in particular, and any Mahout job requiring this input in general. You will customize the codes to your needs.
If you have downloaded the source code of Mahout, the InputDriver is at package org.apache.mahout.clustering.conversion.
org.apache.mahout.clustering.conversion.InputDriver is a class that you can use to create sparse vectors.
Sample code is given below
mahout org.apache.mahout.clustering.conversion.InputDriver -i testdata -o output1/data -v org.apache.mahout.math.RandomAccessSparseVector
If you run mahout org.apache.mahout.clustering.conversion.InputDriver
it will list out the parameters it expects.
Hope this helps.
Also here is an article I wrote to explain how I ran kmeans clustering on an arff file
http://mahout-hadoop.blogspot.com/2013/10/using-mahout-to-cluster-iris-data.html