I am using LIBSVM in java and I need to calculate the AUC values. The read me file says we can use the -v option splits the data into n parts and calculates cross validation accuracy/mean squared error on them, but in java I am using the svm_train function which does not have a -v option (it has SVM Problem and SVM Parameters as inputs). So I am using the svm_cross_validation function as below but it does not return the accuracy (returns the labels) in the target array.
svm.svm_cross_validation(SVM_Prob, SVM_Param, 3, target);
I get results like below which does not show any accuracy
optimization finished, #iter = 21
nu = 0.06666666666666667
obj = -21.0, rho = 0.0
nSV = 42, nBSV = 0
Total nSV = 42
My data is not unbalanced so I am not sure if I should use LibLINEAR.
can anyone tell me how to find the cross validation accuracy of libsvm in java.
Thanks!
You can write a simple one by yourself:
double[] target = new double[labels.length];
svm.svm_cross_validation(problem, param, 3, target);
double correctCounter = 0;
for (int i = 0; i < target.length; i++) {
if (target[i] == labels[i]) {
correctCounter++;
}
}
Related
I am using the smile library for training a neural network for a simple nonlinear regression problem. For every input I get the same prediction. Maybe I am doing something wrong with the construction of the neural network. Could you maybe suggest what am I missing? I am not including the data because of legal reasons ex. [1, 3, 8, 15, 7 ...]
int[] units=new int[4];
units[0]=1;
units[1]=10;
units[2]=10;
units[3]=1;
smile.regression.NeuralNetwork net =
new smile.regression.NeuralNetwork(
smile.regression.NeuralNetwork.ActivationFunction.LOGISTIC_SIGMOID,
units);
int epochs=10;
for (int i = 0; i < epochs; i++) {
net.learn(data, label);
}
double[] pred = new double[label.length];
double maxError=0.0;
for (int i = 0; i < label.length; i++) {
pred[i] = net.predict(data[i]);
double error=Math.abs(pred[i]-label[i]);
if(error>maxError){
maxError=error;
}
System.out.println("For data "+data[i][0]);
System.out.println(pred[i]+" "+label[i]);
}
I belive it is due to your unit vector, it should only contain 3 numbers:
the number of input layers = number of features
the number of hidden layers
and the number of output layers = 1
I want to use SVM (Support vector machine) in my program, but I could not get the true result.
I want to know that how we must train data for SVM.
What I am doing:
Think that we have 5 document (the numbers are just an example), 3 of them is on first category and others (2 of them) are on second category, I merge the categories to each other (it means that the 3 doc that are in the first category will merge in one document), after that I made a train array like this:
double[][] train = new double[cat1.getDocument().getAttributes().size() + cat2.getDocument().getAttributes().size()][];
and I will fill the array like this:
int i = 0;
Iterator<String> iteraitor = cat1.getDocument().getAttributes().keySet().iterator();
Iterator<String> iteraitor2 = cat2.getDocument().getAttributes().keySet().iterator();
while (i < train.length) {
if (i < cat2.getDocument().getAttributes().size()) {
while (iteraitor2.hasNext()) {
String key = (String) iteraitor2.next();
Long value = cat2.getDocument().getAttributes().get(key);
double[] vals = { 0, value };
train[i] = vals;
i++;
System.out.println(vals[0] + "," + vals[1]);
}
} else {
while (iteraitor.hasNext()) {
String key = (String) iteraitor.next();
Long value = cat1.getDocument().getAttributes().get(key);
double[] vals = { 1, value };
train[i] = vals;
i++;
System.out.println(vals[0] + "," + vals[1]);
}
i++;
}
so I will continue like this to get the model :
svm_problem prob = new svm_problem();
int dataCount = train.length;
prob.y = new double[dataCount];
prob.l = dataCount;
prob.x = new svm_node[dataCount][];
for (int k = 0; k < dataCount; k++) {
double[] features = train[k];
prob.x[k] = new svm_node[features.length - 1];
for (int j = 1; j < features.length; j++) {
svm_node node = new svm_node();
node.index = j;
node.value = features[j];
prob.x[k][j - 1] = node;
}
prob.y[k] = features[0];
}
svm_parameter param = new svm_parameter();
param.probability = 1;
param.gamma = 0.5;
param.nu = 0.5;
param.C = 1;
param.svm_type = svm_parameter.C_SVC;
param.kernel_type = svm_parameter.LINEAR;
param.cache_size = 20000;
param.eps = 0.001;
svm_model model = svm.svm_train(prob, param);
Is this way correct? if not please help me to make it true.
these two answers are true : answer one , answer two,
Even without examining the code one can find conceptual errors:
think that we have 5 document , 3 of them is on first category and others( 2 of them) are on second category , i merge the categories to each other (it means that the 3 doc that are in the first category will merge in one document ) ,after that i made a train array like this
So:
training on the 5 documents won't give any reasonable effects, with any machine learning model... these are statistical models,there is no reasonable statistics in 5 points in R^n, where n~10,000
You do not merge anything. Such approach can work for Naive Bayes, which do not really treat documents as "whole" but rather - as probabilistic dependencies between features and classes. In SVM each document should be separate point in the R^n space, where n can be number of distinct words (for bag of words/set of words representation).
A problem might be that you do not terminate each set of features in a training example with an index of -1 which you should according to the read me...
I.e. if you have one example with two features i think you should do:
Index[0]: 0
Value[0]: 22
Index[1]: 1
Value[1]: 53
Index[2]: -1
Good luck!
Using SVMs to classify text is a common task. You can check out research papers by Joachims [1] regarding SVM text classification.
Basically you have to:
Tokenize your documents
Remove stopwords
Apply stemming technique
Apply feature selection technique (see [2])
Transform your documents using features achieved in 4.) (simple would be binary (0: feature is absent, 1: feature is present) or other measures like TFC)
Train your SVM and be happy :)
[1] T. Joachims: Text Categorization with Support Vector Machines: Learning with Many Relevant Features; Springer: Heidelberg, Germany, 1998, doi:10.1007/BFb0026683.
[2] Y. Yang, J. O. Pedersen: A Comparative Study on Feature Selection in Text Categorization. International Conference on Machine Learning, 1997, 412-420.
I have a machine learning scheme in which I am using the java classes from Weka to implement machine learning in a matlab script. I am then uploading the model for the classifier to a database, since I need to perform the classification on a different machine in a different language (obj-c). The evaluation of the network was fairly straightforward to program, but I need the values that WEKA used to normalize the data set before training so I can use them in the evaluation of the network later. Does anyone know how to get the normalization factors that weka would use for training a Multilayer Perceptron network? I would prefer the answer to be in Java.
After some digging through the WEKA source code and documentation... this is what I've come up with. Even though there is a filter in WEKA called "Normalize", the Multilayer Perceptron doesn't use it, instead it uses a bit of code internally that looks like this.
m_attributeRanges = new double[inst.numAttributes()];
m_attributeBases = new double[inst.numAttributes()];
for (int noa = 0; noa < inst.numAttributes(); noa++) {
min = Double.POSITIVE_INFINITY;
max = Double.NEGATIVE_INFINITY;
for (int i=0; i < inst.numInstances();i++) {
if (!inst.instance(i).isMissing(noa)) {
value = inst.instance(i).value(noa);
if (value < min) {
min = value;
}
if (value > max) {
max = value;
}
}
}
m_attributeRanges[noa] = (max - min) / 2;
m_attributeBases[noa] = (max + min) / 2;
if (noa != inst.classIndex() && m_normalizeAttributes) {
for (int i = 0; i < inst.numInstances(); i++) {
if (m_attributeRanges[noa] != 0) {
inst.instance(i).setValue(noa, (inst.instance(i).value(noa)
- m_attributeBases[noa]) /
m_attributeRanges[noa]);
}
else {
inst.instance(i).setValue(noa, inst.instance(i).value(noa) -
m_attributeBases[noa]);
}
So the only values that I should need to transmit to the other system I'm trying to use to evaluate this network would be the min and the max. Luckily for me, there turned out to be a method on the filter weka.filters.unsupervised.attribute.Normalize that returns a double array of the mins and the maxes for a processed dataset. All I had to do then was tell the multilayer perceptron to not automatically normalize my data, and to process it separately with the filter so I could extract the mins and maxes to send to the database along with the weights and everything else.
I need steps to perform document clustering using k-means algorithm in java.
It will be very useful for me to provide the steps easily.
Thanks in advance.
You need to count the words in each document and make a feature generally called bag of words. Before that you need to remove stop words(very common but not giving much information like the, a etc). You can generally take top n common words from your document. Count the frequency of these words and store them in n dimensional vector.
For distance measure you can use cosine vector.
Here is a simple algorithm for 2 mean for 1 dimensional data points. you can extend it to k mean and n dimensional data point easily. Let me know if you want n dim implementation.
double[] x = {1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10};
double[] center = new int[2];
double[] precenter = new int[2];
ArrayList[] cluster = new ArrayList[2];
//generate 2 random number from 0 to x.length without replacement
int rand = new int[2];
Random rand = new Random();
rand[0] = rand.nextInt(x.length + 1);
rand[1] = rand.nextInt(x.length + 1);
while(rand[0] == rand[1] ){
rand[1] = rand.nextInt(x.length + 1);
}
center[0] = x[rand[0]];
center[1] = x[rand[1]];
//there is a better way to generate k random number (w/o replacement) just search.
do{
cluster[0].clear();
cluster[1].clear();
for(int i = 0; i < x.length; ++i){
if(abs(x[i]-center1[0]) <= abs(x[i]-center1[1])){
cluster[0].add(x[i]);
}
else{
cluster[0].add(x[i]);
}
precenter[0] = center[0];
precenter[1] = center[1];
center[0] = mean(cluster[0]);
center[1] = mean(cluster[1]);
}
} while(precenter[0] != center[0] && precenter[1] != center[1]);
double mean(ArrayList list){
double mean = 0;
double sum = 0;
for(int index=0;index
}
The cluster[0] and cluster [1] contain points in the clusters and center[0], center[1] are the 2 means.
you need to do some debugging because I have written the code in R and just converted it into java for you :)
Does this help you? Also the wiki article has some links to implementations in other languages ready to be ported to java.
Steps of the algorithm:
Define the number of clusters you want to have
Distribute the points radomly in your problem space.
Link every observation to the nearest point.
calculate the center of mass for each cluster and place the point into the middle.
Link the points again to the centerpoints and repeat until the points dont move any more.
What do you want to cluster the documents based on? If it's by similarity you'll need to do some natural language processing first, and then you'll need a metric (some kind of assignment algorithm) to place the documents into clusters (crp works and is relatively straight forward).
The hardest part will be the NLP (language processing) if you're not clustering them based on something like "length". I can provide more info on all of these, but I won't dive down the rabbit hole if you don't need it.
I'm using libsvm and the documentation leads me to believe that there's a way to output the believed probability of an output classification's accuracy. Is this so? And if so, can anyone provide a clear example of how to do it in code?
Currently, I'm using the Java libraries in the following manner
SvmModel model = Svm.svm_train(problem, parameters);
SvmNode x[] = getAnArrayOfSvmNodesForProblem();
double predictedValue = Svm.svm_predict(model, x);
Given your code-snippet, I'm going to assume you want to use the Java API packaged with libSVM, rather than the more verbose one provided by jlibsvm.
To enable prediction with probability estimates, train a model with the svm_parameter field probability set to 1. Then, just change your code so that it calls the svm method svm_predict_probability rather than svm_predict.
Modifying your snippet, we have:
parameters.probability = 1;
svm_model model = svm.svm_train(problem, parameters);
svm_node x[] = problem.x[0]; // let's try the first data pt in problem
double[] prob_estimates = new double[NUM_LABEL_CLASSES];
svm.svm_predict_probability(model, x, prob_estimates);
It's worth knowing that training with multiclass probability estimates can change the predictions made by the classifier. For more on this, see the question Calculating Nearest Match to Mean/Stddev Pair With LibSVM.
The accepted answer worked like a charm. Make sure to set probability = 1 during training.
If you are trying to drop prediction when the confidence is not met with threshold, here is the code sample:
double confidenceScores[] = new double[model.nr_class];
svm.svm_predict_probability(model, svmVector, confidenceScores);
/*System.out.println("text="+ text);
for (int i = 0; i < model.nr_class; i++) {
System.out.println("i=" + i + ", labelNum:" + model.label[i] + ", name=" + classLoadMap.get(model.label[i]) + ", score="+confidenceScores[i]);
}*/
//finding max confidence;
int maxConfidenceIndex = 0;
double maxConfidence = confidenceScores[maxConfidenceIndex];
for (int i = 1; i < confidenceScores.length; i++) {
if(confidenceScores[i] > maxConfidence){
maxConfidenceIndex = i;
maxConfidence = confidenceScores[i];
}
}
double threshold = 0.3; // set this based data & no. of classes
int labelNum = model.label[maxConfidenceIndex];
// reverse map number to name
String targetClassLabel = classLoadMap.get(labelNum);
LOG.info("classNumber:{}, className:{}; confidence:{}; for text:{}",
labelNum, targetClassLabel, (maxConfidence), text);
if (maxConfidence < threshold ) {
LOG.info("Not enough confidence; threshold={}", threshold);
targetClassLabel = null;
}
return targetClassLabel;