Java-ML(LibSVM) How can I get the classes probabilities? - java

We are using Java-ML(LibSVM) in order to execute the SVM algorithm over a multi-class problem
Classifier clas = new LibSVM();
clas.buildClassifier(data);
Dataset dataForClassification= FileHandler.loadDataset(new File(.), 0, ",");
/* Counters for correct and wrong predictions. */
int correct = 0, wrong = 0;
/* Classify all instances and check with the correct class values */
for (Instance inst : dataForClassification) {
Object predictedClassValue = clas.classify(inst);
Map<Object,Double> map = clas.classDistribution(inst);
Object realClassValue = inst.classValue();
if (predictedClassValue.equals(realClassValue))
correct++;
else
wrong++;
}
the classDistributtion() returns a standard vector ( meaning all values are 0 but one value which equals to 1)
java-ml - http://java-ml.sourceforge.net/

Despite the other answers, it is possible to output probability estimates for SVMs and LibSVM does do this. However, I'm fairly sure you can't use this feature from Java-ML. The file LibSVM.java only ever refers to the function svm_predict_values and never svm_predict_probabilities. It probably wouldn't be too hard to add this functionality in to Java-ML if you felt you really needed it.

AFAIK, LibSVM is a deterministic classifier, meaning that the only distributions you will see are concentrated on a single class i.e. a standard vector. This is different than a probabilistic classifier such as Naive Bayes, which may give values different than 0.0 and 1.0.

Related

Custom distance metric for DBSCAN in Apache Commons Math (v3.1 vs. v3.6)

I want to use Apache Commons Math's DBSCANClusterer<T extends Clusterable> to perform a clustering using the DBSCAN algorithm, but with a custom distance metric as my data points contain non-numerical values. This seems to have been easily achievable in the older version (note that the fully qualified name of this class is org.apache.commons.math3.stat.clustering.DBSCANClusterer<T> whereas it is org.apache.commons.math3.ml.clustering.DBSCANClusterer<T> for the current release), which has now been deprecated. In the older version, Clusterable would take a type-param, T, describing the type of the data points being clustered, and the distance between two points would be defined by one's implementation of Clusterable.distanceFrom(T), e.g.:
class MyPoint implements Clusterable<MyPoint> {
private String someStr = ...;
private double someDouble = ...;
#Override
public double distanceFrom(MyPoint p) {
// Arbitrary distance metric goes here, e.g.:
double stringsEqual = this.someStr.equals(p.someStr) ? 0.0 : 10000.0;
return stringsEqual + Math.sqrt(Math.pow(p.someDouble - this.someDouble, 2.0));
}
}
In the current release, Clusterable is no longer parameterized. This means that one has to come up with a way of representing one's (potentially non-numerical) data points as a double[] and return that representation from getPoint(), e.g.:
class MyPoint implements Clusterable {
private String someStr = ...;
private double someDouble = ...;
#Override
public double[] getPoint() {
double[] res = new double[2];
res[1] = someDouble; // obvious
res[0] = ...; // some way of representing someStr as a double required
return res;
}
}
And then provide an implementation of DistanceMeasure that defines the custom distance function in terms of the double[] representations of the two points being compared, e.g.:
class CustomDistanceMeasure implements DistanceMeasure {
#Override
public double compute(double[] a, double[] b) {
// Let's mimic the distance function from earlier, assuming that
// a[0] is different from b[0] if the two 'someStr' variables were
// different when their double representations were created.
double stringsEqual = a[0] == b[0] ? 0.0 : 10000.0;
return stringsEqual + Math.sqrt(Math.pow(a[1] - b[1], 2.0));
}
}
My data points are of the form (integer, integer, string, string):
class MyPoint {
int i1;
int i2;
String str1;
String str2;
}
And I want to use a distance function/metric that essentially says "if str1 and/or str2 differ for MyPoint mpa and MyPoint mpb, the distance is maximal, otherwise the distance is the Euclidean distance between the integers" as illustrated by the following snippet:
class Dist {
static double distance(MyPoint mpa, MyPoint mpb) {
if (!mpa.str1.equals(mpb.str1) || !mpa.str2.equals(mpb.str2)) {
return Double.MAX_VALUE;
}
return Math.sqrt(Math.pow(mpa.i1 - mpb.i1, 2.0) + Math.pow(mpa.i2 - mpb.i2, 2.0));
}
}
Questions:
How do I represent a String as a double in order to enable the above distance metric in the current release (v3.6.1) of Apache Commons Math? String.hashCode() is insufficient as hash code collisions would cause different strings to be considered equal. This seems like an unsolvable problem as I'm essentially trying to create a unique mapping from an infinite set of strings to a finite set of numerical values (64bit double).
As (1) seems impossible, am I misunderstanding how to use the library? If yes, were did I take a wrong turn?
Is my only alternative to use the deprecated version for this kind of distance metric? If yes, (3a) why would the designers choose to make the library less general? Perhaps in favor of speed? Perhaps to get rid of the self-reference in class MyPoint implements Clusterable<MyPoint> which some might consider bad design? (I realize that this might be too opinionated, so please disregard it if that is the case). For the commons-math experts: (3b) what downsides are there to using the deprecated version other than forward compatibility (the deprecated version will be removed in 4.0)? Is it slower? Perhaps even incorrect?
Note: I am aware of ELKI which is apparently popular among a set of SO users, but it does not fit my needs as it is marketed as a command-line and GUI tool rather than a Java library to be included in third-party applications:
You can even embed ELKI into your application (if you accept the
AGPL-3 license), but we currently do not (yet) recommend to do so,
because the API is still changing substantially. [...]
ELKI is not designed as embeddable library. It can be used, but it is
not designed to be used this way. ELKI has tons of options and
functionality, and this comes at a price, both in runtime (although it
can easily outperform R and Weka, for example!) memory usage and in
particular in code complexity.
ELKI was designed for research in data mining algorithms, not for
making them easy to include in arbitrary applications. Instead, if you
have a particular problem, you should use ELKI to find out which
approach works good, then reimplement that approach in an optimized
manner for your problem (maybe even in C++ then, to further reduce
memory and runtime).

Incorrect class prediction using Weka

I am using the WEKA API weka-stable-3.8.1.
I have been trying to use J48 decision tree(C4.5 implementation of weka).
My data has around 22 features and a nominal class with 2 possible values : yes or no.
While evaluating with the following code :
Classifier model = (Classifier) weka.core.SerializationHelper.read(trainedModelDestination);
Evaluation evaluation = new Evaluation(trainingInstances);
evaluation.evaluateModel(model, testingInstances);
System.out.println("Number of correct predictions : "+evaluation.correct());
I get all predictions correct.
But when I try these test cases individually using :
for(Instance i : testingInstances){
double predictedClassLabel = model.classifyInstance(i);
System.out.println("predictedClassLabel : "+predictedClassLabel);
}
I always get the same output, i.e. 0.0.
Why is this happening ?
If the provided snippet is indeed from your code, you seem to be always classifying the first test instance: "testingInstances.firstInstance()".
Rather, you may want to make a loop to classify each test instance.
for(Instance i : testingInstances){
double predictedClassLabel = model.classifyInstance(i);
System.out.println("predictedClassLabel : "+predictedClassLabel);
}
Should have updated much sooner.
Here's how I fixed this:
During the training phase, the model learns from your training set. While learning from this set it encounters categorical/nominal features as well.
Most algorithms require numerical values to work. To deal with this the algorithm maps the variables to a specific numerical value. longer explanation here
Since the algorithm has learned this during the training phase, the Instances object holds this information. During testing phase you have to use the same Instances object that was created during training phase. Otherwise, the testing classifier will not correctly map your nominal values to their expected values.
Note:
This kind of encoding gives biased training results in Non-tree based models and things like One-Hot-Encoding should be used in such cases.

How does WEKA normalize attributes?

Suppose I input to WEKA some dataset and set a normalization filter for the attributes so the values be between 0 and 1. Then suppose the normalization is done by dividing on the maximum value, and then the model is built. Then what happens if I deploy the model and in the new instances to be classified an instance has a feature value that is larger than the maximum in the training set. How such a situation is handled? Does it just take 1 or does it then take more than 1? Or does it throw an exception?
The documentation doesn't specify this for filters in general.So it must depend on the filter. I looked at the source code of weka.filters.unsupervised.attribute.Normalize which I assume you are using, and I don't see any bounds checking in it.
The actual scaling code is in the Normalize.convertInstance() method:
value = (vals[j] - m_MinArray[j]) / (m_MaxArray[j] - m_MinArray[j])
* m_Scale + m_Translation;
Barring any (unlikely) additional checks outside this method I'd say that it will scale to a value greater than 1 in the situation that you describe. To be 100% sure your best bet is to write a testcase, invoke the filter yourself, and find out. With libraries that haven't specified their working in the Javadoc, you never know what the next release will do. So if you greatly depend on a particular behaviour, it's not a bad idea to write an automated test that regression-tests the behaviour of the library.
I have the same questions as you said. I did as follows and may this method can help you:
I suppose you use the weka.filters.unsupervised.attribute.Normalize to normalize your data.
as Erwin Bolwidt said, weka use
value = (vals[j] - m_MinArray[j]) / (m_MaxArray[j] - m_MinArray[j])
* m_Scale + m_Translation;
to normalize your attribute.
Don't forget that the Normalize class has this two method:
public double[] getMinArray()
public double[] getMaxArray()
Which Returns the calculated minimum/maximum values for the attributes in the data.
And you can store the minimum/maximum values. And then use the formula to normalize your data by yourself.
Remember you can set the attribute in Instance class, and you can classify your result by Evaluation.evaluationForSingleInstance
I 'll give you the link later, may this help you.
Thank you

Create an almost unique identifier based on a given array of numbers

Given an array of numbers, I would like to create a number identifier that represents that combination as unique as possible.
For example:
int[] inputNumbers = { 543, 134, 998 };
int identifier = createIdentifier(inputNumbers);
System.out.println( identifier );
Output:
4532464234
-The returned number must be as unique as possible
-Ordering of the elements must influence the result
-The algorithm must return always the same result from the same input array
-The algorithm must be as fast as possible to be used alot in 'for' loops
The purpose of this algorithm, is to create a small value to be stored in a DB, and to be easily comparable. It is nothing critical so it's acceptable that some arrays of numbers return the same value, but that cases must be rare.
Can you suggest a good way to accomplish this?
The standard ( Java 7 ) implementation of Arrays.hashCode(int[]) has the required properties. It is implemented thus:
2938 public static int hashCode(int a[]) {
2939 if (a == null)
2940 return 0;
2941
2942 int result = 1;
2943 for (int element : a)
2944 result = 31 * result + element;
2945
2946 return result;
2947 }
As you can see, the implementation is fast, and the result depends on the order of the elements as well as the element values.
If there is a requirement that the hash values are the same across all Java platforms, I think you can rely on that being satisfied. The javadoc says that the method will return a value that is that same as you get when calling List<Integer>.hashcode() on an equivalent list. And the formula for that hashcode is specified.
Have a look at Arrays.hashCode(int[]), it is doing exactly this.
documentation
What you're looking for is the array's hash code.
int hash = Arrays.hashCode(new int[]{1, 2, 3, 4});
See also the Java API
I also say you are looking for some kind of hash function.
I don't know how much you will rely on point 3 The algorithm must return always the same result from the same input array, but this depends on the JVM implementation.
So depending on your use case you might run into some trouble (The solution then would be to use a extern hash library).
For further information take a look at this SO question: Java, Object.hashCode() result constant across all JVMs/Systems?
EDIT
I just read you want to store the values in a DB. In that case I would recommend you to use a extern hasing library that is reliable and guaranteed to yield the same value every time it is invoked. Otherwise you would have to re-hash your whole DB every time you start your application, to have it in a consistent state.
EDIT2
Since you are using only plain ints the hash value should be the same every time. As #Stephen C showed in his answer.

Weka's PCA is taking too long to run

I am trying to use Weka for feature selection using PCA algorithm.
My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensionality of the data using the following code:
AttributeSelection selector = new AttributeSelection();
PrincipalComponents pca = new PrincipalComponents();
Ranker ranker = new Ranker();
selector.setEvaluator(pca);
selector.setSearch(ranker);
Instances instances = SamplesManager.asWekaInstances(trainSet);
try {
selector.SelectAttributes(instances);
return SamplesManager.asSamplesList(selector.reduceDimensionality(instances));
} catch (Exception e ) {
...
}
However, It did not finish to run within 12 hours. It is stuck in the method selector.SelectAttributes(instances);.
My questions are:
Is so long computation time expected for weka's PCA? Or am I using PCA wrongly?
If the long run time is expected:
How can I tune the PCA algorithm to run much faster? Can you suggest an alternative? (+ example code how to use it)?
If it is not:
What am I doing wrong? How should I invoke PCA using weka and get my reduced dimensionality?
Update: The comments confirms my suspicion that it is taking much more time than expected.
I'd like to know: How can I get PCA in java - using weka or an alternative library.
Added a bounty for this one.
After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.
The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.
The code is more complex, but it essentially comes down to this:
Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);
However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.
It looks like you're using the default configuration for the PCA, which judging by the long runtime, it is likely that it is doing way too much work for your purposes.
Take a look at the options for PrincipalComponents.
I'm not sure if -D means they will normalize it for you or if you have to do it yourself. You want your data to be normalized (centered about the mean) though, so I would do this yourself manually first.
-R sets the amount of variance you want accounted for. Default is 0.95. The correlation in your data might not be good so try setting it lower to something like 0.8.
-A sets the maximum number of attributes to include. I presume the default is all of them. Again, you should try setting it to something lower.
I suggest first starting out with very lax settings (e.g. -R=0.1 and -A=2) then working your way up to acceptable results.
Best
for the construction of your covariance matrix, you can use the following formula which is also used by matlab. It is faster then the apache library.
Whereby Matrix is an m x n matrix. (m --> #databaseFaces)

Categories