Incorrect class prediction using Weka - java

I am using the WEKA API weka-stable-3.8.1.
I have been trying to use J48 decision tree(C4.5 implementation of weka).
My data has around 22 features and a nominal class with 2 possible values : yes or no.
While evaluating with the following code :
Classifier model = (Classifier) weka.core.SerializationHelper.read(trainedModelDestination);
Evaluation evaluation = new Evaluation(trainingInstances);
evaluation.evaluateModel(model, testingInstances);
System.out.println("Number of correct predictions : "+evaluation.correct());
I get all predictions correct.
But when I try these test cases individually using :
for(Instance i : testingInstances){
double predictedClassLabel = model.classifyInstance(i);
System.out.println("predictedClassLabel : "+predictedClassLabel);
}
I always get the same output, i.e. 0.0.
Why is this happening ?

If the provided snippet is indeed from your code, you seem to be always classifying the first test instance: "testingInstances.firstInstance()".
Rather, you may want to make a loop to classify each test instance.
for(Instance i : testingInstances){
double predictedClassLabel = model.classifyInstance(i);
System.out.println("predictedClassLabel : "+predictedClassLabel);
}

Should have updated much sooner.
Here's how I fixed this:
During the training phase, the model learns from your training set. While learning from this set it encounters categorical/nominal features as well.
Most algorithms require numerical values to work. To deal with this the algorithm maps the variables to a specific numerical value. longer explanation here
Since the algorithm has learned this during the training phase, the Instances object holds this information. During testing phase you have to use the same Instances object that was created during training phase. Otherwise, the testing classifier will not correctly map your nominal values to their expected values.
Note:
This kind of encoding gives biased training results in Non-tree based models and things like One-Hot-Encoding should be used in such cases.

Related

Java - Efficient evaluation of user-input math functions (preparation possible, existing variables)

In a Java program which has a variable t counting up the time (relative to the program start, not system time), how can I turn a user-input String into a math formula that can be evaluated efficiently when needed.
(Basically, the preparation of the formula can be slow as it happens Pre run-time, but each stored function may be called several times during run-time and then has to be evaluated efficiently)
As I could not find a Math parser that would keep a formula loaded for later reference instead of finding a general graph solving the equation of y=f(x), I was considering to instead have my Java program generate a script (JS, Python, etc) out of the input String and then call said script with the current t as input parameter.
-However I have been told that Scripts are rather slow and thus impractical for real-time applications.
Is there a more efficient way of doing this? (I would even consider making my Java application generate and compile C-code for every user input if this would be viable)
Edit: A tree construct does work to store expressions, but is still fairly slow to evaluate as from what I understand I would need to turn it into a chain of expressions again when evaluating (as in, traverse the tree object) which should need more calls than direct solving of an equation. Instead I will attempt the generation of additional java classes.
What I do is generate Java code at a runtime and compile it. There are a number of libraries to help you do this, one I wrote is https://github.com/OpenHFT/Java-Runtime-Compiler This way it can be as efficient as if you had hand written the Java code yourself and if called enough times will be compiled to native code.
Can you provide some information on assumed function type and requested performance? Maybe it will be enough just to use math parser library, which pre-compiles string containing math formula with variables just once, and then use this pre-compiled form of formula to deliver result even if variables values are changing? This kind of solutions are pretty fast as it typically do not require repeating string parsing, syntax checking and so on.
An example of such open-source math parser I recently used for my project is mXparser:
mXparser on GitHub
http://mathparser.org/
Usage example containing function definition
Function f = new Function("f(x,y) = sin(x) + cos(y)");
double v1 = f.calculate(1,2);
double v2 = f.calculate(3,4);
double v3 = f.calculate(5,6);
In the above code real string parsing will be done just once, before calculating v1. Further calculation v1, v2 (an possible vn) will be done in fast mode.
Additionally you can use function definition in string expression
Expression e = new Expression("f(1,2)+f(3,4)", f);
double v = e.calculate();

Automate real time data using java

I am a new bee to Automation and Java. I am working on a problem which requires me to read the read time stock market data from the database and verify it with the same with the value seen on the UI. I am ok having approximations up to 5% in the value. To verify if these tests have passed its important for me to assert the values with the value in the UI.
I have a small logic to verify these values, I wanted to know if this is a good way of coding on java or do i have a better way to achieve these results.
Alorigthm.
I read the int/float value from db.
Calculate 5% of the value in step 1.
Get the value in the UI and assert if its greater then or equal to value in step 2.
If greater i say Asseert.assertEquals(true,true) else i fail my assert.
If any better way to work for these values, request a better answer.
It's more usual to have your Assertion represent the meaning of your test, having to assert(true, true) does not do this. So:
3. Calculate the absoluete difference between the value obtained in step 1 and the UI value (when I say absolute value, you need to remember that the UI might be higher or lower than the db value, you need to make the difference to be always positive)
4. Assert.assertThat( difference < theFivePercentValue)
Also you could consider using the Hamcrest extension to JUnit that includes a closeTo() method.

How does WEKA normalize attributes?

Suppose I input to WEKA some dataset and set a normalization filter for the attributes so the values be between 0 and 1. Then suppose the normalization is done by dividing on the maximum value, and then the model is built. Then what happens if I deploy the model and in the new instances to be classified an instance has a feature value that is larger than the maximum in the training set. How such a situation is handled? Does it just take 1 or does it then take more than 1? Or does it throw an exception?
The documentation doesn't specify this for filters in general.So it must depend on the filter. I looked at the source code of weka.filters.unsupervised.attribute.Normalize which I assume you are using, and I don't see any bounds checking in it.
The actual scaling code is in the Normalize.convertInstance() method:
value = (vals[j] - m_MinArray[j]) / (m_MaxArray[j] - m_MinArray[j])
* m_Scale + m_Translation;
Barring any (unlikely) additional checks outside this method I'd say that it will scale to a value greater than 1 in the situation that you describe. To be 100% sure your best bet is to write a testcase, invoke the filter yourself, and find out. With libraries that haven't specified their working in the Javadoc, you never know what the next release will do. So if you greatly depend on a particular behaviour, it's not a bad idea to write an automated test that regression-tests the behaviour of the library.
I have the same questions as you said. I did as follows and may this method can help you:
I suppose you use the weka.filters.unsupervised.attribute.Normalize to normalize your data.
as Erwin Bolwidt said, weka use
value = (vals[j] - m_MinArray[j]) / (m_MaxArray[j] - m_MinArray[j])
* m_Scale + m_Translation;
to normalize your attribute.
Don't forget that the Normalize class has this two method:
public double[] getMinArray()
public double[] getMaxArray()
Which Returns the calculated minimum/maximum values for the attributes in the data.
And you can store the minimum/maximum values. And then use the formula to normalize your data by yourself.
Remember you can set the attribute in Instance class, and you can classify your result by Evaluation.evaluationForSingleInstance
I 'll give you the link later, may this help you.
Thank you

Naive Bayes Text Classifier - determining when a document should be labelled 'unclassified'

I have designed and implemented a Naive Bayes Text Classifier (in Java). I am primarily using it to classify tweets into 20 classes. To determine the probability that a document belongs to a class I use
foreach(class)
{
Probability = (P(bag of words occurring for class) * P(class)) / P(bag of words occurring globally)
}
What is the best way to determine if a bag of words really shouldn't belong to any class? I'm aware I could just sent a minimum threshold for P(bag of words occurring for class) and if all the classes are under that threshold then to class the document as unclassifed, however I'm realising this prevents this classifier from being sensitive.
Would an option be to create an Unclassified class and train that with document I deem to be unclassifiable?
Thanks,
Mark
--Edit---
I just had thought - I could set a maximum threshold for P(bag of words occurring globally)*(number of words in document) . This would mean that any documents which mainly consisted of common words (typically the tweets I want to filter out) eg. "Yes I agree with you". Would be filtered out. - Your thoughts on this would be appreciated also.
Or perhaps I should find the standard deviation and if it is low determine it should be unclassified?
I see two different options, seeing the problem as a set of 20 binary classification problems.
You can compute the likelihood of P(doc being in class)/P(doc not being in class). Some Naive Bayes implementations use this kind of method.
Assuming that you have some evaluation measure, you can compute a threshold per class and optimise it based on a cross-validation process. This is the standard way of applying text classification. You would use thresholds (one per class) but they would be based on your data. In your case SCut or ScutFBR would be the best option as explained in this paper.
Regards,

Weka's PCA is taking too long to run

I am trying to use Weka for feature selection using PCA algorithm.
My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensionality of the data using the following code:
AttributeSelection selector = new AttributeSelection();
PrincipalComponents pca = new PrincipalComponents();
Ranker ranker = new Ranker();
selector.setEvaluator(pca);
selector.setSearch(ranker);
Instances instances = SamplesManager.asWekaInstances(trainSet);
try {
selector.SelectAttributes(instances);
return SamplesManager.asSamplesList(selector.reduceDimensionality(instances));
} catch (Exception e ) {
...
}
However, It did not finish to run within 12 hours. It is stuck in the method selector.SelectAttributes(instances);.
My questions are:
Is so long computation time expected for weka's PCA? Or am I using PCA wrongly?
If the long run time is expected:
How can I tune the PCA algorithm to run much faster? Can you suggest an alternative? (+ example code how to use it)?
If it is not:
What am I doing wrong? How should I invoke PCA using weka and get my reduced dimensionality?
Update: The comments confirms my suspicion that it is taking much more time than expected.
I'd like to know: How can I get PCA in java - using weka or an alternative library.
Added a bounty for this one.
After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.
The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.
The code is more complex, but it essentially comes down to this:
Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);
However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.
It looks like you're using the default configuration for the PCA, which judging by the long runtime, it is likely that it is doing way too much work for your purposes.
Take a look at the options for PrincipalComponents.
I'm not sure if -D means they will normalize it for you or if you have to do it yourself. You want your data to be normalized (centered about the mean) though, so I would do this yourself manually first.
-R sets the amount of variance you want accounted for. Default is 0.95. The correlation in your data might not be good so try setting it lower to something like 0.8.
-A sets the maximum number of attributes to include. I presume the default is all of them. Again, you should try setting it to something lower.
I suggest first starting out with very lax settings (e.g. -R=0.1 and -A=2) then working your way up to acceptable results.
Best
for the construction of your covariance matrix, you can use the following formula which is also used by matlab. It is faster then the apache library.
Whereby Matrix is an m x n matrix. (m --> #databaseFaces)

Categories