Background:
If I open Weka Explorer GUI, train a J48 tree and test using the NSL-KDD training and testing datasets a pruned tree would be produced. Weka Explorer GUI expresses the algorithms reasoning for stating whether something would be classified as an anomaly or not in terms of queries such as src_bytes <= 28.
Screenshot of Weka Explorer GUI showing pruned tree
Question:
Referring to the pruned tree example produced by the Weka Explorer GUI, how can I programmatically have weka express the reasoning for each instance classification in Java?
i.e. Instance A was classified as an anomaly as src_bytes < 28 &&
dst_host_srv_count < 88 && dst_bytes < 3 etc.
So Far I've been able to:
Train and test a J48 tree on the NSL-KDD dataset.
Output a description of the J48 tree within Java.
Return the J48 tree as an if-then statement.
But I simply have no idea how whilst iterating through each instance during the testing phase, to express the reasoning for each classification; without each time manually outputting the J48 tree as an if-then statement and adding numerous println expressing when each was triggered (which I'd really rather not do, as this would dramatically increase the human intervention requirements in the long-term).
Additional Screenshots:
Screenshot of the 'description of the J48 tree within Java'
Screenshot of the 'J48 tree as an if-then statement'
Code:
public class Junction_Tree {
String train_path = "KDDTrain+.arff";
String test_path = "KDDTest+.arff";
double accuracy;
double recall;
double precision;
int correctPredictions;
int incorrectPredictions;
int numAnomaliesDetected;
int numNetworkRecords;
public void run() {
try {
Instances train = DataSource.read(train_path);
Instances test = DataSource.read(test_path);
train.setClassIndex(train.numAttributes() - 1);
test.setClassIndex(test.numAttributes() - 1);
if (!train.equalHeaders(test))
throw new IllegalArgumentException("datasets are not compatible..");
Remove rm = new Remove();
rm.setAttributeIndices("1");
J48 j48 = new J48();
j48.setUnpruned(true);
FilteredClassifier fc = new FilteredClassifier();
fc.setFilter(rm);
fc.setClassifier(j48);
fc.buildClassifier(train);
numAnomaliesDetected = 0;
numNetworkRecords = 0;
int n_ana_p = 0;
int ana_p = 0;
correctPredictions = 0;
incorrectPredictions = 0;
for (int i = 0; i < test.numInstances(); i++) {
double pred = fc.classifyInstance(test.instance(i));
String a = "anomaly";
String actual;
String predicted;
actual = test.classAttribute().value((int) test.instance(i).classValue());
predicted = test.classAttribute().value((int) pred);
if (actual.equalsIgnoreCase(a))
numAnomaliesDetected++;
if (actual.equalsIgnoreCase(predicted))
correctPredictions++;
if (!actual.equalsIgnoreCase(predicted))
incorrectPredictions++;
if (actual.equalsIgnoreCase(a) && predicted.equalsIgnoreCase(a))
ana_p++;
if ((!actual.equalsIgnoreCase(a)) && predicted.equalsIgnoreCase(a))
n_ana_p++;
numNetworkRecords++;
}
accuracy = (correctPredictions * 100) / (correctPredictions + incorrectPredictions);
recall = ana_p * 100 / (numAnomaliesDetected);
precision = ana_p * 100 / (ana_p + n_ana_p);
System.out.println("\n\naccuracy: " + accuracy + ", Correct Predictions: " + correctPredictions
+ ", Incorrect Predictions: " + incorrectPredictions);
writeFile(j48.toSource(J48_if-then.java));
writeFile(j48.toString());
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
Junction_Tree JT1 = new Junction_Tree();
JT1.run();
}
}
I have never used it myself, but according to the WEKA documentation the J48 class includes a getMembershipValues method. This method should return an array that indicates the node membership of an instance. One of the few mentions of this method appears to be in this thread on the WEKA forums.
Other than this, I can't find any information on possible alternatives other than the one you mentioned.
Related
Hi I coded a single neuron to predict a student's mark for subject D based of the marks they got for subject A, B and C.
After training my neuron with some historical data that contain the 3 marks as well as the actual mark they got for subject D, I then inputed test data to see how closely the predicted mark would match with the actual one.
Below is my Neuron class
public class Neuron
{
double[] Weights = new double[3];
public Neuron(double W1, double W2, double W3)
{
Weights[0] = W1;
Weights[1] = W2;
Weights[2] = W3;
}
public double FnetLinear(int Z1, int Z2, int Z3)
{
return (Z1*Weights[0] + Z2*Weights[1] + Z3*Weights[2]);
}
public void UpdateWeight(int i, double Wi)
{
Weights[i] = Wi;
}
}
And here is my main class
public class Main
{
public int t;
public Neuron neuron;
double LearningRate = 0.00001;
public ArrayList<Marks> TrainingSet, TestSet;
public static void main(String[] args) throws IOException
{
Main main = new Main();
main.run();
}
public void run()
{
TrainingSet = ReadCSV("G:\\EVOS\\EVO_Assignemnt1\\resources\\Streamdata.csv");
TestSet = ReadCSV("G:\\EVOS\\EVO_Assignemnt1\\resources\\Test.csv");
Random ran = new Random();
neuron = new Neuron(ran.nextDouble(), ran.nextDouble(), ran.nextDouble());
train();
Test();
}
public void train()
{
t = 0;
while(t<1000000)
{
for(Marks mark: TrainingSet)
{
for(int i=0; i<neuron.Weights.length; i++)
{
double yp = neuron.FnetLinear(mark.marks[0] , mark.marks[1], mark.marks[2]);
double wi = neuron.Weights[i] - LearningRate*(-2*(mark.marks[3]-yp))*mark.marks[i];
neuron.UpdateWeight(i, wi);
}
}
t++;
}
}
public void Test()
{
System.out.println("Test Set results:");
int count = 1;
for(Marks mark: TestSet)
{
double fnet = neuron.FnetLinear(mark.marks[0] , mark.marks[1], mark.marks[2]);
System.out.println("Mark " + count + ": " + fnet);
count++;
}
}
public static ArrayList<Marks> ReadCSV(String csv)
{
ArrayList<Marks> temp = new ArrayList<>();
String line;
BufferedReader br;
try
{
br = new BufferedReader(new FileReader(csv));
while((line=br.readLine()) != null)
{
String[] n = line.split(",");
Marks stud = new Marks(Integer.valueOf(n[0]), Integer.valueOf(n[1]), Integer.valueOf(n[2]), Integer.valueOf(n[3]));
temp.add(stud);
}
}
catch (Exception e)
{
System.out.println("ERROR");
}
return temp;
}
}
This is the test data with the last number being the actual mark.
After running the test data i get results around these:
As you can see the first 4 marks predictions are way off from the actual mark.
I followed the text book's explenation of Computational Intlligence An Introduction (Chapter 2 if u are curious).
However I would like to know what I im doing wrong. How can I get more accurate results?
Neural networks are very black-box esque; Due to this, it's pretty hard to say exactly why your marks results are way off.
That being said, here are some of the main methods of increasing the accuracy of your neural network:
Adjust the number of layers and neurons; I notice you're only using a single neuron. A single neuron in a neural network is typically just... bad. You're never going to get any good results like that. Neural networks need enough complexity in the form of layering and neuron count in order to calculate or predict whatever it is you're trying to teach it to do. A single neuron by itself really can't learn anything useful. This is also probably a big reason why your network accuracy is so bad.
Train for longer; I notice you're only training your network 1 million times; this is not always enough. For reference, the last time I trained a neural network, I used over 30 million sets of input/output.
Retrain your network with different starting weights; Randomized starting weights are great, but sometimes you just get a bad batch of starting weights. In the same project where I used 30 million input/output sets, I also tried over 25 different configurations of initial starting weights across 15 different layouts of nodes and layers.
Pick a different activation function; Linear activation functions are usually not that useful. I usually default to using a sigmoid function to start off, unless there are specific other functions that fulfill the use case I'm trying to train.
A common pitfall that can cause low accuracy is bad training data; Make sure the training data you're using is correct and is internally consistent with whatever it is you're trying to teach.
As a final note, I find myself having some trouble understanding what kind of a neural network you're trying to write exactly; I've made the assumption that this is some sort of attempt at a feed forward, back propagation neural network with a single neuron in it, but most of the advice here should still apply.
I have been trying to use jlibSVM
I want to use it for multi output regression.
for example my :
feature set / inputs will be x1,x2,x3
and outputs/target values will be y1,y2
Is this possible using the libSVM library ?
The API docs are not clear and there is not example app showing the use of jlibsvm so I tried to modify the code inside lexecyexec/svm_train.java
The author has originally just created the app to use one output/target value only .
this is seen in this part where the author tries to read the training file :
private void read_problem() throws IOException
{
BufferedReader fp = new BufferedReader(new FileReader(input_file_name));
Vector<Float> vy = new Vector<Float>();
Vector<SparseVector> vx = new Vector<SparseVector>();
int max_index = 0;
while (true)
{
String line = fp.readLine();
if (line == null)
{
break;
}
StringTokenizer st = new StringTokenizer(line, " \t\n\r\f:");
vy.addElement(Float.parseFloat(st.nextToken()));
int m = st.countTokens() / 2;
SparseVector x = new SparseVector(m);
for (int j = 0; j < m; j++)
{
//x[j] = new svm_node();
x.indexes[j] = Integer.parseInt(st.nextToken());
x.values[j] = Float.parseFloat(st.nextToken());
}
if (m > 0)
{
max_index = Math.max(max_index, x.indexes[m - 1]);
}
vx.addElement(x);
}
I tried to modify it so that the vector vy accepts a sparse vector with 2 values.
The program gets executed but the model file seems to be wrong.
Can anyone please verify if they have used jlibsvm for multiple output svm regression???
If yes can someone please explain how they achieved this ??
If no then does someone know of a similar svm implementation in Java ??
The classic SVM algorithm does not support multi dimensional outputs. One way to work around this would be to have a SVM model for each output dimension.
I want to use SVM (Support vector machine) in my program, but I could not get the true result.
I want to know that how we must train data for SVM.
What I am doing:
Think that we have 5 document (the numbers are just an example), 3 of them is on first category and others (2 of them) are on second category, I merge the categories to each other (it means that the 3 doc that are in the first category will merge in one document), after that I made a train array like this:
double[][] train = new double[cat1.getDocument().getAttributes().size() + cat2.getDocument().getAttributes().size()][];
and I will fill the array like this:
int i = 0;
Iterator<String> iteraitor = cat1.getDocument().getAttributes().keySet().iterator();
Iterator<String> iteraitor2 = cat2.getDocument().getAttributes().keySet().iterator();
while (i < train.length) {
if (i < cat2.getDocument().getAttributes().size()) {
while (iteraitor2.hasNext()) {
String key = (String) iteraitor2.next();
Long value = cat2.getDocument().getAttributes().get(key);
double[] vals = { 0, value };
train[i] = vals;
i++;
System.out.println(vals[0] + "," + vals[1]);
}
} else {
while (iteraitor.hasNext()) {
String key = (String) iteraitor.next();
Long value = cat1.getDocument().getAttributes().get(key);
double[] vals = { 1, value };
train[i] = vals;
i++;
System.out.println(vals[0] + "," + vals[1]);
}
i++;
}
so I will continue like this to get the model :
svm_problem prob = new svm_problem();
int dataCount = train.length;
prob.y = new double[dataCount];
prob.l = dataCount;
prob.x = new svm_node[dataCount][];
for (int k = 0; k < dataCount; k++) {
double[] features = train[k];
prob.x[k] = new svm_node[features.length - 1];
for (int j = 1; j < features.length; j++) {
svm_node node = new svm_node();
node.index = j;
node.value = features[j];
prob.x[k][j - 1] = node;
}
prob.y[k] = features[0];
}
svm_parameter param = new svm_parameter();
param.probability = 1;
param.gamma = 0.5;
param.nu = 0.5;
param.C = 1;
param.svm_type = svm_parameter.C_SVC;
param.kernel_type = svm_parameter.LINEAR;
param.cache_size = 20000;
param.eps = 0.001;
svm_model model = svm.svm_train(prob, param);
Is this way correct? if not please help me to make it true.
these two answers are true : answer one , answer two,
Even without examining the code one can find conceptual errors:
think that we have 5 document , 3 of them is on first category and others( 2 of them) are on second category , i merge the categories to each other (it means that the 3 doc that are in the first category will merge in one document ) ,after that i made a train array like this
So:
training on the 5 documents won't give any reasonable effects, with any machine learning model... these are statistical models,there is no reasonable statistics in 5 points in R^n, where n~10,000
You do not merge anything. Such approach can work for Naive Bayes, which do not really treat documents as "whole" but rather - as probabilistic dependencies between features and classes. In SVM each document should be separate point in the R^n space, where n can be number of distinct words (for bag of words/set of words representation).
A problem might be that you do not terminate each set of features in a training example with an index of -1 which you should according to the read me...
I.e. if you have one example with two features i think you should do:
Index[0]: 0
Value[0]: 22
Index[1]: 1
Value[1]: 53
Index[2]: -1
Good luck!
Using SVMs to classify text is a common task. You can check out research papers by Joachims [1] regarding SVM text classification.
Basically you have to:
Tokenize your documents
Remove stopwords
Apply stemming technique
Apply feature selection technique (see [2])
Transform your documents using features achieved in 4.) (simple would be binary (0: feature is absent, 1: feature is present) or other measures like TFC)
Train your SVM and be happy :)
[1] T. Joachims: Text Categorization with Support Vector Machines: Learning with Many Relevant Features; Springer: Heidelberg, Germany, 1998, doi:10.1007/BFb0026683.
[2] Y. Yang, J. O. Pedersen: A Comparative Study on Feature Selection in Text Categorization. International Conference on Machine Learning, 1997, 412-420.
I have a machine learning scheme in which I am using the java classes from Weka to implement machine learning in a matlab script. I am then uploading the model for the classifier to a database, since I need to perform the classification on a different machine in a different language (obj-c). The evaluation of the network was fairly straightforward to program, but I need the values that WEKA used to normalize the data set before training so I can use them in the evaluation of the network later. Does anyone know how to get the normalization factors that weka would use for training a Multilayer Perceptron network? I would prefer the answer to be in Java.
After some digging through the WEKA source code and documentation... this is what I've come up with. Even though there is a filter in WEKA called "Normalize", the Multilayer Perceptron doesn't use it, instead it uses a bit of code internally that looks like this.
m_attributeRanges = new double[inst.numAttributes()];
m_attributeBases = new double[inst.numAttributes()];
for (int noa = 0; noa < inst.numAttributes(); noa++) {
min = Double.POSITIVE_INFINITY;
max = Double.NEGATIVE_INFINITY;
for (int i=0; i < inst.numInstances();i++) {
if (!inst.instance(i).isMissing(noa)) {
value = inst.instance(i).value(noa);
if (value < min) {
min = value;
}
if (value > max) {
max = value;
}
}
}
m_attributeRanges[noa] = (max - min) / 2;
m_attributeBases[noa] = (max + min) / 2;
if (noa != inst.classIndex() && m_normalizeAttributes) {
for (int i = 0; i < inst.numInstances(); i++) {
if (m_attributeRanges[noa] != 0) {
inst.instance(i).setValue(noa, (inst.instance(i).value(noa)
- m_attributeBases[noa]) /
m_attributeRanges[noa]);
}
else {
inst.instance(i).setValue(noa, inst.instance(i).value(noa) -
m_attributeBases[noa]);
}
So the only values that I should need to transmit to the other system I'm trying to use to evaluate this network would be the min and the max. Luckily for me, there turned out to be a method on the filter weka.filters.unsupervised.attribute.Normalize that returns a double array of the mins and the maxes for a processed dataset. All I had to do then was tell the multilayer perceptron to not automatically normalize my data, and to process it separately with the filter so I could extract the mins and maxes to send to the database along with the weights and everything else.
I'm using libsvm and the documentation leads me to believe that there's a way to output the believed probability of an output classification's accuracy. Is this so? And if so, can anyone provide a clear example of how to do it in code?
Currently, I'm using the Java libraries in the following manner
SvmModel model = Svm.svm_train(problem, parameters);
SvmNode x[] = getAnArrayOfSvmNodesForProblem();
double predictedValue = Svm.svm_predict(model, x);
Given your code-snippet, I'm going to assume you want to use the Java API packaged with libSVM, rather than the more verbose one provided by jlibsvm.
To enable prediction with probability estimates, train a model with the svm_parameter field probability set to 1. Then, just change your code so that it calls the svm method svm_predict_probability rather than svm_predict.
Modifying your snippet, we have:
parameters.probability = 1;
svm_model model = svm.svm_train(problem, parameters);
svm_node x[] = problem.x[0]; // let's try the first data pt in problem
double[] prob_estimates = new double[NUM_LABEL_CLASSES];
svm.svm_predict_probability(model, x, prob_estimates);
It's worth knowing that training with multiclass probability estimates can change the predictions made by the classifier. For more on this, see the question Calculating Nearest Match to Mean/Stddev Pair With LibSVM.
The accepted answer worked like a charm. Make sure to set probability = 1 during training.
If you are trying to drop prediction when the confidence is not met with threshold, here is the code sample:
double confidenceScores[] = new double[model.nr_class];
svm.svm_predict_probability(model, svmVector, confidenceScores);
/*System.out.println("text="+ text);
for (int i = 0; i < model.nr_class; i++) {
System.out.println("i=" + i + ", labelNum:" + model.label[i] + ", name=" + classLoadMap.get(model.label[i]) + ", score="+confidenceScores[i]);
}*/
//finding max confidence;
int maxConfidenceIndex = 0;
double maxConfidence = confidenceScores[maxConfidenceIndex];
for (int i = 1; i < confidenceScores.length; i++) {
if(confidenceScores[i] > maxConfidence){
maxConfidenceIndex = i;
maxConfidence = confidenceScores[i];
}
}
double threshold = 0.3; // set this based data & no. of classes
int labelNum = model.label[maxConfidenceIndex];
// reverse map number to name
String targetClassLabel = classLoadMap.get(labelNum);
LOG.info("classNumber:{}, className:{}; confidence:{}; for text:{}",
labelNum, targetClassLabel, (maxConfidence), text);
if (maxConfidence < threshold ) {
LOG.info("Not enough confidence; threshold={}", threshold);
targetClassLabel = null;
}
return targetClassLabel;