How to train data correctly using libsvm?

How to train data correctly using libsvm? - java

I want to use SVM (Support vector machine) in my program, but I could not get the true result.
I want to know that how we must train data for SVM.
What I am doing:
Think that we have 5 document (the numbers are just an example), 3 of them is on first category and others (2 of them) are on second category, I merge the categories to each other (it means that the 3 doc that are in the first category will merge in one document), after that I made a train array like this:
double[][] train = new double[cat1.getDocument().getAttributes().size() + cat2.getDocument().getAttributes().size()][];
and I will fill the array like this:
int i = 0;
Iterator<String> iteraitor = cat1.getDocument().getAttributes().keySet().iterator();
Iterator<String> iteraitor2 = cat2.getDocument().getAttributes().keySet().iterator();
while (i < train.length) {
if (i < cat2.getDocument().getAttributes().size()) {
while (iteraitor2.hasNext()) {
String key = (String) iteraitor2.next();
Long value = cat2.getDocument().getAttributes().get(key);
double[] vals = { 0, value };
train[i] = vals;
i++;
System.out.println(vals[0] + "," + vals[1]);
}
} else {
while (iteraitor.hasNext()) {
String key = (String) iteraitor.next();
Long value = cat1.getDocument().getAttributes().get(key);
double[] vals = { 1, value };
train[i] = vals;
i++;
System.out.println(vals[0] + "," + vals[1]);
}
i++;
}
so I will continue like this to get the model :
svm_problem prob = new svm_problem();
int dataCount = train.length;
prob.y = new double[dataCount];
prob.l = dataCount;
prob.x = new svm_node[dataCount][];
for (int k = 0; k < dataCount; k++) {
double[] features = train[k];
prob.x[k] = new svm_node[features.length - 1];
for (int j = 1; j < features.length; j++) {
svm_node node = new svm_node();
node.index = j;
node.value = features[j];
prob.x[k][j - 1] = node;
}
prob.y[k] = features[0];
}
svm_parameter param = new svm_parameter();
param.probability = 1;
param.gamma = 0.5;
param.nu = 0.5;
param.C = 1;
param.svm_type = svm_parameter.C_SVC;
param.kernel_type = svm_parameter.LINEAR;
param.cache_size = 20000;
param.eps = 0.001;
svm_model model = svm.svm_train(prob, param);
Is this way correct? if not please help me to make it true.
these two answers are true : answer one , answer two,

Even without examining the code one can find conceptual errors:
think that we have 5 document , 3 of them is on first category and others( 2 of them) are on second category , i merge the categories to each other (it means that the 3 doc that are in the first category will merge in one document ) ,after that i made a train array like this
So:
training on the 5 documents won't give any reasonable effects, with any machine learning model... these are statistical models,there is no reasonable statistics in 5 points in R^n, where n~10,000
You do not merge anything. Such approach can work for Naive Bayes, which do not really treat documents as "whole" but rather - as probabilistic dependencies between features and classes. In SVM each document should be separate point in the R^n space, where n can be number of distinct words (for bag of words/set of words representation).

A problem might be that you do not terminate each set of features in a training example with an index of -1 which you should according to the read me...
I.e. if you have one example with two features i think you should do:
Index[0]: 0
Value[0]: 22
Index[1]: 1
Value[1]: 53
Index[2]: -1
Good luck!

Using SVMs to classify text is a common task. You can check out research papers by Joachims [1] regarding SVM text classification.
Basically you have to:
Tokenize your documents
Remove stopwords
Apply stemming technique
Apply feature selection technique (see [2])
Transform your documents using features achieved in 4.) (simple would be binary (0: feature is absent, 1: feature is present) or other measures like TFC)
Train your SVM and be happy :)
[1] T. Joachims: Text Categorization with Support Vector Machines: Learning with Many Relevant Features; Springer: Heidelberg, Germany, 1998, doi:10.1007/BFb0026683.
[2] Y. Yang, J. O. Pedersen: A Comparative Study on Feature Selection in Text Categorization. International Conference on Machine Learning, 1997, 412-420.

Related

What Is better? Two Hashmaps or One making use of a Class?

I'm currently programming something where I'm paying a lot of attention to performance and ram usage.
I came wondering with this problem, and I was trying to make a decision. Imagine this situation:
I need to associate a certain Class (Location) and a Integer to a String (let's say a name). So a Name has an Id and a Location....
What would be the best approach to this?
First: Create two hashmaps
HashMap<String, Location> one = new HashMap<String, Location>
HashMap<String, Integer> two = new HashMap<String, Integer>
Second: Use only one hashmap and create a new class
HashMap<String, NewClass> one = new HashMap<String, NewClass>
where NewClass contains:
class NewClass {
Location loc;
Integer int;
}

If you want every String to be coupled with BOTH the location and integer, use a new class, it will be much easier to debug and maintain, because it makes sense. A String X is connected to both a location and an integer. It ensures you will do less mistakes (like inserting only one of them, or deleting only one), and will be more readable.
If the association is loose, and some strings might need only location, and some only integers - using two maps is probably preferable, as future readers of the code (including you in 3 months) will fail to understand what is this new class and why the String X needs to have a location.
tl;dr:
String->MyClass if each string is always associated with a location and an integer
String->Integer, String->Location if each string is independently assiciated with locations and integers.

If you always need to retrieve both Id and Location, the first approach would require 2 Hash lookups while the second approach would require only 1. In that case, the second approach should have a slight better performance.
To test that I did the simple test below:
// create 2 hashes with 1M entries
for (int i = 0; i < 1000000; i++){
String s = new BigInteger(80, random).toString(32);
hash1.put(s, s);
hash2.put(s, new BigInteger(80, random).intValue());
}
// create 1 hash with 1M entries
for (int i = 0; i < 1000000; i++){
String s = new BigInteger(80, random).toString(32);
NewClass n = new NewClass();
n.i = new BigInteger(80, random).intValue();
n.loc = s;
hash3.put(s, n);
}
// 5M lookups
long start = new Date().getTime();
for (int i = 0; i < 5000000; i++){
String s = "AAA";
hash1.get(s);
hash2.get(s);
}
System.out.println("Approach 1 (2 hashes): " + (new Date().getTime() - start));
// 5M lookups
long start2 = new Date().getTime();
for (int i = 0; i < 5000000; i++){
String s = "BBB";
hash3.get(s);
}
System.out.println("Approach 2 (1 hash): " + (new Date().getTime() - start2));
Running on my computer, the results were:
Approach 1 (2 hashes): 37 ms
Approach 2 (1 hash): 18 ms
The test is super simplistic and, if you are to consider serious performance issues, you should investigate deeper into this issue, considering other aspects as memory footprint, cost of object creation, etc. But, in any case, using 2 hashes will increase the total lookup time.

Job Scheduling Algorithm

Got this question during an interview. Wanted to know if there was a better solution:
Given N tasks, and the dependencies among them, please provide an execution sequence, which make sure jobs are executed without violating the dependency.
Sample File:
5
1<4
3<2
4<5
First line is the number of total tasks.
1<4 means Task 1 has to be executed before task 4.
One possible sequence would be:
1 4 5 3 2
My solution uses a DAG to store all the numbers, followed by a topological sort. Is there a less heavy-handed way of solving this problem?:
DirectedAcyclicGraph<Integer, DefaultEdge> dag = new DirectedAcyclicGraph<Integer, DefaultEdge>(DefaultEdge.class);
Integer [] hm = new Integer[6];
//Add integer objects to storage array for later edge creation and add vertices to DAG
for(int x = 1; x <= numVertices; x++){
Integer newInteger = new Integer(x);
hm[x] = newInteger;
dag.addVertex(newInteger);
}
for(int x = 1; x < lines.size()-1; x++){
//Add edges between vertices
String[] parts = lines.get(x).split("<");
String firstVertex = parts[0];
String secondVertex = parts[1];
dag.addDagEdge(hm[Integer.valueOf(firstVertex)], hm[Integer.valueOf(secondVertex)]);
}
//Topological sort
Iterator<Integer> itr = dag.iterator();
while(itr.hasNext()){
System.out.println(itr.next());
}

As already said by several users (Gassa, shekhar suman, mhum and Colonel Panic) the problem is solved by finding a topological sorting. As long as the iterator in dag returns the elements in that order it's correct.
I don't where the DirectedAcyclicGraph class is from, so I can't help with that. Otherwise, this method does the parsing as yours and uses a simple algorithm (actually, the first one springing to my mind)
public static int[] orderTasks (String[] lines){
// parse
int numTasks = Integer.parseInt(lines[0]);
List<int[]> restrictions = new ArrayList<int[]>(lines.length-1);
for (int i = 1; i < lines.length; i++){
String[] strings = lines[i].split("<");
restrictions.add(new int[]{Integer.parseInt(strings[0]), Integer.parseInt(strings[1])});
}
// ordered
int[] tasks = new int[numTasks];
int current = 0;
Set<Integer> left = new HashSet<Integer>(numTasks);
for (int i = 1; i <= numTasks; i++){
left.add(i);
}
while (current < tasks.length){
// these numbers can't be written yet
Set<Integer> currentIteration = new HashSet<Integer>(left);
for (int[] restriction : restrictions){
// the second element has at least the first one as precondition
currentIteration.remove(restriction[1]);
}
if (currentIteration.isEmpty()){
// control for circular dependencies
throw new IllegalArgumentException("There's circular dependencies");
}
for (Integer i : currentIteration){
tasks[current++]=i;
}
// update tasks left
left.removeAll(currentIteration);
// update restrictions
Iterator<int[]> iterator = restrictions.iterator();
while (iterator.hasNext()){
if (currentIteration.contains(iterator.next()[0])){
iterator.remove();
}
}
}
return tasks;
}
BTW, in your hm array initialization you define it has having 6 elements. It leaves the 0 position null (not a problem since you don't call it anyway) but in the general case the number of tasks could be greater than 5 and then you'll have and IndexOutOfBoundsException
Another punctilious remark, when adding the edges, in case of circular dependencies, if the message of the Exception raised by DAG is not clear enough, the user could be confused. Again, since I don't know where that class is from, I can't know.

How to prevent genetic algorithm from converging on local minima?

I am trying to build a 4 x 4 sudoku solver by using the genetic algorithm. I have some issues with values converging to local minima. I am using a ranked approach and removing the bottom two ranked answer possibilities and replacing them with a crossover between the two highest ranked answer possibilities. For additional help avoiding local mininma, I am also using mutation. If an answer is not determined within a specific amount of generation, my population is filled with completely new and random state values. However, my algorithm seems to get stuck in local minima. As a fitness function, I am using:
(Total Amount of Open Squares * 7 (possible violations at each square; row, column, and box)) - total Violations
population is an ArrayList of integer arrays in which each array is a possible end state for sudoku based on the input. Fitness is determined for each array in the population.
Would someone be able to assist me in determining why my algorithm converges on local minima or perhaps recommend a technique to use to avoid local minima. Any help is greatly appreciated.
Fitness Function:
public int[] fitnessFunction(ArrayList<int[]> population)
{
int emptySpaces = this.blankData.size();
int maxError = emptySpaces*7;
int[] fitness = new int[populationSize];
for(int i=0; i<population.size();i++)
{
int[] temp = population.get(i);
int value = evaluationFunc(temp);
fitness[i] = maxError - value;
System.out.println("Fitness(i)" + fitness[i]);
}
return fitness;
}
Crossover Function:
public void crossover(ArrayList<int[]> population, int indexWeakest, int indexStrong, int indexSecStrong, int indexSecWeak)
{
int[] tempWeak = new int[16];
int[] tempStrong = new int[16];
int[] tempSecStrong = new int[16];
int[] tempSecWeak = new int[16];
tempStrong = population.get(indexStrong);
tempSecStrong = population.get(indexSecStrong);
tempWeak = population.get(indexWeakest);
tempSecWeak = population.get(indexSecWeak);
population.remove(indexWeakest);
population.remove(indexSecWeak);
int crossoverSite = random.nextInt(14)+1;
for(int i=0;i<tempWeak.length;i++)
{
if(i<crossoverSite)
{
tempWeak[i] = tempStrong[i];
tempSecWeak[i] = tempSecStrong[i];
}
else
{
tempWeak[i] = tempSecStrong[i];
tempSecWeak[i] = tempStrong[i];
}
}
mutation(tempWeak);
mutation(tempSecWeak);
population.add(tempWeak);
population.add(tempSecWeak);
for(int j=0; j<tempWeak.length;j++)
{
System.out.print(tempWeak[j] + ", ");
}
for(int j=0; j<tempWeak.length;j++)
{
System.out.print(tempSecWeak[j] + ", ");
}
}
Mutation Function:
public void mutation(int[] mutate)
{
if(this.blankData.size() > 2)
{
Blank blank = this.blankData.get(0);
int x = blank.getPosition();
Blank blank2 = this.blankData.get(1);
int y = blank2.getPosition();
Blank blank3 = this.blankData.get(2);
int z = blank3.getPosition();
int rando = random.nextInt(4) + 1;
if(rando == 2)
{
int rando2 = random.nextInt(4) + 1;
mutate[x] = rando2;
}
if(rando == 3)
{
int rando2 = random.nextInt(4) + 1;
mutate[y] = rando2;
}
if(rando==4)
{
int rando3 = random.nextInt(4) + 1;
mutate[z] = rando3;
}
}

The reason you see rapid convergence is that your methodology for "mating" is not very good. You are always producing two offspring from "mating" of the top two scoring individuals. Imagine what happens when one of the new offspring is the same as your top individual (by chance, no crossover and no mutation, or at least none that have an effect on the fitness). Once this occurs, the top two individuals are identical which eliminates the effectiveness of crossover.
A more typical approach is to replace EVERY individual on every generation. There are lots of possible variations here, but you might do a random choice of two parents weighted fitness.
Regarding population size: I don't know how hard of a problem sudoku is given your genetic representation and fitness function, but I suggest that you think about millions of individuals, not dozens.
If you are working on really hard problems, genetic algorithms are much more effective when you place your population on a 2-D grid and choosing "parents" for each point in the grid from the nearby individuals. You will get local convergence, but each locality will have converged on different solutions; you get a huge amount of variation produced from the borders between the locally-converged areas of the grid.
Another technique you might think about is running to convergence from random populations many times and store the top individual from each run. After you build up a bunch of different local minima genomes, build a new random population from those top individuals.

I think the Sudoku is a permutation problem. therefore i suggest you to use random permutation numbers for initializing population and use the crossover method which Compatible to permutation problems.

Steps to perform document clustering using k-means algorithm in java

I need steps to perform document clustering using k-means algorithm in java.
It will be very useful for me to provide the steps easily.
Thanks in advance.

You need to count the words in each document and make a feature generally called bag of words. Before that you need to remove stop words(very common but not giving much information like the, a etc). You can generally take top n common words from your document. Count the frequency of these words and store them in n dimensional vector.
For distance measure you can use cosine vector.
Here is a simple algorithm for 2 mean for 1 dimensional data points. you can extend it to k mean and n dimensional data point easily. Let me know if you want n dim implementation.
double[] x = {1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10};
double[] center = new int[2];
double[] precenter = new int[2];
ArrayList[] cluster = new ArrayList[2];
//generate 2 random number from 0 to x.length without replacement
int rand = new int[2];
Random rand = new Random();
rand[0] = rand.nextInt(x.length + 1);
rand[1] = rand.nextInt(x.length + 1);
while(rand[0] == rand[1] ){
rand[1] = rand.nextInt(x.length + 1);
}
center[0] = x[rand[0]];
center[1] = x[rand[1]];
//there is a better way to generate k random number (w/o replacement) just search.
do{
cluster[0].clear();
cluster[1].clear();
for(int i = 0; i < x.length; ++i){
if(abs(x[i]-center1[0]) <= abs(x[i]-center1[1])){
cluster[0].add(x[i]);
}
else{
cluster[0].add(x[i]);
}
precenter[0] = center[0];
precenter[1] = center[1];
center[0] = mean(cluster[0]);
center[1] = mean(cluster[1]);
}
} while(precenter[0] != center[0] && precenter[1] != center[1]);
double mean(ArrayList list){
double mean = 0;
double sum = 0;
for(int index=0;index
}
The cluster[0] and cluster [1] contain points in the clusters and center[0], center[1] are the 2 means.
you need to do some debugging because I have written the code in R and just converted it into java for you :)

Does this help you? Also the wiki article has some links to implementations in other languages ready to be ported to java.
Steps of the algorithm:
Define the number of clusters you want to have
Distribute the points radomly in your problem space.
Link every observation to the nearest point.
calculate the center of mass for each cluster and place the point into the middle.
Link the points again to the centerpoints and repeat until the points dont move any more.

What do you want to cluster the documents based on? If it's by similarity you'll need to do some natural language processing first, and then you'll need a metric (some kind of assignment algorithm) to place the documents into clusters (crp works and is relatively straight forward).
The hardest part will be the NLP (language processing) if you're not clustering them based on something like "length". I can provide more info on all of these, but I won't dive down the rabbit hole if you don't need it.

How to compute the probability of a multi-class prediction using libsvm?

I'm using libsvm and the documentation leads me to believe that there's a way to output the believed probability of an output classification's accuracy. Is this so? And if so, can anyone provide a clear example of how to do it in code?
Currently, I'm using the Java libraries in the following manner
SvmModel model = Svm.svm_train(problem, parameters);
SvmNode x[] = getAnArrayOfSvmNodesForProblem();
double predictedValue = Svm.svm_predict(model, x);

Given your code-snippet, I'm going to assume you want to use the Java API packaged with libSVM, rather than the more verbose one provided by jlibsvm.
To enable prediction with probability estimates, train a model with the svm_parameter field probability set to 1. Then, just change your code so that it calls the svm method svm_predict_probability rather than svm_predict.
Modifying your snippet, we have:
parameters.probability = 1;
svm_model model = svm.svm_train(problem, parameters);
svm_node x[] = problem.x[0]; // let's try the first data pt in problem
double[] prob_estimates = new double[NUM_LABEL_CLASSES];
svm.svm_predict_probability(model, x, prob_estimates);
It's worth knowing that training with multiclass probability estimates can change the predictions made by the classifier. For more on this, see the question Calculating Nearest Match to Mean/Stddev Pair With LibSVM.

The accepted answer worked like a charm. Make sure to set probability = 1 during training.
If you are trying to drop prediction when the confidence is not met with threshold, here is the code sample:
double confidenceScores[] = new double[model.nr_class];
svm.svm_predict_probability(model, svmVector, confidenceScores);
/*System.out.println("text="+ text);
for (int i = 0; i < model.nr_class; i++) {
System.out.println("i=" + i + ", labelNum:" + model.label[i] + ", name=" + classLoadMap.get(model.label[i]) + ", score="+confidenceScores[i]);
}*/
//finding max confidence;
int maxConfidenceIndex = 0;
double maxConfidence = confidenceScores[maxConfidenceIndex];
for (int i = 1; i < confidenceScores.length; i++) {
if(confidenceScores[i] > maxConfidence){
maxConfidenceIndex = i;
maxConfidence = confidenceScores[i];
}
}
double threshold = 0.3; // set this based data & no. of classes
int labelNum = model.label[maxConfidenceIndex];
// reverse map number to name
String targetClassLabel = classLoadMap.get(labelNum);
LOG.info("classNumber:{}, className:{}; confidence:{}; for text:{}",
labelNum, targetClassLabel, (maxConfidence), text);
if (maxConfidence < threshold ) {
LOG.info("Not enough confidence; threshold={}", threshold);
targetClassLabel = null;
}
return targetClassLabel;

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.