LIBSVM - perfect confusion matrix on training set, abysmal results on test set?

LIBSVM - perfect confusion matrix on training set, abysmal results on test set? - java

I am doing a project in argument mining. One of the tasks is classifying Strings as PREM(ise), CONC(lusion) or M(ajor)CONC(lusion). I am working with AAEC dataset and have a few thousand features per vector.
For the task I employ a CSVM with polynomial kernel implemented in LibSVM and accessed through WEKA.
I am performing a grid search (w/o cross-validation, its a custom code I wrote that trains an SVM on a subset of the data and prints its results) for best C, gamma. I am trying in range 10^-5 to 10^5 and 2^-15 to 2^3 respectively. I am also printing out the results on the training set and on the test set.
I either get all classified as a for both confusion matrices, or this :
Confusion matrix (on training set)
a b c <-- classified as
416 0 0 | a = PREM
8 169 0 | b = CONC
5 0 80 | c = MCONC
Confusion matrix (on test set)
a b c <-- classified as
107 1 0 | a = PREM
40 0 0 | b = CONC
16 0 0 | c = MCONC
I am not too familiar with SVMs and I am not sure whether this is supposed to be normal or anomalous. Intuitively it seems unlikely that the data is so well separable in the training set yet the result is completely off on the test set.
I am not sure how to proceed. Is this a result of not having optimal C,gamma or the data being not descriptive enough, or is this potentially a signal of a more hidden problem (e.g. filtering mistakes, overfitting)?
Advice would be appreciated, thanks!

Related

Optaplanner does not change planning variable

I am working on an optimisation problem with Optaplanner. I have the feeling that I have not designed my planning entity or variable correctly. Maybe someone would like to say what they think about the following case:
Entities:
In principle, it is a fairly simple problem. There are a lot of warehouses that store exactly one product and that can receive products or supply other warehouses on specified transport routes. Let's call them nodes.
class Node
int upperBoundary
int lowerBoundary
int items
String name
Each transport route can transport all products. You could think of the problem as a matrix or a tree. I have defined the cells of the matrix as planning entities. The planning variable is an integer value on the cell. Let's call them weighted edges. In my Solution Class is a List<Edge> edges.
E1 E2 E3 ... En
N1 1 0 0 2
N2 3 1 3 1
N3 0 1 0 0
...
Nn 1 1 0 0
#PlanningEntity
class WeightedEdge
Node from
Node to
#PlanningVariable
Integer items (nullable)
int minCapacity
int maxCapacity
In my example I now have about 450 weighted edges. All of them have minimum and maximum values. These are recognised correctly.
Node a min 0 max 40
Node b min 0 max 20 uws.
If I now run a benchmark, it shows me exactly the sum of the number ranges from the individual nodes as the "problem scale". In the example above it would be 60.
But the problem is that the planning variables have dependencies. It therefore makes a difference whether I enter 39 for a and 20 for b. Or 38 for a and 20 for b. This gives me 800 possibilities. Does anyone have an idea where the design error could be in this structure?
The Scoring Function aims to balance the Nodes, so that after the planning period their amount of items is between the upper and lower boundary. For that I am using the constraintProvider.
SoftScore = Sum of to many or to few items at each station through the whole network
--- Edit ---
As suggested, I changed the environment mode to FULL_ASSERT. With this config, the solver throws an error:
with this mode it throws an error. My undoMove is corrupted:
Caused by: java.lang.IllegalStateException: UndoMove corruption (1soft): the beforeMoveScore (-65init/0hard/0medium/-2412soft) is not the undoScore (-65init/0hard/0medium/-2411soft) which is the uncorruptedScore (-65init/0hard/0medium/-2411soft) of the workingSolution.
1) Enable EnvironmentMode FULL_ASSERT (if you haven't already) to fail-faster in case there's a score corruption or variable listener corruption.
2) Check the Move.createUndoMove(...) method of the moveClass (class org.optaplanner.core.impl.heuristic.selector.move.generic.ChangeMove). The move (0 items -> 3 items) might have a corrupted undoMove (Undo(0 items -> 3 items)).
3) Check your custom VariableListeners (if you have any) for shadow variables that are used by score constraints that could cause the scoreDifference (1soft).
at org.optaplanner.core.impl.score.director.AbstractScoreDirector.assertExpectedUndoMoveScore(AbstractScoreDirector.java:670)
at org.optaplanner.core.impl.heuristic.thread.MoveThreadRunner.run(MoveThreadRunner.java:149)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
... 4 more
--- Edit 2: Score over time Graph ---
--- Edit 3: Memory Consumption Graph ---

Competitive Coding - Clearing all levels with minimum cost : Not passing all test cases

I was solving problems on a competitive coding website when I came across this. The problem states that:
In this game there are N levels and M types of available weapons. The levels are numbered from 0 to N-1 and the weapons are numbered from 0 to M-1 . You can clear these levels in any order. In each level, some subset of these M weapons is required to clear this level. If in a particular level, you need to buy x new weapons, you will pay x^2 coins for it. Also note that you can carry all the weapons you have currently to the next level . Initially, you have no weapons. Can you find out the minimum coins required such that you can clear all the levels?
Input Format
The first line of input contains 2 space separated integers:
N = the number of levels in the game
M = the number of types of weapons
N lines follows. The ith of these lines contains a binary string of length M. If the jth character of
this string is 1 , it means we need a weapon of type j to clear the ith level.
Constraints
1 <= N <=20
1<= M <= 20
Output Format
Print a single integer which is the answer to the problem.
Sample TestCase 1
Input
1 4
0101
Output
4
Explanation
There is only one level in this game. We need 2 types of weapons - 1 and 3. Since, initially Ben
has no weapons he will have to buy these, which will cost him 2^2 = 4 coins.
Sample TestCase 2
Input
3 3
111
001
010
Output
3
Explanation
There are 3 levels in this game. The 0th level (111) requires all 3 types of weapons. The 1st level (001) requires only weapon of type 2. The 2nd level requires only weapon of type 1. If we clear the levels in the given order(0-1-2), total cost = 3^2 + 0^2 + 0^2 = 9 coins. If we clear the levels in the order 1-2-0, it will cost = 1^2 + 1^2 + 1^2 = 3 coins which is the optimal way.
Approach
I was able to figure out that we can calculate the minimum cost by traversing the Binary Strings in a way that we purchase minimum possible weapons at each level.
One possible way could be traversing the array of binary strings and calculating the cost for each level while the array is already arranged in the correct order. The correct order should be when the Strings are already sorted i.e. 001, 010, 111 as in case of the above test case. Traversing the arrays in this order and summing up the cost for each level gives the correct answer.
Also, the sort method in java works fine to sort these Binary Strings before running a loop on the array to sum up cost for each level.
Arrays.sort(weapons);
This approach work fine for some of the test cases, however more than half of the test cases are still failing and I can't understand whats wrong with my logic. I am using bitwise operators to calculate the number of weapons needed at each level and returning their square.
Unfortunately, I cannot see the test cases that are failing. Any help is greatly appreciated.

This can be solved by dynamic programming.
The state will be the bit mask of weapons we currently own.
The transitions will be to try clearing each of the n possible levels in turn from the current state, acquiring the additional weapons we need and paying for them.
In each of the n resulting states, we take the minimum cost of the current way to achieve it and all previously observed ways.
When we already have some weapons, some levels will actually require no additional weapons to be bought; such transitions will automatically be disregarded since in such case, we arrive at the same state having paid the same cost.
We start at the state of m zeroes, having paid 0.
The end state is the bitwise OR of all the given levels, and the minimum cost to get there is the answer.
In pseudocode:
let mask[1], mask[2], ..., mask[n] be the given bit masks of the n levels
p2m = 2 to the power of m
f[0] = 0
all f[1], f[2], ..., f[p2m-1] = infinity
for state = 0, 1, 2, ..., p2m-1:
current_cost = f[state]
current_ones = popcount(state) // popcount is the number of 1 bits
for level = 1, 2, ..., n:
new_state = state | mask[level] // the operation is bitwise OR
new_cost = current_cost + square (popcount(new_state) - current_ones)
f[new_state] = min (f[new_state], new_cost)
mask_total = mask[1] | mask[2] | ... | mask[n]
the answer is f[mask_total]
The complexity is O(2^m * n) time and O(2^m) memory, which should be fine for m <= 20 and n <= 20 in most online judges.

The dynamic optimization idea by #Gassa could be extended by using A* by estimating min and max of the remaining cost, where
minRemaining(s)=bitCount(maxState-s)
maxRemaining(s)=bitCount(maxState-s)^2
Start with a priority queue - and base it on cost+minRemaining - with the just the empty state, and then replace a state from this queue that has not reached maxState with at most n new states based the n levels:
Keep track bound=min(cost(s)+maxRemaining(s)) in queue,
and initialize all costs with bitCount(maxState)^2+1
extract state with lowest cost
if state!=maxState
remove state from queue
for j in 1..n
if (state|level[j]!=state)
cost(state|level[j])=min(cost(state|level[j]),
cost(state)+bitCount(state|level[j]-state)^2
if cost(state|level[j])+minRemaining(state|level[j])<=bound
add/replace state|level[j] in queue
else break
The idea is to skip dead-ends. So consider an example from a comment
11100 cost 9 min 2 max 4
11110 cost 16 min 1 max 1
11111 cost 25 min 0 max 0
00011 cost 4 min 3 max 9
bound 13
remove 00011 and replace with 11111 (skipping 00011 since no change)
11111 cost 13 min 0 max 0
11100 cost 9 min 2 max 4
11110 cost 16 min 1 max 1
remove 11100 and replace with 11110 11111 (skipping 11100 since no change):
11111 cost 13 min 0 max 0
11110 cost 10 min 1 max 1
bound 11
remove 11110 and replace with 11111 (skipping 11110 since no change)
11111 cost 11 min 0 max 0
bound 11
Number of operations should be similar to dynamic optimization in the worst case, but in many cases it will be better - and I don't know if the worst case can occur.

The logic behind this problem is that every time you have to find the minimum count of set bits corresponding to a binary string which will contain the weapons so far got in the level.
For ex :
we have data as
4 3
101-2 bits
010-1 bits
110-2 bits
101-2 bits
now as 010 has min bits we compute cost for it first then update the current pattern (by using bitwise OR) so current pattern is 010
next we find the next min set bits wrt to current pattern
i have used the logic by first using XOR for current pattern and the given number then using AND with the current number(A^B)&A
so the bits become like this after the operation
(101^010)&101->101-2 bit
(110^010)&110->100-1 bit
now we know the min bit is 110 we pick it and compute the cost ,update the pattern and so on..
This method returns the cost of a string with respect to current pattern
private static int computeCost(String currPattern, String costString) {
int a = currPattern.isEmpty()?0:Integer.parseInt(currPattern, 2);
int b = Integer.parseInt(costString, 2);
int cost = 0;
int c = (a ^ b) & b;
cost = (int) Math.pow(countSetBits(c), 2);
return cost;
}

How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

I want to calculate Centered Moving average of a set of Data .
Example Input format :
quarter | sales
Q1'11 | 9
Q2'11 | 8
Q3'11 | 9
Q4'11 | 12
Q1'12 | 9
Q2'12 | 12
Q3'12 | 9
Q4'12 | 10
Mathematical Representation of data and calculation of Moving average and then centered moving average
Period Value MA Centered
1 9
1.5
2 8
2.5 9.5
3 9 9.5
3.5 9.5
4 12 10.0
4.5 10.5
5 9 10.750
5.5 11.0
6 12
6.5
7 9
I am stuck with the implementation of RecordReader which will provide mapper sales value of a year i.e. of four quarter.

This is actually totally doable in the MapReduce paradigm; it does not have to be though of as a 'sliding window'. Instead think of the fact that each data point is relevant to a max of four MA calculations, and remember that each call to the map function can emit more than one key-value pair. Here is pseudo-code:
First MR job:
map(quarter, sales)
emit(quarter - 1.5, sales)
emit(quarter - 0.5, sales)
emit(quarter + 0.5, sales)
emit(quarter + 1.5, sales)
reduce(quarter, list_of_sales)
if (list_of_sales.length == 4):
emit(quarter, average(list_of_sales))
endif
Second MR job:
map(quarter, MA)
emit(quarter - 0.5, MA)
emit(quarter + 0.5, MA)
reduce(quarter, list_of_MA)
if (list_of_MA.length == 2):
emit(quarter, average(list_of_sales))
endif

In best of my understanding moving average is not nicely maps to MapReduce paradigm since its calculation is essentially "sliding window" over sorted data, while MR is processing of non-intersected ranges of sorted data.
Solution I do see is as following:
a) To implement custom partitioner to be able to make two different partitions in two runs. In each run
your reducers will get different ranges of data and calculate moving average where approprieate
I will try to illustrate:
In first run data for reducers should be:
R1: Q1, Q2, Q3, Q4
R2: Q5, Q6, Q7, Q8
...
here you will cacluate moving average for some Qs.
In next run your reducers should get data like:
R1: Q1...Q6
R2: Q6...Q10
R3: Q10..Q14
And caclulate the rest of moving averages.
Then you will need to aggregate results.
Idea of custom partitioner that it will have two modes of operation - each time dividing into equal ranges but with some shift. In a pseudocode it will look like this :
partition = (key+SHIFT) / (MAX_KEY/numOfPartitions) ;
where:
SHIFT will be taken from the configuration.
MAX_KEY = maximum value of the key. I assume for simplicity that they start with zero.
RecordReader, IMHO is not a solution since it is limited to specific split and can not slide over split's boundary.
Another solution would be to implement custom logic of splitting input data (it is part of the InputFormat). It can be done to do 2 different slides, similar to partitioning.

Exclusive or between N bit sets

I am implementing a program in Java using BitSets and I am stuck in the following operation:
Given N BitSets return a BitSet with 0 if there is more than 1 one in all the BitSets, and 1 otherwise
As an example, suppose we have this 3 sets:
10010
01011
00111
11100 expected result
For the following sets :
10010
01011
00111
10100
00101
01000 expected result
I am trying to do this exclusive with bit wise operations, and I have realized that what I need is literally the exclusive or between all the sets, but not in an iterative fashion,
so I am quite stumped with what to do. Is this even possible?
I wanted to avoid the costly solution of having to check each bit in each set, and keep a counter for each position...
Thanks for any help
Edit : as some people asked, this is part of a project I'm working on. I am building a time table generator and basically one of the soft constraints is that no student should have only 1 class in 1 day, so those Sets represent the attending students in each hour, and I want to filter the ones who have only 1 class.

You can do what you want with two values. One has the bits set at least once, the second has those set more than once. The combination can be used to determine those set once and no more.
int[] ints = {0b10010, 0b01011, 0b00111, 0b10100, 0b00101};
int setOnce = 0, setMore = 0;
for (int i : ints) {
setMore |= setOnce & i;
setOnce |= i;
}
int result = setOnce & ~setMore;
System.out.println(String.format("%5s", Integer.toBinaryString(result)).replace(' ', '0'));
prints
01000

Well first of all, you can't do this without checking every bit in each set. If you could solve this question without checking some arbitrary bit, then that would imply that there exist two solutions (i.e. two different ones for each of the two values that bit can be).
If you want a more efficient way of computing the XOR of multiple bit sets, I'd consider representing your sets as integers rather than with sets of individual bits. Then simply XOR the integers together to arrive at your answer. Otherwise, it seems to me that you would have to iterate through each bit, check its value, and compute the solution on your own (as you described in your question).

Extracting rightmost N bits of an integer

In the yester Code Jam Qualification round http://code.google.com/codejam/contest/dashboard?c=433101#s=a&a=0 , there was a problem called Snapper Chain. From the contest analysis I came to know the problem requires bit twiddling stuff like extracting the rightmost N bits of an integer and checking if they all are 1. I saw a contestant's(Eireksten) code which performed the said operation like below:
(((K&(1<<N)-1))==(1<<N)-1)
I couldn't understand how this works. What is the use of -1 there in the comparison?. If somebody can explain this, it would be very much useful for us rookies. Also, Any tips on identifying this sort of problems would be much appreciated. I used a naive algorithm to solve this problem and ended up solving only the smaller data set.(It took heck of a time to compile the larger data set which is required to be submitted within 8 minutes.). Thanks in advance.

Let's use N=3 as an example. In binary, 1<<3 == 0b1000. So 1<<3 - 1 == 0b111.
In general, 1<<N - 1 creates a number with N ones in binary form.
Let R = 1<<N-1. Then the expression becomes (K&R) == R. The K&R will extract the last N bits, for example:
101001010
& 111
————————————
000000010
(Recall that the bitwise-AND will return 1 in a digit, if and only if both operands have a 1 in that digit.)
The equality holds if and only if the last N bits are all 1. Thus the expression checks if K ends with N ones.

For example: N=3, K=101010
1. (1<<N) = 001000 (shift function)
2. (1<<N)-1 = 000111 (http://en.wikipedia.org/wiki/Two's_complement)
3. K&(1<<N)-1 = 0000010 (Bitmask)
4. K&(1<<N)-1 == (1<<N)-1) = (0000010 == 000111) = FALSE

I was working through the Snapper Chain problem and came here looking for an explanation on how the bit twiddling algorithm I came across in the solutions worked. I found some good info but it still took me a good while to figure it out for myself, being a bitwise noob.
Here's my attempt at explaining the algorithm and how to come up with it. If we enumerate all the possible power and ON/OFF states for each snapper in a chain, we see a pattern. Given the test case N=3, K=7 (3 snappers, 7 snaps), we show the power and ON/OFF states for each snapper for every kth snap:
1 2 3
0b:1 1 1.1 1.0 0.0 -> ON for n=1
0b:10 2 1.0 0.1 0.0
0b:11 3 1.1 1.1 1.0 -> ON for n=1, n=2
0b:100 4 1.0 0.0 1.0
0b:101 5 1.1 1.0 1.0 -> ON for n=1
0b:110 6 1.0 0.1 0.1
0b:111 7 1.1 1.1 1.1 -> ON for n=2, n=3
The lightbulb is on when all snappers are on and receiving power, or when we have a kth snap resulting in n 1s. Even more simply, the bulb is on when all of the snappers are ON, since they all must be receiving power to be ON (and hence the bulb). This means for every k snaps, we need n 1s.
Further, you can note that k is all binary 1s not only for k=7 that satisfies n=3, but for k=3 that satisifes n=2 and k=1 that satisifes n=1. Further, for n = 1 or 2 we see that every number of snaps that turns the bulb on, the last n digits of k are always 1. We can attempt to generalize that all ks that satisfy n snappers will be a binary number ending in n digits of 1.
We can use the expression noted by an earlier poster than 1 << n - 1 always gives us n binary digits of 1, or in this case, 1 << 3 - 1 = 0b111. If we treat our chain of n snappers as a binary number where each digit represents on/off, and we want n digits of one, this gives us our representation.
Now we want to find those cases where 1 << n - 1 is equal to some k that ends in n binary digits of 1, which we do by performing a bitwise-and: k & (1 << n - 1) to get the last n digits of k, and then comparing that to 1 << n - 1.
I suppose this type of thinking comes more naturally after working with these types of problems a lot, but it's still intimidating to me and I doubt I would ever have come up with such a solution by myself!
Here's my solution in perl:
$tests = <>;
for (1..$tests) {
($n, $k) = split / /, <>;
$m = 1 << $n - 1;
printf "Case #%d: %s\n", $_, (($k & $m) == $m) ? 'ON' : 'OFF';
}

I think we can recognize this kind of problem by calculating the answer by hand first, for some series of N (for example 1,2,3,..). After that, we will recognize the state change and then write a function to automate the process (first function). Run the program for some inputs, and notice the pattern.
When we get the pattern, write the function representing the pattern (second function), and compare the output of the first function and the second function.
For the Code Jam case, we can run both function against the small dataset, and verify the output. If it is identical, we have a high probability that the second function can solve the large dataset in time.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.