Java K-means implementation with unexpected output - java

I'm using the Trickl-Cluster project to cluster my data set
and Colt to memorize the data objects in matrices .
After executing this code
import cern.colt.matrix.DoubleMatrix2D;
import cern.colt.matrix.impl.DenseDoubleMatrix2D;
import com.trickl.cluster.KMeans;
DoubleMatrix2D dm1 = new DenseDoubleMatrix2D(3, 3);
dm1.setQuick(0, 0, 5.9);
dm1.setQuick(0, 1, 1.6);
dm1.setQuick(0, 2, 18.0);
dm1.setQuick(1, 0, 2.0);
dm1.setQuick(1, 1, 3.5);
dm1.setQuick(1, 2, 20.3);
dm1.setQuick(2, 0, 11.5);
dm1.setQuick(2, 1, 100.5);
dm1.setQuick(2, 2,6.5);
System.out.println (dm1);
KMeans km = new KMeans();
km.cluster(dm1 ,1);
DoubleMatrix2D dm11 = km.getPartition();
System.out.println (dm11);
DoubleMatrix2D dm111 = km.getMeans();
System.out.println (dm111);
I had the following output
3 x 3 matrix
5.9 1.6 18
2 3.5 20.3
11.5 100.5 6.5
3 x 1 matrix
1
1
1
3 x 1 matrix
6.466667
35.2
14.933333
Following the algorithm steps , it's strange when one expects 1 cluster and has 3 means
The documentation is not so clear about that specific point .
This is the definition of the method Cluster according to the java doc of the project
void cluster(cern.colt.matrix.DoubleMatrix2D data, int clusters)
So logically speaking the int clusters represents the number of the expected clusters after K-means terminates.
Have you any idea about the relation between the outputs of K-means class in the project and the K-means algorithm expected results?

This is one 3-dimensional mean. If you put in three-dimensional data, you get out three-dimensional means.
Note that running k-means with k=1 is absolutely nonsensical, as it will simply compute the mean of the data set:
(5.9+2+11.5) / 3 = 6.466667
(1.6+3.5+100.5) / 3 = 35.2
(18+20.3+6.5) / 3 = 14.933333
The result is obviously correct.

Related

How to find the point that gives the maximum value fast? Java or c++ code please

I need a fast way to find maximum value when intervals are overlapping, unlike finding the point where got overlap the most, there is "order". I would have int[][] data that 2 values in int[], where the first number is the center, the second number is the radius, the closer to the center, the larger the value at that point is going to be. For example, if I am given data like:
int[][] data = new int[][]{
{1, 1},
{3, 3},
{2, 4}};
Then on a number line, this is how it's going to looks like:
x axis: -2 -1 0 1 2 3 4 5 6 7
1 1: 1 2 1
3 3: 1 2 3 4 3 2 1
2 4: 1 2 3 4 5 4 3 2 1
So for the value of my point to be as large as possible, I need to pick the point x = 2, which gives a total value of 1 + 3 + 5 = 9, the largest possible value. It there a way to do it fast? Like time complexity of O(n) or O(nlogn)
This can be done with a simple O(n log n) algorithm.
Consider the value function v(x), and then consider its discrete derivative dv(x)=v(x)-v(x-1). Suppose you only have one interval, say {3,3}. dv(x) is 0 from -infinity to -1, then 1 from 0 to 3, then -1 from 4 to 6, then 0 from 7 to infinity. That is, the derivative changes by 1 "just after" -1, by -2 just after 3, and by 1 just after 6.
For n intervals, there are 3*n derivative changes (some of which may occur at the same point). So find the list of all derivative changes (x,change), sort them by their x, and then just iterate through the set.
Behold:
intervals = [(1,1), (3,3), (2,4)]
events = []
for mid, width in intervals:
before_start = mid - width - 1
at_end = mid + width
events += [(before_start, 1), (mid, -2), (at_end, 1)]
events.sort()
prev_x = -1000
v = 0
dv = 0
best_v = -1000
best_x = None
for x, change in events:
dx = x - prev_x
v += dv * dx
if v > best_v:
best_v = v
best_x = x
dv += change
prev_x = x
print best_x, best_v
And also the java code:
TreeMap<Integer, Integer> ts = new TreeMap<Integer, Integer>();
for(int i = 0;i<cows.size();i++) {
int index = cows.get(i)[0] - cows.get(i)[1];
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
index = cows.get(i)[0] + 1;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) - 2);
}else {
ts.put(index, -2);
}
index = cows.get(i)[0] + cows.get(i)[1] + 2;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
}
int value = 0;
int best = 0;
int change = 0;
int indexBefore = -100000000;
while(ts.size() > 1) {
int index = ts.firstKey();
value += (ts.get(index) - indexBefore) * change;
best = Math.max(value, best);
change += ts.get(index);
ts.remove(index);
}
where cows is the data
Hmmm, a general O(n log n) or better would be tricky, probably solvable via linear programming, but that can get rather complex.
After a bit of wrangling, I think this can be solved via line intersections and summation of function (represented by line segment intersections). Basically, think of each as a triangle on top of a line. If the inputs are (C,R) The triangle is centered on C and has a radius of R. The points on the line are C-R (value 0), C (value R) and C+R (value 0). Each line segment of the triangle represents a value.
Consider any 2 such "triangles", the max value occurs in one of 2 places:
The peak of one of the triangle
The intersection point of the triangles or the point where the two triangles overall. Multiple triangles just mean more possible intersection points, sadly the number of possible intersections grows quadratically, so O(N log N) or better may be impossible with this method (unless some good optimizations are found), unless the number of intersections is O(N) or less.
To find all the intersection points, we can just use a standard algorithm for that, but we need to modify things in one specific way. We need to add a line that extends from each peak high enough so it would be higher than any line, so basically from (C,C) to (C,Max_R). We then run the algorithm, output sensitive intersection finding algorithms are O(N log N + k) where k is the number of intersections. Sadly this can be as high as O(N^2) (consider the case (1,100), (2,100),(3,100)... and so on to (50,100). Every line would intersect with every other line. Once you have the O(N + K) intersections. At every intersection, you can calculate the the value by summing the of all points within the queue. The running sum can be kept as a cached value so it only changes O(K) times, though that might not be posible, in which case it would O(N*K) instead. Making it it potentially O(N^3) (in the worst case for K) instead :(. Though that seems reasonable. For each intersection you need to sum up to O(N) lines to get the value for that point, though in practice, it would likely be better performance.
There are optimizations that could be done considering that you aim for the max and not just to find intersections. There are likely intersections not worth pursuing, however, I could also see a situation where it is so close you can't cut it down. Reminds me of convex hull. In many cases you can easily reduce 90% of the data, but there are cases where you see the worst case results (every point or almost every point is a hull point). For example, in practice there are certainly causes where you can be sure that the sum is going to be less than the current known max value.
Another optimization might be building an interval tree.

Compute covariance matrix using Nd4j

Given a 2 dimensional matrix, I'd like to compute the corresponding covariance matrix.
Are there any methods included with Nd4j that would facilitate this operation?
For example, the covariance matrix computed from the following matrix
1 2
8 12
constructed using Nd4j here:
INDArray array1 = Nd4j.zeros(2, 2);
array1.putScalar(0, 0, 1);
array1.putScalar(0, 1, 2);
array1.putScalar(1, 0, 8);
array1.putScalar(1, 1, 12);
should be
24.5 35.0
35.0 50.0
This can easily be done using pandas' DataFrame's cov method like so:
>>> pandas.DataFrame([[1, 2],[8, 12]]).cov()
0 1
0 24.5 35.0
1 35.0 50.0
Is there any way of doing this using Nd4j?
I hope you already found a solution, for those who are facing the same problem, here is a method in ND4J that computes a covariance matrix:
/**
* Returns the covariance matrix of a data set of many records, each with N features.
* It also returns the average values, which are usually going to be important since in this
* version, all modes are centered around the mean. It's a matrix that has elements that are
* expressed as average dx_i * dx_j (used in procedure) or average x_i * x_j - average x_i * average x_j
*
* #param in A matrix of vectors of fixed length N (N features) on each row
* #return INDArray[2], an N x N covariance matrix is element 0, and the average values is element 1.
*/
public static INDArray[] covarianceMatrix(INDArray in)
GitHub source
This method is found in the org.nd4j.linalg.dimensionalityreduction.PCA package.

Find total number of permutation of sub-categories ignoring reflections

I have Categories and total number of elements in each category.
for example :
2 L,
2 S, and
1 P
I could line them up in the following 16 ways.
llpss
llsps
llssp
lplss
lpsls
lslps
lslsp
lspls
lspsl
lsslp
lsspl
pllss
sllps
sllsp
slpls
slslp
Before you object that the list is incomplete, you should know that
mirror images are considered to be equivalent.
For example, since "sspll" is the same as "llpss" from back to front,
we counts them as one.
You are given a int[] containing the number of element of each
category(L,S,P,A,B). Return an int stating the number of ways they can
be lined up, ignoring reflections.
for example :
{2, 2, 1}
Returns: 16 // illustrated above.
{2, 2, 2}
Returns: 48
What I can think of about the algorithm is very basic :
Convert the number to their respective alphabets(L,S,P,A,B ; B at index 0).
Count the total permutation possible with those alphabets
remove the reflections
But this is certainly not optimal solution.Can anyone tell me any other solution for this problem.
Thanks..
The formula for the number of words of type (a,b,c) discounting reflections is:
[ (a+b+c)! / a! / b! / c! + correction ] / 2
where correction is the number of words whose reflection equals themselves.
For example, for (2,2,1) the correction term is 2 for the two words lspsl and slpls.
The total number of words is (5! / 2! / 2! + 2)/2 = (120/4 + 2) / 2 = 32/2 = 16.
For (1,1,1) the correction term is 0. The correction term is also 0 for (2,1,1). The correction term can be computed directly from the numbers a, b and c (left as an exercise.)

Partition 2d array in sub-arrays

I have to partition a 2d array (the size is given by the user) into sub-arrays given an input number by the user. The code i Wrote works well for most of the instances by there are some that I need some help with.
I do this by taking the square root of the input number. So for example:
If the user inserts [10, 10, 9] it means that this is a 10 * 10 array with 9 sub-arrays. Taking the square root of 9 works fine because it gives 3.
If the user inserts [8, 6, 6] it takes the square root of 6 and rounds it up for the longest side (which gives 3) and rounds it down for the shortest (which is 2). So 3 * 2 = 6. It also works fine.
Then there is a situation like 8. The square root of 8 gives 3 and 2. So the array is partitioned into 6 sub-arrays. Is there another way to find a better partitioning for numbers like 8, 14? Or is there a way to find the optimal distribution for such numbers (e.g. 2 * 4 = 8, 2 * 7 = 14)?
You can calculate them a bit different way:
int x = Math.round(Math.sqrt(n));
int y = Math.round(1. * n / x);
Thus you'll receive:
n = 8 => x = 3, y = 3
n = 14 => x = 4, y = 4
What you need to do is find the two nearest factors to the square root. Try this code:
long n = 14;
long y = 0;
long x = Math.round(Math.sqrt(n));
while(true){
if (n % x == 0) {
y = n/x;
break;
}
else {
x--;
}
}
You might also like to put in some error checking to cope with input errors. e.g. n<1.

Liblinear usage format

I am using .NET implementation of liblinear in my C# code by the following nuget package:
https://www.nuget.org/packages/Liblinear/
But in the readme file of liblinear, the format for x is:
struct problem describes the problem:
struct problem
{
int l, n;
int *y;
struct feature_node **x;
double bias;
};
where `l` is the number of training data. If bias >= 0, we assume
that one additional feature is added to the end of each data
instance. `n` is the number of feature (including the bias feature
if bias >= 0). `y` is an array containing the target values. (integers
in classification, real numbers in regression) And `x` is an array
of pointers, each of which points to a sparse representation (array
of feature_node) of one training vector.
For example, if we have the following training data:
LABEL ATTR1 ATTR2 ATTR3 ATTR4 ATTR5
----- ----- ----- ----- ----- -----
1 0 0.1 0.2 0 0
2 0 0.1 0.3 -1.2 0
1 0.4 0 0 0 0
2 0 0.1 0 1.4 0.5
3 -0.1 -0.2 0.1 1.1 0.1
and bias = 1, then the components of problem are:
l = 5
n = 6
y -> 1 2 1 2 3
x -> [ ] -> (2,0.1) (3,0.2) (6,1) (-1,?)
[ ] -> (2,0.1) (3,0.3) (4,-1.2) (6,1) (-1,?)
[ ] -> (1,0.4) (6,1) (-1,?)
[ ] -> (2,0.1) (4,1.4) (5,0.5) (6,1) (-1,?)
[ ] -> (1,-0.1) (2,-0.2) (3,0.1) (4,1.1) (5,0.1) (6,1) (-1,?)
But, in the example showing java implementation:
https://gist.github.com/hodzanassredin/6682771
problem.x <- [|
[|new FeatureNode(1,0.); new FeatureNode(2,1.)|]
[|new FeatureNode(1,2.); new FeatureNode(2,0.)|]
|]// feature nodes
problem.y <- [|1.;2.|] // target values
which means his data set is:
1 0 1
2 2 0
So, he is not storing the nodes as per sparse format of liblinear. Does, anyone know of correct format for x for liblinear implementation?
Though it doesn't address exactly the library you mentioned, I can offer you an alternative. The
Accord.NET Framework has recently incorporated all of LIBLINEAR's algorithms in its machine learning
namespaces. It is also available through NuGet.
In this library, the direct syntax to create a linear support vector machine from in-memory data is
// Create a simple binary AND
// classification problem:
double[][] problem =
{
// a b a + b
new double[] { 0, 0, 0 },
new double[] { 0, 1, 0 },
new double[] { 1, 0, 0 },
new double[] { 1, 1, 1 },
};
// Get the two first columns as the problem
// inputs and the last column as the output
// input columns
double[][] inputs = problem.GetColumns(0, 1);
// output column
int[] outputs = problem.GetColumn(2).ToInt32();
// However, SVMs expect the output value to be
// either -1 or +1. As such, we have to convert
// it so the vector contains { -1, -1, -1, +1 }:
//
outputs = outputs.Apply(x => x == 0 ? -1 : 1);
After the problem is created, one can learn a linear SVM using
// Create a new linear-SVM for two inputs (a and b)
SupportVectorMachine svm = new SupportVectorMachine(inputs: 2);
// Create a L2-regularized L2-loss support vector classification
var teacher = new LinearDualCoordinateDescent(svm, inputs, outputs)
{
Loss = Loss.L2,
Complexity = 1000,
Tolerance = 1e-5
};
// Learn the machine
double error = teacher.Run(computeError: true);
// Compute the machine's answers for the learned inputs
int[] answers = inputs.Apply(x => Math.Sign(svm.Compute(x)));
This assumes, however, that your data is already in-memory. If you wish to load your data from the
disk, from a file in libsvm sparse format, you can use the framework's SparseReader class.
An example on how to use it can be found below:
// Suppose we are going to read a sparse sample file containing
// samples which have an actual dimension of 4. Since the samples
// are in a sparse format, each entry in the file will probably
// have a much smaller number of elements.
//
int sampleSize = 4;
// Create a new Sparse Sample Reader to read any given file,
// passing the correct dense sample size in the constructor
//
SparseReader reader = new SparseReader(file, Encoding.Default, sampleSize);
// Declare a vector to obtain the label
// of each of the samples in the file
//
int[] labels = null;
// Declare a vector to obtain the description (or comments)
// about each of the samples in the file, if present.
//
string[] descriptions = null;
// Read the sparse samples and store them in a dense vector array
double[][] samples = reader.ReadToEnd(out labels, out descriptions);
Afterwards, one can use the samples and labels vectors as the inputs and outputs of the problem,
respectively.
I hope it helps.
Disclaimer: I am the author of this library. I am answering this question in the sincere hope it
can be useful for the OP, since not long ago I also faced the same problems. If a moderator thinks
this looks like spam, feel free to delete. However, I am only posting this because I think it might
help others. I even came across this question by mistake while searching for existing C#
implementations of LIBSVM, not LIBLINEAR.

Categories