I'm doing a KMean clustering on a 12 dimensional matrix. I managed to get the result in K set of cluster. I want to show the result by plotting it into a 2D graph, but I can't figure it out how can I convert the 12 dimension data into 2 dimension.
Any suggestion on how can I do the conversion or any alternative ways on visualizing the result? I tried Multidimensional Scaling for Java (MDSJ) but it did not work.
The KMean algorithm I'm using was from the Java Machine Learning Library: Clustering basics.
I would do Principal Component Analysis (probably the easiest algorithm from Multidimensional scaling algorithms). (BTW PCA has nothing to do with KMeans, it is a general method for dimensionality reduction)
I assume variables are in columns, observations are in rows.
Standardize the data - convert variables to z-scores. That means: from each cell, subtract the mean of the column and devide the result by the std. deviation of the column. That way you get zero mean and unit variance. The former is obligatory, the latter, I would say, good to do. If you have zero variance, you calculate the eigen-vectors from the covariance matrix, otherwise have to use correlation matrix which kind of standardizes the data automatically. See this for explanation).
Calculate eigen-vectors and eigen-values of the covariance matrix. Sort the eigen-vectors by the eigen-values. (Many libraries already give you eigen-vectors sorted that way).
Use first two columns of the eigen-vector matrix and multiply the original matrix (converted to z-scores), visualize this data.
Using the colt library, you can do the following. It will be similar with other matrix libraries:
import cern.colt.matrix.DoubleMatrix1D;
import cern.colt.matrix.DoubleMatrix2D;
import cern.colt.matrix.doublealgo.Statistic;
import cern.colt.matrix.impl.SparseDoubleMatrix2D;
import cern.colt.matrix.linalg.Algebra;
import cern.colt.matrix.linalg.EigenvalueDecomposition;
import hep.aida.bin.DynamicBin1D;
public class Pca {
// to show matrix creation, it does not make much sense to calculate PCA on random data
public static void main(String[] x) {
double[][] data = {
{2.0,4.0,1.0,4.0,4.0,1.0,5.0,5.0,5.0,2.0,1.0,4.0},
{2.0,6.0,3.0,1.0,1.0,2.0,6.0,4.0,4.0,4.0,1.0,5.0},
{3.0,4.0,4.0,4.0,2.0,3.0,5.0,6.0,3.0,1.0,1.0,1.0},
{3.0,6.0,3.0,3.0,1.0,2.0,4.0,6.0,1.0,2.0,4.0,4.0},
{1.0,6.0,4.0,2.0,2.0,2.0,3.0,4.0,6.0,3.0,4.0,1.0},
{2.0,5.0,5.0,3.0,1.0,1.0,6.0,6.0,3.0,2.0,6.0,1.0}
};
DoubleMatrix2D matrix = new DenseDoubleMatrix2D(data);
DoubleMatrix2D pm = pcaTransform(matrix);
// print the first two dimensions of the transformed matrix - they capture most of the variance of the original data
System.out.println(pm.viewPart(0, 0, pm.rows(), 2).toString());
}
/** Returns a matrix in the space of principal components, take the first n columns */
public static DoubleMatrix2D pcaTransform(DoubleMatrix2D matrix) {
DoubleMatrix2D zScoresMatrix = toZScores(matrix);
final DoubleMatrix2D covarianceMatrix = Statistic.covariance(zScoresMatrix);
// compute eigenvalues and eigenvectors of the covariance matrix (flip needed since it is sorted by ascending).
final EigenvalueDecomposition decomp = new EigenvalueDecomposition(covarianceMatrix);
// Columns of Vs are eigenvectors = principal components = base of the new space; ordered by decreasing variance
final DoubleMatrix2D Vs = decomp.getV().viewColumnFlip();
// eigenvalues: ev(i) / sum(ev) is the percentage of variance captured by i-th column of Vs
// final DoubleMatrix1D ev = decomp.getRealEigenvalues().viewFlip();
// project the original matrix to the pca space
return Algebra.DEFAULT.mult(zScoresMatrix, Vs);
}
/**
* Converts matrix to a matrix of z-scores (by columns)
*/
public static DoubleMatrix2D toZScores(final DoubleMatrix2D matrix) {
final DoubleMatrix2D zMatrix = new SparseDoubleMatrix2D(matrix.rows(), matrix.columns());
for (int c = 0; c < matrix.columns(); c++) {
final DoubleMatrix1D column = matrix.viewColumn(c);
final DynamicBin1D bin = Statistic.bin(column);
if (bin.standardDeviation() == 0) { // use epsilon
for (int r = 0; r < matrix.rows(); r++) {
zMatrix.set(r, c, 0.0);
}
} else {
for (int r = 0; r < matrix.rows(); r++) {
double zScore = (column.get(r) - bin.mean()) / bin.standardDeviation();
zMatrix.set(r, c, zScore);
}
}
}
return zMatrix;
}
}
You could also use weka. I would first load your data into weka, then run PCA using the GUI (under attribute selection). You will see what classes are called with what parameters and then do the same thing from your code. The problem is you will need to convert/wrap your matrix into the data format weka works with.
A similar question has been discussed on CrossValidated2. The basic idea is to find an appropriate projection that separates these clusters (e.g., with discproj in R) and then to plot the projection on the clusters on the new space.
In addition to what the other answers suggest you should probably have a look at multidimensional scaling too.
Related
I‘m trying to perform some data cleansing algorithm recently. When I try to calculate the mahalanobis distance between points in the data set and the mean vector, it seems the same.
For example, I have a data set like:
{{2,2,3},{4,5,9},{7,8,9}}
The mean vector is :
{13/3,5,7}
And the covariance matrix is:
{{6.333333333333333,7.5,7.0},{7.5,9.0,9.0},{7.0,9.0,12.0}}
Then the distances between {2,2,3}, {4,5,9}, {7,8,9} and the mean vector are all 8290542, which is quite strange. After calculating on paper, the result is the same.
Does anyone know what's wrong with my code or thought? I'd be more than grateful if someone could help me out. Following is some code I used in dealing with this problem.
import org.apache.commons.math3.linear.RealMatrix;
import org.apache.commons.math3.stat.correlation.Covariance;
import org.apache.mahout.math.*;
import org.apache.mahout.common.distance.MahalanobisDistanceMeasure;
public class Test {
public static void main(String[] args) {
double[] a = {2,2,3};
Vector aVector = new DenseVector(a);
double[] b = {4,5,9};
Vector bVector = new DenseVector(b);
double[] c = {7,8,9};
Vector cVector = new DenseVector(b);
double[] mean = {13/3,5,7};
Vector meanVector = new DenseVector(mean);
MahalanobisDistanceMeasure measure = new MahalanobisDistanceMeasure();
double[][] ma = {{2,2,3},{4,5,9},{7,8,9}};
RealMatrix matrix = new Covariance(ma).getCovarianceMatrix();
Matrix math = new DenseMatrix(matrix.getData());
measure.setCovarianceMatrix(math);
measure.setMeanVector(meanVector);
System.out.println(matrix.toString());
System.out.println(measure.distance(meanVector,cVector));
}
}
You need to use more data.
The mean vector + covariance matrix will otherwise overfit to your data, and give the same distance each.
For 3d data, use at least 20 points.
I have to fit a Gaussian curve to data points where 1 peak is expected. The data points can have an arbitrary offset in the y-direction.
I am using the org.apache.commons.math3 package. However, when creating a GaussianCurveFitter instance, it is only possible to pass the following initial guess values:
Normalization
Mean
Sigma
Right now I have come this far:
import org.apache.commons.math3.fitting.GaussianCurveFitter;
import org.apache.commons.math3.fitting.WeightedObservedPoints;
public void fitGaussian(double[] data) {
WeightedObservedPoints obs = new WeightedObservedPoints();
//add data points
for (int j = 0; j < data.length; j++) {
obs.add(j, data[j]);
}
//fit gaussian curve
double[] parameters = GaussianCurveFitter.create().fit(obs.toList());
}
Here the parameters contain above mentioned values (Normalization, Mean, Sigma).
Does anybody have an idea how I can also include the offset in y-direction as a free optimization parameter? Or maybe how to transform the original data points to suit the optimizer?
Thanks for the help!
Basically, I would like to calculate the inverse of a matrix which belongs to the ComplexDoubleMatrix class, but I did not find such a function like inverse() or inv(), does any body know how to calculate the inverse of a matrix? Thank you in advance.
My final goal is to create a eigen decomposition of a matrix using jblas.eigen.
Now my current implementation is by jama library below. To perform similar functions, I need to calculate Vinverse, that is why I want to find an inverse function in jblas.
public static SimpleEigenDecomposition SimpleEigenDecomposition(double [][] rates)
{
Matrix ratesMatrix = new Matrix(rates);
EigenvalueDecomposition ed = new EigenvalueDecomposition(ratesMatrix);
Matrix V = ed.getV();
Matrix D =ed.getD();
Matrix Vinverse = V.inverse();
Matrix resultMatrix = V.times(D).times(V.inverse());
//check if result and rates are close enough
SimpleMatrix trueMatrix = new SimpleMatrix(rates);
SimpleMatrix calculatedMatrix = new SimpleMatrix(resultMatrix.getArray()) ;
if(EJMLUtils.isClose(trueMatrix, calculatedMatrix, THRESHOLD))
{
return new SimpleEigenDecomposition(V, D, Vinverse);
}else{
throw new RuntimeException();
}
The reason is that there is no inverse operation because that is simply too computationally expensive if done using Cramer's Rule. I initially thought this weird as it could have been implemented using Gauss Jordon elimination. But it is strange that even I could not find one. If anyone finds it in JBLAS please comment below.
One alternative that I can suggest is using pinv(). It uses the least square method and is present in org.jblas.Solve as a static function.
import org.jblas.Solve
public static SimpleEigenDecomposition SimpleEigenDecomposition(double [][] rates)
{
// inside the main function replace this for your implementation of inverse
DoubleMatrix Vinverse = Solve.pinv(V);
}
Lease squares pinv gives the same output as actual inverse when the matrix in invertible.
The inverse of a matrix A can be found by solving AX = I where I is the identity matrix, and X will be the inverse of A. So, using jblas we can say
DoubleMatrix Vinverse = Solve.solve(A, DoubleMatrix.eye(A.rows));
Note that we can not invert a non-square matrix. We can check that matrix A is square using the isSquare method:
A.isSquare(); // returns true if it is
I am trying to build an OCR by calculating the Coefficient Correlation between characters extracted from an image with every character I have pre-stored in a database. My implementation is based on Java and pre-stored characters are loaded into an ArrayList upon the beginning of the application, i.e.
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
...
...
public double findCorrelation(byte [] extractedCharacter, byte [] storedCharacter)
{
double mag1, mag2, corr = 0;
for(int i=0; i < extractedCharacter.length; i++)
{
mag1 += extractedCharacter[i] * extractedCharacter[i];
mag2 += storedCharacter[i] * storedCharacter[i];
corr += extractedCharacter[i] * storedCharacter[i];
} // for
corr /= Math.sqrt(mag1*mag2);
return corr;
}
The number of extractedCharacters are around 100-150 per image but the database has 15600 stored binary characters. Checking the coefficient correlation between every extracted character and every stored character has an impact on the performance as it needs around 15-20 seconds to complete for every image, with an Intel i5 CPU.
Is there a way to improve the speed of this program, or suggesting another path of building this bringing similar results. (The results produced by comparing every character with such a large dataset is quite good).
Thank you in advance
UPDATE 1
public static void run() {
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
computeNorms(charComps, extractedCharacters);
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
}
}
private static double[] storedNorms;
private static double[] extractedNorms;
// Correlation between to binary images
public static double findCorrelation(byte[] arr1, byte[] arr2, int strCharIndex, int extCharNo){
final int dotProduct = dotProduct(arr1, arr2);
final double corr = dotProduct * storedNorms[strCharIndex] * extractedNorms[extCharNo];
return corr;
}
public static void computeNorms(ArrayList<byte[]> storedCharacters, ArrayList<byte[]> extractedCharacters) {
storedNorms = computeInvNorms(storedCharacters);
extractedNorms = computeInvNorms(extractedCharacters);
}
private static double[] computeInvNorms(List<byte []> a) {
final double[] result = new double[a.size()];
for (int i=0; i < result.length; ++i)
result[i] = 1 / Math.sqrt(dotProduct(a.get(i), a.get(i)));
return result;
}
private static int dotProduct(byte[] arr1, byte[] arr2) {
int dotProduct = 0;
for(int i = 0; i< arr1.length; i++)
dotProduct += arr1[i] * arr2[i];
return dotProduct;
}
Nowadays, it's hard to find a CPU with a single core (even in mobiles). As the tasks are nicely separated, you can do it with a few lines only. So I'd go for it, though the gain is limited.
In case you really mean cross-correlation, then a transform like DFT or DCT could help. They surely do for big images, but with yours 12x16, I'm not sure.
Maybe you mean just a dot product? And maybe you should tell us?
Note that you actually don't need to compute the correlation, most of the time you only need is find out if it's bigger than a threshold:
corr = findCorrelation(extractedCharacter, storedCharacter)
..... more code to check if this is the best match ......
This may lead to some optimizations or not, depending on how the images look like.
Note also that a simple low level optimization can give you nearly a factor of 4 as in this question of mine. Maybe you really should tell us what you're doing?
UPDATE 1
I guess that due to the computation of three products in the loop, there's enough instruction level parallelism, so a manual loop unrolling like in my above question is not necessary.
However, I see that those three products get computed some 100 * 15600 times, while only one of them depends on both extractedCharacter and storedCharacter. So you can compute
100 + 15600 + 100 * 15600
dot products instead of
3 * 100 * 15600
This way you may get a factor of three pretty easily.
Or not. After this step there's a single sum computed in the relevant step and the problem linked above applies. And so does its solution (unrolling manually).
Factor 5.2
While byte[] is nicely compact, the computation involves extending them to ints, which costs some time as my benchmark shows. Converting the byte[]s to int[]s before all the correlations gets computed saves time. Even better is to make use of the fact that this conversion for storedCharacters can be done beforehand.
Manual loop unrolling twice helps but unrolling more doesn't.
I am not familiar with coordinate systems or much of the math dealing with these things at all. What I am trying to do is take a Point (x,y), and find its position in a 1 dimensional array such that it follows this:
(0,2)->0 (1,2)->1 (2,2)->2
(0,1)->4 (1,1)->5 (2,1)->6
(0,0)->8 (1,0)->9 (2,0)->10
where the arrows are showing what value the coordinates should map to. Notice that an index is skipped after each row. I'm think it'll end up being a fairly trivial solution, but I can't find any questions similar to this and I haven't had any luck coming up with ideas myself. I do know the width and height of the 2 dimensional array. Thank you for any help!
My question is perhaps ambiguous or using the wrong terminology, my apologies.
I know that the coordinate (0,0) will be the bottom left position. I also know that the top left coordinate should be placed at index 0. Each new row skips an index by 1. The size of the coordinate system varies, but I know the number of rows and number of columns.
First step, flip the values upside down, keep points in tact:
(0,2)->8 (1,2)->9 (2,2)->10
(0,1)->4 (1,1)->5 (2,1)->6
(0,0)->0 (1,0)->1 (2,0)->2
You'll notice that y affects the output by a factor of 4 and x by a factor of 1.
Thus we get a very simple 4y + x.
Now to get back to the original, you'll notice the transformation is (x,y) <- (x,2-y) (that is, if we transform each point above with this transformation, we get the original required mapping).
So, substituting it into the equation, we get (2-y)*4 + x.
Now this is specific to 3x3, but I'm sure you'll be able to generalize it by replacing 2 and 4 by variables.
If you want to reduce the dimension and avoid overlapping you need a space-filling-curve, for example a morton curve. Your example looks like a peano curve because it's a 3x3 matrix. These curves is difficult to calculate but have some nice things. But if you just look for self-avoiding curves you can create your own? Read here: http://www.fractalcurves.com/Root4Square.html.
I was beaten to the formula, here is the bruteforce using a Map.
public class MapPointToIndex {
private Map<Point, Integer> map;
private int index, rowcount;
public MapPointToIndex(int rows, int columns) {
map = new HashMap<Point, Integer>();
for (int i = rows - 1; i >= 0; i--) {
index += rowcount;
for (int j = 0; j < columns; j++) {
Point p = new Point(j, i);
map.put(p, index);
index++;
}
rowcount = 1;
}
}
public int getIndex(Point point){
return map.get(point);
}
public static void main(String[] args) {
MapPointToIndex one = new MapPointToIndex(3, 3);
System.out.println(one.map);
}
}
Out:
{java.awt.Point[x=0,y=0]=8, java.awt.Point[x=2,y=2]=2, java.awt.Point[x=1,y=2]=1, java.awt.Point[x=2,y=1]=6, java.awt.Point[x=1,y=1]=5, java.awt.Point[x=2,y=0]=10, java.awt.Point[x=0,y=2]=0, java.awt.Point[x=1,y=0]=9, java.awt.Point[x=0,y=1]=4}