Related
I'm a student in computing sciences in Paris. In mathematics this year we have to use the K-means algorithm to solve a problem (the Clustered Capacited Vehicle Routing Problem applied to the resupplying of self-service bicycles' stations). Here is my algorithm :
public void run() {
boolean hasConverged = false;
List<Integer> nearestClusters = null;
//A list used to check if the nearestClusters list has evolved
//If it isn't the case, the algorithm is finish
List<Integer> previousList = new ArrayList<Integer>();
//Random initialization of the clusters' centroids
for (int i = 0; i < clustersNumber; ++i) {
clusters.add(ClusterGenerator.Generate(stationsList,colorList.get(i) ,latMin, latMax, lngMin, lngMax));
}
while (!hasConverged) {
if (nearestClusters != null) {
previousList.clear();
previousList.addAll(nearestClusters);
}
nearestClusters= new ArrayList<Integer>();
//Each point is connected to it nearest cluster
for (int j = 0; j < stationsList.size(); ++j) {
nearestClusters.add(getIndexOfTheNearestCluster(stationsList.get(j)));
}
//We move the clusters centroids to the center of the points they are connected to
for (int k = 0; k < clusters.size(); ++k) {
clusters.get(k).setCentre(stationsCenters(getStationsOfCluster(clusters.get(k), nearestClusters)));
}
if (!nearestClusters.isEmpty() && previousList.equals(nearestClusters))
hasConverged = true;
}
}
Yet, I wanted to show the result of my algorithm with the clusters formed and I found this work on the Internet : https://github.com/ertugrulozcan/K-Means-Simulation
I imported in my project the class ClusterGenerator which creates clusters along with random elements, the class Item, the class Graphic (I didn't touch anything there) and the class MainWindow which initiates all the graphic elements.
I did not manage to display the plots and there are no errors in Eclipse that could give me any clue.
Can someone please explain to me where is the problem ?
Thanks
The problem was that my algorithm was generating clusters for the stations but I did not configure the class Graphic (which I understood later was very important for the display) to render correctly my points. Since, I used latitude and longitude as coordinates for my station, I had to put these coordinates to scale for the window. Here is how I did that (using cross multiplications) : I calculate the "gap" between two units in the graph and added an adjustment because I don't start at zero.
double gapX = (this.getWidth() - 2 * edgeSpace) / (topX-bottomX+1);
int adjustmentX =(int) (-bottomX*gapX);
(getWidth() gives the actual width of the panel where is the graph, edgespace is the padding space between the graph and the edge of the panel, topX is the maximum value of a coordinate and bottomX the minimum value)
I am trying to build an OCR by calculating the Coefficient Correlation between characters extracted from an image with every character I have pre-stored in a database. My implementation is based on Java and pre-stored characters are loaded into an ArrayList upon the beginning of the application, i.e.
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
...
...
public double findCorrelation(byte [] extractedCharacter, byte [] storedCharacter)
{
double mag1, mag2, corr = 0;
for(int i=0; i < extractedCharacter.length; i++)
{
mag1 += extractedCharacter[i] * extractedCharacter[i];
mag2 += storedCharacter[i] * storedCharacter[i];
corr += extractedCharacter[i] * storedCharacter[i];
} // for
corr /= Math.sqrt(mag1*mag2);
return corr;
}
The number of extractedCharacters are around 100-150 per image but the database has 15600 stored binary characters. Checking the coefficient correlation between every extracted character and every stored character has an impact on the performance as it needs around 15-20 seconds to complete for every image, with an Intel i5 CPU.
Is there a way to improve the speed of this program, or suggesting another path of building this bringing similar results. (The results produced by comparing every character with such a large dataset is quite good).
Thank you in advance
UPDATE 1
public static void run() {
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
computeNorms(charComps, extractedCharacters);
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
}
}
private static double[] storedNorms;
private static double[] extractedNorms;
// Correlation between to binary images
public static double findCorrelation(byte[] arr1, byte[] arr2, int strCharIndex, int extCharNo){
final int dotProduct = dotProduct(arr1, arr2);
final double corr = dotProduct * storedNorms[strCharIndex] * extractedNorms[extCharNo];
return corr;
}
public static void computeNorms(ArrayList<byte[]> storedCharacters, ArrayList<byte[]> extractedCharacters) {
storedNorms = computeInvNorms(storedCharacters);
extractedNorms = computeInvNorms(extractedCharacters);
}
private static double[] computeInvNorms(List<byte []> a) {
final double[] result = new double[a.size()];
for (int i=0; i < result.length; ++i)
result[i] = 1 / Math.sqrt(dotProduct(a.get(i), a.get(i)));
return result;
}
private static int dotProduct(byte[] arr1, byte[] arr2) {
int dotProduct = 0;
for(int i = 0; i< arr1.length; i++)
dotProduct += arr1[i] * arr2[i];
return dotProduct;
}
Nowadays, it's hard to find a CPU with a single core (even in mobiles). As the tasks are nicely separated, you can do it with a few lines only. So I'd go for it, though the gain is limited.
In case you really mean cross-correlation, then a transform like DFT or DCT could help. They surely do for big images, but with yours 12x16, I'm not sure.
Maybe you mean just a dot product? And maybe you should tell us?
Note that you actually don't need to compute the correlation, most of the time you only need is find out if it's bigger than a threshold:
corr = findCorrelation(extractedCharacter, storedCharacter)
..... more code to check if this is the best match ......
This may lead to some optimizations or not, depending on how the images look like.
Note also that a simple low level optimization can give you nearly a factor of 4 as in this question of mine. Maybe you really should tell us what you're doing?
UPDATE 1
I guess that due to the computation of three products in the loop, there's enough instruction level parallelism, so a manual loop unrolling like in my above question is not necessary.
However, I see that those three products get computed some 100 * 15600 times, while only one of them depends on both extractedCharacter and storedCharacter. So you can compute
100 + 15600 + 100 * 15600
dot products instead of
3 * 100 * 15600
This way you may get a factor of three pretty easily.
Or not. After this step there's a single sum computed in the relevant step and the problem linked above applies. And so does its solution (unrolling manually).
Factor 5.2
While byte[] is nicely compact, the computation involves extending them to ints, which costs some time as my benchmark shows. Converting the byte[]s to int[]s before all the correlations gets computed saves time. Even better is to make use of the fact that this conversion for storedCharacters can be done beforehand.
Manual loop unrolling twice helps but unrolling more doesn't.
I'm doing a KMean clustering on a 12 dimensional matrix. I managed to get the result in K set of cluster. I want to show the result by plotting it into a 2D graph, but I can't figure it out how can I convert the 12 dimension data into 2 dimension.
Any suggestion on how can I do the conversion or any alternative ways on visualizing the result? I tried Multidimensional Scaling for Java (MDSJ) but it did not work.
The KMean algorithm I'm using was from the Java Machine Learning Library: Clustering basics.
I would do Principal Component Analysis (probably the easiest algorithm from Multidimensional scaling algorithms). (BTW PCA has nothing to do with KMeans, it is a general method for dimensionality reduction)
I assume variables are in columns, observations are in rows.
Standardize the data - convert variables to z-scores. That means: from each cell, subtract the mean of the column and devide the result by the std. deviation of the column. That way you get zero mean and unit variance. The former is obligatory, the latter, I would say, good to do. If you have zero variance, you calculate the eigen-vectors from the covariance matrix, otherwise have to use correlation matrix which kind of standardizes the data automatically. See this for explanation).
Calculate eigen-vectors and eigen-values of the covariance matrix. Sort the eigen-vectors by the eigen-values. (Many libraries already give you eigen-vectors sorted that way).
Use first two columns of the eigen-vector matrix and multiply the original matrix (converted to z-scores), visualize this data.
Using the colt library, you can do the following. It will be similar with other matrix libraries:
import cern.colt.matrix.DoubleMatrix1D;
import cern.colt.matrix.DoubleMatrix2D;
import cern.colt.matrix.doublealgo.Statistic;
import cern.colt.matrix.impl.SparseDoubleMatrix2D;
import cern.colt.matrix.linalg.Algebra;
import cern.colt.matrix.linalg.EigenvalueDecomposition;
import hep.aida.bin.DynamicBin1D;
public class Pca {
// to show matrix creation, it does not make much sense to calculate PCA on random data
public static void main(String[] x) {
double[][] data = {
{2.0,4.0,1.0,4.0,4.0,1.0,5.0,5.0,5.0,2.0,1.0,4.0},
{2.0,6.0,3.0,1.0,1.0,2.0,6.0,4.0,4.0,4.0,1.0,5.0},
{3.0,4.0,4.0,4.0,2.0,3.0,5.0,6.0,3.0,1.0,1.0,1.0},
{3.0,6.0,3.0,3.0,1.0,2.0,4.0,6.0,1.0,2.0,4.0,4.0},
{1.0,6.0,4.0,2.0,2.0,2.0,3.0,4.0,6.0,3.0,4.0,1.0},
{2.0,5.0,5.0,3.0,1.0,1.0,6.0,6.0,3.0,2.0,6.0,1.0}
};
DoubleMatrix2D matrix = new DenseDoubleMatrix2D(data);
DoubleMatrix2D pm = pcaTransform(matrix);
// print the first two dimensions of the transformed matrix - they capture most of the variance of the original data
System.out.println(pm.viewPart(0, 0, pm.rows(), 2).toString());
}
/** Returns a matrix in the space of principal components, take the first n columns */
public static DoubleMatrix2D pcaTransform(DoubleMatrix2D matrix) {
DoubleMatrix2D zScoresMatrix = toZScores(matrix);
final DoubleMatrix2D covarianceMatrix = Statistic.covariance(zScoresMatrix);
// compute eigenvalues and eigenvectors of the covariance matrix (flip needed since it is sorted by ascending).
final EigenvalueDecomposition decomp = new EigenvalueDecomposition(covarianceMatrix);
// Columns of Vs are eigenvectors = principal components = base of the new space; ordered by decreasing variance
final DoubleMatrix2D Vs = decomp.getV().viewColumnFlip();
// eigenvalues: ev(i) / sum(ev) is the percentage of variance captured by i-th column of Vs
// final DoubleMatrix1D ev = decomp.getRealEigenvalues().viewFlip();
// project the original matrix to the pca space
return Algebra.DEFAULT.mult(zScoresMatrix, Vs);
}
/**
* Converts matrix to a matrix of z-scores (by columns)
*/
public static DoubleMatrix2D toZScores(final DoubleMatrix2D matrix) {
final DoubleMatrix2D zMatrix = new SparseDoubleMatrix2D(matrix.rows(), matrix.columns());
for (int c = 0; c < matrix.columns(); c++) {
final DoubleMatrix1D column = matrix.viewColumn(c);
final DynamicBin1D bin = Statistic.bin(column);
if (bin.standardDeviation() == 0) { // use epsilon
for (int r = 0; r < matrix.rows(); r++) {
zMatrix.set(r, c, 0.0);
}
} else {
for (int r = 0; r < matrix.rows(); r++) {
double zScore = (column.get(r) - bin.mean()) / bin.standardDeviation();
zMatrix.set(r, c, zScore);
}
}
}
return zMatrix;
}
}
You could also use weka. I would first load your data into weka, then run PCA using the GUI (under attribute selection). You will see what classes are called with what parameters and then do the same thing from your code. The problem is you will need to convert/wrap your matrix into the data format weka works with.
A similar question has been discussed on CrossValidated2. The basic idea is to find an appropriate projection that separates these clusters (e.g., with discproj in R) and then to plot the projection on the clusters on the new space.
In addition to what the other answers suggest you should probably have a look at multidimensional scaling too.
I am not familiar with coordinate systems or much of the math dealing with these things at all. What I am trying to do is take a Point (x,y), and find its position in a 1 dimensional array such that it follows this:
(0,2)->0 (1,2)->1 (2,2)->2
(0,1)->4 (1,1)->5 (2,1)->6
(0,0)->8 (1,0)->9 (2,0)->10
where the arrows are showing what value the coordinates should map to. Notice that an index is skipped after each row. I'm think it'll end up being a fairly trivial solution, but I can't find any questions similar to this and I haven't had any luck coming up with ideas myself. I do know the width and height of the 2 dimensional array. Thank you for any help!
My question is perhaps ambiguous or using the wrong terminology, my apologies.
I know that the coordinate (0,0) will be the bottom left position. I also know that the top left coordinate should be placed at index 0. Each new row skips an index by 1. The size of the coordinate system varies, but I know the number of rows and number of columns.
First step, flip the values upside down, keep points in tact:
(0,2)->8 (1,2)->9 (2,2)->10
(0,1)->4 (1,1)->5 (2,1)->6
(0,0)->0 (1,0)->1 (2,0)->2
You'll notice that y affects the output by a factor of 4 and x by a factor of 1.
Thus we get a very simple 4y + x.
Now to get back to the original, you'll notice the transformation is (x,y) <- (x,2-y) (that is, if we transform each point above with this transformation, we get the original required mapping).
So, substituting it into the equation, we get (2-y)*4 + x.
Now this is specific to 3x3, but I'm sure you'll be able to generalize it by replacing 2 and 4 by variables.
If you want to reduce the dimension and avoid overlapping you need a space-filling-curve, for example a morton curve. Your example looks like a peano curve because it's a 3x3 matrix. These curves is difficult to calculate but have some nice things. But if you just look for self-avoiding curves you can create your own? Read here: http://www.fractalcurves.com/Root4Square.html.
I was beaten to the formula, here is the bruteforce using a Map.
public class MapPointToIndex {
private Map<Point, Integer> map;
private int index, rowcount;
public MapPointToIndex(int rows, int columns) {
map = new HashMap<Point, Integer>();
for (int i = rows - 1; i >= 0; i--) {
index += rowcount;
for (int j = 0; j < columns; j++) {
Point p = new Point(j, i);
map.put(p, index);
index++;
}
rowcount = 1;
}
}
public int getIndex(Point point){
return map.get(point);
}
public static void main(String[] args) {
MapPointToIndex one = new MapPointToIndex(3, 3);
System.out.println(one.map);
}
}
Out:
{java.awt.Point[x=0,y=0]=8, java.awt.Point[x=2,y=2]=2, java.awt.Point[x=1,y=2]=1, java.awt.Point[x=2,y=1]=6, java.awt.Point[x=1,y=1]=5, java.awt.Point[x=2,y=0]=10, java.awt.Point[x=0,y=2]=0, java.awt.Point[x=1,y=0]=9, java.awt.Point[x=0,y=1]=4}
I just ran into an issue while trying to write an bitmap-manipulating algo for an android device.
I have a 1680x128 pixel Bitmap and need to apply a filter on it. But this very simple code-piece actually took almost 15-20 seconds to run on my Android device (xperia ray with a 1Ghz processor).
So I tried to find the bottleneck and reduced as many code lines as possible and ended up with the loop itself, which took almost the same time to run.
for (int j = 0; j < 128; j++) {
for (int i = 0; i < 1680; i++) {
Double test = Math.random();
}
}
Is it normal for such a device taking so much time in a simple for-loop with no difficult operations?
I'm very new to programming on mobile devices so please excuse if this question may be stupid.
UPDATE: Got it faster now with some simpler operations.
But back to my main problem:
public static void filterImage(Bitmap img, FilterStrategy filter) {
img.prepareToDraw();
int height = img.getHeight();
int width = img.getWidth();
RGB rgb;
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
rgb = new RGB(img.getPixel(i, j));
if (filter.isBlack(rgb)) {
img.setPixel(i, j, 0);
} else
img.setPixel(i, j, 0xffffffff);
}
}
return;
}
The code above is what I really need to run faster on the device. (nearly immediate)
Do you see any optimizing potential in it?
RGB is only a class that calculates the red, green and blue value and the filter simply returns true if all three color parts are below 100 or any othe specified value.
Already the loop around img.getPixel(i,j) or setPixel takes 20 or more seconds. Is this such an expensive operation?
It may be because too many Objects of type Double being created.. thus it increase heap size and device starts freezing..
A way around is
double[] arr = new double[128]
for (int j = 0; j < 128; j++) {
for (int i = 0; i < 1680; i++) {
arr[i] = Math.random();
}
}
First of all Stephen C makes a good argument: Try to avoid creating a bunch of RGB-objects.
Second of all, you can make a huge improvement by replacing your relatively expensive calls to getPixel with a single call to getPixels
I made some quick testing and managed to cut to runtime to about 10%. Try it out. This was the code I used:
int[] pixels = new int[height * width];
img.getPixels(pixels, 0, width, 0, 0, width, height);
for(int pixel:pixels) {
// check the pixel
}
There is a disclaimer in the docs below for random that might be affecting performance, try creating an instance yourself rather than using the static version, I have highlighted the performance disclaimer in bold:
Returns a pseudo-random double n, where n >= 0.0 && n < 1.0. This method reuses a single instance of Random. This method is thread-safe because access to the Random is synchronized, but this harms scalability. Applications may find a performance benefit from allocating a Random for each of their threads.
Try creating your own random as a static field of your class to avoid synchronized access:
private static Random random = new Random();
Then use it as follows:
double r = random.nextDouble();
also consider using float (random.nextFloat()) if you do not need double precision.
RGB is only a class that calculates the red, green and blue value and the filter simply returns true if all three color parts are below 100 or any othe specified value.
One problem is that you are creating height * width instances of the RGB class, simply to test whether a single pizel is black. Replace that method with a static method call that takes the pixel to be tested as an argument.
More generally, if you don't know why some piece of code is slow ... profile it. In this case, the profiler would tell you that a significant amount of time is spent in the RGB constructor. And the memory profiler would tell you that large numbers of RGB objects are being created and garbage collected.