mapreduce distance calculation in hadoop

mapreduce distance calculation in hadoop - java

Is there a distance calculation implementation using hadoop map/reduce. I am trying to calculate a distance between a given set of points.
Looking for any resources.
Edit
This is a very intelligent solution. I have tried some how like the first algorithm, and I get almost what I was looking for. I am not concerned about optimizing the program at the moment, but my problem was the dist(X,Y) function was not working. When I got all the points on the reducer, I was unable to go through all the points on an Iterator and calculate the distance. Someone on stackoverflow.com told me that the Iterator on hadoop is different than the normal JAVA Iterator, i am not sure about that. But if i can find a simple way to go through the Iterator on my dist() function, i can use your second algorithm to optimize.
//This is your code and I am refering to that code too, just to make my point clear.
map(x,y) {
for i in 1:N #number of points
emit(i, (x,y)) //i did exactly like this
reduce (i, X)
p1 = X[i]
for j in i:N
// here is my problem, I can't get the values from the Iterator.
emit(dist(X[i], X[j]))

you need to do a self join on that data set. In hive that would look like, more or less
select dist(P1.x,P1.y,P2.x, P2.y) from points P1 join points P2 on (True) where P1.x < P2.x or (P1.x = P2.x and P1.y < P2.y)
The function dist would need to be implemented using other hive functions or written in Java and added as a UDF. Also I am not sure about the True constant but you can write 0=0 to the same effect. The where clause is to avoid computing the same distances twice or 0 distances. The question is: would hive optimize this the way you can do programming carefully in hadoop? I am not sure. This is a sketch in hadoop
map(x,y) {
for i in 1:N #number of points
emit(i, (x,y))
reduce (i, X)
p1 = X[i]
for j in i:N
emit(dist(X[i], X[j]))
For this to work you need X to get to the reducer sorted in some order, for instance by x and then by y using secondary sort keys (that do not affect the grouping). This way every reducer gets a copy of all the points and works on a column of the distance matrix you are trying to generate. The memory requirements are minimal. You could trade some communication for memory by re-organizing the computation so that every reducer computes a square submatrix of the final matrix, knowing only two subsets of the points and calculating the distances among all of them. To achieve this, you need to make explicit the order of your points, say you are storing i, x, y
map(i,x,y) {
for j in 1:N/k #k is size of submatrix
emit((i/k, j), ("row", (x,y)))
emit((j, i/k), ("col", (x,y)))
reduce ((a,b), Z)
split Z in rows X and cols Y
for x in X
for y in Y
emit(dist(x,y))
In this case you can see that the map phase emits only 2*N*N/k points, whereas the previous algorithm emitted N^2. Here we have (N/k)^2 reducers vs N for the other one. Each reducer has to hold k values in memory (using the secondary key technique to have all the rows get to the reducer before all the columns), vs only 2 before. So you see there are tradeoffs and for the second algorithm you can use the parameter k for perf tuning.

This problem does not sound like a good fit for map-reduce since you're not really able to break it into pieces and calculate each piece independently. If you could have a separate program that generates the complete graph of your points as a list (x1,y1,x2,y2) then you could do a straightforward map to get the distance.

Related

How to divide Array Length into arrays with lengths that are multiples of 3

So I have a fairly large array that contains xyz coordinates, where array[0] = x0, array[1] = y0, array[2] = z0, array[3] = x1, array[4] = y1... and so on.
I'm running an algorithm on this array that is taking longer than I would like it to, and I want to split the work amongst threads. I have my threads set up, but I am not sure how to divide this array properly so I can distribute this work across 3 threads.
Even though I have an array length that is divisible by 3, this won't work, because splitting into 3 can split an xyz coordinate (for instance, if my array was size 15, dividing it by 3 will give me arrays of size 5, which means I'm splitting an XYZ coordinate.
How can I split this array (it doesn't have to necessarily be equal in size) so that I can distribute the work? (for instance, in the previous example, I would like to have two arrays of size 6 and one of size 3).
Note: The size of the array is variable, but is always divisible by 3.
EDIT: Sorry, should have mentioned that I'm working in Java. My algorithm iterates through a collection of coordinates and determines which coordinates lie inside of a particular 3d shape (such as an ellipsoid). It saves these coordinates and I perform other tasks with these coordinates (I'm working on a computer graphics app).
EDIT2: I'm going to elaborate on the algorithm a bit more.
Basically, I am working in Android OpenGL-ES-3.0. I have complex 3D-object with somewhere around 230000 vertices and close to a million triangles.
In the app, the user moves either a ellipsoid or box (they choose which one) to a location close to or on the object. After moving it, they click a button, which runs my algorithm.
The purpose of the algorithm is to determine which points from my object lie inside of the ellipsoid or box. These points are subsequently changed to a different color. To add to the complexity, however, is the fact that I have transformation matrices applied to both the points of the object and the points of the ellipsoid/box.
My current algorithm begins by iterating through all the points of the object. For those of you unclear on my iteration, this is my loop.
for(int i = 0; i < numberOfVertices*3;)
{
pointX = vertices[i];
i++;
pointY = vertices[i];
i++;
pointZ = vertices[i];
i++;
//consider transformations, then run algorithm
}
I perform the necessary steps to consider all my transformations, and after that is finished, I have a point from my object and the location of my ellipsoid/box centroid.
Then, depending on the shape, one of the following algorithms is used:
Ellipsoid: I use the centroid of the ellipse and apply the formula
(x−c)T RT A R(x−c) (sorry I don't know how to format that, I'll explain the formula). x is a column vector describing the xyz point from my object that I am on in my iteration. c is a column vector describing the xyz point of my centroid. T is supposed to mean transpose. R is my rotation matrix. A is a diagonal matrix with entries with entries (1/a^2, 1/b^2, 1/c^2), and I have values for a b and c. If this formula is > 1, then x lies outside of my ellipsoid and is not a valid point. If it is <=1, then I save x.
Box: I simply check if the point falls within a range. If the point of the object lies a certain distance in the X-direction, Y-direction, and Z-direction from the centroid, I save it.
These algorithms are accurate, and work as intended. The issue, is obviously efficiency. I don't seem to have a good understanding of what makes my app strain and what doesn't. I thought multi-threading would work, and I tried some of the techniques described but they didn't have a significant improvement on performance. If anyone has ideas on filtering out my search so I'm not iterating through all these points, it would help.

May I suggest a slightly different way to handle it. I know this isn't a direct answer to your question, but please consider it.
This could be easier to see if you implemented it as coordinate Objects, each with x, y and z values. Your "array" would now be 1/3 as long. You might think this would be less efficient--and you might be right--but you'd be surprised at how well java can optimize things. Often Java optimizes for the cases people use the most and your manually manipulating this array as you suggest is possibly even slower than using objects. Until you've proven the most readable design too slow you shouldn't optimize it.
Now you have a collection of coordinate objects. Java has queues that multiple threads can pull from efficiently. Dump all your objects into a queue and have each of your threads pull one and work on it by processing it and putting it in a "Completed" queue. Note that this gives you the ability to add or remove threads easily, without effecting your code except for 1 number. How would you take the array based solution to 4 or 6 threads?
Good luck

Here is a demo of the work explained below.
Observations
Each coordinate is 3 indexes.
You have 3 threads.
Let's say you have 17 coordinates, that's 51 indexes. You want to split the 17 coordinates among your 3 threads.
var arraySize = 51;
var numberOfThreads = 3;
var numberOfIndexesPerCoordinate = 3;
var numberOfCoordinates = arraySize / numberOfIndexesPerCoordinate; //17 coordinates
Now split that 17 coordinates among your threads.
var coordinatesPerThread = numberOfCoordinates / numberOfThreads; //5.6667
This isn't an even number, so you need to distribute unevenly. We can use Math.floor and modulo to distribute.
var floored = Math.floor(coordinatesPerThread); //5 - every thread gets at least 5.
var modulod = numberOfCoordinates % floored; // 2 - there will be 2 left that need to be placed sequentially into your thread pool
This should give you all the information you need. Without knowing what language you are using, I don't want to give any real code samples.
I see you edited your question to specify Java as your language. I'm not going to do the threading work for you, but I'll give a rough idea.
float[] coordinates = new float[17 * 3]; //17 coordinates with 3 indexes each.
int numberOfThreads = 3;
int numberOfIndexesPerCoordinate = 3;
int numberOfCoordinates = coordinates.length / numberOfIndexesPerCoordinate ; //coordinates * 3 indexes each = 17
//Every thread has this many coordinates
int coordinatesPerThread = Math.floor(numberOfCoordinates / numberOfThreads);
//This is the number of coordinates remaining that couldn't evenly be split.
int remainingCoordinates = numberOfCoordinates % coordinatesPerThread
//To make things easier, I'm just going to track the offset in the original array. It could probably be computed instead, but its just an int.
int offset = 0;
for (int i = 0; i < numberOfThreads; i++) {
int numberOfIndexes = coordinatesPerThread * numberOfIndexesPerCoordinate;
//If this index is one of the remainders, then increase by 1 coordinate (3 indexes).
if (i < remainingCoordinates)
numberOfIndexes += numberOfIndexesPerCoordinate ;
float[] dest = new float[numberOfIndexes];
System.arraycopy(coordinates, offset, dest, 0, numberOfIndexes);
offset += numberOfIndexes;
//Put the dest array of indexes into your threads.
}
Another, potentially better option would be to use a Concurrent Deque that has all of your coordinates, and have each thread pull from it as they need a new coordinate to work with. For this solution, you'd need to create Coordinate objects.
Declare a Coordinate object
public static class Coordinate {
protected float x;
protected float y;
protected float z;
public Coordinate(float x, float y, float z) {
this.x = x;
this.y = y;
this.z = z;
}
}
Declare a task to do your work, and pass it your concurrent deque.
public static class CoordinateTask implements Runnable {
private final Deque<Coordinate> deque;
public CoordinateTask(Deque<Coordinate> deque) {
this.deque = deque;
}
public void run() {
Coordinate coordinate;
while ((coordinate = this.deque.poll()) != null) {
//Do your processing here.
System.out.println(String.format("Proccessing coordinate <%f, %f, %f>.",
coordinate.x,
coordinate.y,
coordinate.z));
}
}
}
Here's the main method showing the example in action
public static void main(String []args){
Coordinate[] coordinates = new Coordinate[17];
for (int i = 0; i < coordinates.length; i++)
coordinates[i] = new Coordinate(i, i + 1, i + 2);
final Deque<Coordinate> deque = new ConcurrentLinkedDeque<Coordinate>(Arrays.asList(coordinates));
Thread t1 = new Thread(new CoordinateTask(deque));
Thread t2 = new Thread(new CoordinateTask(deque));
Thread t3 = new Thread(new CoordinateTask(deque));
t1.start();
t2.start();
t3.start();
}
See this demo.

Before trying to optimize with concurrency, try to minimize the amount of points you need to test, and minimize the cost of those tests, by using the most efficient collision detection methods at your disposal.
Some general suggestions:
Consider normalizing everything to a common frame of reference before running through your calculations. For example, instead of applying transformations to each point, transform the selection box/ellipsoid into the shape's coordinate system so you can perform your collision detection without the transformations within each iteration.
You may also be able to combine some or all of your transformations (rotation, translation, etc.) into a single matrix calculation, but that won't gain you much unless you're performing a lot of transformations, which you should try to avoid.
Generally speaking it's beneficial to keep the transformation pipeline as streamlined as possible, and keep all coordinate calculations in the same space to avoid transformations as much as possible.
Try to minimize the number of points you need to perform your slowest calculations on. The most accurate collision test should only be necessary for points that you can't rule out as being inside the shape by faster means, using an approximation of the shape, such as a collection of spheres, or the shape's convex hull. Simplifying the shape allows you to limit the slowest calculations to only those points that lie very close to your shape's actual bounds.
In my own 2D work in the past I found that even calculating the convex hulls for hundreds of complex animated shapes in real time was faster than doing collision detection directly without using their convex hulls, because they enable much faster collision calculations.
Consider calculating/storing additional information about the shape, such as an inner and outer collision sphere (one sphere inside all points, and one outside all points) which you can use as a fast initial filter. Anything inside the smaller sphere is guaranteed to be inside your shape, anything outside the outer sphere is known to be outside your shape. You might even want to store a simplified version of your shape, (or its convex hull), which you could calculate in advance and use to aid collision detection.
Similarly, consider using one or more spheres to approximate your ellipsoid in initial calculations, to minimize which points you need to test for collision.
Instead of calculating actual distances, calculate the squared distances and use those for comparison. However, prefer using faster tests for collision if possible. For example, for convex polygons you can use the Separating Axis Theorem, which projects vertices onto a common axis/plane to permit very quick overlap calculations.

How to efficiently select neighbour in 1-dimensional and n-dimensional space for Simulated Annealing

I would like to use Simulated Annealing to find local minimum of single variable Polynomial function, within some predefined interval. I would also like to try and find Global minimum of Quadratic function.
Derivative-free algorithm such as this is not the best way to tackle the problem, so this is only for study purposes.
While the algorithm itself is pretty straight-forward, i am not sure how to efficiently select neighbor in single or n-dimensional space.
Lets say that i am looking for local minimum of function: 2*x^3+x+1 over interval [-0.5, 30], and assume that interval is reduced to tenths of each number, e.g {1.1, 1.2 ,1.3 , ..., 29.9, 30}.
What i would like to achieve is balance between random walk and speed of convergence from starting point to points with lower energy.
If i simply select random number form the given interval every time, then there is no random walk and the algorithm might circle around. If, on the contrary, next point is selected by simply adding or subtracting 0.1 with the equal probability, then the algorithm might turn into exhaustive search - based on the starting point.
How should i efficiently balance Simulated Annealing neighbor search in single dimensional and n-dimensional space ?

So you are trying to find an n-dimensional point P' that is "randomly" near another n-dimensional point P; for example, at distance T. (Since this is simulated annealing, I assume that you will be decrementing T once in a while).
This could work:
double[] displacement(double t, int dimension, Random r) {
double[] d = new double[dimension];
for (int i=0; i<dimension; i++) d[i] = r.nextGaussian()*t;
return d;
}
The output is randomly distributed in all directions and centred on the origin (notice that r.nextDouble() would favour 45º angles and be centred at 0.5). You can vary the displacement by increasing t as needed; 95% of results will be within 2*t of the origin.
EDIT:
To generate a displaced point near a given one, you could modify it as
double[] displaced(double t, double[] p, Random r) {
double[] d = new double[p.length];
for (int i=0; i<p.length; i++) d[i] = p[i] + r.nextGaussian()*t;
return d;
}
You should use the same r for all calls (because if you create a new Random() for each you will keep getting the same displacements over and over).

In "Numerical Recepies in C++" there is a chapter titled "Continuous Minimization by Simulated Annealing". In it we have
A generator of random changes is inefficient if, when local downhill moves exist, it nevertheless almost always proposes an uphill move. A good generator, we think, should not become inefficient in narrow valleys; nor should it become more and more inefficient as convergence to a minimum is approached.
They then proceed to discuss a "downhill simplex method".

Java Math.abs vs Math.pow

So here's an odd question. I'm working on a kNN problem and need to find the nearest neighbor. I'm looking into distance, but once again, I don't care about the actual distance, just which one is closest. However, since distance can't be negative, I need to either square or take the absolute value of the distance.
So here are two options for how to accomplish this:
//note: it's been abstracted for multiple dimensions (not just x and y)
for(int i = 0; i < (numAttributes - 1); i++)
{
distance += Math.pow((a.value(i) - b.value(i)), 2);
}
and
//note: it's been abstracted for multiple dimensions (not just x and y)
for(int i = 0; i < (numAttributes - 1); i++)
{
distance += Math.abs(a.value(i) - b.value(i));
}
My question is which is faster. Since this is a data mining application, I want it to be able to process the information as quickly as possible. And while I understand, that in the guts, a power of two can be implemented with a shift, I'm not sure this is the case in such a high level language like Java where it gets translated for the JVM. Is there a reason why one is better than the other?

First, consider vectors A=[0,0,0], B=[1,1,1], C=[0,0,2]. Which one is closer to A? Is it B or C? Actually, caring about the distance measure is absolutely crucial in kNN. And we are only talking about Manhattan and euclidean distances. You could, by example, use the cosine similarity as well, and you should select the distance measure carefully, taking the into account the knowledge you have about your data.
Second, instead of such a low-level optimization, consider something smarter. Such as breaking your for(int i = 0; i < (numAttributes - 1); i++) loop as soon as too large distance is detected.
Third, using Math.pow(a,2) to compute a*a is definitely very inefficient.
Fourth, i < (numAttributes - 1)? Didn't you mean i < numAttributes??

Getting a gradient of a bi-variant function

I'm doing some video processing, for each frame I need to get a gradient of a bi-variate function.
The function is represented as a two dimensional array of doubles. Where the domain is the rows and columns indices and the range is the double value of the corresponding indices values. Or more simply put, the function f is defined for double[][] matrix as such:
f(x,y)=matrix[x][y]
I'm trying to use the Apache Commons Math library for it:
SmoothingPolynomialBicubicSplineInterpolator iterpolator = new SmoothingPolynomialBicubicSplineInterpolator();
BicubicSplineInterpolatingFunction f = iterpolator.interpolate(xs, ys, matrix.getData());
for (int i = 0; i < ans.length; i++) {
for (int j = 0; j < ans[0].length; j++) {
ans[i][j] = f.partialDerivativeY(i, j);
}
}
with xs, as a sorted array of the x indices (0,1,...,matrix.getRowDimension() - 1)
ys the same on the columns dimension (0,1,...,matrix.getColumnDimension() - 1)
The problem is that for a typical matrix in the size of 150X80 it takes as much as 1.4 seconds to run, which renders it completely irrelevant for my needs. So, as a novice user of this library, and programmatic numeric analysis in general, I want to know:
Am I doing something wrong?
Is there another, faster, way I can accomplish this task with?
Is there another open source library (preferably maven-friendly) that offers a solution?

Numerical differentiation is an entire topic unto itself, a simple google should bring up enough material for you to work with (just the wiki might be sufficient). There are parameters of your problem that I cannot know, so I can only speak broadly here, but there are direct methods of determining the gradient at a given point, i.e. ones that don't require an interpolation. See the wikipedia for the formulae (ranging from the simple f(x+1)-f(x), which is where h=1, to the higher order ones). Calculating the partial derivatives is then a simple O(NM) loop with an uber easy formula inside (no interpolation required).
The specifics can get gritty:
The higher order formulae need to be reduced for the edges, or
discarded altogether.
Your precise speed requirements might render more complex formulae useless (depending on the platform sometimes the lookup times for higher order formulae make them too slow; again, it depends on the cache etc.). This is easy to test, the formulae are simple; code them and benchmark.
The specific implementation is also dependent on your error requirements. The theory provides error bounds, so that will play a role in what formula you need; but again, there's a trade-off with speed requirements. The in turn can be practically lowered if you know specifics about the types of matrices you'll be processing, if such a thing is known.
The implementation can be made even easier (and maybe faster) if you have existing convolution tools, since this method is really just a convolution of the matrix (note; technically it's called a cross-correlation).

R - lm and r squared

I have been running through the questions on this site and others and I just want to make sure I understand that I am going this correctly and then I will want some advice to analyse the results.
I am exporting a m by n binary matrix from Java to R (using jri) and then I want to run lm() against an expected vector of 0s.
Here is the export function for getting the matrix into R
REXP x = re.eval("selectionArray <- c()");
for (int j = 0; j < currentSelection.length; j++){
boolean result = re.assign("currentSNPs", currentSelection[j]);
if (result == true){
x = re.eval("selectionArray <- rbind(selectionArray, currentSNPs)");
}
}
So then I want to execute the the lm() function to get the r squared values
x = re.eval("fm = lm(selectionArray ~ 0)");
I know that I need to use summary(fm) at this point to get the r squared values but I am not sure how to pull them out or what they mean at this point. I want to know the significance of the deviation from the expected 0 value at each column.
Thanks!

to extract the R^2 value from an 'lm' object named 'm'
summary(m)$r.squared
you can always view the structure of an object in R by using the str() function; in this situation you want str(summary(m))
However, it's not clear what you're trying to accomplish here. In the formula argument of the lm() function you specify selectionArray ~ 0, which doesn't make sense for two reasons: 1) as previously hinted at, a 0 on the right side of the formula corresponds to a model where your predictor variable is a vector of zeros and the beta coefficient corresponding to this predictor cannot be defined. 2) Your outcome, selectionArray, is a matrix. As far as I know, lm() isn't set up to have multiple outcomes.
Are you attempting to test the significance that each column of selectionArray differs from a 0? If so, ANY column with at least one success (1) in it is significantly different from a 0 column. If you're interested in the confidence intervals for the probability of success in each column, use the following code. Note that this does not adjust for multiple comparisons.
First let's start with a toy example to demonstrate the concept
v1 <- rbinom(100,size=1,p=.25)
#create a vector, length 100,
#where each entry corresponds to the
#result of a bernoulli trial with probability p
binom.test(sum(v1), n=length(v1), p = 0)
##let's pretend we didn't just generate v1 ourselves,
##we can use binom.test to determine the 95% CI for p
#now in terms of what you want to do...
#here's a dataset that might be something like yours:
selectionArray <- sapply(runif(10), FUN=function(.p) rbinom(100,size=1,p=.p))
#I'm just generating 10 vectors from a binomial distribution
#where each entry corresponds to 1 trial and each column
#has a randomly generated p between 0 and 1
#using a for loop
#run a binomial test on each column, store the results in binom.test.results
binom.test.results <- list()
for(i in 1:ncol(selectionArray)){
binom.test.results[[i]] <- binom.test(sum(selectionArray[,i]),
n=nrow(selectionArray), p=0)
}
#for loops are considered bad programming in r, so here's the "right" way to do it:
binom.test.results1 <- lapply(as.data.frame(selectionArray), function(.v){
binom.test(sum(.v), n=nrow(selectionArray), p = 0)
})
#using str() on a single element of binom.test.result will help you
#identify what results you'd like to extract from each test

I don't know much about Java, so I am not talking about that.
So, you 've got a matrix of 0 and 1 values and no other binary numbers?
And you want to know if the means of the columns are signif. different from 0?
That means, you should do hypothesis tests and not necessarily a regression. However a regression can be equivalent to such a test.
lm(y~0) does not make sense. If you only want an intercept you should use lm(y~1). However, that would be equivalent to a t-Test, which is not statistically correct.
I suspect it would be better to use fit<-glm(y~1,family=binomial) and than extract the p-value p<-summary(fit)$coef[4], but I am not a statistician.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.