I had an interview question as below:
Suppose we have a line and M points in this line. If we define the distance of a subset of points which has N (N <= M) points to be the minimum distance of the distance between of each pair of point, write a algorithm to find the maximum distance of all subsets, each one has N points...
By this I mean, if we have an array {1,2,10}, and N=2, then the subset with the maximum distance should be {1,10}. My first thought was to get all the combinations of subset and calculate the distance of each one, but the interviewer didn't like it because it would take too much time. Does anyone have a time efficient idea?
You need to sort the array, and find the 1st element(Smallest) and M will always be the last element (largest), So subset will always be the {smallest,Largest}.
Having things arranged along a line is often a tip-off that dynamic programming will work.
Work from left to right. At each point in the line work out, for k = 1..N the set of points of size k amongst those seen so far that has the largest minimum distance. You can work out the the answers for one point in the line from the answers you have already worked out for points to its left. To find the answer for k points, consider each point to its left and find min(minimum distance for k-1 points at that point, distance from current point to that point). Then take the maximum of these possible values.
Related
I have a 16x16x16 cubic matrix containing [0;k] possible values I would like to be able to list the largest possible cuboids inside that matrix where every value is the same for this cuboid.
An iterative "expansion" algorithm could do the trick, but given there are 4096 cells, that would be way to expensive to do.
There are similar questions, but they only address a two-dimensional matrix
I expect that by "cuboid", you mean that it must be the same size in all 3 dimensions.
In that case, the size of the largest cuboid with maximal point (x,y,z) can be calculated from the sizes of the largest cuboids with maximal points (x-1,y,z), (x,y-1,z), (x,y,z-1), (x-1,y-1,z), (x-1,y,z-1), (x,y-1,z-1), and (x-1, y-1, z-1).
Just process the points in sum(x,y,z) order, and then, if all those neighboring points have the same value, then largest_cuboid_size(x,y,z) = 1 + min(largest_cuboid_size(..for each neighbor with a smaller coordinate...))
EDIT: Since you do NOT require all sides the same length, you would need to keep track of multiple data per cell in order to use this method. For example, you could calculate the maximum box size for each (width,height).
You can still calculate the entries for each cell from its neighbors, but there can be up to 256 entries per cell, so it's a longer process.
That's up to 16^5 (1048576) values that have to be calculated to solve the whole problem. Should take much less than a second, so maybe it's fast enough for your purpose.
I'm working on the k-means clustering with Java. I don't see problem in my code and it looks well. However, I don't understand something.
Step 1:
Choose N number of centers. (Let there is N number of clusters)
Step 2:
Put each vector into cluster with nearest center using Euclidean distance. (||v1 - v2||)
Step 3:
Find new mean (=center) for each cluster
Step 4:
If the center have moved significantly, go to step 2
However, when I make a plot of total of point-to-respective-center distances after each iteration, I can see that the total is decreasing all the time (although it's decreasing in general and converging well).
total distance of 2nd iteration is always shorter than first one, and is the shortest. And the total distance is slightly increasing at the 3rd iteration and converges at 4 or 5th iteration.
I believe I was told to be it should be always decreasing. What's wrong? My algorithm (implementation) or my assumption about the total distance?
It must always be decreasing for the same seed.
Maybe your error is that you use Euclidean distance.
K-means does not minimize Euclidean distances.
This is a common misconception that even half of the professors get wrong. K-means minimizes the sum-of-squares, i.e., the sum of squared Euclidean distances. And no, this does not find the solution with the smallest Euclidean distances.
So make sure you are plotting SSQ everywhere. Remove all square roots from your code. They do not belong into k-means.
Additional comments:
Don't minimize variances (or equivalently, standard deviations), as tempting as it might be:
Minimizing sum of squared distances is not equivalent to minimizing variances, but that hasn't stopped people from suggesting it as the proper objective for k-means.
It is easy to imagine why this could be a bad idea:
Imagine a single point that is almost mid-way (Euclidean) between two cluster centroids, both with the same variance before including the new point. Now imagine one of the clusters has a much larger membership of points than the other cluster. Let's say the new point is slightly closer to the one with the much larger membership. Adding the new point to the larger cluster, though correct because it is closer to that centroid, won't decrease its variance nearly as much as adding the new point to the other cluster with the much smaller membership.
If you are minimizing the proper objective function, but it still isn't decreasing monotonically, check that you aren't quantizing your centroid means:
This would happen, for example, if you are performing image segmentation with integer values that range in [0, 255] rather than float values in [0, 1], and you are forcing the centroid means to be uint8 datatypes.
Whenever the centroid means are found, they should then be used in the objective function as-is. If your algorithm is finding one value for centroid means (floats), but is then minimizing the objective with other values (byte ints), this could lead to unacceptable variations from the supposed monotonically decreasing objective.
I would like to use Simulated Annealing to find local minimum of single variable Polynomial function, within some predefined interval. I would also like to try and find Global minimum of Quadratic function.
Derivative-free algorithm such as this is not the best way to tackle the problem, so this is only for study purposes.
While the algorithm itself is pretty straight-forward, i am not sure how to efficiently select neighbor in single or n-dimensional space.
Lets say that i am looking for local minimum of function: 2*x^3+x+1 over interval [-0.5, 30], and assume that interval is reduced to tenths of each number, e.g {1.1, 1.2 ,1.3 , ..., 29.9, 30}.
What i would like to achieve is balance between random walk and speed of convergence from starting point to points with lower energy.
If i simply select random number form the given interval every time, then there is no random walk and the algorithm might circle around. If, on the contrary, next point is selected by simply adding or subtracting 0.1 with the equal probability, then the algorithm might turn into exhaustive search - based on the starting point.
How should i efficiently balance Simulated Annealing neighbor search in single dimensional and n-dimensional space ?
So you are trying to find an n-dimensional point P' that is "randomly" near another n-dimensional point P; for example, at distance T. (Since this is simulated annealing, I assume that you will be decrementing T once in a while).
This could work:
double[] displacement(double t, int dimension, Random r) {
double[] d = new double[dimension];
for (int i=0; i<dimension; i++) d[i] = r.nextGaussian()*t;
return d;
}
The output is randomly distributed in all directions and centred on the origin (notice that r.nextDouble() would favour 45º angles and be centred at 0.5). You can vary the displacement by increasing t as needed; 95% of results will be within 2*t of the origin.
EDIT:
To generate a displaced point near a given one, you could modify it as
double[] displaced(double t, double[] p, Random r) {
double[] d = new double[p.length];
for (int i=0; i<p.length; i++) d[i] = p[i] + r.nextGaussian()*t;
return d;
}
You should use the same r for all calls (because if you create a new Random() for each you will keep getting the same displacements over and over).
In "Numerical Recepies in C++" there is a chapter titled "Continuous Minimization by Simulated Annealing". In it we have
A generator of random changes is inefficient if, when local downhill moves exist, it nevertheless almost always proposes an uphill move. A good generator, we think, should not become inefficient in narrow valleys; nor should it become more and more inefficient as convergence to a minimum is approached.
They then proceed to discuss a "downhill simplex method".
I have two ArrayList, Double data type,
1.latitudes
2. longitudes,
each has over 200 elements
say i give a random test coordinates, say (1.33, 103.4), the format is [latitude, longitude]
is there any algorithm to easily find closest point,
or do i have to brute force calculate every possible point, find hypotenuse, and then compare over 200 hypotenuses to return the closest point? thanks
Sort the array of points along one axis. Then, locate the point in the array closest to the required point along this axis and calculate the distance (using whatever metric is appropriate to the problem topology and scale).
Then, search along the array in both directions until the distance to these points is greater than the best result so far. The shortest distance point is the answer.
This can result in having to search the entire array, and is a form of Branch and bound constrained by the geometry of the problem. If the points are reasonably evenly distributed around the point you are searching for, then the scan will not require many trials.
Alternate spatial indices (like quad-trees) will give better results, but your small number of points would make the setup cost in preparing the index much larger than a simple sort. You will need to track the position changes caused by the sort as your other array will not be sorted the same way. If you change the data into a single array of points, then the sort will reorder entire points at the same time.
If your arrays are sorted, you can use binary search to find a position of a requested point in array. After you find index, you should check four near by points to find the closest.
1)Suppose you have two sorted arrays longitudes-wise and latitudes-wise
2)You search first one and find two nearby points
3)Then you search second one and find two more points
4)Now you have from two to four points(results might intersect)
5)These points will form a square around destination point
6)Find the closest point
it's not true that closest lat (or long) value should be choosen to search over the long (or lat) axis, in fact you could stay on a lat (or long) line but far away along the long (or lat) value
so best way is to calculate all distances and sort them
I have one 2d line (it can be a curved line, with loops and so on), and multiple similar paths. I want to compare the first path with the rest, and determine which one is the most similar (in percentage if possible).
I was thinking maybe transforming the paths into bitmaps and then using a library to compare the bitmaps, but that seems like overkill. In my case, I have only an uninterrupted path, made of points, and no different colors or anything.
Can anyone help me?
Edit:
So the first line is the black one. I compare all other lines to it. I want a library or algorithm that can say: the red line is 90% accurate (because it has almost the same shape, and is close to the black one); the blue line is 5% accurate - this percentage is made up for this example... - because it has a similar shape, but it's smaller and not close to the black path.
So the criterion of similarity would be:
how close the lines are one to another
what shape do they have
how big they are
(color doesn't matter)
I know it's impossible to find a library that considers all this. But the most important comparisons should be: are they the same shape and size? The distance I can calculate on my own.
I can think of two measures to express similarity between two lines N (defined as straight line segments between points p0, p1... pr)
M (with straight line segments between q0, q1, ...qs). I assume that p0 and q0 are always closer than p0 and qs.
1) Area
Use the sum of the areas enclosed between N and M, where N and M are more different as the area gets larger.
To get N and M to form a closed shape you should connect p0 and q0 and pr and qs with straight line segments.
To be able to calculate the surface of the enclosed areas, introduce new points at the intersections between segments of N and M, so that you get one or more simple polygons without holes or self-intersections. The area of such a polygon is relatively straightforward to compute (search for "polygon area calculation" around on the web), sum the areas and you have your measure of (dis)similarity.
2) Sampling
Take a predefined number (say, 1000) of sample points O that lie on N (either evenly spaced with respect to the entire line, or evenly spaced
over each line segment of N). For each sample point o in O, we'll calculate the distance to the closest corresponding point on M: the result is the sum of these distances.
Next, reverse the roles: take the sample points from M and calculate each closest corresponding point on N, and sum their distances.
Whichever of these two produces the smallest sum (they're likely not the same!) is the measure of (dis)similarity.
Note: To locate the closest corresponding point on M, locate the closest point for every straight line segment in M (which is simple algebra, google for "shortest distance between a point and a straight line segment"). Use the result from the segment that has the smallest distance to o.
Comparison
Method 1 requires several geometric primitives (point, line segment, polygon) and operations on them (such as calculating intersection points and polygon areas),
in order to implement. This is more work, but produces a more robust result and is easier to optimize for lines consisting of lots of line segments.
Method 2 requires picking a "correct" number of sample points, which can be hard if the lines have alternating parts with little detail
and parts with lots of detail (i.e. a lot of line segments close together), and its implementation is likely to quickly get (very) slow
with a large number of sample points (matching every sample point against every line segment is a quadratic operation).
On the upside, it doesn't require a lot of geometric operations and is relatively easy to implement.