I have a 16x16x16 cubic matrix containing [0;k] possible values I would like to be able to list the largest possible cuboids inside that matrix where every value is the same for this cuboid.
An iterative "expansion" algorithm could do the trick, but given there are 4096 cells, that would be way to expensive to do.
There are similar questions, but they only address a two-dimensional matrix
I expect that by "cuboid", you mean that it must be the same size in all 3 dimensions.
In that case, the size of the largest cuboid with maximal point (x,y,z) can be calculated from the sizes of the largest cuboids with maximal points (x-1,y,z), (x,y-1,z), (x,y,z-1), (x-1,y-1,z), (x-1,y,z-1), (x,y-1,z-1), and (x-1, y-1, z-1).
Just process the points in sum(x,y,z) order, and then, if all those neighboring points have the same value, then largest_cuboid_size(x,y,z) = 1 + min(largest_cuboid_size(..for each neighbor with a smaller coordinate...))
EDIT: Since you do NOT require all sides the same length, you would need to keep track of multiple data per cell in order to use this method. For example, you could calculate the maximum box size for each (width,height).
You can still calculate the entries for each cell from its neighbors, but there can be up to 256 entries per cell, so it's a longer process.
That's up to 16^5 (1048576) values that have to be calculated to solve the whole problem. Should take much less than a second, so maybe it's fast enough for your purpose.
Related
I have a rather interesting problem - I'm given an input list of points in 3d space and I'm required to output a collection of combinations of these points using the factorial combination equation below:
where n is the size of the input list of points, and r is the combination length.
For the output, I'm required to produce a list of lists with the sub-list containing the chosen points (size of each sublist being r, and the size of the parent list is the output of 'n choose r')
The problem is that given large enough values of n and r, I start running into the INTEGER.MAXVALUE size limitation for lists in java. E.g. having an input list size of 200 with an 'r' value of 5 will return a value of 2.5 billion - which is already above the max list size.
One way I've thought of to get around this is to split the input list into manageable chunks before I pass it to the combinatorial function:
// inputPoints is a List<Point> type
List<List<Point>> inputSplits = Helper.splitInputList(inputPoints) ; // splits input points list so that each subList is a maximum of say 100 in size.
List<List<List<Point>>> outputSplit;
for(var inputListSplit : inputListSplits){
outputSplit.Add(getCombinations(inputListSplit); // each result will be a List with size smaller than integer.MaxValue.
}
This can work but is inelegant. I've also thought of using linked lists (which apparently don't have a size limit) but haven't looked into the pros and cons of that just yet.
Are there any other ways this could be tackled ? I'm required to produce all possible combination outputs (they don't need to be ordered).
You can try a java.util.LinkedList, which has (in theory) limitless size.
I'm working on the k-means clustering with Java. I don't see problem in my code and it looks well. However, I don't understand something.
Step 1:
Choose N number of centers. (Let there is N number of clusters)
Step 2:
Put each vector into cluster with nearest center using Euclidean distance. (||v1 - v2||)
Step 3:
Find new mean (=center) for each cluster
Step 4:
If the center have moved significantly, go to step 2
However, when I make a plot of total of point-to-respective-center distances after each iteration, I can see that the total is decreasing all the time (although it's decreasing in general and converging well).
total distance of 2nd iteration is always shorter than first one, and is the shortest. And the total distance is slightly increasing at the 3rd iteration and converges at 4 or 5th iteration.
I believe I was told to be it should be always decreasing. What's wrong? My algorithm (implementation) or my assumption about the total distance?
It must always be decreasing for the same seed.
Maybe your error is that you use Euclidean distance.
K-means does not minimize Euclidean distances.
This is a common misconception that even half of the professors get wrong. K-means minimizes the sum-of-squares, i.e., the sum of squared Euclidean distances. And no, this does not find the solution with the smallest Euclidean distances.
So make sure you are plotting SSQ everywhere. Remove all square roots from your code. They do not belong into k-means.
Additional comments:
Don't minimize variances (or equivalently, standard deviations), as tempting as it might be:
Minimizing sum of squared distances is not equivalent to minimizing variances, but that hasn't stopped people from suggesting it as the proper objective for k-means.
It is easy to imagine why this could be a bad idea:
Imagine a single point that is almost mid-way (Euclidean) between two cluster centroids, both with the same variance before including the new point. Now imagine one of the clusters has a much larger membership of points than the other cluster. Let's say the new point is slightly closer to the one with the much larger membership. Adding the new point to the larger cluster, though correct because it is closer to that centroid, won't decrease its variance nearly as much as adding the new point to the other cluster with the much smaller membership.
If you are minimizing the proper objective function, but it still isn't decreasing monotonically, check that you aren't quantizing your centroid means:
This would happen, for example, if you are performing image segmentation with integer values that range in [0, 255] rather than float values in [0, 1], and you are forcing the centroid means to be uint8 datatypes.
Whenever the centroid means are found, they should then be used in the objective function as-is. If your algorithm is finding one value for centroid means (floats), but is then minimizing the objective with other values (byte ints), this could lead to unacceptable variations from the supposed monotonically decreasing objective.
I had an interview question as below:
Suppose we have a line and M points in this line. If we define the distance of a subset of points which has N (N <= M) points to be the minimum distance of the distance between of each pair of point, write a algorithm to find the maximum distance of all subsets, each one has N points...
By this I mean, if we have an array {1,2,10}, and N=2, then the subset with the maximum distance should be {1,10}. My first thought was to get all the combinations of subset and calculate the distance of each one, but the interviewer didn't like it because it would take too much time. Does anyone have a time efficient idea?
You need to sort the array, and find the 1st element(Smallest) and M will always be the last element (largest), So subset will always be the {smallest,Largest}.
Having things arranged along a line is often a tip-off that dynamic programming will work.
Work from left to right. At each point in the line work out, for k = 1..N the set of points of size k amongst those seen so far that has the largest minimum distance. You can work out the the answers for one point in the line from the answers you have already worked out for points to its left. To find the answer for k points, consider each point to its left and find min(minimum distance for k-1 points at that point, distance from current point to that point). Then take the maximum of these possible values.
I have two ArrayList, Double data type,
1.latitudes
2. longitudes,
each has over 200 elements
say i give a random test coordinates, say (1.33, 103.4), the format is [latitude, longitude]
is there any algorithm to easily find closest point,
or do i have to brute force calculate every possible point, find hypotenuse, and then compare over 200 hypotenuses to return the closest point? thanks
Sort the array of points along one axis. Then, locate the point in the array closest to the required point along this axis and calculate the distance (using whatever metric is appropriate to the problem topology and scale).
Then, search along the array in both directions until the distance to these points is greater than the best result so far. The shortest distance point is the answer.
This can result in having to search the entire array, and is a form of Branch and bound constrained by the geometry of the problem. If the points are reasonably evenly distributed around the point you are searching for, then the scan will not require many trials.
Alternate spatial indices (like quad-trees) will give better results, but your small number of points would make the setup cost in preparing the index much larger than a simple sort. You will need to track the position changes caused by the sort as your other array will not be sorted the same way. If you change the data into a single array of points, then the sort will reorder entire points at the same time.
If your arrays are sorted, you can use binary search to find a position of a requested point in array. After you find index, you should check four near by points to find the closest.
1)Suppose you have two sorted arrays longitudes-wise and latitudes-wise
2)You search first one and find two nearby points
3)Then you search second one and find two more points
4)Now you have from two to four points(results might intersect)
5)These points will form a square around destination point
6)Find the closest point
it's not true that closest lat (or long) value should be choosen to search over the long (or lat) axis, in fact you could stay on a lat (or long) line but far away along the long (or lat) value
so best way is to calculate all distances and sort them
I'm trying to sort d-dimensional data vectors by their Hilbert order, for bulk-loading a spatial index.
However, I do not want to compute the Hilbert value for each point explicitly, which in particular requires setting a particular precision. In high-dimensional data, this involves a precision such as 32*d bits, which becomes quite messy to do efficiently. When the data is distributed unevenly, some of these calculations are unnecessary, and extra precision for parts of the data set are necessary.
Instead, I'm trying to do a partitioning approach. When you look at the 2D first order hilbert curve
1 4
| |
2---3
I'd split the data along the x-axis first, so that the first part (not necessarily containing half of the objects!) will consist of 1 and 2 (not yet sorted) and the second part will have objects from 3 and 4 only. Next, I'd split each half again, on the Y axis, but reverse the order in 3-4.
So essentially, I want to perform a divide-and-conquer strategy (closely related to QuickSort - on evenly distributed data this should even be optimal!), and only compute the necessary "bits" of the hilbert index as needed. So assuming there is a single object in "1", then there is no need to compute the full representation of it; and if the objects are evenly distributed, partition sizes will drop quickly.
I do know the usual textbook approach of converting to long, gray-coding, dimension interleaving. This is not what I'm looking for (there are plenty of examples of this available). I explicitly want a lazy divide-and-conquer sorting only. Plus, I need more than 2D.
Does anyone know of an article or hilbert-sorting algorithm that works this way? Or a key idea how to get the "rotations" right, which representation to choose for this? In particular in higher dimensionalities... in 2D it is trivial; 1 is rotated +y, +x, while 4 is -y,-x (rotated and flipped). But in higher dimensionalities this gets more tricky, I guess.
(The result should of course be the same as when sorting the objects by their hilbert order with a sufficiently large precision right away; I'm just trying to save the time computing the full representation when not needed, and having to manage it. Many people keep a hashmap "object to hilbert number" that is rather expensive.)
Similar approaches should be possible for Peano curves and Z-curve, and probably a bit easier to implement... I should probably try these first (Z-curve is already working - it indeed boils down to something closely resembling a QuickSort, using the appropriate mean/grid value as virtual pivot and cycling through dimensions for each iteration).
Edit: see below for how I solved it for Z and peano curves. It is also working for 2D Hilbert curves already. But I do not have the rotations and inversion right yet for Hilbert curves.
Use radix sort. Split each 1-dimensional index to d .. 32 parts, each of size 1 .. 32/d bits. Then (from high-order bits to low-order bits) for each index piece compute its Hilbert value and shuffle objects to proper bins.
This should work well with both evenly and unevenly distributed data, both Hilbert ordering or Z-order. And no multi-precision calculations needed.
One detail about converting index pieces to Hilbert order:
first extract necessary bits,
then interleave bits from all dimensions,
then convert 1-dimensional indexes to inverse Gray code.
If the indexes are stored in doubles:
If indexes may be negative, add some value to make everything positive and thus simplify the task.
Determine the smallest integer power of 2, which is greater than all the indexes and divide all indexes to this value
Multiply the index to 2^(necessary number of bits for current sorting step).
Truncate the result, convert it to integer, and use it for Hilbert ordering (interleave and compute the inverse Gray code)
Subtract the result, truncated on previous step, from the index: index = index - i
Coming to your variant of radix sort, i'd suggest to extend zsort (to make hilbertsort out of zsort) with two binary arrays of size d (one used mostly as a stack, other is used to invert index bits) and the rotation value (used to rearrange dimensions).
If top value in the stack is 1, change pivotize(... ascending) to pivotize(... descending), and then for the first part of the recursion, push this top value to the stack, for second one - push the inverse of this value. This stack should be restored after each recursion. It contains the "decision tree" of last d recursions of radix sort procedure (in inverse Gray code).
After d recursions this "decision tree" stack should be used to recalculate both the rotation value and the array of inversions. The exact way how to do it is non-trivial. It may be found in the following links: hilbert.c or hilbert.c.
You can compute the hilbert curve from f(x)=y directly without using recursion or L-systems or divide and conquer. Basically it's a gray code or hamiltonian path traversal. You can find a good description at Nick's spatial index hilbert curve quadtree blog or from the book hacker's delight. Or take a look at monotonic n-ary gray code. I've written an implementation in php including a moore curve.
I already answered this question (and others) but my answer(s) mysteriously disappeared. The Compact Hilbert Index implemention from http://code.google.com/p/uzaygezen/source/browse/trunk/core/src/main/java/com/google/uzaygezen/core/CompactHilbertCurve.java (method index()) already allows one to limit the number of hilbert index bits computed up to a given level. Each iteration of the loop from the mentioned method computes a number of bits equal to the dimensionality of the space. You can easily refactor the for loop to compute just one level (i.e., a number of bits equal to the dimensionality of the space) at a time, going only as deeply as needed to compare lexicographically two numbers by their Compact Hilbert Index.