Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am trying to compute the best latitude longitude pairs for several locations.
I have a database with locations and for each location I may have multiple coordinates. Most of these coordinates seem relevant for the location as they are located within 5 meters from each other.
So I can derive a new (final) latitude longitude pair by averaging them.
Sometimes however I have a point (sometimes more then one) that is located several hundred meters away.
Given a set of few (maximum 10) latitude longitude points, I would like to find and keep only those points that make sense and discard those who are too far away from others.
What approach / algorithm should I use ?
Note I work with Java.
Simple approach:
Compute the distance of all points to some arbitrary point.
Find the median distance of all points.
Discard all points whose abs (dist - median) > value.
A bit better than the centroid approach which could get screwed by few far away points that are clustered together.
The simplest approach is likely to be:
Find the centroid (average long/lat) point for a given set of points
Compute the distance from each point in the set to the centroid. Discard all points with a distance over a certain constant value (calling these points noise)
Recompute the centroid from the remaining non-noise points, call that the location.
This should be pretty simple to implement in java and certainly can be O(N), N being the number of points in your set.
Your problem is a specific case of K-means clustering, in that you know which real-world data correspond to which samples whereas in the general case you don't have that knowledge. So look into that problem and assorted approaches if you want more research.
There are a couple of questions you need to ask yourself:
Which point should be treated as "not making sense" if you have only two points being 100 meters away.
Which point should be treated as "not making sense" if you have two separate clusters of points?
What should you do if you have a continuous row of points that still fit within the margin of error counting to the closest neighbour, but in total span over the limit?
The question you've asked is hard to answer without clear criteria, although I'd try looking through clustering algorithms.
If we would skip problems I've mentioned, I'd say that it's computationally heavy, but you can go by
calculating the distances between all points in given set
sorting them by the sum of distances
filtering out the one with highest sum
Iterating over until there are no points for which the sum of distances is greater than errorMargin * N-1 where N is the current number of points.
Still you need to take the border cases into consideration, cause for instance problem mentioned in 1) would leave you with a single random point - I doubt you're ok with that, so you need to carefully analyse your domain.
If you are using Java8 then the following code provides an elegant solution.
Collector<Location, ?, Location> centreCollector = new CentreCollector();
Location centre = locations.stream().collect(centreCollector);
centre = locations.stream().filter(centre::furtherThan(NOISE_DISTANCE)).collect(centreCollector);
You have 2 things to create. The CentreCollector class which implements Collector and averages Location objects as they are streamed to it; and the furtherThan method which returns a Predicate that compares the distance between this and a given location to a given distance.
A slightly more elegant method would be to calculate the standard deviation of the distances to the centre and then discard any locations that are more than a certain number of standard deviations from the average distance. This would have the advantage of taking account of sets of locations in which all or most of the samples are more than the NOISE_DISTANCE from the centre. In that case the CentreCollector will have to return a more complex object that holds the location and statistical information and have furtherThan as a member of that class rather than of Location. Let me know in the comments if you want me to post the equivalent code for using standard deviations.
Related
I use haifengl/smile and I need to get the optimal cluster number.
I am using CLARANS where I need to specify the number of clusters to create. I think maybe there is some solution to sort out for example from 2 to 10 clusters, see the best result and choose the number of clusters with the best result. How can this be done with the Elbow method?
To determine the appropriate number of clusters such that elements within the cluster are similar to each other and dissimilar to elements in other groups, can be found by applying a variety of techniques like;
Gap Statistic- compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data.
Silhouette Method The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.
Sum of Square method
For more details, read the sklearn documentation on this subject.
The Elbow method is not automatic.
You compute the scores for the desired range of k, plot this, and then visually try to find an "elbow" - which may or may not work.
Because x and y have no "correct" relation to each other, beware that the interpretation of the plot (and any geometric attempt to automate this) depend on the scaling of the plot and are inherently subjective. In the end, the entire concept of an "elbow" likely is flawed and not sound in this form. I'd rather look for more advanced measures where you can argue for the maximum or minimum, although some notion of "significantly better k" would be desirable.
Ways to find clusters:
1- Silhouette method:
Using separation and cohesion or just using an implemented method the optimal number of clusters is the one with the maximum silhouette coefficient. * silhouette coefficient range from [-1,1] and 1 is the best value.
Example of the silhouette method with scikit-learn.
2- Elbow method (You can use the elbow method automatically)
The elbow method is a graph between the number of clusters and the average square sum of the distances.
To apply it automatically in python there is a library Kneed in python to detect the knee in a graph.Kneed Repository
With a coworker we are looking for a method to calculate the maximum number of points in a plane 2D that can communicate as long distance they can represented by "D", the code is requested on Java and each one of the points must be considered as an object with two coordinates, "X" and "Y" who must be represented as 2 int in the code.
We found that if we select any of this points in the plane, is possible determinate the radius D of the circle around the point selected, where all of the mid points contained inside of this radius can communicate with the target point.
Then after that you can use alliterations for determinate all the communications in each plane 2D at a determinate distance "D" and find which is the zone with maximum number of communications between all the mid points.
After all the mentioned my question is the following:
Exist another way more easy to do it in Java?
Some friend suggest to us do it in C# language because it includes a library which facilitates this kind of representation with the use of pointers and the memory addresses, but is a primordial requirement to do it in Java.
If you have any suggestions or a better way to approach our problem would be very appreciated for us.
I'm working on the k-means clustering with Java. I don't see problem in my code and it looks well. However, I don't understand something.
Step 1:
Choose N number of centers. (Let there is N number of clusters)
Step 2:
Put each vector into cluster with nearest center using Euclidean distance. (||v1 - v2||)
Step 3:
Find new mean (=center) for each cluster
Step 4:
If the center have moved significantly, go to step 2
However, when I make a plot of total of point-to-respective-center distances after each iteration, I can see that the total is decreasing all the time (although it's decreasing in general and converging well).
total distance of 2nd iteration is always shorter than first one, and is the shortest. And the total distance is slightly increasing at the 3rd iteration and converges at 4 or 5th iteration.
I believe I was told to be it should be always decreasing. What's wrong? My algorithm (implementation) or my assumption about the total distance?
It must always be decreasing for the same seed.
Maybe your error is that you use Euclidean distance.
K-means does not minimize Euclidean distances.
This is a common misconception that even half of the professors get wrong. K-means minimizes the sum-of-squares, i.e., the sum of squared Euclidean distances. And no, this does not find the solution with the smallest Euclidean distances.
So make sure you are plotting SSQ everywhere. Remove all square roots from your code. They do not belong into k-means.
Additional comments:
Don't minimize variances (or equivalently, standard deviations), as tempting as it might be:
Minimizing sum of squared distances is not equivalent to minimizing variances, but that hasn't stopped people from suggesting it as the proper objective for k-means.
It is easy to imagine why this could be a bad idea:
Imagine a single point that is almost mid-way (Euclidean) between two cluster centroids, both with the same variance before including the new point. Now imagine one of the clusters has a much larger membership of points than the other cluster. Let's say the new point is slightly closer to the one with the much larger membership. Adding the new point to the larger cluster, though correct because it is closer to that centroid, won't decrease its variance nearly as much as adding the new point to the other cluster with the much smaller membership.
If you are minimizing the proper objective function, but it still isn't decreasing monotonically, check that you aren't quantizing your centroid means:
This would happen, for example, if you are performing image segmentation with integer values that range in [0, 255] rather than float values in [0, 1], and you are forcing the centroid means to be uint8 datatypes.
Whenever the centroid means are found, they should then be used in the objective function as-is. If your algorithm is finding one value for centroid means (floats), but is then minimizing the objective with other values (byte ints), this could lead to unacceptable variations from the supposed monotonically decreasing objective.
I have two ArrayList, Double data type,
1.latitudes
2. longitudes,
each has over 200 elements
say i give a random test coordinates, say (1.33, 103.4), the format is [latitude, longitude]
is there any algorithm to easily find closest point,
or do i have to brute force calculate every possible point, find hypotenuse, and then compare over 200 hypotenuses to return the closest point? thanks
Sort the array of points along one axis. Then, locate the point in the array closest to the required point along this axis and calculate the distance (using whatever metric is appropriate to the problem topology and scale).
Then, search along the array in both directions until the distance to these points is greater than the best result so far. The shortest distance point is the answer.
This can result in having to search the entire array, and is a form of Branch and bound constrained by the geometry of the problem. If the points are reasonably evenly distributed around the point you are searching for, then the scan will not require many trials.
Alternate spatial indices (like quad-trees) will give better results, but your small number of points would make the setup cost in preparing the index much larger than a simple sort. You will need to track the position changes caused by the sort as your other array will not be sorted the same way. If you change the data into a single array of points, then the sort will reorder entire points at the same time.
If your arrays are sorted, you can use binary search to find a position of a requested point in array. After you find index, you should check four near by points to find the closest.
1)Suppose you have two sorted arrays longitudes-wise and latitudes-wise
2)You search first one and find two nearby points
3)Then you search second one and find two more points
4)Now you have from two to four points(results might intersect)
5)These points will form a square around destination point
6)Find the closest point
it's not true that closest lat (or long) value should be choosen to search over the long (or lat) axis, in fact you could stay on a lat (or long) line but far away along the long (or lat) value
so best way is to calculate all distances and sort them
I have one 2d line (it can be a curved line, with loops and so on), and multiple similar paths. I want to compare the first path with the rest, and determine which one is the most similar (in percentage if possible).
I was thinking maybe transforming the paths into bitmaps and then using a library to compare the bitmaps, but that seems like overkill. In my case, I have only an uninterrupted path, made of points, and no different colors or anything.
Can anyone help me?
Edit:
So the first line is the black one. I compare all other lines to it. I want a library or algorithm that can say: the red line is 90% accurate (because it has almost the same shape, and is close to the black one); the blue line is 5% accurate - this percentage is made up for this example... - because it has a similar shape, but it's smaller and not close to the black path.
So the criterion of similarity would be:
how close the lines are one to another
what shape do they have
how big they are
(color doesn't matter)
I know it's impossible to find a library that considers all this. But the most important comparisons should be: are they the same shape and size? The distance I can calculate on my own.
I can think of two measures to express similarity between two lines N (defined as straight line segments between points p0, p1... pr)
M (with straight line segments between q0, q1, ...qs). I assume that p0 and q0 are always closer than p0 and qs.
1) Area
Use the sum of the areas enclosed between N and M, where N and M are more different as the area gets larger.
To get N and M to form a closed shape you should connect p0 and q0 and pr and qs with straight line segments.
To be able to calculate the surface of the enclosed areas, introduce new points at the intersections between segments of N and M, so that you get one or more simple polygons without holes or self-intersections. The area of such a polygon is relatively straightforward to compute (search for "polygon area calculation" around on the web), sum the areas and you have your measure of (dis)similarity.
2) Sampling
Take a predefined number (say, 1000) of sample points O that lie on N (either evenly spaced with respect to the entire line, or evenly spaced
over each line segment of N). For each sample point o in O, we'll calculate the distance to the closest corresponding point on M: the result is the sum of these distances.
Next, reverse the roles: take the sample points from M and calculate each closest corresponding point on N, and sum their distances.
Whichever of these two produces the smallest sum (they're likely not the same!) is the measure of (dis)similarity.
Note: To locate the closest corresponding point on M, locate the closest point for every straight line segment in M (which is simple algebra, google for "shortest distance between a point and a straight line segment"). Use the result from the segment that has the smallest distance to o.
Comparison
Method 1 requires several geometric primitives (point, line segment, polygon) and operations on them (such as calculating intersection points and polygon areas),
in order to implement. This is more work, but produces a more robust result and is easier to optimize for lines consisting of lots of line segments.
Method 2 requires picking a "correct" number of sample points, which can be hard if the lines have alternating parts with little detail
and parts with lots of detail (i.e. a lot of line segments close together), and its implementation is likely to quickly get (very) slow
with a large number of sample points (matching every sample point against every line segment is a quadratic operation).
On the upside, it doesn't require a lot of geometric operations and is relatively easy to implement.