OPTICS Clustering algorithm. How to get the best epsilon - java

I am implementing a project which needs to cluster geographical points. OPTICS algorithm seems to be a very nice solution. It needs just 2 parameters as input(MinPts and Epsilon), which are, respectively, the minimum number of points needed to consider them as a cluster, and the distance value used to compare if two points are in can be placed in same cluster.
My problem is that, due to the extreme variety of the points, I can't set a fixed epsilon.
Just look at the image below.
The same points structure but in a different scale would result very different. Suppose to set MinPts=2 and epsilon = 1Km.
On the left, the algorithm would create 2 clusters(red and blue), but on the right it would create one single cluster containing all of the points(red), but I would like to obtain 2 clusters even on the right.
So my question is: is there any kind of way to calculate dynamically the epsilon value to get this result?
EDIT 05 June 2012 3.15pm:
I thought I was using the OPTICS algorithm implementation from the javaml library, but it seems it is actually a DBSCAN algorithm implementation.
So the question now is: does anybody know a java based implementation of OPTICS algorithm?
Thank you very much and excuse my for my poor english.
Marco

The epsilon value in OPTICS is solely to limit the runtime complexity when using index structures. If you do not have an index for acceleration, you can set it to infinity.
To quote Wikipedia on OPTICS
The parameter \varepsilon is strictly speaking not necessary. It can be set to a maximum value. When a spatial index is available, it does however play a practical role when it comes to complexity.
What you seem to have looks much more like DBSCAN than OPTICS. In OPTICS, you should not need to choose epsilon (it should have been called max-epsilon by the authors!), but your cluster extraction method will take care of that. Are you using the Xi extraction proposed in the OPTICS paper?
minPts is much more important. You should try a value of at least 5 or 10, not 2. With 2, you are essentially performing single-linkage clustering!
The example you gave above should work fine once you increase minPts!
Re: edit: As you can even see in the Wikipedia article, ELKI has a proper OPTICS implementation and it's in Java.

You'd can try to scale epsilon by the total size of the enclosing rectangle. For example, your left data is about 4km x 6km (using my Mark I eyeball to measure) and the right is about 2km x 2km. So, epsilon on the right should be about 2.5 times smaller.
Of course, this doesn't work reliably. If, on your right hand data, there were an additional single point 4km to the right and 2km down, that would make the enclosing rectangle for the right the same as on the left, and you'd get similar (wrong) results.

You can try a minimum spanning tree and then remove the longest edge. The remaining spanning tree and the center of them is the best center for OPTICS and you can count the numbers of points around it.

In your explanation above, it is the change in scale which creates the uncertainty. When your scale gets bigger, your epsilon should change accordingly. Because they are at two very different scales, the two images you've presented are NOT the same set of points. They will not respond identically to your OPTICS algorithm without changing the parameters.
In short, no. there is no way to dynamically calculate epsilon to get this result. Clustering like this is already NP-Hard, and these clustering algorithims (optics, k-means, veroni) can only approximate the optimal solution.

Related

Convert lat/long to US State

I have access to a list of lat/long coordinates, and I want to know (roughly) the US State these coordinates are located in. I can do with loss of precision, but I can't rely on external libraries or API. I can also add a database of locations in my code.
What is a reasonable way to do this?
I thought about 3 possibilities:
Represent each state by a single point at its center, then do a nearest-neighbour search
Represent each state by points located at cities in the state, then do a nearest-neighbour search (with much more points)
Represent each state by a simple bounding box, then use some algorithm to query which bounding box my point belongs to
What do you think is best? I would tend to think about solution 3, but I can't find a list of coarse "bounding boxes" for US states
I made a little search and find out a proper solution for what you are looking for with a dataset of bounding box.
Answer on StackOverflow: LINK
Dataset: LINK
Algorithm to use(implement): LINK
So yes, the proper way to implement it's using the solution 3 with the given dataset.
Hope it helps :)
Will not work, consider
Has a high likelihood to not work for at least some states. Consider states with towns/cities more clustered to the middle, against states with towns/cities clustered to the edge.
Will not work (these were supposed to be 90 degree angles, perfect squares, but drawing with a mouse is hard :) )
If you want to do this even vaguely accurately you will need some shape data which defines the boundaries between states. You will then need an algorithm which can determine whether a point is within an irregular polygon
See List of the United States (US) state boundaries / borders as latitude/longitude pairs for geofence?

Clustering of images to evaluate diversity (Weka?)

Within a university course I have some features of images (as text files). I have to rank those images according to their diversity.#
The idea I have in mind is to feed a k-means classifier with the images and then compute the euclidian-distance from the images within a cluster to the cluster's centroïd. Then do a rotation between clusters and take always the (next) closest image to the centroïd. I.e., return closest to centroïd 1, then closest to centroïd 2, then 3.... then second closest to centroïd 1, 2, 3 and so on.
First question: would this be a clever approach? Or am I on the wrong path?
Second question: I'm a bit confused. I thought I'd feed the data to Weka and it'd tell me "hey, if I were you, I'd split this data into 7 clusters", or something like that. I mean, that it'd be able to give me some information about the clusters I need. Instead, to use simplekmeans I'm supposed to know a priori how many clusters I'll use... how could I possibly know that?
One example of what I mean: let's say I have 3 mono-color images: light-blue, blue, red.
I thought Weka would notice that the 2 blues are similar and cluster them together.
Btw I'm kind of new to Weka (as you might have seen) so if you could provide some information on which functions I miggt want to use (and why :P) I'd be grateful!
Thank you!
Simple K-means - is an algorithm where you have to specify a number of the possible clusters in the data set.
If you don't know how many clusters there might be, it's better to get different algorithm or find out a number of the clusters.
You can use X-means -there you don't need to specify k parameter. (http://weka.sourceforge.net/doc.packages/XMeans/weka/clusterers/XMeans.html)
X-Means is K-Means extended by an Improve-Structure part In this part of the algorithm the centers are attempted to be split in its region. The decision between the children of each center and itself is done comparing the BIC-values of the two structures.
or you can observe a cut point chart based on AHC - hierarchical clustering algorithm (https://en.wikipedia.org/wiki/Hierarchical_clustering)
and then deduct a number of the clusters

How do I do a simple Gaussian distribution algorithm to distribute points on a plane?

What I seek is to turn a grid into a somewhat "random" plane of tiles.
I tried just multiplying Math.random() individually with the width and height of the plane (in this case its 800 / 600). The circles you see there are points that intersect each other and have been removed from the scene.
As you can see, it looks very far from an "evenly distributed" field of points. There are large holes and just as bad, clusters of points can be seen.
What I am looking for is a way to distribute these points better to have a minimum amount of clusters and holes. Ideally, to have a value that is the minimum distance between any two points, while having the maximum number of points that can fit in the area. I am fine with approximations of all kinds, I just don't want to attempt to do a greedy distribution.
Whatever ecma solution you give its fine, I can convert it to Actionscript.
I have found a visual example. The left side is what I got and the right is what I aim for.
You can try Loyds algorithm, i.e. centroidal weighted voronoi diagrams. Compute the vd and then the center of gravity of each cell. Replace the old points and rinse and repeat: http://www-cs-students.stanford.edu/~amitp/game-programming/polygon-map-generation/.
In general, it is a non-trivial problem, and there are many different approaches.
One that I have liked, since it is fast and produces decent results, is the quasi-random number generator from this article: "The Unreasonable Effectiveness of Quasirandom Sequences"
Other approaches are generally iterative, where the more iterations you do, the better results. You could look up "Mitchell's Best Candidate", for one. Another is "Poisson Disc Sampling".
There are innumerable variations on the different algorithms depending on what you want — some applications demand certain frequencies of noise, for instance. But if you just want something that "looks okay", I think the quasirandom one is a good starting point.
Another cheap and easy one is a "jittered grid", where you evenly space the points on your plane, then randomly adjust each one a small amount.

cost / mapping function for determining center of object based on detected features

I wrote an object tracker that will try to detect and follow a moving object in a recorded video. In order to maximize the detection rate, my algorithm is using a bunch of detection & tracking algorithms (cascade, foreground & particle tracker). Each tracking algorithm will return me some point of interest that might be part of the object that I'm trying to track. Let's assume (for the simplicity of this example) that my object is a rectangle and that the three tracking algorithms returned the points 1, 2 and 3:
Based on the relation / distance of these three points it is possible to calculate the center of gravity (blue X in above image) of the tracked object. So for each frame I might be able to come up with some good estimate of the center of gravity. However, the object might move from one frame to the next:
In this example I merely rotated the original object. My algorithm will give me three new points of interest: 1',2' and 3'. I could again calculate the center of gravity based on these three new points, but I would throw away important information that I've acquired from the previous frame: based on points 1, 2 and 3 I already do know something about the relationship of these points and thus by combining the information from 1, 2 and 3 and 1',2' and 3' I should be able to come up with a better estimate of the center of gravity.
Furthermore, the next frame might yield a forth data point:
This is what I would like to do (but I don't know how):
based on the individual points (and their relationship to each other) that are returned from the different tracking algorithms, I want to build up a localization map of the tracked object. Intuitively I feel like I need to come up with A) an identification function that will identify individual points across frames and B) some cost function that will determine how similar tracked points (and the relationship / distance between them) are from frame to frame, but I can't get my head around on how to implement this. Alternatively, maybe some kind of map buildup based on the points will work. But again, I don't know how to approach this.
Any advice (and example code) is highly appreciated!
EDIT1
a simple particle filter might probably work too, but I again don't know how to define the cost function. A particle filter for tracking a certain color is easy to program: for each pixel you calculate the difference between target color and pixel color. But how would I do the same for estimating the relationship between tracked points?
EDIT2 intuitively I feel like Kalman filters could also help with the prediction step. See slides 24 - 32 of this pdf. Or am I misled?
What I think you're trying to do is essentially build up a state space of features, which can be applied to a filtering process, such as an Extended Kalman Filter. This is a useful framework when you have multiple observations in every frame, and you're trying to estimate or measure something indicated by these observations.
To determine the similarity of the tracked points, you can perform simple template matching from frame to frame for small regions around the points. One way of doing this is to extract an NxN (say, 7x7) region around point a in frame n and point a' in frame n+1, followed by normalised cross correlation between the extracted regions. This will give you a reasonable measure of how similar the patches are. If the patches are not similar, then you've probably lost track of that point.
There is an enormous literature on this and related problems starting in the 80's. Try searching for "optical flow" algorithms". The input for such algorithms is two successive frames of the same scene. The output is a vector field, one vector per pixel in the second image, which shows what the direction and speed of movement of the feature in that field. This presentation is a pretty nice summary.
A nice thing about optical flow is that many algorithms for it parallelize nicely and map onto your favorite video card GPU, so they can run in real time. Think ESPN overlays.
According to me, in order to identify who is who in each frame, you will have to use a greater dimension. For example if you want to know which point is where between two frame (considering your extracted point are same), you will have to build vectors or simplex and then deduce an organisation between your points (like angles values).
The main problem is that combinations increase with point number. If your camera is a fixed point then, you could use background as a reference in order to deduce object rotations and translations, i mean build vectors between background interest points and object points in order to clearly identify them.
hope that help go forward.
I would recommend looking in to the divided difference filter (DDF), which is similar to the extended Kalman filter (EKF), but does not require an approximate model of the dynamics of your system (which you may not have). Basically the DDF approximates the derivatives used in the EKF using a difference equation. There are plenty of papers online about this, but I do not know whether you have access to them so I have not linked them here. If you are working from a university or a company that has access to online journals (like IEEE Explore), then just Google "divided difference filter" and check out some of the papers.

Realworld parameter optimization

I'm in the need to do parameter optimization for my latest research project. I have an algorithm which has currently 5 parameters (four double [0,1] and one nominal with 3 values). The algorithm uses those parameters to calculate some stuff and afterwards I calculate the Precision, Recall & FMeasure. A single run takes about 1,8s. Currently I'm going through each parameter with a 0.1 step size which shows me approximately where the global maxima is. But I want to find the precise global maximum. I've looked into Gradient Descent but I don't really know how to apply this to my algorithm (if it's even possible). Could anybody please guide me a little how I would implement such an algorithm since I'm very new to this kind of work.
Cheers,
Daniel
You can certainly do better than a grid search.
Before applying an algorithm like gradient descent, you have to be sure that your parameter space does not contain local maxima or that at least your starting point is close to the global maximum and your step size is appropriate enough to bring you to it.
In your case, I would recommend starting by drawing as many random samples as you can. This is a much better way of exploring the parameter space than a grid search. Once you collect enough data this way, you can use a mode-finding algorithm, such as mean shift or one of its faster derivatives, or go straight to optimization. Since you don't have the Jacobian of your parameter space, you could use the Broyden's method, which iteratively approximates it or a secant method, such as BFGS.
Also, see this related question: How can I adjust parameters for image processing algorithm in an efficient way?

Categories