ELKI get clustering data points - java

How do I get the data points and centroid that are in a kmeans (llyod) cluster when I use elki?
Also could I plug in those points into one of the distance functions and get the distance between any two of the points?
This question is different, because the main focus of my question is retrieving the data points, not custom data points. Also the answer on the other thread is currently incomplete, since it refers to a wiki that is not functioning at the moment. Additionally I would like to know specifically what needs to be done, because the documentation on all of the libraries is a bit like a wild goose chase and it would be greatly appreciated that if you know/understand the library that you would be direct with the answer so that others with the same problem could also have a good solid reference to refer to, instead of trying to figure out the library.

A Cluster (JavaDoc) in ELKI never stores the point data. It only stores point DBIDs (Wiki), which you can get using the getIDs() method. To get the original data, you need the Relation from your database. The method getModel() returns the cluster model, which for kmeans is a KMeansModel.
You can get the point data from the database Relation by their DBID,
or compute the distance based on two DBIDs.
The centroid of KMeans is special - it is not a database object, but always a numerical vector - the arithmetic mean of the cluster. When using KMeans, you should be using SquaredEuclideanDistanceFunction. This is a NumberVectorDistanceFunction, which has the method distance(NumberVector o1, NumberVector o2) (not all distances work on number vectors!).
Relation<? extends NumberVector> rel = ...;
NumberDistanceFunction df = SquaredEuclideanDistanceFunction.STATIC;
... run the algorithm, then iterate over each cluster: ...
Cluster<KMeansModel> cluster = ...;
Vector center = cluster.getModel().getMean();
double varsum = cluster.getModel().getVarianceContribution();
double sum = 0.;
// C++-style for loop, for efficiency:
for(DBIDRef id = cluster.getIDs().iterDBIDs(); id.valid(); id.advance()) {
double distance = df.distance(relation.get(id), center);
sum += distance;
}
System.out.println(varsum+" should be the same as "+sum);

Related

Custom distance metric for DBSCAN in Apache Commons Math (v3.1 vs. v3.6)

I want to use Apache Commons Math's DBSCANClusterer<T extends Clusterable> to perform a clustering using the DBSCAN algorithm, but with a custom distance metric as my data points contain non-numerical values. This seems to have been easily achievable in the older version (note that the fully qualified name of this class is org.apache.commons.math3.stat.clustering.DBSCANClusterer<T> whereas it is org.apache.commons.math3.ml.clustering.DBSCANClusterer<T> for the current release), which has now been deprecated. In the older version, Clusterable would take a type-param, T, describing the type of the data points being clustered, and the distance between two points would be defined by one's implementation of Clusterable.distanceFrom(T), e.g.:
class MyPoint implements Clusterable<MyPoint> {
private String someStr = ...;
private double someDouble = ...;
#Override
public double distanceFrom(MyPoint p) {
// Arbitrary distance metric goes here, e.g.:
double stringsEqual = this.someStr.equals(p.someStr) ? 0.0 : 10000.0;
return stringsEqual + Math.sqrt(Math.pow(p.someDouble - this.someDouble, 2.0));
}
}
In the current release, Clusterable is no longer parameterized. This means that one has to come up with a way of representing one's (potentially non-numerical) data points as a double[] and return that representation from getPoint(), e.g.:
class MyPoint implements Clusterable {
private String someStr = ...;
private double someDouble = ...;
#Override
public double[] getPoint() {
double[] res = new double[2];
res[1] = someDouble; // obvious
res[0] = ...; // some way of representing someStr as a double required
return res;
}
}
And then provide an implementation of DistanceMeasure that defines the custom distance function in terms of the double[] representations of the two points being compared, e.g.:
class CustomDistanceMeasure implements DistanceMeasure {
#Override
public double compute(double[] a, double[] b) {
// Let's mimic the distance function from earlier, assuming that
// a[0] is different from b[0] if the two 'someStr' variables were
// different when their double representations were created.
double stringsEqual = a[0] == b[0] ? 0.0 : 10000.0;
return stringsEqual + Math.sqrt(Math.pow(a[1] - b[1], 2.0));
}
}
My data points are of the form (integer, integer, string, string):
class MyPoint {
int i1;
int i2;
String str1;
String str2;
}
And I want to use a distance function/metric that essentially says "if str1 and/or str2 differ for MyPoint mpa and MyPoint mpb, the distance is maximal, otherwise the distance is the Euclidean distance between the integers" as illustrated by the following snippet:
class Dist {
static double distance(MyPoint mpa, MyPoint mpb) {
if (!mpa.str1.equals(mpb.str1) || !mpa.str2.equals(mpb.str2)) {
return Double.MAX_VALUE;
}
return Math.sqrt(Math.pow(mpa.i1 - mpb.i1, 2.0) + Math.pow(mpa.i2 - mpb.i2, 2.0));
}
}
Questions:
How do I represent a String as a double in order to enable the above distance metric in the current release (v3.6.1) of Apache Commons Math? String.hashCode() is insufficient as hash code collisions would cause different strings to be considered equal. This seems like an unsolvable problem as I'm essentially trying to create a unique mapping from an infinite set of strings to a finite set of numerical values (64bit double).
As (1) seems impossible, am I misunderstanding how to use the library? If yes, were did I take a wrong turn?
Is my only alternative to use the deprecated version for this kind of distance metric? If yes, (3a) why would the designers choose to make the library less general? Perhaps in favor of speed? Perhaps to get rid of the self-reference in class MyPoint implements Clusterable<MyPoint> which some might consider bad design? (I realize that this might be too opinionated, so please disregard it if that is the case). For the commons-math experts: (3b) what downsides are there to using the deprecated version other than forward compatibility (the deprecated version will be removed in 4.0)? Is it slower? Perhaps even incorrect?
Note: I am aware of ELKI which is apparently popular among a set of SO users, but it does not fit my needs as it is marketed as a command-line and GUI tool rather than a Java library to be included in third-party applications:
You can even embed ELKI into your application (if you accept the
AGPL-3 license), but we currently do not (yet) recommend to do so,
because the API is still changing substantially. [...]
ELKI is not designed as embeddable library. It can be used, but it is
not designed to be used this way. ELKI has tons of options and
functionality, and this comes at a price, both in runtime (although it
can easily outperform R and Weka, for example!) memory usage and in
particular in code complexity.
ELKI was designed for research in data mining algorithms, not for
making them easy to include in arbitrary applications. Instead, if you
have a particular problem, you should use ELKI to find out which
approach works good, then reimplement that approach in an optimized
manner for your problem (maybe even in C++ then, to further reduce
memory and runtime).

Find if location is land or water in WorldWind

I know in WorldWind Java you can find out the elevation and a particular location with something like this:
public Double getPositionElevationMeters(Double lat, Double lon) {
double elevation = getWorldWindCanvas().getModel().getGlobe()
.getElevation(Angle.fromDegrees(lat), Angle.fromDegrees(lon));
return elevation;
}
Is there a way to figure out if that lat/lon is actually a major body of water or land pro-grammatically? I've taken a "blind" approach of just considering elevation less than 0 to be water, but that's obviously not ideal.
I'd even use another library that would give me this information; I just need it to work offline.
You could possibly use a data source such as this from which you should be able to determine the polygons for all countries on Earth. Antarctica has also been added to that data set. This would get you most of the way there, depending on what you define as a "major" body of water.
From there, you can use GeoTools to import the shape data and calculate which polygons a given lat/lon pair fall in to. If none, then it is probably an ocean.
The following pseudocode illustrates the logical flow:
// pseudocode, not the actual GeoTools API
boolean isWater(Coordinate point)
{
ShapeDataStore countryShapes = loadShapeData("world.shp");
GeoShape shape = countryShapes.findShapeByPoint(point);
if (shape == null)
return true // not a country or Antarctica, must be international waters.
else
return false;
}
edit
See this answer for an answer to a similar question that describes this process in a bit more detail.

Implementing nonlinear optimization with nonlinear inequality constraints with java

How do I implement a nonlinear optimization with nonlinear constraints in java? I am currently using org.apache.commons.math3.optim.nonlinear.scalar.noderiv, and I have read that none of the optimizers (such as the one I am currently working with, SimplexOptimizer) take constraints by default, but that instead one must map the constrained parameters to unconstrained ones by implementing the MultivariateFunctionPenaltyAdapter or MultivariateFunctionMappingAdapter classes. However, as far as I can tell, even using these wrappers, one can still only implement linear or "simple" constraints. I am wondering if there is any way to include nonlinear inequality constraints?
For example, suppose that My objective function is a function of 3 parameters: a,b,and c (depending on them non-linearly) and that additionally these parameters are subject to the constraint that ab
Any advice that would solve the problem using just apache commons would be great, but any suggestions for extending existing classes or augmenting the package would also be welcome of course.
My best attempt so far at implementing the COBYLA package is given below:
public static double[] Optimize(double[][] contractDataMatrix,double[] minData, double[] maxData,double[] modelData,String modelType,String weightType){
ObjectiveFunction objective = new ObjectiveFunction(contractDataMatrix,modelType,weightType);
double rhobeg = 0.5;
double rhoend = 1.0e-6;
int iprint = 3;
int maxfun = 3500;
int n = modelData.length;
Calcfc calcfc = new Calcfc(){
#Override
public double Compute(int n, int m, double[] x, double[] con){
con[0]=x[3]*x[3]-2*x[0]*x[1];
System.out.println("constraint: "+(x[3]*x[3]-2*x[0]*x[1]));
return objective.value(x);
}
};
COBYLAExitStatus result = COBYLA.FindMinimum(calcfc, n, 1, modelData, rhobeg, rhoend, iprint, maxfun);
return modelData;
}
The issue is that I am still getting illegal values in my optimization. As you can see, within the anonymous override of the compute function, I am printing out the value of my constraint. The result is often negative. But shouldn't this value be constrainted to be non-negative?
EDIT: I found the bug in my code, which was unrelated to the optimizer itself but rather my implementation.
Best,
Paul
You might want to consider an optimizer that is not available in Apache Commons Math. COBYLA is a derivative-free method for relatively small optimization problems (less than 100 variables) with nonlinear constraints. I have ported the original Fortran code to Java, the source code is here.

Solving a non linear system in java (using optim toolbox)

I have a system of nonlinear dynamics which I which to solve to optimality. I know how to do this in MATLAB, but I wish to implement this in JAVA. I'm for some reason lost in how to do it in Java.
What I have is following:
z(t) which returns states in a dynamic system.
z(t) = [state1(t),...,state10(t)]
The rate of change of this dynamic system is given by:
z'(t) = f(z(t),u(t),d(t)) = [dstate1(t)/dt,...,dstate10(t)/dt]
where u(t) and d(t) is some external variables that I know the value of.
In addition I have a function, lets denote that g(t) which is defined from a state variable:
g(t) = state4(t)/c1
where c1 is some constant.
Now I wish to solve the following unconstrained nonlinear system numerically:
g(t) - c2 = 0
f(z(t),u(t),0)= 0
where c2 is some constant. Above system can be seen as a simple f'(x) = 0 problem consisting of 11 equations and 1 unkowns and if I where supposed to solve this in MATLAB I would do following:
[output] = fsolve(#myDerivatives, someInitialGuess);
I am aware of the fact that JAVA doesn't come with any build-in solvers. So as I see it there are two options in solving the above mentioned problem:
Option 1: Do it my-self: I could use numerical methods as e.g. Gauss newton or similar to solve this system of nonlinear equations. However, I will start by using a java toolbox first, and then move to a numerical method afterwards.
Option 2: Solvers (e.g. commons optim) This solution is what I am would like to look into. I have been looking into this toolbox, however, I have failed to find an exact example of how to actually use the MultiVariateFunction evaluater and the numerical optimizer. Does any of you have any experience in doing so?
Please let me know if you have any ideas or suggestions for solving this problem.
Thanks!
Please compare what your original problem looks like:
A global optimization problem
minimize f(y)
is solved by looking for solutions of the derivatives system
0=grad f(y) or 0=df/dy (partial derivatives)
(the gradient is the column vector containing all partial derivatives), that is, you are computing the "flat" or horizontal points of f(y).
For optimization under constraints
minimize f(y,u) such that g(y,u)=0
one builds the Lagrangian functional
L(y,p,u) = f(y,u)+p*g(y,u) (scalar product)
and then compute the flat points of that system, that is
g(y,u)=0, dL/dy(y,p,u)=0, dL/du(y,p,u)=0
After that, as also in the global optimization case, you have to determine what the type of the flat point is, maximum, minimun or saddle point.
Optimal control problems have the structure (one of several equivalent variants)
minimize integral(0,T) f(t,y(t),u(t)) dt
such that y'(t)=g(t,y(t),u(t)), y(0)=y0 and h(T,y(T))=0
To solve it, one considers the Hamiltonian
H(t,y,p,u)=f(t,y,u)-p*g(t,y,u)
and obtained the transformed problem
y' = -dH/dp = g, (partial derivatives, gradient)
p' = dH/dy,
with boundary conditions
y(0)=y0, p(T)= something with dh/dy(T,y(T))
u(t) realizes the minimum in v -> H(t,y(t),p(t),v)

Find shortest route between points in 2d from collection

I have a list of 2D points on my scene and i have an array of connections between these Points stored as unordered pairs
Pair is defined exactly as here how to write write a set for unordered pair in Java
so i have :
ArrayList<PointF> mPoints = new ArrayList<PointF>();
ArrayList<Pair<PointF>> mConnections = new ArrayList<Pair<PointF>>();
//
PointF mStartPoint = mPoints.get(0);
PointF mEndPoint = mPoints.get(80);
I need to find array of Points which will lead me from source to destination Point.
I think to add to each Pair information about distance but what next ?
This is an instance of a standard path finding problem.
If you need a guaranteed exact solution, go with something like Dijkstra's algorithm. If you need something more efficient, but can live with suboptimal solutions for certain cases, go with the A* algorithm.
See http://en.wikipedia.org/wiki/Dijkstras_algorithm#Algorithm for a solution.

Categories