Hierarchical clustering with custom distance

Hierarchical clustering with custom distance - java

I need to implement a hierarchical clustering algorithm based on a custom distance. The distance is computed by looking in a database for the value associated to the two ids of the objects that are being compared.
Is there an easy way to do this in Java? I took a look at Weka and their custom distance function but I cannot find a way to define instances so that when I am in the custom distance function I can get the IDs of the two original objects.
Any help would be greatly appreciated
Thanks a lot in advance
Rossella

You can take a look on Apache Mahout.
Here is a link Mahout Hierarchical clustering
This tool is written in Java and its open source.

Related

process geojson and find a given point contains within polygon

I have a requirement to process geojson file (it has multiple polygons) and find a given point (longitude and latitude) contains with in polygon. I am looking for a java solution. Can you please recommend possible solutions?
Thanks.

If you need more functionality than what JTS offers, you should check out GeoTools.
It provides the ability to read and write most major cartographic formats, supports map projections and coordinate transformations, and is a much more full featured GIS suite.
JTS is strictly geometry -- it deals with 2d shapes with no units attached.

When one needs a geometry library for Java JTS should be mentioned at first. And of course Java-GeoJSon to load the data. The latter library is built upon JTS, so you even don't need any special adapters between two libraries.

Simple Genetic Algorithm tutorial for timetabling?

My final year project is about automated timetabling using Genetic Algorithm.
First, I'm not asking about a sample working code.
I just need a tutorial in which I can understand more about GA in timetabling.
I currently understand GA operations (selection,crossover,mutation) based on tutorial I found.
But I have no idea on how to apply it onto the timetable. The GA tutorial I looked at encode data in the form of binary or string. But what about for creating timetable?
I hope somebody can guide me to understand about GA in timetabling in more detail. If you have another tutorial of GA that can help me understand GA better, it is welcomed. :)
Thanx in advance!

Define your individual/genotype
Which parameters does a time table have? Can you store them as a bit string or an array of integers?
Define your fitness function
Create rules how to calculate the goodness of a time table.
Define the type of selection
How to select individuals for mating? Will the best individual be stored during the whole run? (elitism)
Define genetic operators
How can two individuals create an offspring? Do you want to use mutation, crossover or both?
Define parameters for the algorithm
Will the population size be fixed and new individuals replace old individuals depending on their fitness value (steady state)? Or do you want to create a new generation each time all individuals are evaluated?
Implement a SGA and test.

Word association search in Apache Lucene

I have a requirement to associate math terms that come under a common topic. For e.g. angles, cos, tan, etc., should relate to trigonometry. So when a user searches for angles, triangles, etc. the search should present results related to trigonometry as well. Can anyone provide leads on how to do this in Apache Lucene?

There is a classification api which includes K-nearest neighbors and naive Bayes models.
You would first use the train() method with your training set. Once the classifier is trained use the assignClass() method to classify a given string.
For a training set you could use Wikipedia pages for your given classes.
After you give those two a try you could make use of the Classifier interface to build a competing model.

If you already know the associations, you can just add them to the index for the specific terms -- i.e. indexing 'cos' as 'cos', 'trigonometry'.
Also if you know the associations, you could index the parent term and all of the sibling terms -- i.e. indexing 'cos' as 'trigonometry', 'cos', 'sin', etc. This sounds more like what you want.

In addition to #Josh S.'s good answer, you can also take a more direct approach, of generating your own synonyms dictionary, e.g. see Match a word with similar words using Solr?

Which Java data object to use for multidimensional range matching?

Project Background:
I am writing a map tile overlay class for java that can use gdal2tile.py tiles. Basically I will end up with thousands of jpg files that are in a file structure like
"Zoom Level/X coordinate/Y coordinate"
The coordinates are ints but will not necessarily start at 0 or 1.
I will have to search for tiles that are within a certain range to find out which ones I need to render.
My Problem:
I tried iterating using the file structure itself but it is wicked slow (not surprising).
I tried iterating using an ArrayList of strings of the file structure and .contains() but it seems to be even slower (not too surprising).
Optimally I would like to use a data structure that would let me choose a range on multiple dimensions so that I can call something like.
Tiles.getWhere(Zoom Level,min X,max X,min Y,maxY);
I assume that some sort of Collection or TreeMap would be the right choice but I'm not experienced enough with Java to know for sure and I'd prefer not to have to benchmark a lot of different approaches.
I could use SQLite to do it but that seems like overkill.
My Question:
What is the most efficient way to check for the existence of datasets given multiple dimensional constraints?

May be you are looking for a map with multiple keys.
Commons-collections provides a map with multiple lookup keys:
http://commons.apache.org/collections/apidocs/org/apache/commons/collections/map/MultiKeyMap.html
a map guarantees a O(1) insertion and O(1) selection timings.

Thinking of your problem I could find out three directions to which you could aim your search next (this is not a hand-by-hand guide but rather a out-of-the-box brain opener for a stucked situation you have faced):
1) Usage of Java built in structures. Yes, indeed, a list is the worst case of a searching method. A Map, as the name suggests, is far more convenient for maps. It is not only the name, but the indexing to a Map is signifigantly less time consuming compared to a List. You can imagine your map as a cube, where you have to handle about half of the dots inside it, if you use List and probably only a narrow layer of it when you search by indexing a Map. There is a magnitude of difference. So, my answer here: Map is a key word towards the correct direction (assuming you want to do it in this way after reading on my answer).
2) Usage of a Map Server solution. This is probably too far from your approach, but entire frameworks are made for solving your type of question. An example is GeoServer. It has a ready made solution for the entire problem. It is a stable solution for the great big problem possibly in your hand: showing a map to a user from a source.
3) Sticking to the GDAL framework you were using, you could select slightly different py-file, like gdal_proximity.py and - wow! - you have a searching possibility in your hand! This particular one searches by a center point and a distance, but will do the stuff you need =)
There is a starting point, how I would make it. Could this serve for something?

Sounds to me like you are looking for something like an Interval Tree.
http://en.wikipedia.org/wiki/Interval_tree
I have implemented one of these in the past but only in one dimension. The Wikipedia reference mentions extensions to more dimensions.
Paul

How to implement a cluster graph in java

I want to draw a cluster graph like this. Is there a library for this? How should I build the data structure to contain the input data? Like a dictionary with key as the node itself and value as an array with the nodes connecting to. What could be a more precise term to describe this?

Try a library like JUNG.
JUNG is a framework made for displaying and working with any kind of graphs and networks on Java. It supports transitions, collapsing, complex layouts, …
About the data structure: It is complicated, and depends on the type
of cluster (bidirectional or unidirectional).
In the last case, you shouldn't use a Dictionary,
or connections would be stored twice.
Look at JUNG, for example. I think its data is Serializable.

Take a look at JGraphT: it provides the data structures and you can then render that using JGraph.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.