Finding N nodes in a graph with maximum spread / distance from eachother - java

Given a graph with N nodes (thousands), I need to find K nodes so that the average path length between each pair (K1,K2) of K is maximized. So basically, I want to place them as far away as possible from eachother.
Which algorithm would I use for this / how could I program this without having to try out several single combination of K?
Also as an extension: if I now have N nodes and I need to place 2 groups of nodes K and L in the graph such that the average path length between each pair (L,K) is maximized, how would I do this?
My current attempt is to just do a couple of random placements and then calculate the average path length between the pairs of both K and L, but this calculation is starting to take a lot of time so I'd rather not spend that much time on just evaluating randomly chosen combinations. I'd rather spend time once on getting the REAL most spread combination.
Are there any algorithms out there for this?

The bad news is that this problem is NP-hard, by a reduction from Independent Set. (Assume unit weights. Add a new vertex connected to all other vertices; then we're looking for a collection of K that have average distance 2.)
If you're determined to get an exact solution (and I'm not sure that you shouldn't be), then I'd try branch and bound, using node is/is not one of the K as the branching decision and a crude bound (given a subset of K, find the two nodes that maximize the appropriate combination of the distance between them and the distance to the subset, then set the bound to the appropriate weighted average incorporating the known inter-node distances).
If the exact algorithm above chokes on thousand-node graphs as Evgeny fears it will, then use a farthest-point clustering (link goes to the Wikipedia page on Facility Location, which describes FPC) to cut the graph to a manageable size, incurring a hopefully small approximation error.

Related

Does the total distance sum in K-means have to be always decreasing?

I'm working on the k-means clustering with Java. I don't see problem in my code and it looks well. However, I don't understand something.
Step 1:
Choose N number of centers. (Let there is N number of clusters)
Step 2:
Put each vector into cluster with nearest center using Euclidean distance. (||v1 - v2||)
Step 3:
Find new mean (=center) for each cluster
Step 4:
If the center have moved significantly, go to step 2
However, when I make a plot of total of point-to-respective-center distances after each iteration, I can see that the total is decreasing all the time (although it's decreasing in general and converging well).
total distance of 2nd iteration is always shorter than first one, and is the shortest. And the total distance is slightly increasing at the 3rd iteration and converges at 4 or 5th iteration.
I believe I was told to be it should be always decreasing. What's wrong? My algorithm (implementation) or my assumption about the total distance?
It must always be decreasing for the same seed.
Maybe your error is that you use Euclidean distance.
K-means does not minimize Euclidean distances.
This is a common misconception that even half of the professors get wrong. K-means minimizes the sum-of-squares, i.e., the sum of squared Euclidean distances. And no, this does not find the solution with the smallest Euclidean distances.
So make sure you are plotting SSQ everywhere. Remove all square roots from your code. They do not belong into k-means.
Additional comments:
Don't minimize variances (or equivalently, standard deviations), as tempting as it might be:
Minimizing sum of squared distances is not equivalent to minimizing variances, but that hasn't stopped people from suggesting it as the proper objective for k-means.
It is easy to imagine why this could be a bad idea:
Imagine a single point that is almost mid-way (Euclidean) between two cluster centroids, both with the same variance before including the new point. Now imagine one of the clusters has a much larger membership of points than the other cluster. Let's say the new point is slightly closer to the one with the much larger membership. Adding the new point to the larger cluster, though correct because it is closer to that centroid, won't decrease its variance nearly as much as adding the new point to the other cluster with the much smaller membership.
If you are minimizing the proper objective function, but it still isn't decreasing monotonically, check that you aren't quantizing your centroid means:
This would happen, for example, if you are performing image segmentation with integer values that range in [0, 255] rather than float values in [0, 1], and you are forcing the centroid means to be uint8 datatypes.
Whenever the centroid means are found, they should then be used in the objective function as-is. If your algorithm is finding one value for centroid means (floats), but is then minimizing the objective with other values (byte ints), this could lead to unacceptable variations from the supposed monotonically decreasing objective.

Minimum Hitting Set algorithm using DFS in polynomial time

I need to write an algorithm that finds the minimum hitting set F of a given undirected graph, i.e. a set containing the minimum edge in each cycle of the graph such that the intersection of F and any given cycle is not empty. I have written an algorithm that uses Depth First Search to find all the possible cycles in the graph, and then takes the min edge in each cycle and puts it in a set (in which i remove duplicates).
However, I was asked to complete this task in polynomial time, which I am not quite sure my algorithm does. For instance, I have added a counter to solve the following Graph starting on A and my DFS method gets called 34 times:
Could anyone help me figure the running time of the algorithm I wrote? It is functional but it seems to be very inefficient. Thanks
Here is the code form my DFS method. MHS are basic data structure that act like nodes. They have a tag and a list of links that contain an endpoint (the other node) and a integer value associated to it). cycles is simply an ArrayList containing all the cycles, which are themselves represented as ArrayLists of edge values.
I used the approach described in this post https://stackoverflow.com/a/549312/1354784.
public static void DFS(MHS v){
++counter;
if(v.visited){
MHS current=v.predecessor;
ArrayList<Integer> currEdges=new ArrayList<Integer>();
currEdges.add(getEdgeValue(current, v));
while(!current.equals(v)){
MHS p=current.predecessor;
if(p==null)
break;
currEdges.add(getEdgeValue(p, current));
current=p;
}
if(currEdges.size()>0)
cycles.add(currEdges);
}else{
v.visited=true;
for(int i=0;i<v.links.size();i++){
MHS w=v.links.get(i).next;
if(v.predecessor==null || !v.predecessor.equals(w)){
if(v.visited){
MHS ok=w.predecessor;
w.predecessor=v;
DFS(w);
w.predecessor=ok;
}else{
w.predecessor=v;
DFS(w);
}
}
}
v.visited=false;
}
}
Do you really start by finding every cycle in the graph? If you have the complete graph Kn, in which every pair of nodes is connected, then every ordered set of any size of the nodes defines a cycle, so there are exponentially many cycles.
You could try something like
While (there are any cycles left)
Find a cycle
find the shortest edge in that cycle
remove that edge from the graph
Which should be polynomial time, because every trip round the while loop removes an edge from the graph, and sooner or later you will run out of edges.
But I am not sure if this answers your question. I note also that the standard definition of hitting set is an NP-Complete problem, via set cover - https://en.wikipedia.org/wiki/Set_cover_problem#Hitting_set_formulation
As it turns out, it wasn't necessary to consider all cycles in the graph. We can simply adapt the cycle property of minimum spanning trees that says that the most costly edge in a cycle will not be included in the MST. This same property, applied instead to a maximum spanning tree, tells us that the least costly edge of a cycle will not be included in the max ST.
In order to solve the problem, we only need to find the max ST of G (by adapting Kruskal's algorithm) and return all the edges of G that have not been added to the tree.

Simplest algorithm to find 4-cycles in an undirected graph

I have an input text file containing a line for each edge of a simple undirected graph. The file contains reciprocal edges, i.e. if there's a line u,v, then there's also the line v,u.
I need an algorithm which just counts the number of 4-cycles in this graph. I don't need it to be optimal because I only have to use it as a term of comparison. If you can suggest me a Java implementation, I would appreciate it for the rest of my life.
Thank you in advance.
Construct the adjacency matrix M, where M[i,j] is 1 if there's an edge between i and j. M² is then a matrix which counts the numbers of paths of length 2 between each pair of vertices.
The number of 4-cycles is sum_{i<j}(M²[i,j]*(M²[i,j]-1)/2)/2. This is because if there's n paths of length 2 between a pair of points, the graph has n choose 2 (that is n*(n-1)/2) 4-cycles. We sum only the top half of the matrix to avoid double counting and degenerate paths like a-b-a-b-a. We still count each 4-cycle twice (once per pair of opposite points on the cycle), so we divide the overall total by another factor of 2.
If you use a matrix library, this can be implemented in a very few lines code.
Detecting a cycle is one thing but counting all of the 4-cycles is another. I think what you want is a variant of breadth first search (BFS) rather than DFS as has been suggested. I'll not go deeply into the implementation details, but note the important points.
1) A path is a concatenation of edges sharing the same vertex.
2) A 4-cycle is a 4-edge path where the start and end vertices are the same.
So I'd approach it this way.
Read in graph G and maintain it using Java objects Vertex and Edge. Every Vertex object will have an ArrayList of all of the Edges that are connected to that Vertex.
The object Path will contain all of the vertexes in the path in order.
PathList will contain all of the paths.
Initialize PathList to all of the 1-edge paths which are exactly all of edges in G. BTW, this list will contain all of the 1-cycles (vertexes connected to themselves) as well as all other paths.
Create a function that will (pseudocode, infer the meaning from the function name)
PathList iterate(PathList currentPathList)
{
newPathList = new PathList();
for(path in currentPathList.getPaths())
{
for(edge in path.lastVertexPath().getEdges())
{
PathList.addPath(Path.newPathFromPathAndEdge(path,edge));
}
}
return newPathList;
}
Replace currentPathList with PathList.iterate(currentPathList) once and you will have all of the 2-cyles, call it twice and you will have all of the 3 cycles, call it 3 times and you will have all of the 4 cycles.
Search through all of the paths and find the 4-cycles by checking
Path.firstVertex().isEqualTo(path.lastVertex())
Depth-first search, DFS-this is what you need
Construct an adjacency matrix, as prescribed by Anonymous on Jan 18th, and then find all the cycles of size 4.
It is an enumeration problem. If we know that the graph is a complete graph, then we know off a generating function for the number of cycles of any length. But for most of other graphs, you have to find all the cycles to find the exact number of cycles.
Depth first search with backtracking should be the ideal strategy. Implement it with each node as the starting node, one by one. Keep track of visited nodes. If you run out of nodes without finding a cycle of size 4, just backtrack and try a different route.
Backtrack is not ideal for larger graphs. For example, even a complete graph of order 11 is a little to much for backtracking algorithms. For larger graphs you can look for a randomized algorithm.

Finding the single nearest neighbor using a Prefix tree in O(1)?

I'm reading a paper where they mention that they were able to find the single nearest neighbor in O(1) using a prefix tree. I will describe the general problem and then the classical solution and finally the proposed solution in the paper:
Problem: given a list of bit vectors L (all vectors have the same length) and query bit vector q, we would like to find the nearest neighbor of q. The distance metric is the hamming distance (how many bits are different). The naive approach would be to go through the list and calculate the hamming distance between each vector in the list and q, which will take O(N). However given that we will have millions of bit vectors that is very expensive so we would like to reduce that.
Classical solution:
the classical solution to this problem is by using an approximation to find the nearest neighbor so to achieve O(logN). The way to do this is by first sorting L lexicographically so that similar bit vectors will be close to each other. Then given q, we apply binary search on the sorted list to get the position of where q could be in the sorted list and take the vectors above it and below it in the list (since they are similar cuz of the sorting) and calculate the distance between them and pick the one with lowest hamming distance. However just simply doing one sorting we will still miss many similar vectors, so to cover as much similar vectors as possible we use P number of lists and P number of jumbling functions. Each jumbling function corresponds to each list. Then we insert each bit vector to each list in P after jumbling its bits with the corresponding jumbling function. So we end up with P lists each having the bits vectors but with the bits in different order. We again sort each list in P lexicographically. Now given q, we apply the same binary search for each list in P, but here we before apply the jumbling function to q according to which list we are accessing. In this step we get P number of most similar vectors to q, so we finally get the one most similar to q. This way we cover as most similar vectors as we can. By ignoring the time required for sorting, the time needed to locate the nearest neighbor is O(log n) which is the time for the binary search on each list.
Proposed solution:
this solution as proposed in the paper (but without any explanation) says that we can get the nearest neighbor in O(1) time using prefix trees. In the paper they said that they use P number of prefix trees and P number of jumbling functions, where each jumbling function corresponds to each tree. Then they insert the bit vectors into each tree after jumbling the bits of each vector with the corresponding jumbling function. Given q, we apply the jumping function to q corresponding to each tree and we retrieve the most similar vector to q from each tree. Now we end up with P bits vectors retrieved from the trees. In the paper they say that just getting the most similar vector to q from a prefix tree is O(1). I really don't understand this at all, as I know searching prefix tree is O(M) where M is the length of the bit vector. Does anybody understand why is it O(1)?
This is the paper I'm referring to (Section 3.3.2): Content-Based Crowd Retrieval on the Real-Time Web
http://students.cse.tamu.edu/kykamath/papers/cikm2012/fp105-kamath.pdf
I also wish if you can answer my other question related to this one: How to lookup the most similar bit vector in an prefix tree for NN-search?
I think the argument in the paper is that if it was O(f(x)) then x would have to be the number of items stored in the tree, not the number of dimensions. As you point out, for a prefix tree the time goes up as O(M) where M is the length of the bit vector, but if you reckon M is fixed and what you are interested in is the behaviour as the number of items in the tree increases you have O(1).
By the way, the paper "Fast approximate nearest neighbors with automatic algorithm configuration" by Muja and Lowe also considers tree-based competitors to LSH. The idea here appears to be to randomise the tree construction, create multiple trees, and do a quick but sketchy search of each tree, picking the best answer found in any tree.
This is O(1) only in a very loosely defined sense. In fact I would go so far as to challenge their usage in this case.
From their paper, To determine the nearest neighbor to a user, u.
"We first calculate it's signature, u" : can be O(1) depending on the "signature"
"Then for every prefix tree in P" : Uh oh, not sounding very O(1), O(P) would be more correct.
iterative part from 2. "... we find the nearest signature in the prefix tree, by iterating through the tree one level at a time..." : best case O(d) where d is the depth of the tree or length of the word. (this is generous as finding the nearest point in a prefix tree can be more than this)
"After doing this... we end up with |P| signatures... of which the smallest hamming distance is picked" : so another P calculations times the length of the word. O(Pd).
More correctly the total runtime is O(1) + O(P)+ O(Pd) + O(Pd) = O(Pd)
I believe that #mcdowella is correct in his analysis of how they try to make this O(1), but from what I've read they haven't convinced me.
I assume they have a reference to P's node in a tree, and can navigate to the next or previous entry in O(1) amortized time. i.e. the trick is to have access to the underlying nodes.

Question about k-Connected Graphs

Given an undirected Graph, G, is there any standard algorithm to find the value of k, where (k-1) represents the number of vertices whose removal results in a graph that is still connected and the removal of the kth vertex makes the graph disconnected?
Thank you!
Hop
I don't know of any standard algorithm, but for a graph to have this property, every pair of vertices must have >= k independent paths between them (its a simple proof by contradiction to see that this is the case).
So a potential algorithm would be to check that for all pairs of vertices in your graph there are at least K independent paths. To find this you can use a Maximum Flow algorithm. Unfortunately doing this trivially will probably take a long time. Ford-Fulkerson network flow takes O(EV) time (on the graph you would use for this), and there are O(V^2) pairs of nodes to check. So worst case time is approx. O(V^5).

Categories