I'm working on an algorithm which goal is to find a minimum set of packages to install package "X".
I'll explain better with an example:
X depends on A and (E or C)
A depends on E and (H or Y)
E depends on B and (Z or Y)
C depends on (A or K)
H depends on nothing
Y depends on nothing
Z depends on nothing
K depends on nothing
The solution is to install: A E B Y.
Here is an image to describe the example:
Is there an algorithm to solve the problem without using a brute-force approach?
I've already read a lot about algorithms such as DFS, BFS, Dijkstra, etc...
The problem is that these algorithms are unable to handle the "OR" condition.
UPDATE
I don't want to use external libraries.
The algorithm doesn't have to handle circular dependencies.
UPDATE
One possible solution could be to calculate all the possible paths of each vertex and, for each vertex in the possible path, doing the same.
So, the possible path for X would be (A E),(A C). Now, for each element in those two possible paths we can do the same: A = (E H),(E Y) / E = (B Z),(B Y), and so on...
At the end we can combine the possible paths of each vertex in a SET and choose the one with minimum length.
What do you think?
Unfortunately, there is little hope to find an algorithm which is much better than brute-force, considering that the problem is actually NP-hard (but not even NP-complete).
A proof of NP-hardness of this problem is that the minimum vertex cover problem (well known to be NP-hard and not NP-complete) is easily reducible to it:
Given a graph. Let's create package Pv for each vertex v of the graph. Also create package X what "and"-requires (Pu or Pv) for each edge (u, v) of the graph. Find a minimum set of packages to be installed in order to satisfy X. Then v is in the minimum vertex cover of the graph iff the corresponding package Pv is in the installation set.
"I dint get the problem with "or" (the image is not loading for me).
Here is my reasoning . Say we take standard shortest route algo like Dijkstras and then use equated weightage to find the best path .
Taking your example
Select the best Xr from below 2 options
Xr= X+Ar+Er
Xr= X+Ar+Cr
where Ar = is the best option from the tree A=H(and subsequent child's) or A=Y(and subsequent childs)
The idea is to first assign standard weight for each or option (since and option is not a problem) .
And later for each or option we repeat the process with its child nodes till we reach no more or option .
However , we need to first define , what best choice means, assume that least number of dependencies ie shortest path is the criteria .
The by above logic we assign weight of 1 for X. There onwards
X=1
X=A and E or C hence X=A1+E1 and X=A1+C1
A= H or Y, assuming H and Y are leaf node hence A get final weight as 1
hence , X=1+E1 and X=1+C1
Now for E and C
E1=B1+Z1 and B1+Y1 . C1=A1 and C=K1.
Assuming B1,Z1,Y1,A1and K1 are leaf node
E1=1+1 and 1+1 . C1=1 and C1=1
ie E=2 and C=1
Hence
X=1+2 and X=1+1 hence please choose X=>C as the best route
Hope this clears it .
Also we need to take care of cyclical dependencies X=>Y=>Z=>X , here we may assign such nodes are zero at parent or leaf node level and take care of dependecy."
I actually think graphs are the appropriate structure for this problem. Note that
A and (E or C) <==> (A and E) or (A and C). Thus, we can represent X = A and (E or C) with the following set of directed edges:
A <- K1
E <- K1
A <- K2
C <- K2
K1 <- X
K2 <- X
Essentially, we're just decomposing the logic of the statement and using "dummy" nodes to represent the ANDs.
Suppose we decompose all the logical statements in this fashion (dummy Ki nodes for ANDS and directed edges otherwise). Then, we can represent the input as a DAG and recursively traverse the DAG. I think the following recursive algorithm could solve the problem:
Definitions:
Node u - Current Node.
S - The visited set of nodes.
children(x) - Returns the out neighbors of x.
Algorithm:
shortestPath u S =
if (u has no children) {
add u to S
return 1
} else if (u is a dummy node) {
(a,b) = children(u)
if (a and b are in S) {
return 0
} else if (b is in S) {
x = shortestPath a S
add a to S
return x
} else if (a in S) {
y = shortestPath b S
add b to S
return y
} else {
x = shortestPath a S
add a to S
if (b in S) return x
else {
y = shortestPath b S
add b to S
return x + y
}
}
} else {
min = Int.Max
min_node = m
for (x in children(u)){
if (x is not in S) {
S_1 = S
k = shortestPath x S_1
if (k < min) min = k, min_node = x
} else {
min = 1
min_node = x
}
}
return 1 + min
}
Analysis:
This is an entirely sequential algorithm that (I think) traverses each edge at most once.
A lot of the answers here focus on how this is a theoretically hard problem due to its NP-hard status. While this means you will experience asymptotically poor performance exactly solving the problem (given current solution techniques), you may still be able to solve it quickly (enough) for your particular problem data. For instance, we are able to exactly solve enormous traveling salesman problem instances despite the fact that the problem is theoretically challenging.
In your case, a way to solve the problem would be to formulate it as a mixed integer linear program, where there is a binary variable x_i for each package i. You can convert requirements A requires (B or C or D) and (E or F) and (G) to constraints of the form x_A <= x_B + x_C + x_D ; x_A <= x_E + x_F ; x_A <= x_G, and you can require that a package P be included in the final solution with x_P = 1. Solving such a model exactly is relatively straightforward; for instance, you can use the pulp package in python:
import pulp
deps = {"X": [("A"), ("E", "C")],
"A": [("E"), ("H", "Y")],
"E": [("B"), ("Z", "Y")],
"C": [("A", "K")],
"H": [],
"B": [],
"Y": [],
"Z": [],
"K": []}
required = ["X"]
# Variables
x = pulp.LpVariable.dicts("x", deps.keys(), lowBound=0, upBound=1, cat=pulp.LpInteger)
mod = pulp.LpProblem("Package Optimization", pulp.LpMinimize)
# Objective
mod += sum([x[k] for k in deps])
# Dependencies
for k in deps:
for dep in deps[k]:
mod += x[k] <= sum([x[d] for d in dep])
# Include required variables
for r in required:
mod += x[r] == 1
# Solve
mod.solve()
for k in deps:
print "Package", k, "used:", x[k].value()
This outputs the minimal set of packages:
Package A used: 1.0
Package C used: 0.0
Package B used: 1.0
Package E used: 1.0
Package H used: 0.0
Package Y used: 1.0
Package X used: 1.0
Package K used: 0.0
Package Z used: 0.0
For very large problem instances, this might take too long to solve. You could either accept a potentially sub-optimal solution using a timeout (see here) or you could move from the default open-source solvers to a commercial solver like gurobi or cplex, which will likely be much faster.
To add to Misandrist's answer: your problem is NP-complete NP-hard (see dened's answer).
Edit: Here is a direct reduction of a Set Cover instance (U,S) to your "package problem" instance: make each point z of the ground set U an AND requirement for X. Make each set in S that covers a point z an OR requirement for z. Then the solution for package problem gives the minimum set cover.
Equivalently, you can ask which satisfying assignment of a monotone boolean circuit has fewest true variables, see these lecture notes.
Since the graph consists of two different types of edges (AND and OR relationship), we can split the algorithm up into two parts: search all nodes that are required successors of a node and search all nodes from which we have to select one single node (OR).
Nodes hold a package, a list of nodes that must be successors of this node (AND), a list of list of nodes that can be successors of this node (OR) and a flag that marks on which step in the algorithm the node was visited.
define node: package p , list required , listlist optional ,
int visited[default=MAX_VALUE]
The main-routine translates the input into a graph and starts traversal at the starting node.
define searchMinimumP:
input: package start , string[] constraints
output: list
//generate a graph from the given constraint
//and save the node holding start as starting point
node r = getNode(generateGraph(constraints) , start)
//list all required nodes
return requiredNodes(r , 0)
requiredNodes searches for all nodes that are required successors of a node (that are connected to n via AND-relation over 1 or multiple edges).
define requiredNodes:
input: node n , int step
output: list
//generate a list of all nodes that MUST be part of the solution
list rNodes
list todo
add(todo , n)
while NOT isEmpty(todo)
node next = remove(0 , todo)
if NOT contains(rNodes , next) AND next.visited > step
add(rNodes , next)
next.visited = step
addAll(rNodes , optionalMin(rNodes , step + 1))
for node r in rNodes
r.visited = step
return rNodes
optimalMin searches for the shortest solution among all possible solutions for optional neighbours (OR). This algorithm is brute-force (all possible selections for neighbours will be inspected.
define optionalMin:
input: list nodes , int step
output: list
//find all possible combinations for selectable packages
listlist optSeq
for node n in nodes
if NOT n.visited < step
for list opt in n.optional
add(optSeq , opt)
//iterate over all possible combinations of selectable packages
//for the given list of nodes and find the shortest solution
list shortest
int curLen = MAX_VALUE
//search through all possible solutions (combinations of nodes)
for list seq in sequences(optSeq)
list subseq
for node n in distinct(seq)
addAll(subseq , requiredNodes(n , step + 1))
if length(subseq) < curLen
//mark all nodes of the old solution as unvisited
for node n in shortest
n.visited = MAX_VALUE
curLen = length(subseq)
shortest = subseq
else
//mark all nodes in this possible solution as unvisited
//since they aren't used in the final solution (not at this place)
for node n in subseq
n.visited = MAX_VALUE
for node n in shorest
n.visited = step
return shortest
The basic idea would be the following: Start from the starting node and search for all nodes that must be part of the solution (nodes that can be reached from the starting node by only traversing AND-relationships). Now for all of these nodes, the algorithm searches for the combination of optional nodes (OR) with the fewest nodes required.
NOTE: so far this algorithm isn't much better than brute-force. I'll update as soon as i've found a better approach.
My code is here.
Scenario:
Represent the constraints.
X : A&(E|C)
A : E&(Y|N)
E : B&(Z|Y)
C : A|K
Prepare two variables target and result.
Add the node X to target.
target = X, result=[]
Add single node X to the result.
Replace node X with its dependent in the target.
target = A&(E|C), result=[X]
Add single node A to result.
Replace node A with its dependent in the target.
target = E&(Y|N)&(E|C), result=[X, A]
Single node E must be true.
So (E|C) is always true.
Remove it from the target.
target = E&(Y|N), result=[X, A]
Add single node E to result.
Replace node E with its dependent in the target.
target = B&(Z|Y)&(Y|N), result=[X, A, E]
Add single node B to result.
Replace node B with its dependent in the target.
target = (Z|Y)&(Y|N), result=[X, A, E, B]
There are no single nodes any more.
Then expand the target expression.
target = Z&Y|Z&N|Y&Y|Y&N, result=[X, A, E, B]
Replace Y&Y to Y.
target = Z&Y|Z&N|Y|Y&N, result=[X, A, E, B]
Choose the term that has smallest number of nodes.
Add all nodes in the term to the target.
target = , result=[X, A, E, B, Y]
I would suggest you to first transform the graph in a AND-OR Tree. Once done you can perform a search in the tree for the best (where you can choose what "best" means: shortest, lowest memory occupation of packages in nodes, etc...) path.
A suggestion I'd make, being that the condition to install X would be something like install(X) = install(A) and (install(E) or install(C)), is to group the OR nodes (in this case: E and C) into a single node, say EC, and transform the condition in install(X) = install(A) and install(EC).
In alternative, based on the AND-OR Tree idea, you could create a custom AND-OR Graph using the grouping idea. In this way you could use an adaptation of a graph traversal algorithm, which could be more useful in certain scenarios.
Yet another solution could be to use Forward Chaining. You'd have to follow these steps:
Transform (just re-writing the conditions here):
A and (E or C) => X
E and (H or Y) => A
B and (Z or Y) => E
into
(A and E) or (A and C) => X
(E and H) or (E and Y) => A
(B and Z) or (B and Y) => E
Set X as goal.
Insert B, H, K, Y, Z as facts.
Run Forward chaining and stop on the first occurrence of X (the goal). That should be the shortest way to achieve the goal in this case (just remember to keep track of the facts that have been used).
Let me know if anything is unclear.
This is an example of a Constraint Satisfaction Problem. There are Constraint Solvers for many languages, even some that can run on generic 3SAT engines, and thus be run on GPGPU.
Another (fun) way to solved this issue is to use a genetic algorithm.
Genetic Algorithm is powerful but you have to use a lot of parameters and find the better one.
Genetic Step are the following one :
a . Creation : a number of random individual, the first generation
(for instance : 100)
b. mutation : mutate of low percent of them (for instance : 0,5%)
c. Rate : rate (also call fitness) all the individual.
d. Reproduction : select (using rates) pair of them and create child (for instance : 2 child)
e. Selection : select Parent and Child to create a new generation (for instance : keep 100 individual by generation)
f. Loop : Go back to step "a" and repeat all the process a number of time (for instance : 400 generation)
g. Pick : Select an individual of the last generation with a max rate.
Individual will be your solution.
Here is what you have to decide :
Find a genetic code for your individual
You have to represent a possible solution (call individual) of your problem as a genetic code.
In your case, it could be a group of letter representing the node which respect constraint OR and NOT.
For instance :
[ A E B Y ], [ A C K H ], [A E Z B Y] ...
Find a way to rate individual
To know if an individual is a good solution, you have to rate it, in order to compare it to other individual.
In your case, it could be pretty easy : individual rate = number of node - number of individual node
For instance :
[ A E B Y ] = 8 - 4 = 4
[ A E Z B Y] = 8 - 5 = 3
[ A E B Y ] as a better rate than [ A E Z B Y ]
Selection
Thanks to individual's rate, we can select Pair of them for reproduction.
For instance by using Genetic Algorithm roulette wheel selection
Reproduction
Take a pair of individual an create some (for instance 2) child (other individual) from them.
For instance :
Take a node from the first one and swap it with a node of the second one.
Make some adjustment to fit "or, and" constraint.
[ A E B Y ], [ A C K H ] => [ A C E H B Y ], [ A E C K B Y]
Note : that this is not the good way to reproduct it because the child are worth than the parent. Maybe we can swap a range of node.
Mutation
You have just to change genetic code of select individual.
For instance :
Delete a node
Make some adjustment to fit "or, and" constraint.
As you can see, it's not hard to implements but a lot of choice has to be done for designing it with a specific issue and to control the different parameters (percent of mutation, rate system, reproduction system, number of individual, number of generation, ...)
Related
I have written a bigger program, I am constructing a graph Graph(V,E), from words in a data file. Then I am parsing another file with words on same line: "other there" <-- like that, the first string is start word, the second is end word, as seen below:
while (true) {
String line = reader.readLine();
if (line == null) { break; }
assert line.length() == 11;
String start = line.substring(0, 5);
String goal = line.substring(6, 11);
System.out.println(
G.findShortestPath(
words.indexOf(start),
words.indexOf(goal)));
}
I am using undirected graphs with BFS. I am doing word transformation word chain, like this: climb → blimp → limps → pismo → moist → stoic
The input file contains words of length five, and a path/connection is defined as such so it goes from Xto Y in one step, only if, and only if the four last letters in Xare found in Y.
What is known, and I have calculated.
Time complexity: O(V + E) for building the graph G(V, E) of words. The second part of the program consists of a while loop and a for loop of finding the shortest path (using BFS), which is O(V^2).
Space complexity: O(V) in the worst case. The graph holds all the words. The nodes are made up of a single node class object which contains n neighbor(s).
Process:
Program loads into buffer a file with words.
The program builds the graph.
The program runs test and loads into buffer information from a test file (different file). Then selects start and end node and performs the shortest path search.
If there's a connection, the code returns the shortest path length. If there's no connection between two words or end/goal cannot be reached, we return -1.
Now, I am trying to come up with an O(?) time algorithm or total time complexity for V,E,F, where:
V is the number of vertices
E is the number of edges
F is the number of test cases (number of lines in the test file)*
*Number of test cases in function: public void performTest(String filename) throws IOException. The body of such function is here, see above. Now, I know, that for lines n, there will be same n amount of test cases. F=O(n). But, in what way, can one incorporate or add this calculation to make a general O expression with variables that holds for whatever amount of words in list and words in graph and words from testfile.
The main body of the BFS algorithm is the nesting of two loops, the while loop visits each vertex once, so it is O(|V|), and the for nested in the while , Since each edge will be checked only once when its starting vertex u is dequeued, and each vertex will be dequeued at most once, so the edge will be checked at most once, and the total is 0 (|E|).The time complexity of BFS is 0 (|V|+|E|)
Contents in Testfile:
The first word becomes the starting word, the second on the same line becomes the end or target word.
blimp moist
limps limps
Some other code, I've written previously:
public static void main(String... args) throws IOException {
ArrayList<String> words = new ArrayList<String>();
words.add("their");
words.add("moist");
words.add("other");
words.add("blimp");
Graph g = new Graph(words.size());
for(String word: words) {
for(String word2: words){
g.addEdge(words.indexOf(word), words.indexOf(word2));
}
}
BufferedReader readValues = null;
try {
readValues =
new BufferedReader(new InputStreamReader(new FileInputStream("testfile.txt")));
String line = null;
while ((line = readValues.readLine()) != null) {
// assert line.length() == 11;
String[] tokens = line.split(" ");
String start = tokens[0];
String goal = tokens[1];
BreadthFirstPaths bfs = new BreadthFirstPaths(g, words.indexOf(start));
if (bfs.hasPathTo(words.indexOf(goal))) {
System.out.println("Shortest path distance from " + start + " to " + goal + " = " + bfs.distTo(words.indexOf(goal)));
} else System.out.println("Nothing");
}
} finally {
if (readValues != null) {
try {
readValues.close();
} catch (Throwable t) {
t.printStackTrace();
}
}
}
Notice: Not interested in FASTER solutions.
The straightforward answer:
The direct approach would be to use Floyd - Warshall algorithm. This algorithm computes shortest paths between all pairs of vertices in a directed graph without negative cycles. Since you are using an undirected graph with positive weights it is sufficient to replace every undirected edge (u,v) with directed pair (u,v), (v,u).
The runtime of Floyd - Warshall is O(V ^ 3) and it would compute all the answers you could ever seek at once, given that you can retrieve them in a reasonable time. (Which should be rather easy since you already have V^3 of a breathing room).
Getting faster:
In your case that most likely isn't optimal(Not to mention that I don't know how many queries will you make - if only a few, then FW is definitely an overkill). Since your graph doesn't have any negative edges and it seems that the edge count is only C * |V| from your space complexity we can go further. Enters Djikstra.
Djikstra algoritm's complexity is O(E + Vlog(V)).
Considering that you most likely have only ~ C * V edges, this would bring the repeated Djikstra's computation costs to F * O(V * log(V)).
And faster:
If you wish to give frying your brain a go, Djikstra can be improved upon in some special cases by using the dark magic of the fibonacci heaps(which are modified for the purpose of the algorithm to make things more confusing). From what I can see, your case could be special enough so that the O(N * sqrt(log(N))) from this article is achievable.Their assumption are:
n vertices
m edges
the the longest arc(a length of an edge if my google-fu is correct) being bounded by a polynomial function of n.
This is it for my attempt at a quick dive into the shortest path problem. If you wish to research more, I would recommend looking into the all-pairs-shotest-paths problem in general. There are many other algorithms that are similar in the complexity. Your ideal approach will also depend on your F.
P.S.:
Unless you have many, many words, your neighbor count can still be rather big: 5! * 26 in the worst case to be precise. (Four letters are fixed and one is arbitrary - possible permutations * letter count). Can be lower in case of repetitions, still it isn't small, although it can technically be considered a constant.
It seems to me that you are simply asking about the computational complexity of your existing solution, expressed in terms of the 3 variables V, E and F.
If I am reading your question correctly, it is
O(V + E) // loading
+ O(V^2) done F times // F test cases
which simplifies to:
O(V + E + (F * V^2))
This assumes that your Big-O characterizations of the load and search times are correct. We cannot confirm1 that those characterizations are correct without seeing the complete Java source code for your solution. (An English or pseudo-code description is too imprecise.)
Note that the above formula is "canonical". It cannot be simplified further unless you eliminate variables; e.g. by treating them as a constant or placing bounds on them.
However, if we can assume that F > 0, we can reduce it to:
O(E + (F * V^2))
since when F > 0 the F*V^2 term will dominate the V term as either F or V tends to infinity. (Intuitively, the F == 0 case corresponds to just loading the graph and running no test cases. You would expect the performance characteristics to be different in that case.)
It is possible that the E variable could be approximated as function of V. (Each edge from one node to another represents a permutation of a word with one letter changed. If one did some statistical analysis of words in the English language, it may be possible to determine a the average number of edges per node as a function of the number of nodes.) If we can prove that this (hypothetical) average_E(V) function is O(V^2) or better (as seems likely!), then we can eliminate the E term from the overall complexity class.
(Of course, that assumes that the input graph contains no duplicate or incorrect edges per the description of the problem you are trying to solve.)
Finally, it seems that the O(V^2) is actually a measure of worst-case performance, depending on the sparseness of the graph. So you need to factor that into your answer ... and the way that you attempt to validate the complexity.
1 - The O(V^2) seems a bit suspicious to me.
Time: O(F * (V + E))
Space: O(V + E)
Following what you described
V: Vertices
E: Edges
F: path queries
The BFS algorithm complexity is O(V + E) time, and O(V) space
Each query is a BFS without any modification, so the time complexity is O(F * (V + E)), and space complexity is the same as one single BFS since you are using the same structure O(V), but we have to consider the space used to store the graph O(E).
Graph construction, you are iterating over all pairs of words and adding one edge for each word your graph always have $E = V^2$. If you had asked advice to improve your algorithm as is usual in this community, I would tell you to avoid adding edges that should not be used (those with more than 1 character difference).
Find the path between a node to another given node in a Tree represented by an Adjacency List
EDIT:
The tree is given as an acyclic connected graph of n nodes where nodes are from 1 to n
For example if when n = 5 the tree is given as:
1 4
4 5
3 2
4 2
I should be able to find the path between any of the n nodes using the algorithm
I can program in c++ and java and python
You can do it either via BFS or DFS. It will take O(N) time. However if you want to query for any 2 of N nodes than it can be done online using Heavy Light Decomposition.
Here is a BFS based implementation.
int n;
cin >> n;
vector <int> graph[n], visited(n);
for (int i = 0; i < n - 1; ++i)
{
int st, en;
cin >> st >> en;
st--;
en--;
graph[st].push_back(en);
graph[en].push_back(st);
}
int st, en;
cin >> st >> en;
st--, en--;
queue <pair <int, int> > q;
q.push({st, 0});
visited[st] = 1;
while (!q.empty())
{
auto top = q.front();
if (top.first == en)
return top.second;
q.pop();
for (auto & x : graph[top.first])
{
if(!visited[x])
{
q.push({x, top.second + 1});
visited[x] = 1;
}
}
}
First take a visited array and initialize it with 0 for all the nodes.
Mark the start node (s) in the visited array as 1(i.e. visited) and perform a basic DFS on this node. As soon as you arrive on the desired node, stop the algorithm.
DFS:
http://www.geeksforgeeks.org/depth-first-traversal-for-a-graph/
You should look up dijkstra's and A* path finding algorithms, they will solve this for you if your data can be manipulated.
dijkstra's is my favorite, simple to understand. But not as efficient as A*
https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
Can't really give a more useful answer without knowing how your data is stored, what programming language or graphing software you're using, please add tags and be more descriptive in your question.
Hope this helps.
Consider a directed graph which is traversed from first node 1 to some final nodes (which have no more outgoing edges). Each edge in the graph has a probability associated with it. Summing up the probabilities to take each possible path towards all possible final nodes returns 1. (Which means, we are guaranteed to arrive at one of the final nodes eventually.)
The problem would be simple if loops in the graph would not exist. Unfortunately rather convoluted loops can arise in the graph, which can be traversed an infinite amount of times (probability decreases multiplicatively with each loop traversal, obviously).
Is there a general algorithm to find the probabilities to arrive at each of the final nodes?
A particularly nasty example:
We can represent the edges as a matrix (probability to go from row (node) x to row (node) y is in the entry (x,y))
{{0, 1/2, 0, 1/14, 1/14, 0, 5/14},
{0, 0, 1/9, 1/2, 0, 7/18, 0},
{1/8, 7/16, 0, 3/16, 1/8, 0, 1/8},
{0, 1, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0}}
Or as a directed graph:
The starting node 1 is blue, the final nodes 5,6,7 are green. All edges are labelled by the probability to traverse them when starting from the node where they originate.
This has eight different paths from starting node 1 to the final nodes:
{{1/14, {1, 5}}, {5/14, {1, 7}}, {7/36, {1, 2, 6}},
{1/144, {1, 2, 3, 5}}, {1/144, {1, 2, 3, 7}},
{1/36, {1, 4, 2, 6}}, {1/1008, {1, 4, 2, 3, 5}}, {1/1008, {1, 4, 2, 3, 7}}}
(The notation for each path is {probability,sequence of nodes visited})
And there are five distinct loops:
{{1/144, {2, 3, 1}}, {7/144, {3, 2}}, {1/2, {4, 2}},
{1/48, {3, 4, 2}}, {1/1008, {4, 2, 3, 1}}}
(Notation for loops is {probability to traverse loop once,sequence of nodes visited}).
If only these cycles could be resolved to obtain an effectively tree like graph, the problem would be solved.
Any hint on how to tackle this?
I'm familiar with Java, C++ and C, so suggestions in these languages are preferred.
I'm not expert in the area of Markov chains, and although I think it's likely that algorithms are known for the kind of problem you present, I'm having difficulty finding them.
If no help comes from that direction, then you can consider rolling your own. I see at least two different approaches here:
Simulation.
Examine how the state of the system evolves over time by starting with the system in state 1 at 100% probability, and performing many iterations in which you apply your transition probabilities to compute the probabilities of the state obtained after taking a step. If at least one final ("absorbing") node can be reached (at non-zero probability) from every node, then over enough steps, the probability that the system is in anything other than a final state will decrease asymptotically toward zero. You can estimate the probability that the system ends in final state S as the probability that it is in state S after n steps, with an upper bound on the error in that estimate given by the probability that the system is in a non-final state after n steps.
As a practical matter, this is the same is computing Trn, where Tr is your transition probability matrix, augmented with self-edges at 100% probability for all the final states.
Exact computation.
Consider a graph, G, such as you describe. Given two vertices i and f, such that there is at least one path from i to f, and f has no outgoing edges other than self-edges, we can partition the paths from i to f into classes characterized by the number of times they revisit i prior to reaching f. There may be an infinite number of such classes, which I will designate Cif(n), where n represents the number of times the paths in Cif(n) revisit node i. In particular, Cii(0) contains all the simple loops in G that contain i (clarification: as well as other paths).
The total probability of ending at node f given that the system traverses graph G starting at node i is given by
Pr(f|i, G) = Pr(Cif(0)|G) + Pr(Cif(1)|G) + Pr(Cif(2)|G) ...
Now observe that if n > 0 then each path in Cif(n) has the form of a union of two paths c and t, where c belongs to Cii(n-1) and t belongs to Cif(0). That is, c is a path that starts at node i and ends at node i, passing through i n-1 times between, and t is a path from i to f that does not pass through i again. We can use that to rewrite our probability formula:
Pr(f|i,G) = Pr(Cif(0)|G) + Pr(Cii(0)|G) * Pr(Cif(0)|G) + Pr(Cii(1)|G) * Pr(Cif(0)|G) + ...
But note that every path in Cii(n) is a composition of n+1 paths belonging to Cii(0). It follows that Pr(Cii(n)|G) = Pr(Cii(0)|G)n+1, so we get
Pr(f|i) = Pr(Cif(0)|G) + Pr(Cii(0)|G) * Pr(Cif(0)|G) + Pr(Cii(0)|G)2 * Pr(Cif(0)|G) + ...
And now, a little algebra gives us
Pr(f|i,G) - Pr(Cif(0)|G) = Pr(Cii(0)|G) * Pr(f|i,G)
, which we can solve for Pr(f|i,G) to get
Pr(f|i,G) = Pr(Cif(0)|G) / (1 - Pr(Cii(0)|G))
We've thus reduced the problem to one in terms of paths that do not return to the starting node, except possibly as their end node. These do not preclude paths that have loops that don't include the starting node, but we can we nevertheless rewrite this problem in terms of several instances of the original problem, computed on a subgraph of the original graph.
In particular, let S(i, G) be the set of successors of vertex i in graph G -- that is, the set of vertices s such that there is an edge from i to s in G, and let X(G,i) be the subgraph of G formed by removing all edges that start at i. Furthermore, let pis be the probability associated with edge (i, s) in G.
Pr(Cif(0)|G) = Sum over s in S(i, G) of pis * Pr(f|s,X(G,i))
In other words, the probability of reaching f from i through G without revisiting i in between is the sum over all successors of i of the product of the probability of reaching s from i in one step with the probability of reaching f from s through G without traversing any edges outbound from i. That applies for all f in G, including i.
Now observe that S(i, G) and all the pis are knowns, and that the problem of computing Pr(f|s,X(G,i)) is a new, strictly smaller instance of the original problem. Thus, this computation can be performed recursively, and such a recursion is guaranteed to terminate. It may nevertheless take a long time if your graph is complex, and it looks like a naive implementation of this recursive approach would scale exponentially in the number of nodes. There are ways you could speed the computation in exchange for higher memory usage (i.e. memoization).
There are likely other possibilities as well. For example, I'm suspicious that there may be a bottom-up dynamic programming approach to a solution, but I haven't been able to convince myself that loops in the graph don't present an insurmountable problem there.
Problem Clarification
The input data is a set of m rows of n columns of probabilities, essentially an m by n matrix, where m = n = number of vertices on a directed graph. Rows are edge origins and columns are edge destinations. We will, on the bases of the mention of cycles in the question, that the graph is cyclic, that at least one cycle exists in the graph.
Let's define the starting vertex as s. Let's also define a terminal vertex as a vertex for which there are no exiting edges and the set of them as set T with size z. Therefore we have z sets of routes from s to a vertex in T, and the set sizes may be infinite due to cycles 1. In such a scenario, one cannot conclude that a terminal vertex will be reached in an arbitrarily large number of steps.
In the input data, probabilities for rows that correspond with vertices not in T are normalized to total to 1.0. We shall assume the Markov property, that the probabilities at each vertex do not vary with time. This precludes the use of probability to prioritize routes in a graph search 2.
Finite math texts sometimes name example problems similar to this question as Drunken Random Walks to underscore the fact that the walker forgets the past,
referring to the memory-free nature of Markovian chains.
Applying Probability to Routes
The probability of arriving at a terminal vertex can be expressed as an infinite series sum of products.
Pt = lim s -> ∞ Σ ∏ Pi, j,
where s is the step index, t is a terminal vertex index, i ∈ [1 .. m] and j ∈ [1 .. n]
Reduction
When two or more cycles intersect (sharing one or more vertices), analysis is complicated by an infinite set of patterns involving them. It appears, after some analysis and review of relevant academic work, that arriving at an accurate set of terminal vertex arrival probabilities with today's mathematical tools may best be accomplished with a converging algorithm.
A few initial reductions are possible.
The first consideration is to enumerate the destination vertex, which is easy since the corresponding rows have probabilities of zero.
The next consideration is to differentiate any further reductions from what the academic literature calls irreducible sub-graphs. The below depth first algorithm remembers which vertices have already been visited while constructing a potential route, so it can be easily retrofitted to identify which vertices are involved in cycles. However it is recommended to use existing well tested, peer reviewed graph libraries to identify and characterize sub-graphs as irreducible.
Mathematical reduction of irreducible portions of the graph may or may not be plausible. Consider starting vertex A and sole terminating vertex B in the graph represented as {A->C, C->A, A->D, D->A, C->D, D->C, C->B, D->B}.
Although one can reduce the graph to probability relations absent of cycles through vertex A, the vertex A cannot be removed for further reduction without either modifying probabilities of vertices exiting C and D or allowing both totals of probabilities of edges exiting C and D to be less than 1.0.
Convergent Breadth First Traversal
A breadth first traversal that ignores revisiting and allows cycles can iterate step index s, not to some fixed smax but to some sufficiently stable and accurate point in a convergent trend. This approach is especially called for if cycles overlap creating bifurcations in the simpler periodicity caused by a single cycle.
Σ PsΔ s.
For the establishment of a reasonable convergence as s increases, one must determine the desired accuracy as a criteria for completing convergence algorithm and a metric for measuring accuracy by looking at longer term trends in results at all terminal vertices. It may be important to provide a criteria where the sum of terminal vertex probabilities is close to unity in conjunction with the trend convergence metric, as both a sanity check and an accuracy criteria. Practically, four convergence criteria may be necessary 3.
Per terminal vertex probability trend convergence delta
Average probability trend convergence delta
Convergence of total probability on unity
Total number of steps (to cap depth for practical computing reasons)
Even beyond these four, the program may need to contain a trap for an interrupt that permits the writing and subsequent examination of output after a long wait without the satisfying of all four above criteria.
An Example Cycle Resistant Depth First Algorithm
There are more efficient algorithms than the following one, but it is fairly comprehensible, it compiles without warning with C++ -Wall, and it produces the desired output for all finite and legitimate directed graphs and start and destination vertices possible 4. It is easy to load a matrix in the form given in the question using the addEdge method 5.
#include <iostream>
#include <list>
class DirectedGraph {
private:
int miNodes;
std::list<int> * mnpEdges;
bool * mpVisitedFlags;
private:
void initAlreadyVisited() {
for (int i = 0; i < miNodes; ++ i)
mpVisitedFlags[i] = false;
}
void recurse(int iCurrent, int iDestination,
int route[], int index,
std::list<std::list<int> *> * pnai) {
mpVisitedFlags[iCurrent] = true;
route[index ++] = iCurrent;
if (iCurrent == iDestination) {
auto pni = new std::list<int>;
for (int i = 0; i < index; ++ i)
pni->push_back(route[i]);
pnai->push_back(pni);
} else {
auto it = mnpEdges[iCurrent].begin();
auto itBeyond = mnpEdges[iCurrent].end();
while (it != itBeyond) {
if (! mpVisitedFlags[* it])
recurse(* it, iDestination,
route, index, pnai);
++ it;
}
}
-- index;
mpVisitedFlags[iCurrent] = false;
}
public:
DirectedGraph(int iNodes) {
miNodes = iNodes;
mnpEdges = new std::list<int>[iNodes];
mpVisitedFlags = new bool[iNodes];
}
~DirectedGraph() {
delete mpVisitedFlags;
}
void addEdge(int u, int v) {
mnpEdges[u].push_back(v);
}
std::list<std::list<int> *> * findRoutes(int iStart,
int iDestination) {
initAlreadyVisited();
auto route = new int[miNodes];
auto pnpi = new std::list<std::list<int> *>();
recurse(iStart, iDestination, route, 0, pnpi);
delete route;
return pnpi;
}
};
int main() {
DirectedGraph dg(5);
dg.addEdge(0, 1);
dg.addEdge(0, 2);
dg.addEdge(0, 3);
dg.addEdge(1, 3);
dg.addEdge(1, 4);
dg.addEdge(2, 0);
dg.addEdge(2, 1);
dg.addEdge(4, 1);
dg.addEdge(4, 3);
int startingNode = 2;
int destinationNode = 3;
auto pnai = dg.findRoutes(startingNode, destinationNode);
std::cout
<< "Unique routes from "
<< startingNode
<< " to "
<< destinationNode
<< std::endl
<< std::endl;
bool bFirst;
std::list<int> * pi;
auto it = pnai->begin();
auto itBeyond = pnai->end();
std::list<int>::iterator itInner;
std::list<int>::iterator itInnerBeyond;
while (it != itBeyond) {
bFirst = true;
pi = * it ++;
itInner = pi->begin();
itInnerBeyond = pi->end();
while (itInner != itInnerBeyond) {
if (bFirst)
bFirst = false;
else
std::cout << ' ';
std::cout << (* itInner ++);
}
std::cout << std::endl;
delete pi;
}
delete pnai;
return 0;
}
Notes
[1] Improperly handled cycles in a directed graph algorithm will hang in an infinite loop. (Note the trivial case where the number of routes from A to B for the directed graph represented as {A->B, B->A} is infinity.)
[2] Probabilities are sometimes used to reduce the CPU cycle cost of a search. Probabilities, in that strategy, are input values for meta rules in a priority queue to reduce the computational challenge very tedious searches (even for a computer). The early literature in production systems termed the exponential character of unguided large searches Combinatory Explosions.
[3] It may be practically necessary to detect breadth first probability trend at each vertex and specify satisfactory convergence in terms of four criteria
Δ(Σ∏P)t <= Δmax ∀ t
Σt=0T Δ(Σ∏P)t / T <= Δave
|Σ Σ∏P - 1| <= umax, where u is the maximum allowable deviation from unity for the sum of final probabilities
s < Smax
[4] Provided there are enough computing resources available to support the data structures and ample time to arrive at an answer for the given computing system speed.
[5] You can load DirectedGraph dg(7) with the input data using two loops nested to iterate through the rows and columns enumerated in the question. The body of the inner loop would simply be a conditional edge addition.
if (prob != 0) dg.addEdge(i, j);
Variable prob is P m,n. Route existence is only concerned with zero/nonzero status.
I found this question while researching directed cyclic graphs. The probability of reaching each of the final nodes can be calculated using absorbing Markov chains.
The video Markov Chains - Part 7 (+ parts 8 and 9) explains absorbing states in Markov chains and the math behind it.
I understand this as the following problem:
Given an initial distribution to be on each node as a vector b and a Matrix A that stores the probability to jump from node i to node j in each time step, somewhat resembling an adjacency matrix.
Then the distribution b_1 after one time step is A x b. The distribution b_2 after two time steps is A x b_1. Likewise, the distribution b_n is A^n x b.
For an approximation of b_infinite, we can do the following:
Vector final_probability(Matrix A, Vector b,
Function Vector x Vector -> Scalar distance, Scalar threshold){
b_old = b
b_current = A x b
while(distance(b_old,b_current) < threshold){
b_old = b_current
b_current = A x b_current
}
return b_current
}
(I used mathematical variable names for convencience)
In other words, we assume that the sequence of distributions converges nicely after the given threshold. Might not hold true, but will usually work.
You might want to add a maximal amount of iterations to that.
Euclidean distance should work well as distance.
(This uses the concept of a Markov Chain but is more of a pragmatical solution)
I have a homework assignment that asks of me to check, for any three numbers, a,b,c such that 0<=a,b,c<=10^16, if I can reach c by adding a and b to each other. The trick is, with every addition, their value changes, so if we add a to b, we would then have the numbers a and a+b, instead of a and b. Because of this, I realized it's not a simple linear equation.
In order for this to be possible, the target number c, must be able to be represented in the form:
c = xa + yb
Through some testing, I figured out that the values of x and y, can't be equal, nor can both of them be even, in order for me to be able to reach the number c. Keeping this in mind, along with some special cases involving a,b or c to be equal to zero.
Any ideas?
EDIT:
It's not Euclid's Algorithm, it's not a diophantine equation, maybe I have mislead you with the statement that c = xa + yc. Even though they should satisfy this statement, it's not enough for the assignment at hand.
Take a=2, b=3, c=10 for example. In order to reach c, you would need to add a to b or b to a in the first step, and then in the second step you'd get either : a = 2, b = 5 or a = 5, b = 3, and if you keep doing this, you will never reach c. Euclid's algorithm will provide the output yes, but it's clear that you can't reach 10, by adding 2 and 3 to one another.
Note: To restate the problem, as I understand it: Suppose you're given nonnegative integers a, b, and c. Is it possible, by performing a sequence of zero or more operations a = a + b or b = b + a, to reach a point where a + b == c?
OK, after looking into this further, I think you can make a small change to the statement you made in your question:
In order for this to be possible, the target number c, must be able to
be represented in the form:
c = xa + yb
where GCD(x,y) = 1.
(Also, x and y need to be nonnegative; I'm not sure if they may be 0 or not.)
Your original observation, that x may not equal y (unless they're both 1) and that x and y cannot both be even, are implied by the new condition GCD(x,y) = 1; so those observations were correct, but not strong enough.
If you use this in your program instead of the test you already have, it may make the tests pass. (I'm not guaranteeing anything.) For a faster algorithm, you can use Extended Euclid's Algorithm as suggested in the comments (and Henry's answer) to find one x0 and y0; but if GCD(x0,y0) ≠ 1, you'd have to try other possibilities x = x0 + nb, y = y0 - na, for some n (which may be negative).
I don't have a rigorous proof. Suppose we constructed the set S of all pairs (x,y) such that (1,1) is in S, and if (x,y) is in S then (x,x+y) and (x+y,y) are in S. It's obvious that (1,n) and (n,1) are in S for all n > 1. Then we can try to figure out, for some m and n > 1, how could the pair (m,n) get into S? If m < n, this is possible only if (m, n-m) was already in S. If m > n, it's possible only if (m-n, n) was already in S. Either way, when you keep subtracting the smaller number from the larger, what you get is essentially Euclid's algorithm, which means you'll hit a point where your pair is (g,g) where g = GCD(m,n); and that pair is in S only if g = 1. It appears to me that the possible values for x and y in the above equation for the target number c are exactly those which are in S. Still, this is partly based on intuition; more work would be needed to make it rigorous.
If we forget for a moment that x and y should be positive, the equation c = xa + yb has either no or infinitely many solutions. When c is not a multiple of gcd(a,b) there is no solution.
Otherwise, calling gcd(a,b) = t use the extended euclidean algorithm to find d and e such that t = da + eb. One solution is then given by c = dc/t a + ec/t b.
It is clear that 0 = b/t a - a/t b so more solutions can be found by adding a multiple f of that to the equation:
c = (dc + fb)/t a + (ec - af)/t b
When we now reintroduce the restriction that x and y must be positive or zero, the question becomes to find values of f that make x = (dc + fb)/t and y = (ec - af)/t both positive or zero.
If dc < 0 try the smallest f that makes dc + fb >= 0 and see if ec - af is also >=0.
Otherwise try the largest f (a negative number) that makes ec - af >= 0 and check if dc + fb >= 0.
import java.util.*;
import java.math.BigInteger;
public class Main
{
private static boolean result(long a, long b, long c)
{
long M=c%(a+b);
return (M%b == 0) || (M%a == 0);
}
}
Idea:c=xa+by, because either x or y is bigger we can write the latter equation in one of two forms:
c=x(a+b)+(y-x)b,
c=y(a+b)+(x-y)a
depending on who is bigger, so by reducing c by a+b each time, c eventually becomes:
c=(y-x)b or c=(x-y)b, so c%b or c%a will evaluate to 0.
I wonder whether there is an algorithm to efficiently calculate
a discrete 1-dimensional Minkowski sum. The Minkowski sum is defined as:
S + T = { x + y | x in S, y in T }
Could it be that we can represent the sets as lists, sort S and T, and
then do something similarly to computing the union of two sets. i.e. walk
along the sets in parallel and generate the result.
Are there such algorithms known where I don't have to additionally sort the
result to remove overlapping cases x1+y1 = x2+y2? Preferably formulated in Java?
First, the size of the output can be O(nm), if there are no collisions (e.g., A={0, 1, 2, ..., n-1}, B={n, 2*n, 3*n, ...n*n}), so if we depend on n and m, we have no hope of finding a sub-quadratic algorithm.
A straightforward one is computing all pairs (O(nm)), sorting and unique-ing (total of O(nm log nm).
If you have an upper bound M such that x <= M for all x in A union B, we can compute the sum in O(M log M) in the following way.
Generate the characteristic vectors A[i] = 1 ff i \in A, 0 otherwise and similarly for B. Each such vector is of size M.
Compute the convolution of A and B using FFT (time: O(M log M)). Output size is O(M).
Scan output O - at each cell, O[i] is nonzero iff i is an element of the Minkowski sum.
Proof: O[i] != 0 iff there exists k such that A[k] != 0 and B[i-k] != 0, iff k \in A and i-k \in B, iff k + i-k, that is i, is in the Minkowski sum.
(Taken from this paper)
Sort S and T, iterate over S searching for matching elements in T, each time you find a match remove the element from S and T and put it in a new set U. Because they are sorted, once you find a match in T, further comparisons in T can start from the last match.
Now S, T and U are all disjoint. So iterate over S and T adding each one, and S and U, and T and U. Finally iterate over U, and add every element in U by every element in U whose set index is equal to or greater than the current set index.
Sadly the algorithm is still O(n^2) with this optimization. If T and S are identical it will be 2x faster than the naive solution. You also don't have to search in the output set for duplicates.