Directed probability graph - algorithm to reduce cycles? - java

Consider a directed graph which is traversed from first node 1 to some final nodes (which have no more outgoing edges). Each edge in the graph has a probability associated with it. Summing up the probabilities to take each possible path towards all possible final nodes returns 1. (Which means, we are guaranteed to arrive at one of the final nodes eventually.)
The problem would be simple if loops in the graph would not exist. Unfortunately rather convoluted loops can arise in the graph, which can be traversed an infinite amount of times (probability decreases multiplicatively with each loop traversal, obviously).
Is there a general algorithm to find the probabilities to arrive at each of the final nodes?
A particularly nasty example:
We can represent the edges as a matrix (probability to go from row (node) x to row (node) y is in the entry (x,y))
{{0, 1/2, 0, 1/14, 1/14, 0, 5/14},
{0, 0, 1/9, 1/2, 0, 7/18, 0},
{1/8, 7/16, 0, 3/16, 1/8, 0, 1/8},
{0, 1, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0}}
Or as a directed graph:
The starting node 1 is blue, the final nodes 5,6,7 are green. All edges are labelled by the probability to traverse them when starting from the node where they originate.
This has eight different paths from starting node 1 to the final nodes:
{{1/14, {1, 5}}, {5/14, {1, 7}}, {7/36, {1, 2, 6}},
{1/144, {1, 2, 3, 5}}, {1/144, {1, 2, 3, 7}},
{1/36, {1, 4, 2, 6}}, {1/1008, {1, 4, 2, 3, 5}}, {1/1008, {1, 4, 2, 3, 7}}}
(The notation for each path is {probability,sequence of nodes visited})
And there are five distinct loops:
{{1/144, {2, 3, 1}}, {7/144, {3, 2}}, {1/2, {4, 2}},
{1/48, {3, 4, 2}}, {1/1008, {4, 2, 3, 1}}}
(Notation for loops is {probability to traverse loop once,sequence of nodes visited}).
If only these cycles could be resolved to obtain an effectively tree like graph, the problem would be solved.
Any hint on how to tackle this?
I'm familiar with Java, C++ and C, so suggestions in these languages are preferred.

I'm not expert in the area of Markov chains, and although I think it's likely that algorithms are known for the kind of problem you present, I'm having difficulty finding them.
If no help comes from that direction, then you can consider rolling your own. I see at least two different approaches here:
Simulation.
Examine how the state of the system evolves over time by starting with the system in state 1 at 100% probability, and performing many iterations in which you apply your transition probabilities to compute the probabilities of the state obtained after taking a step. If at least one final ("absorbing") node can be reached (at non-zero probability) from every node, then over enough steps, the probability that the system is in anything other than a final state will decrease asymptotically toward zero. You can estimate the probability that the system ends in final state S as the probability that it is in state S after n steps, with an upper bound on the error in that estimate given by the probability that the system is in a non-final state after n steps.
As a practical matter, this is the same is computing Trn, where Tr is your transition probability matrix, augmented with self-edges at 100% probability for all the final states.
Exact computation.
Consider a graph, G, such as you describe. Given two vertices i and f, such that there is at least one path from i to f, and f has no outgoing edges other than self-edges, we can partition the paths from i to f into classes characterized by the number of times they revisit i prior to reaching f. There may be an infinite number of such classes, which I will designate Cif(n), where n represents the number of times the paths in Cif(n) revisit node i. In particular, Cii(0) contains all the simple loops in G that contain i (clarification: as well as other paths).
The total probability of ending at node f given that the system traverses graph G starting at node i is given by
Pr(f|i, G) = Pr(Cif(0)|G) + Pr(Cif(1)|G) + Pr(Cif(2)|G) ...
Now observe that if n > 0 then each path in Cif(n) has the form of a union of two paths c and t, where c belongs to Cii(n-1) and t belongs to Cif(0). That is, c is a path that starts at node i and ends at node i, passing through i n-1 times between, and t is a path from i to f that does not pass through i again. We can use that to rewrite our probability formula:
Pr(f|i,G) = Pr(Cif(0)|G) + Pr(Cii(0)|G) * Pr(Cif(0)|G) + Pr(Cii(1)|G) * Pr(Cif(0)|G) + ...
But note that every path in Cii(n) is a composition of n+1 paths belonging to Cii(0). It follows that Pr(Cii(n)|G) = Pr(Cii(0)|G)n+1, so we get
Pr(f|i) = Pr(Cif(0)|G) + Pr(Cii(0)|G) * Pr(Cif(0)|G) + Pr(Cii(0)|G)2 * Pr(Cif(0)|G) + ...
And now, a little algebra gives us
Pr(f|i,G) - Pr(Cif(0)|G) = Pr(Cii(0)|G) * Pr(f|i,G)
, which we can solve for Pr(f|i,G) to get
Pr(f|i,G) = Pr(Cif(0)|G) / (1 - Pr(Cii(0)|G))
We've thus reduced the problem to one in terms of paths that do not return to the starting node, except possibly as their end node. These do not preclude paths that have loops that don't include the starting node, but we can we nevertheless rewrite this problem in terms of several instances of the original problem, computed on a subgraph of the original graph.
In particular, let S(i, G) be the set of successors of vertex i in graph G -- that is, the set of vertices s such that there is an edge from i to s in G, and let X(G,i) be the subgraph of G formed by removing all edges that start at i. Furthermore, let pis be the probability associated with edge (i, s) in G.
Pr(Cif(0)|G) = Sum over s in S(i, G) of pis * Pr(f|s,X(G,i))
In other words, the probability of reaching f from i through G without revisiting i in between is the sum over all successors of i of the product of the probability of reaching s from i in one step with the probability of reaching f from s through G without traversing any edges outbound from i. That applies for all f in G, including i.
Now observe that S(i, G) and all the pis are knowns, and that the problem of computing Pr(f|s,X(G,i)) is a new, strictly smaller instance of the original problem. Thus, this computation can be performed recursively, and such a recursion is guaranteed to terminate. It may nevertheless take a long time if your graph is complex, and it looks like a naive implementation of this recursive approach would scale exponentially in the number of nodes. There are ways you could speed the computation in exchange for higher memory usage (i.e. memoization).
There are likely other possibilities as well. For example, I'm suspicious that there may be a bottom-up dynamic programming approach to a solution, but I haven't been able to convince myself that loops in the graph don't present an insurmountable problem there.

Problem Clarification
The input data is a set of m rows of n columns of probabilities, essentially an m by n matrix, where m = n = number of vertices on a directed graph. Rows are edge origins and columns are edge destinations. We will, on the bases of the mention of cycles in the question, that the graph is cyclic, that at least one cycle exists in the graph.
Let's define the starting vertex as s. Let's also define a terminal vertex as a vertex for which there are no exiting edges and the set of them as set T with size z. Therefore we have z sets of routes from s to a vertex in T, and the set sizes may be infinite due to cycles 1. In such a scenario, one cannot conclude that a terminal vertex will be reached in an arbitrarily large number of steps.
In the input data, probabilities for rows that correspond with vertices not in T are normalized to total to 1.0. We shall assume the Markov property, that the probabilities at each vertex do not vary with time. This precludes the use of probability to prioritize routes in a graph search 2.
Finite math texts sometimes name example problems similar to this question as Drunken Random Walks to underscore the fact that the walker forgets the past,
referring to the memory-free nature of Markovian chains.
Applying Probability to Routes
The probability of arriving at a terminal vertex can be expressed as an infinite series sum of products.
Pt = lim s -> ∞ Σ ∏ Pi, j,
where s is the step index, t is a terminal vertex index, i ∈ [1 .. m] and j ∈ [1 .. n]
Reduction
When two or more cycles intersect (sharing one or more vertices), analysis is complicated by an infinite set of patterns involving them. It appears, after some analysis and review of relevant academic work, that arriving at an accurate set of terminal vertex arrival probabilities with today's mathematical tools may best be accomplished with a converging algorithm.
A few initial reductions are possible.
The first consideration is to enumerate the destination vertex, which is easy since the corresponding rows have probabilities of zero.
The next consideration is to differentiate any further reductions from what the academic literature calls irreducible sub-graphs. The below depth first algorithm remembers which vertices have already been visited while constructing a potential route, so it can be easily retrofitted to identify which vertices are involved in cycles. However it is recommended to use existing well tested, peer reviewed graph libraries to identify and characterize sub-graphs as irreducible.
Mathematical reduction of irreducible portions of the graph may or may not be plausible. Consider starting vertex A and sole terminating vertex B in the graph represented as {A->C, C->A, A->D, D->A, C->D, D->C, C->B, D->B}.
Although one can reduce the graph to probability relations absent of cycles through vertex A, the vertex A cannot be removed for further reduction without either modifying probabilities of vertices exiting C and D or allowing both totals of probabilities of edges exiting C and D to be less than 1.0.
Convergent Breadth First Traversal
A breadth first traversal that ignores revisiting and allows cycles can iterate step index s, not to some fixed smax but to some sufficiently stable and accurate point in a convergent trend. This approach is especially called for if cycles overlap creating bifurcations in the simpler periodicity caused by a single cycle.
Σ PsΔ s.
For the establishment of a reasonable convergence as s increases, one must determine the desired accuracy as a criteria for completing convergence algorithm and a metric for measuring accuracy by looking at longer term trends in results at all terminal vertices. It may be important to provide a criteria where the sum of terminal vertex probabilities is close to unity in conjunction with the trend convergence metric, as both a sanity check and an accuracy criteria. Practically, four convergence criteria may be necessary 3.
Per terminal vertex probability trend convergence delta
Average probability trend convergence delta
Convergence of total probability on unity
Total number of steps (to cap depth for practical computing reasons)
Even beyond these four, the program may need to contain a trap for an interrupt that permits the writing and subsequent examination of output after a long wait without the satisfying of all four above criteria.
An Example Cycle Resistant Depth First Algorithm
There are more efficient algorithms than the following one, but it is fairly comprehensible, it compiles without warning with C++ -Wall, and it produces the desired output for all finite and legitimate directed graphs and start and destination vertices possible 4. It is easy to load a matrix in the form given in the question using the addEdge method 5.
#include <iostream>
#include <list>
class DirectedGraph {
private:
int miNodes;
std::list<int> * mnpEdges;
bool * mpVisitedFlags;
private:
void initAlreadyVisited() {
for (int i = 0; i < miNodes; ++ i)
mpVisitedFlags[i] = false;
}
void recurse(int iCurrent, int iDestination,
int route[], int index,
std::list<std::list<int> *> * pnai) {
mpVisitedFlags[iCurrent] = true;
route[index ++] = iCurrent;
if (iCurrent == iDestination) {
auto pni = new std::list<int>;
for (int i = 0; i < index; ++ i)
pni->push_back(route[i]);
pnai->push_back(pni);
} else {
auto it = mnpEdges[iCurrent].begin();
auto itBeyond = mnpEdges[iCurrent].end();
while (it != itBeyond) {
if (! mpVisitedFlags[* it])
recurse(* it, iDestination,
route, index, pnai);
++ it;
}
}
-- index;
mpVisitedFlags[iCurrent] = false;
}
public:
DirectedGraph(int iNodes) {
miNodes = iNodes;
mnpEdges = new std::list<int>[iNodes];
mpVisitedFlags = new bool[iNodes];
}
~DirectedGraph() {
delete mpVisitedFlags;
}
void addEdge(int u, int v) {
mnpEdges[u].push_back(v);
}
std::list<std::list<int> *> * findRoutes(int iStart,
int iDestination) {
initAlreadyVisited();
auto route = new int[miNodes];
auto pnpi = new std::list<std::list<int> *>();
recurse(iStart, iDestination, route, 0, pnpi);
delete route;
return pnpi;
}
};
int main() {
DirectedGraph dg(5);
dg.addEdge(0, 1);
dg.addEdge(0, 2);
dg.addEdge(0, 3);
dg.addEdge(1, 3);
dg.addEdge(1, 4);
dg.addEdge(2, 0);
dg.addEdge(2, 1);
dg.addEdge(4, 1);
dg.addEdge(4, 3);
int startingNode = 2;
int destinationNode = 3;
auto pnai = dg.findRoutes(startingNode, destinationNode);
std::cout
<< "Unique routes from "
<< startingNode
<< " to "
<< destinationNode
<< std::endl
<< std::endl;
bool bFirst;
std::list<int> * pi;
auto it = pnai->begin();
auto itBeyond = pnai->end();
std::list<int>::iterator itInner;
std::list<int>::iterator itInnerBeyond;
while (it != itBeyond) {
bFirst = true;
pi = * it ++;
itInner = pi->begin();
itInnerBeyond = pi->end();
while (itInner != itInnerBeyond) {
if (bFirst)
bFirst = false;
else
std::cout << ' ';
std::cout << (* itInner ++);
}
std::cout << std::endl;
delete pi;
}
delete pnai;
return 0;
}
Notes
[1] Improperly handled cycles in a directed graph algorithm will hang in an infinite loop. (Note the trivial case where the number of routes from A to B for the directed graph represented as {A->B, B->A} is infinity.)
[2] Probabilities are sometimes used to reduce the CPU cycle cost of a search. Probabilities, in that strategy, are input values for meta rules in a priority queue to reduce the computational challenge very tedious searches (even for a computer). The early literature in production systems termed the exponential character of unguided large searches Combinatory Explosions.
[3] It may be practically necessary to detect breadth first probability trend at each vertex and specify satisfactory convergence in terms of four criteria
Δ(Σ∏P)t <= Δmax ∀ t
Σt=0T Δ(Σ∏P)t / T <= Δave
|Σ Σ∏P - 1| <= umax, where u is the maximum allowable deviation from unity for the sum of final probabilities
s < Smax
[4] Provided there are enough computing resources available to support the data structures and ample time to arrive at an answer for the given computing system speed.
[5] You can load DirectedGraph dg(7) with the input data using two loops nested to iterate through the rows and columns enumerated in the question. The body of the inner loop would simply be a conditional edge addition.
if (prob != 0) dg.addEdge(i, j);
Variable prob is P m,n. Route existence is only concerned with zero/nonzero status.

I found this question while researching directed cyclic graphs. The probability of reaching each of the final nodes can be calculated using absorbing Markov chains.
The video Markov Chains - Part 7 (+ parts 8 and 9) explains absorbing states in Markov chains and the math behind it.

I understand this as the following problem:
Given an initial distribution to be on each node as a vector b and a Matrix A that stores the probability to jump from node i to node j in each time step, somewhat resembling an adjacency matrix.
Then the distribution b_1 after one time step is A x b. The distribution b_2 after two time steps is A x b_1. Likewise, the distribution b_n is A^n x b.
For an approximation of b_infinite, we can do the following:
Vector final_probability(Matrix A, Vector b,
Function Vector x Vector -> Scalar distance, Scalar threshold){
b_old = b
b_current = A x b
while(distance(b_old,b_current) < threshold){
b_old = b_current
b_current = A x b_current
}
return b_current
}
(I used mathematical variable names for convencience)
In other words, we assume that the sequence of distributions converges nicely after the given threshold. Might not hold true, but will usually work.
You might want to add a maximal amount of iterations to that.
Euclidean distance should work well as distance.
(This uses the concept of a Markov Chain but is more of a pragmatical solution)

Related

Time Complexity of this Word Chain

I have written a bigger program, I am constructing a graph Graph(V,E), from words in a data file. Then I am parsing another file with words on same line: "other there" <-- like that, the first string is start word, the second is end word, as seen below:
while (true) {
String line = reader.readLine();
if (line == null) { break; }
assert line.length() == 11;
String start = line.substring(0, 5);
String goal = line.substring(6, 11);
System.out.println(
G.findShortestPath(
words.indexOf(start),
words.indexOf(goal)));
}
I am using undirected graphs with BFS. I am doing word transformation word chain, like this: climb → blimp → limps → pismo → moist → stoic
The input file contains words of length five, and a path/connection is defined as such so it goes from Xto Y in one step, only if, and only if the four last letters in Xare found in Y.
What is known, and I have calculated.
Time complexity: O(V + E) for building the graph G(V, E) of words. The second part of the program consists of a while loop and a for loop of finding the shortest path (using BFS), which is O(V^2).
Space complexity: O(V) in the worst case. The graph holds all the words. The nodes are made up of a single node class object which contains n neighbor(s).
Process:
Program loads into buffer a file with words.
The program builds the graph.
The program runs test and loads into buffer information from a test file (different file). Then selects start and end node and performs the shortest path search.
If there's a connection, the code returns the shortest path length. If there's no connection between two words or end/goal cannot be reached, we return -1.
Now, I am trying to come up with an O(?) time algorithm or total time complexity for V,E,F, where:
V is the number of vertices
E is the number of edges
F is the number of test cases (number of lines in the test file)*
*Number of test cases in function: public void performTest(String filename) throws IOException. The body of such function is here, see above. Now, I know, that for lines n, there will be same n amount of test cases. F=O(n). But, in what way, can one incorporate or add this calculation to make a general O expression with variables that holds for whatever amount of words in list and words in graph and words from testfile.
The main body of the BFS algorithm is the nesting of two loops, the while loop visits each vertex once, so it is O(|V|), and the for nested in the while , Since each edge will be checked only once when its starting vertex u is dequeued, and each vertex will be dequeued at most once, so the edge will be checked at most once, and the total is 0 (|E|).The time complexity of BFS is 0 (|V|+|E|)
Contents in Testfile:
The first word becomes the starting word, the second on the same line becomes the end or target word.
blimp moist
limps limps
Some other code, I've written previously:
public static void main(String... args) throws IOException {
ArrayList<String> words = new ArrayList<String>();
words.add("their");
words.add("moist");
words.add("other");
words.add("blimp");
Graph g = new Graph(words.size());
for(String word: words) {
for(String word2: words){
g.addEdge(words.indexOf(word), words.indexOf(word2));
}
}
BufferedReader readValues = null;
try {
readValues =
new BufferedReader(new InputStreamReader(new FileInputStream("testfile.txt")));
String line = null;
while ((line = readValues.readLine()) != null) {
// assert line.length() == 11;
String[] tokens = line.split(" ");
String start = tokens[0];
String goal = tokens[1];
BreadthFirstPaths bfs = new BreadthFirstPaths(g, words.indexOf(start));
if (bfs.hasPathTo(words.indexOf(goal))) {
System.out.println("Shortest path distance from " + start + " to " + goal + " = " + bfs.distTo(words.indexOf(goal)));
} else System.out.println("Nothing");
}
} finally {
if (readValues != null) {
try {
readValues.close();
} catch (Throwable t) {
t.printStackTrace();
}
}
}
Notice: Not interested in FASTER solutions.
The straightforward answer:
The direct approach would be to use Floyd - Warshall algorithm. This algorithm computes shortest paths between all pairs of vertices in a directed graph without negative cycles. Since you are using an undirected graph with positive weights it is sufficient to replace every undirected edge (u,v) with directed pair (u,v), (v,u).
The runtime of Floyd - Warshall is O(V ^ 3) and it would compute all the answers you could ever seek at once, given that you can retrieve them in a reasonable time. (Which should be rather easy since you already have V^3 of a breathing room).
Getting faster:
In your case that most likely isn't optimal(Not to mention that I don't know how many queries will you make - if only a few, then FW is definitely an overkill). Since your graph doesn't have any negative edges and it seems that the edge count is only C * |V| from your space complexity we can go further. Enters Djikstra.
Djikstra algoritm's complexity is O(E + Vlog(V)).
Considering that you most likely have only ~ C * V edges, this would bring the repeated Djikstra's computation costs to F * O(V * log(V)).
And faster:
If you wish to give frying your brain a go, Djikstra can be improved upon in some special cases by using the dark magic of the fibonacci heaps(which are modified for the purpose of the algorithm to make things more confusing). From what I can see, your case could be special enough so that the O(N * sqrt(log(N))) from this article is achievable.Their assumption are:
n vertices
m edges
the the longest arc(a length of an edge if my google-fu is correct) being bounded by a polynomial function of n.
This is it for my attempt at a quick dive into the shortest path problem. If you wish to research more, I would recommend looking into the all-pairs-shotest-paths problem in general. There are many other algorithms that are similar in the complexity. Your ideal approach will also depend on your F.
P.S.:
Unless you have many, many words, your neighbor count can still be rather big: 5! * 26 in the worst case to be precise. (Four letters are fixed and one is arbitrary - possible permutations * letter count). Can be lower in case of repetitions, still it isn't small, although it can technically be considered a constant.
It seems to me that you are simply asking about the computational complexity of your existing solution, expressed in terms of the 3 variables V, E and F.
If I am reading your question correctly, it is
O(V + E) // loading
+ O(V^2) done F times // F test cases
which simplifies to:
O(V + E + (F * V^2))
This assumes that your Big-O characterizations of the load and search times are correct. We cannot confirm1 that those characterizations are correct without seeing the complete Java source code for your solution. (An English or pseudo-code description is too imprecise.)
Note that the above formula is "canonical". It cannot be simplified further unless you eliminate variables; e.g. by treating them as a constant or placing bounds on them.
However, if we can assume that F > 0, we can reduce it to:
O(E + (F * V^2))
since when F > 0 the F*V^2 term will dominate the V term as either F or V tends to infinity. (Intuitively, the F == 0 case corresponds to just loading the graph and running no test cases. You would expect the performance characteristics to be different in that case.)
It is possible that the E variable could be approximated as function of V. (Each edge from one node to another represents a permutation of a word with one letter changed. If one did some statistical analysis of words in the English language, it may be possible to determine a the average number of edges per node as a function of the number of nodes.) If we can prove that this (hypothetical) average_E(V) function is O(V^2) or better (as seems likely!), then we can eliminate the E term from the overall complexity class.
(Of course, that assumes that the input graph contains no duplicate or incorrect edges per the description of the problem you are trying to solve.)
Finally, it seems that the O(V^2) is actually a measure of worst-case performance, depending on the sparseness of the graph. So you need to factor that into your answer ... and the way that you attempt to validate the complexity.
1 - The O(V^2) seems a bit suspicious to me.
Time: O(F * (V + E))
Space: O(V + E)
Following what you described
V: Vertices
E: Edges
F: path queries
The BFS algorithm complexity is O(V + E) time, and O(V) space
Each query is a BFS without any modification, so the time complexity is O(F * (V + E)), and space complexity is the same as one single BFS since you are using the same structure O(V), but we have to consider the space used to store the graph O(E).
Graph construction, you are iterating over all pairs of words and adding one edge for each word your graph always have $E = V^2$. If you had asked advice to improve your algorithm as is usual in this community, I would tell you to avoid adding edges that should not be used (those with more than 1 character difference).

Arrays Left Rotation { A baby coder's struggle with time complexities}

I am working on a Hackerrank problem where I am given an array and must rotate elements to the left n amount of times. I was able to solve the problem but I am still stuck on how to calculate the time complexity of my solution since I am traversing through the array n amount of times and initializing i at 0 until the while loop condition is false, I wanted to say O(n), but I'm leaning towards O(n^2) or I'm just way off... Can someone please help me understand the approach to calculating time complexities? I know that my space complexity is O(1) since I am not using an extra data structure. Lastly, depending on the actual time complexity is there a need to make my code more efficient? P.S. please be kind with your responses, I am an undergrad still trying to master basic Computer Science principles. If you have any good books or website suggestions specifically for learning time complexities I'd be so grateful if you'd provide a link in the comments.
Here is the question:
A left rotation operation on an array shifts each of the array's elements unit to the left. For example, if 2 left rotations are performed on the array [1,2,3,4,5], then the array would become [3,4,5,1,2].
Given an array of integers n and a number, d, perform d left rotations on the array. Return the updated array to be printed as a single line of space-separated integers.
Here is my input:
5 4
1 2 3 4 5
n = 5 d = 4
Here is my output:
5, 1, 2, 3, 4
My Solution/Code :
static int[] rotLeft(int[] a, int d) {
while(d != 0)
{
int k = a[0]; int i = 0;
while(i < a.length-1)
{
a[i] = a[i+1];
i++;
}
a[a.length-1] = k;
d--;
}
return a;
}
A left rotation performs n assignments on the array, so it's an O(n) operation.
You're performing d of those operations. Since d is unrelated to n, you could say that the general form of the time complexity is O(n*d).
If you were given any additional information about the relative sizes of d and n, you could further refine this:
If d is much larger than n, n is negligible and you could say the time complexity of the entire operation is O(d).
If d is much smaller than n, d is negligible and you could say the time complexity of the entire operation is O(n).
If d is the same order of magnitude as n, you could say the time complexity of the entire operation is O(n2).

How to find the point that gives the maximum value fast? Java or c++ code please

I need a fast way to find maximum value when intervals are overlapping, unlike finding the point where got overlap the most, there is "order". I would have int[][] data that 2 values in int[], where the first number is the center, the second number is the radius, the closer to the center, the larger the value at that point is going to be. For example, if I am given data like:
int[][] data = new int[][]{
{1, 1},
{3, 3},
{2, 4}};
Then on a number line, this is how it's going to looks like:
x axis: -2 -1 0 1 2 3 4 5 6 7
1 1: 1 2 1
3 3: 1 2 3 4 3 2 1
2 4: 1 2 3 4 5 4 3 2 1
So for the value of my point to be as large as possible, I need to pick the point x = 2, which gives a total value of 1 + 3 + 5 = 9, the largest possible value. It there a way to do it fast? Like time complexity of O(n) or O(nlogn)
This can be done with a simple O(n log n) algorithm.
Consider the value function v(x), and then consider its discrete derivative dv(x)=v(x)-v(x-1). Suppose you only have one interval, say {3,3}. dv(x) is 0 from -infinity to -1, then 1 from 0 to 3, then -1 from 4 to 6, then 0 from 7 to infinity. That is, the derivative changes by 1 "just after" -1, by -2 just after 3, and by 1 just after 6.
For n intervals, there are 3*n derivative changes (some of which may occur at the same point). So find the list of all derivative changes (x,change), sort them by their x, and then just iterate through the set.
Behold:
intervals = [(1,1), (3,3), (2,4)]
events = []
for mid, width in intervals:
before_start = mid - width - 1
at_end = mid + width
events += [(before_start, 1), (mid, -2), (at_end, 1)]
events.sort()
prev_x = -1000
v = 0
dv = 0
best_v = -1000
best_x = None
for x, change in events:
dx = x - prev_x
v += dv * dx
if v > best_v:
best_v = v
best_x = x
dv += change
prev_x = x
print best_x, best_v
And also the java code:
TreeMap<Integer, Integer> ts = new TreeMap<Integer, Integer>();
for(int i = 0;i<cows.size();i++) {
int index = cows.get(i)[0] - cows.get(i)[1];
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
index = cows.get(i)[0] + 1;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) - 2);
}else {
ts.put(index, -2);
}
index = cows.get(i)[0] + cows.get(i)[1] + 2;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
}
int value = 0;
int best = 0;
int change = 0;
int indexBefore = -100000000;
while(ts.size() > 1) {
int index = ts.firstKey();
value += (ts.get(index) - indexBefore) * change;
best = Math.max(value, best);
change += ts.get(index);
ts.remove(index);
}
where cows is the data
Hmmm, a general O(n log n) or better would be tricky, probably solvable via linear programming, but that can get rather complex.
After a bit of wrangling, I think this can be solved via line intersections and summation of function (represented by line segment intersections). Basically, think of each as a triangle on top of a line. If the inputs are (C,R) The triangle is centered on C and has a radius of R. The points on the line are C-R (value 0), C (value R) and C+R (value 0). Each line segment of the triangle represents a value.
Consider any 2 such "triangles", the max value occurs in one of 2 places:
The peak of one of the triangle
The intersection point of the triangles or the point where the two triangles overall. Multiple triangles just mean more possible intersection points, sadly the number of possible intersections grows quadratically, so O(N log N) or better may be impossible with this method (unless some good optimizations are found), unless the number of intersections is O(N) or less.
To find all the intersection points, we can just use a standard algorithm for that, but we need to modify things in one specific way. We need to add a line that extends from each peak high enough so it would be higher than any line, so basically from (C,C) to (C,Max_R). We then run the algorithm, output sensitive intersection finding algorithms are O(N log N + k) where k is the number of intersections. Sadly this can be as high as O(N^2) (consider the case (1,100), (2,100),(3,100)... and so on to (50,100). Every line would intersect with every other line. Once you have the O(N + K) intersections. At every intersection, you can calculate the the value by summing the of all points within the queue. The running sum can be kept as a cached value so it only changes O(K) times, though that might not be posible, in which case it would O(N*K) instead. Making it it potentially O(N^3) (in the worst case for K) instead :(. Though that seems reasonable. For each intersection you need to sum up to O(N) lines to get the value for that point, though in practice, it would likely be better performance.
There are optimizations that could be done considering that you aim for the max and not just to find intersections. There are likely intersections not worth pursuing, however, I could also see a situation where it is so close you can't cut it down. Reminds me of convex hull. In many cases you can easily reduce 90% of the data, but there are cases where you see the worst case results (every point or almost every point is a hull point). For example, in practice there are certainly causes where you can be sure that the sum is going to be less than the current known max value.
Another optimization might be building an interval tree.

Dependency Algorithm - find a minimum set of packages to install

I'm working on an algorithm which goal is to find a minimum set of packages to install package "X".
I'll explain better with an example:
X depends on A and (E or C)
A depends on E and (H or Y)
E depends on B and (Z or Y)
C depends on (A or K)
H depends on nothing
Y depends on nothing
Z depends on nothing
K depends on nothing
The solution is to install: A E B Y.
Here is an image to describe the example:
Is there an algorithm to solve the problem without using a brute-force approach?
I've already read a lot about algorithms such as DFS, BFS, Dijkstra, etc...
The problem is that these algorithms are unable to handle the "OR" condition.
UPDATE
I don't want to use external libraries.
The algorithm doesn't have to handle circular dependencies.
UPDATE
One possible solution could be to calculate all the possible paths of each vertex and, for each vertex in the possible path, doing the same.
So, the possible path for X would be (A E),(A C). Now, for each element in those two possible paths we can do the same: A = (E H),(E Y) / E = (B Z),(B Y), and so on...
At the end we can combine the possible paths of each vertex in a SET and choose the one with minimum length.
What do you think?
Unfortunately, there is little hope to find an algorithm which is much better than brute-force, considering that the problem is actually NP-hard (but not even NP-complete).
A proof of NP-hardness of this problem is that the minimum vertex cover problem (well known to be NP-hard and not NP-complete) is easily reducible to it:
Given a graph. Let's create package Pv for each vertex v of the graph. Also create package X what "and"-requires (Pu or Pv) for each edge (u, v) of the graph. Find a minimum set of packages to be installed in order to satisfy X. Then v is in the minimum vertex cover of the graph iff the corresponding package Pv is in the installation set.
"I dint get the problem with "or" (the image is not loading for me).
Here is my reasoning . Say we take standard shortest route algo like Dijkstras and then use equated weightage to find the best path .
Taking your example
Select the best Xr from below 2 options
Xr= X+Ar+Er
Xr= X+Ar+Cr
where Ar = is the best option from the tree A=H(and subsequent child's) or A=Y(and subsequent childs)
The idea is to first assign standard weight for each or option (since and option is not a problem) .
And later for each or option we repeat the process with its child nodes till we reach no more or option .
However , we need to first define , what best choice means, assume that least number of dependencies ie shortest path is the criteria .
The by above logic we assign weight of 1 for X. There onwards
X=1
X=A and E or C hence X=A1+E1 and X=A1+C1
A= H or Y, assuming H and Y are leaf node hence A get final weight as 1
hence , X=1+E1 and X=1+C1
Now for E and C
E1=B1+Z1 and B1+Y1 . C1=A1 and C=K1.
Assuming B1,Z1,Y1,A1and K1 are leaf node
E1=1+1 and 1+1 . C1=1 and C1=1
ie E=2 and C=1
Hence
X=1+2 and X=1+1 hence please choose X=>C as the best route
Hope this clears it .
Also we need to take care of cyclical dependencies X=>Y=>Z=>X , here we may assign such nodes are zero at parent or leaf node level and take care of dependecy."
I actually think graphs are the appropriate structure for this problem. Note that
A and (E or C) <==> (A and E) or (A and C). Thus, we can represent X = A and (E or C) with the following set of directed edges:
A <- K1
E <- K1
A <- K2
C <- K2
K1 <- X
K2 <- X
Essentially, we're just decomposing the logic of the statement and using "dummy" nodes to represent the ANDs.
Suppose we decompose all the logical statements in this fashion (dummy Ki nodes for ANDS and directed edges otherwise). Then, we can represent the input as a DAG and recursively traverse the DAG. I think the following recursive algorithm could solve the problem:
Definitions:
Node u - Current Node.
S - The visited set of nodes.
children(x) - Returns the out neighbors of x.
Algorithm:
shortestPath u S =
if (u has no children) {
add u to S
return 1
} else if (u is a dummy node) {
(a,b) = children(u)
if (a and b are in S) {
return 0
} else if (b is in S) {
x = shortestPath a S
add a to S
return x
} else if (a in S) {
y = shortestPath b S
add b to S
return y
} else {
x = shortestPath a S
add a to S
if (b in S) return x
else {
y = shortestPath b S
add b to S
return x + y
}
}
} else {
min = Int.Max
min_node = m
for (x in children(u)){
if (x is not in S) {
S_1 = S
k = shortestPath x S_1
if (k < min) min = k, min_node = x
} else {
min = 1
min_node = x
}
}
return 1 + min
}
Analysis:
This is an entirely sequential algorithm that (I think) traverses each edge at most once.
A lot of the answers here focus on how this is a theoretically hard problem due to its NP-hard status. While this means you will experience asymptotically poor performance exactly solving the problem (given current solution techniques), you may still be able to solve it quickly (enough) for your particular problem data. For instance, we are able to exactly solve enormous traveling salesman problem instances despite the fact that the problem is theoretically challenging.
In your case, a way to solve the problem would be to formulate it as a mixed integer linear program, where there is a binary variable x_i for each package i. You can convert requirements A requires (B or C or D) and (E or F) and (G) to constraints of the form x_A <= x_B + x_C + x_D ; x_A <= x_E + x_F ; x_A <= x_G, and you can require that a package P be included in the final solution with x_P = 1. Solving such a model exactly is relatively straightforward; for instance, you can use the pulp package in python:
import pulp
deps = {"X": [("A"), ("E", "C")],
"A": [("E"), ("H", "Y")],
"E": [("B"), ("Z", "Y")],
"C": [("A", "K")],
"H": [],
"B": [],
"Y": [],
"Z": [],
"K": []}
required = ["X"]
# Variables
x = pulp.LpVariable.dicts("x", deps.keys(), lowBound=0, upBound=1, cat=pulp.LpInteger)
mod = pulp.LpProblem("Package Optimization", pulp.LpMinimize)
# Objective
mod += sum([x[k] for k in deps])
# Dependencies
for k in deps:
for dep in deps[k]:
mod += x[k] <= sum([x[d] for d in dep])
# Include required variables
for r in required:
mod += x[r] == 1
# Solve
mod.solve()
for k in deps:
print "Package", k, "used:", x[k].value()
This outputs the minimal set of packages:
Package A used: 1.0
Package C used: 0.0
Package B used: 1.0
Package E used: 1.0
Package H used: 0.0
Package Y used: 1.0
Package X used: 1.0
Package K used: 0.0
Package Z used: 0.0
For very large problem instances, this might take too long to solve. You could either accept a potentially sub-optimal solution using a timeout (see here) or you could move from the default open-source solvers to a commercial solver like gurobi or cplex, which will likely be much faster.
To add to Misandrist's answer: your problem is NP-complete NP-hard (see dened's answer).
Edit: Here is a direct reduction of a Set Cover instance (U,S) to your "package problem" instance: make each point z of the ground set U an AND requirement for X. Make each set in S that covers a point z an OR requirement for z. Then the solution for package problem gives the minimum set cover.
Equivalently, you can ask which satisfying assignment of a monotone boolean circuit has fewest true variables, see these lecture notes.
Since the graph consists of two different types of edges (AND and OR relationship), we can split the algorithm up into two parts: search all nodes that are required successors of a node and search all nodes from which we have to select one single node (OR).
Nodes hold a package, a list of nodes that must be successors of this node (AND), a list of list of nodes that can be successors of this node (OR) and a flag that marks on which step in the algorithm the node was visited.
define node: package p , list required , listlist optional ,
int visited[default=MAX_VALUE]
The main-routine translates the input into a graph and starts traversal at the starting node.
define searchMinimumP:
input: package start , string[] constraints
output: list
//generate a graph from the given constraint
//and save the node holding start as starting point
node r = getNode(generateGraph(constraints) , start)
//list all required nodes
return requiredNodes(r , 0)
requiredNodes searches for all nodes that are required successors of a node (that are connected to n via AND-relation over 1 or multiple edges).
define requiredNodes:
input: node n , int step
output: list
//generate a list of all nodes that MUST be part of the solution
list rNodes
list todo
add(todo , n)
while NOT isEmpty(todo)
node next = remove(0 , todo)
if NOT contains(rNodes , next) AND next.visited > step
add(rNodes , next)
next.visited = step
addAll(rNodes , optionalMin(rNodes , step + 1))
for node r in rNodes
r.visited = step
return rNodes
optimalMin searches for the shortest solution among all possible solutions for optional neighbours (OR). This algorithm is brute-force (all possible selections for neighbours will be inspected.
define optionalMin:
input: list nodes , int step
output: list
//find all possible combinations for selectable packages
listlist optSeq
for node n in nodes
if NOT n.visited < step
for list opt in n.optional
add(optSeq , opt)
//iterate over all possible combinations of selectable packages
//for the given list of nodes and find the shortest solution
list shortest
int curLen = MAX_VALUE
//search through all possible solutions (combinations of nodes)
for list seq in sequences(optSeq)
list subseq
for node n in distinct(seq)
addAll(subseq , requiredNodes(n , step + 1))
if length(subseq) < curLen
//mark all nodes of the old solution as unvisited
for node n in shortest
n.visited = MAX_VALUE
curLen = length(subseq)
shortest = subseq
else
//mark all nodes in this possible solution as unvisited
//since they aren't used in the final solution (not at this place)
for node n in subseq
n.visited = MAX_VALUE
for node n in shorest
n.visited = step
return shortest
The basic idea would be the following: Start from the starting node and search for all nodes that must be part of the solution (nodes that can be reached from the starting node by only traversing AND-relationships). Now for all of these nodes, the algorithm searches for the combination of optional nodes (OR) with the fewest nodes required.
NOTE: so far this algorithm isn't much better than brute-force. I'll update as soon as i've found a better approach.
My code is here.
Scenario:
Represent the constraints.
X : A&(E|C)
A : E&(Y|N)
E : B&(Z|Y)
C : A|K
Prepare two variables target and result.
Add the node X to target.
target = X, result=[]
Add single node X to the result.
Replace node X with its dependent in the target.
target = A&(E|C), result=[X]
Add single node A to result.
Replace node A with its dependent in the target.
target = E&(Y|N)&(E|C), result=[X, A]
Single node E must be true.
So (E|C) is always true.
Remove it from the target.
target = E&(Y|N), result=[X, A]
Add single node E to result.
Replace node E with its dependent in the target.
target = B&(Z|Y)&(Y|N), result=[X, A, E]
Add single node B to result.
Replace node B with its dependent in the target.
target = (Z|Y)&(Y|N), result=[X, A, E, B]
There are no single nodes any more.
Then expand the target expression.
target = Z&Y|Z&N|Y&Y|Y&N, result=[X, A, E, B]
Replace Y&Y to Y.
target = Z&Y|Z&N|Y|Y&N, result=[X, A, E, B]
Choose the term that has smallest number of nodes.
Add all nodes in the term to the target.
target = , result=[X, A, E, B, Y]
I would suggest you to first transform the graph in a AND-OR Tree. Once done you can perform a search in the tree for the best (where you can choose what "best" means: shortest, lowest memory occupation of packages in nodes, etc...) path.
A suggestion I'd make, being that the condition to install X would be something like install(X) = install(A) and (install(E) or install(C)), is to group the OR nodes (in this case: E and C) into a single node, say EC, and transform the condition in install(X) = install(A) and install(EC).
In alternative, based on the AND-OR Tree idea, you could create a custom AND-OR Graph using the grouping idea. In this way you could use an adaptation of a graph traversal algorithm, which could be more useful in certain scenarios.
Yet another solution could be to use Forward Chaining. You'd have to follow these steps:
Transform (just re-writing the conditions here):
A and (E or C) => X
E and (H or Y) => A
B and (Z or Y) => E
into
(A and E) or (A and C) => X
(E and H) or (E and Y) => A
(B and Z) or (B and Y) => E
Set X as goal.
Insert B, H, K, Y, Z as facts.
Run Forward chaining and stop on the first occurrence of X (the goal). That should be the shortest way to achieve the goal in this case (just remember to keep track of the facts that have been used).
Let me know if anything is unclear.
This is an example of a Constraint Satisfaction Problem. There are Constraint Solvers for many languages, even some that can run on generic 3SAT engines, and thus be run on GPGPU.
Another (fun) way to solved this issue is to use a genetic algorithm.
Genetic Algorithm is powerful but you have to use a lot of parameters and find the better one.
Genetic Step are the following one :
a . Creation : a number of random individual, the first generation
(for instance : 100)
b. mutation : mutate of low percent of them (for instance : 0,5%)
c. Rate : rate (also call fitness) all the individual.
d. Reproduction : select (using rates) pair of them and create child (for instance : 2 child)
e. Selection : select Parent and Child to create a new generation (for instance : keep 100 individual by generation)
f. Loop : Go back to step "a" and repeat all the process a number of time (for instance : 400 generation)
g. Pick : Select an individual of the last generation with a max rate.
Individual will be your solution.
Here is what you have to decide :
Find a genetic code for your individual
You have to represent a possible solution (call individual) of your problem as a genetic code.
In your case, it could be a group of letter representing the node which respect constraint OR and NOT.
For instance :
[ A E B Y ], [ A C K H ], [A E Z B Y] ...
Find a way to rate individual
To know if an individual is a good solution, you have to rate it, in order to compare it to other individual.
In your case, it could be pretty easy : individual rate = number of node - number of individual node
For instance :
[ A E B Y ] = 8 - 4 = 4
[ A E Z B Y] = 8 - 5 = 3
[ A E B Y ] as a better rate than [ A E Z B Y ]
Selection
Thanks to individual's rate, we can select Pair of them for reproduction.
For instance by using Genetic Algorithm roulette wheel selection
Reproduction
Take a pair of individual an create some (for instance 2) child (other individual) from them.
For instance :
Take a node from the first one and swap it with a node of the second one.
Make some adjustment to fit "or, and" constraint.
[ A E B Y ], [ A C K H ] => [ A C E H B Y ], [ A E C K B Y]
Note : that this is not the good way to reproduct it because the child are worth than the parent. Maybe we can swap a range of node.
Mutation
You have just to change genetic code of select individual.
For instance :
Delete a node
Make some adjustment to fit "or, and" constraint.
As you can see, it's not hard to implements but a lot of choice has to be done for designing it with a specific issue and to control the different parameters (percent of mutation, rate system, reproduction system, number of individual, number of generation, ...)

What is the best way to represent a tree in this case?

I am trying to solve this question: https://www.hackerrank.com/challenges/journey-to-the-moon I.e. a problem of finding connected components of a graph. What I have is a list of vertices (from 0 to N-1) and each line in the standard input gives me pair of vertices that are connected by an edge (i.e. if I have 1, 3) it means that vertex 1 and vertex 3 are in one connected component. My question is what is the best way to store the inpit, i.e. how to represent my graph? My idea is to use ArrayList of Arraylist - each position in the array list stores another arraylist of adgecent vertices. This is the code:
public static List<ArrayList<Integer>> graph;
and then in the main() method:
graph = new ArrayList<ArrayList<Integer>>(N);
for (int j = 0; j < N; j++) {
graph.add(new ArrayList<Integer>());
}
//then for each line in the standard input I fill the corresponding values in the array:
for (int j = 0; j < I; j++) {
String[] line2 = br.readLine().split(" ");
int a = Integer.parseInt(line2[0]);
int b = Integer.parseInt(line2[1]);
graph.get(a-1).add(b);
graph.get(b-1).add(a);
}
I'm pretti sure that for solving the question I have to put vertex a at position b-1 and then vertex b at position a-1 so this should not change. But what I am looking for is better way to represent the graph?
Using Java's collections (ArrayList, for example) adds a lot of memory overhead. each Integer object will take at least 12 bytes, in addition to the 4 bytes required for storing the int.
Just use a huge single int array (let's call it edgeArray), which represents the adjacency matrix. Enter 1 when the cell corresponds to an edge. e.g., if nodes k and m is seen on the input, then cell (k, m) will have 1, else 0. In the row major order, it will be the index k * N + m. i.e, edgeArray[k * N + m ] = 1. You can either choose column major order, or row major order. But then your int array will be very sparse. It's trivial to implement a sparse array. Just have an array for the non-zero indices in sorted order. It should be in sorted order so that you can binary search. The number of elements will be in the order of number of edges.
Of course, when you are building the adjacency matrix, you won't know how many edges are there. So you won't be able to allocate the array. Just use a hash set. Don't use HashSet, which is very inefficient. Look at IntOpenHashSet from fastutils. If you are not allowed to use libraries, implement one that is similar to that.
Let us say that the openHashMap variable you will be using is called adjacencyMatrix. So if you see, 3 and 2 and there are 10^6 nodes in total (N = 10^6). then you will just do
adjacencyMatirx.add(3 * 10000000 + 2);
Once you have processed all the inputs, then you can make the sparse adjacency matrix implementation above:
final int[] edgeArray = adjacencyMatrix.toIntArray(new int[adjacencyMatrix.size()]);
IntArrays.sort(edgeArray)
Given an node, finding all adjacent nodes:
So if you need all the nodes connected to node p, you would binary search for the next value that is greater than or equal to p * N (O(log (number of edges))). Then you will just traverse the array until you hit a value that is greater than or equal to (p + 1) * N. All the values you encounter will be nodes connected to p.
Comparing it with the approach you mentioned in your question:
It uses O(N*b) space complexity, where N (number of nodes) and b is the branching factor. It's lower bounded by the number of edges.
For the approach I mentioned, the space complexity is just O(E). In fact it's exactly e number of integers plus the header for the int array.
I used var graph = new Dictionary<long, List<long>>();
See here for complete solution in c# - https://gist.github.com/newton3/a4a7b4e6249d708622c1bd5ea6e4a338
PS - 2 years but just in case someone stumbles into this.

Categories