I'm looking for an algorithm(C/C++/Java - doesn't matter) which will resolve a problem which consists of finding the shortest path between 2 nodes (A and B) of a graph. The catch is that the path must visit certain other given nodes(cities). A city can be visited more than once. Example of path(A-H-D-C-E-F-G-F-B) (where A is source, B is destination, F and G are cities which must be visited).
I see this as a variation of the Traveling Salesman Problem but I couldn't find or write a working algorithm based on my searches.
I was trying to find a solution starting from these topics but without any luck:
https://stackoverflow.com/questions/24856875/tsp-branch-and-bound-implementation-in-c and
Variation of TSP which visits multiple cities
An easy reduction of the problem to TSP would be:
For each (u,v) that "must be visited", find the distance d(u,v) between them. This can be done efficiently using Floyd-Warshall Algorithm to find all-to-all shortest path.
Create a new graph G' consisting only of those nodes, with all edges existing, with the distances as calculated on (1).
Run standard TSP algorithm to solve the problem on the reduced graph.
I think that in addition to amit's answer, you'll want to increase the cost of the edges that have A or B as endpoints by a sufficient amount (the total cost of the graph + 1 would probably be sufficient) to ensure that you don't end up with a path that goes through A or B (instead of ending at A and B).
A--10--X--0--B
| | |
| 10 |
| | |
+---0--Y--0--+
The above case would result in a path from A to Y to B to X, unless you increase the cost of the A and B edges (by 21).
A--31--X--21--B
| | |
| 10 |
| | |
+---21--Y--21-+
Now it goes from A to X to Y to B.
Also make sure you remove any edges (A,B) (if they exist).
Related
this is quite long, and I am sorry about this.
I have been trying to implement the Minhash LSH algorithm discussed in chapter 3 by using Spark (Java). I am using a toy problem like this:
+--------+------+------+------+------+
|element | doc0 | doc1 | doc2 | doc3 |
+--------+------+------+------+------+
| d | 1 | 0 | 1 | 1 |
| c | 0 | 1 | 0 | 1 |
| a | 1 | 0 | 0 | 1 |
| b | 0 | 0 | 1 | 0 |
| e | 0 | 0 | 1 | 0 |
+--------+------+------+------+------+
the goal is to identify, among these four documents (doc0,doc1,doc2 and doc3), which documents are similar to each other. And obviously, the only possible candidate pair would be doc0 and doc3.
Using Spark's support, generating the following "characteristic matrix" is as far as I can reach at this point:
+----+---------+-------------------------+
|key |value |vector |
+----+---------+-------------------------+
|key0|[a, d] |(5,[0,2],[1.0,1.0]) |
|key1|[c] |(5,[1],[1.0]) |
|key2|[b, d, e]|(5,[0,3,4],[1.0,1.0,1.0])|
|key3|[a, c, d]|(5,[0,1,2],[1.0,1.0,1.0])|
+----+---------+-------------------------+
and here is the code snippets:
CountVectorizer vectorizer = new CountVectorizer().setInputCol("value").setOutputCol("vector").setBinary(false);
Dataset<Row> matrixDoc = vectorizer.fit(df).transform(df);
MinHashLSH mh = new MinHashLSH()
.setNumHashTables(5)
.setInputCol("vector")
.setOutputCol("hashes");
MinHashLSHModel model = mh.fit(matrixDoc);
Now, there seems to be two main calls on the MinHashLSHModel model that one can use: model.approxSimilarityJoin(...) and model.approxNearestNeighbors(...). Examples about using these two calls are here: https://spark.apache.org/docs/latest/ml-features.html#lsh-algorithms
On the other hand, model.approxSimilarityJoin(...) requires us to join two datasets, and I have only one dataset which has 4 documents and I would like to figure out which ones in these four are similar to each other, so I don't have a second dataset to join... Just to try it out, I actually joined my only dataset with itself. Based on the result, seems like model.approxSimilarityJoin(...) just did a pair-wise Jaccard calculation, and I don't see any impact by changing the number of Hash functions etc, left me wondering about where exactly the minhash signature was calculated and where the band/row partition has happened...
The other call, model.approxNearestNeighbors(...), actually asks a comparison point, and then the model will identify the nearest neighbor(s) to this given point... Obviously, this is not what I wanted either, since I have four toy documents, and I don't have an extra reference point.
I am running out of ideas, so I went ahead implemented my own version of the algorithm, using Spark APIs, but not much support from MinHashLSHModel model, which really made me feel bad. I am thinking I must have missed something... ??
I would love to hear any thoughts, really wish to solve the mystery.
Thank you guys in advance!
The minHash signatures calculation happens in
model.approxSimilarityJoin(...) itself where model.transform(...)
function is called on each of the input datasets and hash signatures
are calculated before joining them and doing a pair-wise jaccard
distance calculation. So, the impact of changing the number of hash
functions can be seen here.
In model.approxNearestNeighbors(...),
the impact of the same can be seen while creating the model using
minHash.fit(...) function in which transform(...) is called on
the input dataset.
Hi this is my first time posting here,
I have been trying to work out a question for studying but haven't been able to figure it out:
We consider the forest implementation of the disjoint-set abstract data type, with Weighted Union by size and Path Compression. Initially, each element is in a one-node tree.
Starting from the above initial state:
give a (short) sequence of UNION and FIND operations in which the last operation is a UNION that causes a taller tree A to become the subtree of a shorter tree B (ie. the height of A is strictly larger than the height of B).
Show the two trees A and B that the last UNION merges
Hint: You can start from n = 9 elements, each one in a one-node tree.
I'm not sure how that would work since the smaller tree always gets merged with the larger tree because of union by size?
Thanks.
I don't want to answer your homework, but this question is old enough that your semester is likely over, and in any case a hint should help enough.
There's a distinction between union by size and union by height, primarily because of path compression. Specifically, path compression can result in very high degree nodes, and thus trees with many nodes but a very short height. For example, these are two trees you can create with union find with path compression:
|T1: o (n=5, h=2)
| /| |\
| o o o o
|
|T2: o (n=4, h=3)
| /|
| o o
| |
| o
If the next operation is a merge of these two trees, the "union by rank" and "union by height" algorithms would select different parents.
In practice, "union by rank" is usually used. Rank is an upper bound for height which can be updated in constant time, and yields the best asymptotic running time. A web search will yield many good explanations of that algorithm.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to find list of possible words from a letter matrix [Boggle Solver]
I have a String[][] array such as
h,b,c,d
e,e,g,h
i,l,k,l
m,l,o,p
I need to match an ArrayList against this array to find the words specified in the ArrayList. When searching for word hello, I need to get a positive match and the locations of the letters, for example in this case (0,0), (1,1), (2,1), (3,1) and (3,2).
When going letter by letter and we suppose we are successfully located the first l letter, the program should try to find the next letter (l) in the places next to it. So it should match against e, e, g, k, o, l, m and i meaning all the letters around it: horizontally, vertically and diagonally. The same position can't be found in the word twice, so (0,0), (1,1), (2,1), (2,1) and (3,2) wouldn't be acceptable because the position (2,1) is matched twice. In this case, both will match the word because diagonally location is allowed but it needs to match the another l due to the requirement that a position can not be used more than once.
This case should also be matched
h,b,c,d
e,e,g,h
l,l,k,l
m,o,f,p
If we suppose that we try to search for helllo, it won't match. Either (x1, y1) (x1, y1) or (x1, y1) (x2, y2) (x1, y1) can't be matched.
What I want to know what is the best way to implement this kind of feature. If I have 4x4 String[][] array and 100 000 words in an ArrayList, what is the most efficient and the easiest way to do this?
I think you will probably spend most of your time trying to match words that can't possibly be built by your grid. So, the first thing I would do is try to speed up that step and that should get you most of the way there.
I would re-express the grid as a table of possible moves that you index by the letter. Start by assigning each letter a number (usually A=0, B=1, C=2, ... and so forth). For your example, let's just use the alphabet of the letters you have (in the second grid where the last row reads " m o f p "):
b | c | d | e | f | g | h | k | l | m | o | p
---+---+---+---+---+---+---+---+---+---+----+----
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
Then you make a 2D boolean array that tells whether you have a certain letter transition available:
| 0 1 2 3 4 5 6 7 8 9 10 11 <- from letter
| b c d e f g h k l m o p
-----+--------------------------------------
0 b | T T T T
1 c | T T T T T
2 d | T T T
3 e | T T T T T T T
4 f | T T T T
5 g | T T T T T T T
6 h | T T T T T T T
7 k | T T T T T T T
8 l | T T T T T T T T T
9 m | T T
10 o | T T T T
11 p | T T T
^
to letter
Now go through your word list and convert the words to transitions (you can pre-compute this):
hello (6, 3, 8, 8, 10):
6 -> 3, 3 -> 8, 8 -> 8, 8 -> 10
Then check if these transitions are allowed by looking them up in your table:
[6][3] : T
[3][8] : T
[8][8] : T
[8][10] : T
If they are all allowed, there's a chance that this word might be found.
For example the word "helmet" can be ruled out on the 4th transition (m to e: helMEt), since that entry in your table is false.
And the word hamster can be ruled out, since the first (h to a) transition is not allowed (doesn't even exist in your table).
Now, for the remaining words that you didn't eliminate, try to actually find them in the grid the way you're doing it now or as suggested in some of the other answers here. This is to avoid false positives that result from jumps between identical letters in your grid. For example the word "help" is allowed by the table, but not by the grid
Let me know when your boggle phone-app is done! ;)
Although I am sure there is a beatiful and efficient answer for this question academically, you can use the same approach, but with a list possibilities. so, for the word 'hello', when you find the letter 'h', next you will add possible 'e' letters and so on. Every possibility will form a path of letters.
I would start by thinking of your grid as a graph, where each grid position is a node and each node connect to its eight neighbors (you shouldn't need to explicitly code it as a graph in code, however). Once you find the potential starting letters, all you need to do is to do a depth first search of the graph from each start position. The key is to remember where you've already searched, so you don't end up making more work for yourself (or worse, get stuck in a cycle).
Depending on the size of character space being used, you might also be able to benefit from building lookup tables. Let's assume English (26 contiguous character codepoints); if you start by building a 26-element List<Point>[] array, you can populate that array once from your grid, and then can quickly get a list of locations to start your search for any word. For example, to get the locations of h I would write arr['h'-'a']
You can even leverage this further if you apply the same strategy and build lookup tables for each edge list in the graph. Instead of having to search all 8 edges for each node, you already know which edges to search (if any).
(One note - if your character space is non-contiguous, you can still do a lookup table, but you'll need to use a HashMap<Character,List<Point>> and map.get('h') instead.)
One approach to investigate is to generate all the possible sequences of letters (strings) from the grid, then check if each word exists in this set of strings, rather than checking each word against the grid. E.g. starting at h in your first grid:
h
hb
he
he // duplicate, but different path
hbc
hbg
hbe
hbe // ditto
heb
hec
heg
...
This is only likely to be faster for very large lists of words because of the overhead of generating the sequences. For small lists of words it would be much faster to test them individually against the grid.
You would either need to store the entire path (including coordinates) or have a separate step to work out the path for the words that match. Which is faster will depend on the hit rate (i.e. what proportion of input words you actually find in the grid).
Depending on what you need to achieve, you could perhaps compare the sequences against a list of dictionary words to eliminate the non-words before beginning the matching.
Update 2 in the linked question there are several working, fast solutions that generate sequences from the grid, deepening recursively to generate longer sequences. However, they test these against a Trie generated from the words list, which enables them to abandon a subtree of sequences early - this prunes the search and improves efficiency greatly. This has a similar effect to the transition filtering suggested by Markus.
i am using slightly modified Dijkstra algorithm in my app but it`s quite slow and i know there have to be a lot better approach. My input data are bus stops with specified travel times between each other ( ~ 400 nodes and ~ 800 paths, max. result depth = 4 (max 4 bus changes or nothing).
Input data (bus routes) :
bus_id | location-from | location-to | travel-time | calendar_switch_for_today
XX | A | B | 12 | 1
XX | B | C | 25 | 1
YY | C | D | 5 | 1
ZZ | A | D | 15 | 0
dijkstraResolve(A,D, '2012-10-10') -> (XX,A,B,12),(XX,B,C,25),(YY,C,D,5)
=> one bus change, 3 bus stops to final destination
* A->D cant be used as calendar switch is OFF
As you can imagine, in more complicated graphs where e.g. main city(node) does have 170 connections to different cities is Dijkstra slower (~ more then 5 seconds) because compute all neighbours first one by one as it`s not "trying" to reach target destination by some other way...
Could you recommend me any other algorithm which could fit well ?
I was looking on :
http://xlinux.nist.gov/dads//HTML/bellmanford.html (is it faster ?)
http://jboost.sourceforge.net/examples.html (i do not see
straightforward example here...)
Would be great to have (just optional things) :
- option to prefer minimal number of bus changes or minimal time
- option to look on alternatives way (if travel time is similar)
Thank you for tips
Sounds like you're looking for A*. It's a variant of Djikstra's which uses a heuristic to speed up the search. Under certain reasonable assumptions, A* is the fastest optimal algorithm. Just make sure to always break ties towards the endpoint.
There are also variants of A* which can provide near-optimal paths in much shorter time. See for example here and here.
Bellman-Ford (as suggested in your question) tends to be slower than either Djikstra's or A* - it is primarily used when there are negative edge-weights, which there are not here.
Maybe A* algorithm? See: http://en.wikipedia.org/wiki/A-star_algorithm
Maybe contraction hierarchies? See: http://en.wikipedia.org/wiki/Contraction_hierarchies.
Contraction hierarchies are implemented by the very nice, very fast Open Source Routing Machine (OSRM):
http://project-osrm.org/
and by OpenTripPlanner:
http://opentripplanner.com/
A* is implemented by a number of routing systems. Just do a search with Google.
OpenTripPlanner is a multi-modal routing system and, as long as I can see, should be very similar to your project.
The A* algorithm would be great for this; it achieves better performance by using heuristics.
Here is a simple tutorial to get you started: Link
I am looking for an efficient way to solve the following problem.
List 1 is a list of records that are identified by a primitive triplet:
X | Y | Z
List 2 is a list of records that are identified by three sets. One Xs, one Ys, one Zs. The X, Y, Zs are of the same 'type' as those in list one so are directly comparable with one another.
Set(X) | Set(Y) | Set(Z)
For an item in list 1 I need to find all the items in list 2 where the X, Y, Z from list 1 all occur in their corresponding sets in list 2. This is best demonstrated by an example:
List 1:
X1, Y1, Z1
List 2:
(X1, X2) | (Y1) | (Z1, Z3)
(X1) | (Y1, Y2) | (Z1, Z2, Z3)
(X3) | (Y1, Y3) | (Z2, Z3)
In the above, the item in list 1 would match the first two items in list 2. The third item would not be matched as X1 does not occur in the X set, and Z1 does not occur in the Z set.
I have written a functionally correct version of the algorithm but am concerned about performance on larger data sets. Both lists are very large so iterating over list 1 and then performing an iteration over list 2 per item is going to be very inefficient.
I tried to build an index by de-normalizing each item in list 2 into a map, but the number of index entries in the index per item is proportional to the size of the item's subsets. As such this uses a very high level of memory and also requires some significant resource to build.
Can anyone suggest to me an optimal way of solving this. I'm happy to consider both memory and CPU optimal solutions but striking a balance would be nice!
There are going to be a lot of ways to approach this. Which is right depends on the data and how much memory is available.
One simple technique is to build a table from list2, to accelerate the queries coming from list1.
from collections import defaultdict
# Build "hits". hits[0] is a table of, for each x,
# which items in list2 contain it. Likewise hits[1]
# is for y and hits[2] is for z.
hits = [defaultdict(set) for i in range(3)]
for rowid, row in enumerate(list2):
for i in range(3):
for v in row[i]:
hits[i][v].add(rowid)
# For each row, query the database to find which
# items in list2 contain all three values.
for x, y, z in list1:
print hits[0][x].intersection(hits[1][y], hits[2][z])
If the total size of the Sets is not too large you could try to model List 2 as bitfields. The structure will be probably quite fragmented though - maybe the structures referenced in the Wikipedia article on Bit arrays (Judy arrays, tries, Bloom filter) can help address the memory problems of you normalization approach.
You could build a tree out of List2; the first level of the tree is the first of (X1..Xn) that appears in set X. The second level is the values for the second item, plus a leaf node containing the set of lists which contain only X1. The next level contains the next possible value, and so on.
Root --+--X1--+--EOF--> List of pointers to list2 lines containing only "X1"
| |
| +--X2---+--EOF--> List of pointers to list2 lines containing only "X1,X2"
| | |
| | +--X3--+--etc--
| |
| +--X3---+--EOF--> "X1,X3"
|
+--X2--+--EOF--> "X2"
| |
| +--X3---+--EOF--> "X2,X3"
| | |
...
This is expensive in memory consumption (N^2 log K, I think? where N=values for X, K=lines in List2) but results in fast retrieval times. If the number of possible Xs is large then this approach will break down...
Obviously you could build this index for all 3 parts of the tuple, and then AND together the results from searching each tree.
There's a fairly efficient way to do this with a single pass over list2. You start by building an index of the items in list1.
from collections import defaultdict
# index is HashMap<X, HashMap<Y, HashMap<Z, Integer>>>
index = defaultdict(lambda: defaultdict(dict))
for rowid, (x, y, z) in enumerate(list1):
index[x][y][z] = rowid
for rowid2, (xs, ys, zs) in enumerate(list2):
xhits = defaultdict(list)
for x in xs:
if x in index:
for y, zmap in index[x].iteritems():
xhits[y].append(zmap)
yhits = defaultdict(list)
for y in ys:
if y in xhits:
for z, rowid1 in xhits[y].iteritems():
yhits[z].append(rowid1)
for z in zs:
if z in yhits:
for rowid1 in yhits[z]:
print "list1[%d] matches list2[%d]" % (hit[z], rowid2)
The extra bookkeeping here will probably make it slower than indexing list2. But since in your case list1 is typically much smaller than list2, this will use much less memory. If you're reading list2 from disk, with this algorithm you never need to keep any part of it in memory.
Memory access can be a big deal, so I can't say for sure which will be faster in practice. Have to measure. The worst-case time complexity in both cases, barring hash table malfunctions, is O(len(list1)*len(list2)).
How about using HashSet (or HashSets) for List 2 ? This way you will only need to iterate over List 1
If you use Guava, there is a high-level way to do this that is not necessarily optimal but doesn't do anything crazy:
List<SomeType> list1 = ...;
List<Set<SomeType>> candidateFromList2 = ...;
if (Sets.cartesianProduct(candidateFromList2).contains(list1)) { ... }
But it's not that hard to check this "longhand" either.