i am using slightly modified Dijkstra algorithm in my app but it`s quite slow and i know there have to be a lot better approach. My input data are bus stops with specified travel times between each other ( ~ 400 nodes and ~ 800 paths, max. result depth = 4 (max 4 bus changes or nothing).
Input data (bus routes) :
bus_id | location-from | location-to | travel-time | calendar_switch_for_today
XX | A | B | 12 | 1
XX | B | C | 25 | 1
YY | C | D | 5 | 1
ZZ | A | D | 15 | 0
dijkstraResolve(A,D, '2012-10-10') -> (XX,A,B,12),(XX,B,C,25),(YY,C,D,5)
=> one bus change, 3 bus stops to final destination
* A->D cant be used as calendar switch is OFF
As you can imagine, in more complicated graphs where e.g. main city(node) does have 170 connections to different cities is Dijkstra slower (~ more then 5 seconds) because compute all neighbours first one by one as it`s not "trying" to reach target destination by some other way...
Could you recommend me any other algorithm which could fit well ?
I was looking on :
http://xlinux.nist.gov/dads//HTML/bellmanford.html (is it faster ?)
http://jboost.sourceforge.net/examples.html (i do not see
straightforward example here...)
Would be great to have (just optional things) :
- option to prefer minimal number of bus changes or minimal time
- option to look on alternatives way (if travel time is similar)
Thank you for tips
Sounds like you're looking for A*. It's a variant of Djikstra's which uses a heuristic to speed up the search. Under certain reasonable assumptions, A* is the fastest optimal algorithm. Just make sure to always break ties towards the endpoint.
There are also variants of A* which can provide near-optimal paths in much shorter time. See for example here and here.
Bellman-Ford (as suggested in your question) tends to be slower than either Djikstra's or A* - it is primarily used when there are negative edge-weights, which there are not here.
Maybe A* algorithm? See: http://en.wikipedia.org/wiki/A-star_algorithm
Maybe contraction hierarchies? See: http://en.wikipedia.org/wiki/Contraction_hierarchies.
Contraction hierarchies are implemented by the very nice, very fast Open Source Routing Machine (OSRM):
http://project-osrm.org/
and by OpenTripPlanner:
http://opentripplanner.com/
A* is implemented by a number of routing systems. Just do a search with Google.
OpenTripPlanner is a multi-modal routing system and, as long as I can see, should be very similar to your project.
The A* algorithm would be great for this; it achieves better performance by using heuristics.
Here is a simple tutorial to get you started: Link
Related
I am doing a project in argument mining. One of the tasks is classifying Strings as PREM(ise), CONC(lusion) or M(ajor)CONC(lusion). I am working with AAEC dataset and have a few thousand features per vector.
For the task I employ a CSVM with polynomial kernel implemented in LibSVM and accessed through WEKA.
I am performing a grid search (w/o cross-validation, its a custom code I wrote that trains an SVM on a subset of the data and prints its results) for best C, gamma. I am trying in range 10^-5 to 10^5 and 2^-15 to 2^3 respectively. I am also printing out the results on the training set and on the test set.
I either get all classified as a for both confusion matrices, or this :
Confusion matrix (on training set)
a b c <-- classified as
416 0 0 | a = PREM
8 169 0 | b = CONC
5 0 80 | c = MCONC
Confusion matrix (on test set)
a b c <-- classified as
107 1 0 | a = PREM
40 0 0 | b = CONC
16 0 0 | c = MCONC
I am not too familiar with SVMs and I am not sure whether this is supposed to be normal or anomalous. Intuitively it seems unlikely that the data is so well separable in the training set yet the result is completely off on the test set.
I am not sure how to proceed. Is this a result of not having optimal C,gamma or the data being not descriptive enough, or is this potentially a signal of a more hidden problem (e.g. filtering mistakes, overfitting)?
Advice would be appreciated, thanks!
Assume you have something like
class Person {
LocalDate bornOn;
LocalDate diedOn;
}
let's say you have a bunch of "Person" instances that you can store any way you like.
What's the best way of writing an efficient function that can list all people that were alive at a given time?
The data structure should be efficiently mutable as well, in particular in terms of adding new elements.
I.e. conceptually something like
List<Person> alive(List<Person> people, LocalDate date) {
return people.stream().filter(x -> x.bornOn.compareTo(date) <= 0 && x.diedOn.compareTo(date) > 0).collect(Collectors.toList())
}
Only more efficient.
My first gut feeling would be having two NavigableMaps
NavigableMap<LocalDate, Person> peopleSortedByBornOn;
NavigableMap<LocalDate, Person> peopleSortedByDiedOn;
each could be queried with headMap() / tailMap() of the given date, and the intersection of those queries would be the result.
Is there any faster, or significantly more convenient solution though? Maybe even some sort of widely used Java collection/map type that'd support such an operation?
I would like to mention geometric data structures, like quad trees. For theoretical purposes. Have (born, died) coordinates: died >= born.
d b=d
| | - /
| + | /
| | /
D |____|/
| /:
|- / :
| / :
|/___:_____ b
D
The points are all located in the upper triangle, and + is the rectangular area for people living at date D. The rectangle is open ended left and top.
Having geometric data structure could do. And there are databases that can handle such geometric queries.
I would love to see an implementation, though a speed advantage I would not bet on. Maybe with huge numbers.
Given the constraints clarifications, I would keep it simple and use a map to keep references for living people on a given day, effectively creating an index.
Map<LocalDate,LinkedList<Person>> aliveMap;
Puts cost would be O(1) for the map and O(1) for the LinkedList.
Get on the other hand, is as good as it gets; O(1) (assuming a good hashing algorithm).
Memory wise, you would incur the cost of the extra "references", however this might be significant (~80 years x 365 x 8 bytes for a x64 VM or 233,600 bytes per person).
This approach will yield optimal performance on the get operations, probably the worst in terms of memory and average on the put operations.
Variation:
Instead of creating the full index, you can create buckets e.g. yearly where you first get everyone alive in a given year and then filter out the dead.
Map<Integer,LinkedList<Person>> aliveMap;
NB: I assume that your data go over 100's of years and not cover the whole population (7.5 billion). If you were only looking into a 50-100 year window, then there may be more efficient specialisations.
The only way I can think that you can make that more efficient is creating your own custom data structure. For example create your own HashMap in Java in which you can rewrite the "put" method. This way, when you will insert a Person object in the map you will know from the moment of insertion if it is alive or dead.
Here you have an example on how to create a custom HashMap.
this is quite long, and I am sorry about this.
I have been trying to implement the Minhash LSH algorithm discussed in chapter 3 by using Spark (Java). I am using a toy problem like this:
+--------+------+------+------+------+
|element | doc0 | doc1 | doc2 | doc3 |
+--------+------+------+------+------+
| d | 1 | 0 | 1 | 1 |
| c | 0 | 1 | 0 | 1 |
| a | 1 | 0 | 0 | 1 |
| b | 0 | 0 | 1 | 0 |
| e | 0 | 0 | 1 | 0 |
+--------+------+------+------+------+
the goal is to identify, among these four documents (doc0,doc1,doc2 and doc3), which documents are similar to each other. And obviously, the only possible candidate pair would be doc0 and doc3.
Using Spark's support, generating the following "characteristic matrix" is as far as I can reach at this point:
+----+---------+-------------------------+
|key |value |vector |
+----+---------+-------------------------+
|key0|[a, d] |(5,[0,2],[1.0,1.0]) |
|key1|[c] |(5,[1],[1.0]) |
|key2|[b, d, e]|(5,[0,3,4],[1.0,1.0,1.0])|
|key3|[a, c, d]|(5,[0,1,2],[1.0,1.0,1.0])|
+----+---------+-------------------------+
and here is the code snippets:
CountVectorizer vectorizer = new CountVectorizer().setInputCol("value").setOutputCol("vector").setBinary(false);
Dataset<Row> matrixDoc = vectorizer.fit(df).transform(df);
MinHashLSH mh = new MinHashLSH()
.setNumHashTables(5)
.setInputCol("vector")
.setOutputCol("hashes");
MinHashLSHModel model = mh.fit(matrixDoc);
Now, there seems to be two main calls on the MinHashLSHModel model that one can use: model.approxSimilarityJoin(...) and model.approxNearestNeighbors(...). Examples about using these two calls are here: https://spark.apache.org/docs/latest/ml-features.html#lsh-algorithms
On the other hand, model.approxSimilarityJoin(...) requires us to join two datasets, and I have only one dataset which has 4 documents and I would like to figure out which ones in these four are similar to each other, so I don't have a second dataset to join... Just to try it out, I actually joined my only dataset with itself. Based on the result, seems like model.approxSimilarityJoin(...) just did a pair-wise Jaccard calculation, and I don't see any impact by changing the number of Hash functions etc, left me wondering about where exactly the minhash signature was calculated and where the band/row partition has happened...
The other call, model.approxNearestNeighbors(...), actually asks a comparison point, and then the model will identify the nearest neighbor(s) to this given point... Obviously, this is not what I wanted either, since I have four toy documents, and I don't have an extra reference point.
I am running out of ideas, so I went ahead implemented my own version of the algorithm, using Spark APIs, but not much support from MinHashLSHModel model, which really made me feel bad. I am thinking I must have missed something... ??
I would love to hear any thoughts, really wish to solve the mystery.
Thank you guys in advance!
The minHash signatures calculation happens in
model.approxSimilarityJoin(...) itself where model.transform(...)
function is called on each of the input datasets and hash signatures
are calculated before joining them and doing a pair-wise jaccard
distance calculation. So, the impact of changing the number of hash
functions can be seen here.
In model.approxNearestNeighbors(...),
the impact of the same can be seen while creating the model using
minHash.fit(...) function in which transform(...) is called on
the input dataset.
I'm looking for an algorithm(C/C++/Java - doesn't matter) which will resolve a problem which consists of finding the shortest path between 2 nodes (A and B) of a graph. The catch is that the path must visit certain other given nodes(cities). A city can be visited more than once. Example of path(A-H-D-C-E-F-G-F-B) (where A is source, B is destination, F and G are cities which must be visited).
I see this as a variation of the Traveling Salesman Problem but I couldn't find or write a working algorithm based on my searches.
I was trying to find a solution starting from these topics but without any luck:
https://stackoverflow.com/questions/24856875/tsp-branch-and-bound-implementation-in-c and
Variation of TSP which visits multiple cities
An easy reduction of the problem to TSP would be:
For each (u,v) that "must be visited", find the distance d(u,v) between them. This can be done efficiently using Floyd-Warshall Algorithm to find all-to-all shortest path.
Create a new graph G' consisting only of those nodes, with all edges existing, with the distances as calculated on (1).
Run standard TSP algorithm to solve the problem on the reduced graph.
I think that in addition to amit's answer, you'll want to increase the cost of the edges that have A or B as endpoints by a sufficient amount (the total cost of the graph + 1 would probably be sufficient) to ensure that you don't end up with a path that goes through A or B (instead of ending at A and B).
A--10--X--0--B
| | |
| 10 |
| | |
+---0--Y--0--+
The above case would result in a path from A to Y to B to X, unless you increase the cost of the A and B edges (by 21).
A--31--X--21--B
| | |
| 10 |
| | |
+---21--Y--21-+
Now it goes from A to X to Y to B.
Also make sure you remove any edges (A,B) (if they exist).
Given a array for any dimension (for instance [1 2 3]), a function that gives all combinations like
1 |
1 2 |
1 2 3 |
1 3 |
2 |
2 1 3 |
2 3 |
...
Since I'm guessing this is homework, I'll try to refrain from giving a complete answer.
Suppose you already had all combinations (or permutations if that is what you are looking for) of an array of size n-1. If you had that, you could use those combinations/permutations as a basis for forming the new combinations/permutations by adding the nth element to them in the appropriate way. That is the basis for what computer scientists call recursion (and mathematicians like to call a very similar idea induction).
So you could write a method that would handle the the n case, assuming the n-1 case had been handled, and you can put a check to handle the base case as well.