Fastest way to take top N closest vectors by cosine distance

Fastest way to take top N closest vectors by cosine distance - java

I have a huge list of vectors (~100k) (representing words and calculated using random indexing) and have to find given 1 input word the top N closest vectors. The way I'm doing it now is do a complete sort by distance and then extract the top N results but this takes too much time to be usable, as I have to compute 100k distances. Is there a more efficient way to do it? The vectors are already normalized, so I just have to compute the dot product when calculating the distance.
The vectors are stored in a a Java HashMap<String, Vector>, where Vector is la4j class for sparse vectors.

You can put your vectors into a spatial-aware container, such as R-tree or k-d tree or PK-Tree.
This way you’ll be able to find the points without iterating through all your dataset, by only looking in a few adjacent cells. Don‘t forget you’ll need to search not just in a single cell, but in the adjacent cells as well, and in multi-dimensional space there’re a lot of neighbors.
Update: You still need to measure the distance manually. However, you will not need to iterate through all vectors.
One simple solution — define max distance, iterate all vectors within the cells within that distance, sort, pick the top N.
The most optimal solution (much harder to develop) — iterative search process. For example, start with the single cell where your input vector vX is, find N closest vectors in this cell. If the distance between vX and the N-th found vector (the farthest one) is less than the distance between vX and the nearest point of any cell that’s not yet being searched, then you have your N results. Otherwise, add the vectors from the nearest cell that was not yet searched, and repeat the process. The most complex thing here — keeping track on what cells are already searched and what to do next (esp. for the PK-tree where the tree is of variable height).
The tradeoff solution (not that hard to develop, could be reasonably optimal for you) — iterative search process where you go up the tree all the time. You start with the leaf node containing the vX, if it doesn’t have N vectors or if the vX is closer to a boundary of the cell then the N-th found vector, you go one level up, and add the complete sub-tree starting from the parent node. This way the algorithm is much simpler because the searched area is always rectangular. However, the worst case (which is, if vX lies on the border between the 2 root cells), is much worse — you’ll have to iterate through all your 100k points.

If you know your vectors are more or less evenly distributed in your N-dimensional space, you don’t need all that complexity with the spatial trees.
Instead, you can split your space into regular hypercubic grid so that average grid cell contains under 20 vectors, and store the cells in HashMap<List<Integer>, List<Vector>> where the keys are integer coordinates of the grid cell, and values are list of vectors that are inside the corresponding cells.

Related

How to determinate the final path in A* [duplicate]

Blue-Walls
Green highlighted cells = open list
Red Highlighted cells = closed list
Hello, can anyone tell me how can i implement backtracking in a a star search algorithm?
I've implemented the a star search according to wiki, but it does not backtrack, what i mean by backtrack is that the open list(green cells) contains 2,0 and 3,3 as shown in the picture, upon reaching 2,0 the current node would "jump" to 3,3 since the cost is now more than 3,3 and continue the search from there, how can it be done so that it would backtrack from 2,0->2,1->2,2... all the way back to 3,3 and start the search from there?

your image is like 2d grid map
But your text suggest graph approach which is a bit confusing.
For 2D grid map the costs must be different between cells on path
You got too much of cost=100 in there and therefore you can not backtrack the path. You have to increase or decrease cost on each step and fill only cells that are near last filled cells. That can be done by recursion on big maps or by scanning whole map or bounding box for last filled number on small maps.
Look here for mine C++ A* implementation
The backtracking
Can be done by scanning neighbors of start/end cells after A* filling moving always to the smallest/biggest cost
In this example start filling from (2,0) until (3,3) is hit and then backtrack from (3,2) cost=8 to the smallest cost (always cost-1 for incremental filling). If you need the path in reverse order then start filling from (3,3) instead ...
speedup
Sometimes double filling speed up the process so: Start filling from both ends and stop when they join. To recognize which cell is filled from which point you can use positive and negative values, or some big enough ranges for costs.

You can follow backpointers from the two nodes until you reach the common ancestor (0,2), then print the nodes you visited when following from (2,0), followed by the the nodes you visited when following from (3,3), printed in reverse.
To find the common ancestor of two nodes in an A* search tree, just maintain the two "current nodes", and follow the backpointer of whichever has the higher g-cost, until the two current nodes are in the same place.
It bears mentioning that this is a weird thing to do, though. A* is not a stack-based traversal, so it doesn't backtrack.

How to look for the edges that must exist in every minimum spanning trees of a weighted graph

Given an undirected weighted graph, the actual weights of the edges are not known, however; instead, each edge is classified as either Light, Medium or Heavy.
All Light edges have a smaller weight than any Medium or Heavy edge.
All Medium edges have a smaller weight than any Heavy edge
In general, nothing is known about the relationship between two edges in the same weight class.
Then, how to identify all the edges that must exist in every MST of this graph? The following is what I'm thinking:
1. determine the number of strongly connected components.
2. the edges composed of articulation points must exist in the MST.
3. The lightest edge in each connected component must exist in the MST.
I am not sure whether my thinking is correct or not? If it is correct, how to implement the code with java? Thank you very much.

Jason, I will not go into the description of how to implement the code in Java, but let's look at the thought process behind the algorithm for your problem.
Since your vertices are classified into three weight categories, we can re-label them with comparative weights as follows: Light is 1; Medium is 2; Heavy is 3. This way, your conditions are maintained.
Next, we can use Kruskal's Minimum Spanning Tree Algorithm (MST) as we normally would to create a minimum spanning tree on an undirected weighted graph. This algorithm is greedy, so it would sort the edges from light to heavy, pick the next smallest edge so long as it doesn't create a cycle, and then repeat step 2 until all vertices are included in the MST. (See link below for reference)
https://www.geeksforgeeks.org/kruskals-minimum-spanning-tree-algorithm-greedy-algo-2/
When it comes to verifying that your algorithm is correct, there are two potential cases.
1). You can reveal the actual weights of edges in the MST and those excluded. Check the excluded edges and if when adding an excluded edge to the MST, that edge is not the heaviest in the cycle, swap it with the heaviest edge. Keep doing this until all originally-excluded edges are explored and the MST maintains its property of containing every vertex.
2). You cannot reveal the actual weights of any vertex in the graph. In this case, there is no way to even verify that your algorithm created a Minimum Spanning Tree, so your algorithm would have no way of checking itself. In any event, using Kruskal's algorithm with comparative weights would create a spanning tree that is very close to minimum, even without knowing the actual weights.

Here is a simple algorithm that runs in O(|E|)- time.
Initialize an empty set S.
Add all bridges in the graph to S.
Remove all heavy edges from the graph.
Add all bridges in the graph to S.
Remove all medium edges from the graph
Add all bridges in the graph to S.
Return S.
There are a few fast algorithms that find all bridges in a graph, such as this one
Why is the algorithm above correct?
Theorem on characterization of an all-MST edge: An edge appears in every minimum spanning tree if and only if for any cycle that contains that edge, that cycle also contains an edge heavier than that edge.
The theorem above is proved here. By the way, an edge that is in every MST of a graph is called a critical MST edge sometimes.
The algorithm given in the question is incorrect. For example, a triangle with all 3 edges classified as light. For another example, a triangle with one light edge, one medium edge and one heavy edge.

Finding the distance between two objects in a 2D array, not going Diagonal

Not sure if this was posted before, but this is a relatively short question
I am currently working on a maze game being chased by something.
Currently, the player starts at (0,0), and the monster starts at (9,9). If moving is incrementing/decrementing one (not both), what is the algorithm/code to find the amount of moves it'd take for the monster to reach the main character?
From the comments I realized I should have clarified a few thing.
If the room type is 1, then it's a wall, else is open. But the main thing is that the walls do NOT impact the monster. Perhaps the better way to ask is how many moves would it take if all of the arrays were open.

You could take a look at the A* search algorithm which is widely used in games since it uses a heuristic to make performance improvements.
In your case, the heuristic could be the Manhattan Distance metric to calculate the distance between your grid elements. This distance metric takes into consideration the total of the differences in the X and Y coordinates (which unlike euclidean distance, does not allow for diagonal traversal).
To use this algorithm however, you would need to make a graph representation of your map, something like below:

Finding if an edge lies within a set of disjoint rectangles

I'm using a triangulation library to compute the Constrained Delaunay Triangulation of a set of rectangles within some large boundary. The algorithm returns all the edges, but also adds edges inside of the rectangles that define the constraints. I want to be able to find if an edge lies inside of a rectangle in O(1) time.
Here's a more general description of the problem I want to solve. Given a set of nonoverlapping rectangles (the borders of the rectangles may touch) and an edge e with endpoints (x1,y1) and (x2, y2), find in O(1) time if e lies within any of the rectangles (including the border).
Also let me know of any data structures I can use for speedups! I'm also implementing this in java so I have easy access to hash sets, maps and all those nice data structures.

Since the rectangles are completely enclosed, the inside of each rectangle will simply be the CDT of the rectangle itself -- which is to say, two triangles, meeting along a diagonal of the rectangle. So you can just insert all rectangles' diagonals (remember, two possible diagonals per rectangle) into a hashtable, and check which edges exactly match those endpoints.

It is possible to break up the area that all of the rectangles cover into a N by M grid of boxes. By labeling each box with the rectangle it is in or the rectangles it overlaps. It is possible to obtain O(1) queries, with O(N*M) pre-processing.
However, in order for it to work, the grid has to created based on an algorithm that allows for calculating which box a point lies in in O(1). It also requires that the number of rectangles a box overlaps be very small (ideally no more than 2 or 3) as otherwise the average query time could be O(log N) or worst. This means that the number of boxes can get very large.

Shortest path between raw geo coordinates and a node of a graph

I have implemented a simple Dijkstra's algorithm for finding the shortest path on an .osm map with Java.
The pathfinding in a graph which is created from an .osm file works pretty well. But in case the user's current location and/or destination is not a node of this graph (just raw coordinates) how do we 'link' those coordinates to the graph to make pathfinding work?
The simple straightforward solution "find the nearest to the current location node and draw a straight line" doesn't seem to be realistic. What if we have a situation like on the attached picture? (UPD)
The problem here is that before we start any 'smart' pathfinding algorithms (like Dijkstra's) we 'link' the current position to the graph, but it is just dumb formula (a hypotenuse from Pythagorean theorem) of finding the nearest node in terms of geographical coordinates and this formula is not 'pathinding' - it can not take obstacles and types of nodes into account.
To paraphrase it - how do we find the shortest path between A and B if B is a node in a graph, and A is not a node?
Have you heard of any other solutions to this problem?

The process you're describing is "map matching," and it uses a spatial index to find the nearest node.
One common approach is to construct a quadtree that contains all your nodes, then identify the quad that contains your point, then calculate the distance from your point to all nodes in the quad (recognizing that longitudinal degrees are shorter than latitudinal degrees). If there are no nodes in the quad then you progressively expand your search. There are several caveats with quadtrees, but this should at least get you started.

Practically speaking, I would just ignore the problem and use your suggested algorithm "straight line to nearest node". It is the user's responsibility to be as close as possible to a routable entity. If the first guess you suggested was missleading, user can change the starting position accordingly.
The user who really starts his journey in no man's land hopefully knows the covered area much more than your algorithm. Trust him.
BTW, this is the algorithm that OpenRouteService and Google Maps are using.
If still not convinced, I suggest to use the "shortest straight line that does not cross an obstacle". If this is still not enough, define a virtual grid of say 5mx5m and its diagonals. Than span a shortest path algorithm until you reach a graph-vertex.

You could treat the current location as a node, and connect that node to a few of the nearest nodes in a straight line. GPS applications would consider this straight line as being 'off road', so the cost of this line is very big compared to the other lines between nodes.
However, I didn't see your attached picture, so not sure this is a good solution for you.

If you have constraints in your path, you should consider using a linear programming formulation of the same shortest path problem.
http://en.wikipedia.org/wiki/Shortest_path_problem
Your objective would be to minimize the sum of the distance of each "way" taken between the starting and ending "nodes" defined in your .osm file. Any obstacles would be formulated as constraints. To implement with Java, see the link below.
http://javailp.sourceforge.net/

Really nice question!
Quad tree is a solution, as you can also index lines/edges into it, not only nodes.
But the problems with this 'naive' approach is that these solutions are memory intensive. E.g. if you already have a very big graph for shortest path calculation then you need the same or more memory for the quad tree (or I was doing something very stupid).
One solution is as follows:
use an array which is a grid over the used area. I mean you can devide your area into tiles, and per tile you have an array entry with the list of nodes.
so per array entry you'll have a list of nodes in that tile. For an edge you can just add both nodes to the entry. Take care when there are edges crossing a tile without having its node in this tile. The BresenhamLine algorithm will help here.
querying: converte the input ala (lat,lon) into a tile number. now get all points from the array entry. Calculate the nearest neighbor of the nodes AND edges to your point A using euclidean geometry (which should be fine as long as they are not too far away which is the case for nearest neighbor).
Is this description clear?
Update
This is now implemented in graphhopper!
To get a more memory efficient solution you have to simply exclude identical nodes for one entry (tile).
A bit more complicated technic to reduce mem-usage: if a graph traversal respects the tile bounds you can imagine that the graph is then devided into several sub-graphs for that tile (ie. a graph traversal wouldn't reach the other sub-graph within the tile-bounds). Now you don't need all nodes but only the nodes which lay in a different sub-graph! This will reduce the memory usage, but while querying you need to traverse not only one edge further (like in the current graphhopper implementation)! Because you need to traverse the full tile - i.e. as so long as the tile bounds are not exceeded.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.