Effective search from a huge number of points

Effective search from a huge number of points - java

I have a bunch of gps points collected and now I need to make a match of these points with 18000 points. I have these in two arraylists. Is there a better way to search? I am performing this in Java.
Here is a sample of huge data. They contain one more additional parameter ID1 by which a set of points can be grouped.
ID1 ID2 ID3 longi lati,
2 1 1 -79.911635 39.609849,
2 1 2 -79.91151 39.60956,
2 1 3 -79.9115 39.609489,
2 1 4 -79.911496 39.609433,
3 1 1 -79.908162 39.609841,
3 1 2 -79.908447 39.610019,
4 1 1 -79.911136 39.608433,
4 1 2 -79.910961 39.608446,
4 1 3 -79.910629 39.608451,
4 1 4 -79.910064 39.608493,
4 1 5 -79.909117 39.608586,

If you are looking for exact matches, then you can place the points in a set (both HashSet and TreeSet will work), and find the intersection: set1.intersect(set2). You will have to implement compare() or hashcode() accordingly, and equals() in any case, but that is the easy scenario.
If you are looking for "closer than X", you should use a quadtree. Place all the nodes in the first arraylist in a quadtree, and then perform quick lookup using this datastructure (which can yield the closest point in O(log N) per lookup instead of the O(N) per lookup of the brute-force approach). There is an open-source implementation of a quadtree in, for example, geotools.

You could also use the spatial index known as RTREE. It is usually faster than quadtree.
For example this paper finds it to be 2 -3 times faster in Oracle databases: http://pdf.aminer.org/000/300/406/incorporating_updates_in_domain_indexes_experiences_with_oracle_spatial_r.pdf
Java Topology Suite (JTS) contains a good implementation of the rtree: http://www.vividsolutions.com/jts/javadoc/com/vividsolutions/jts/index/strtree/STRtree.html
Note that GeoTools is based on JTS, so there may well also be an rtree lurking inside the spatial index functionality of it: http://docs.geotools.org/latest/userguide/library/main/collection.html

Related

Best algorithm to group and add in a list of following pattern

Let us consider we have objects in a list as
listOfObjects = [a,b,ob,ob,c,ob,c,ob,c,ob,ob,c,ob]
we have to group them as
[ob,ob,c,ob,c,ob] from index 2 to 7
[ob,ob,c,ob] from index 9 to 12
i.e the group starts if we have two ob's together, as in index 2 and 7, and ends before the 'c' having two ob's following, as in index 8 having 'c' which is followed by two 'ob's or if the list ends.
So what will be the best algorithm to get the above(in java)?

I assume "best algorithm" according to you is that which is optimal in terms of time complexity.
You can do this task by simple one traversal with keeping track of next 3 elements (of course taking care that you don't go out of list size) and ending the group by checking the strategy you said. If there are no 3 elements next the current element, you simply end your group (as you specified in your strategy)
So the time complexity of this algorithm will be O(n). It will not be possible to get better than this.

I think Stack is a suitable data structure.
it'll be all right once you put 'ob' in the stack.
also you need 'count' variable.

Where to start on a "generate and test" approach using Java

I am required to solve the "N Queens" problem using a generate and test approach in Java, so basically if N=8 my program must generate the 8^8 possible lists and test each one to return the 92 distinct lists that result in a solution to the problem. I must also use a DFS algorithm with backtracking to enumerate the possibilities.
To provide an example, list (2,4,6,8,3,1,7,5) means that the first queen is column 1 row 2, the second is column 2 row 4, the third is column 3 row 6...and so on.
The two main things preventing me from making headway on this are:
1) I have no idea how to generate every possible list of length N (and integers size N or less) in Java
2) I don't really understand how once I have all these lists, to abstract them to a datatype that can be traversed with a DFS algorithm.
I'm not begging someone to do my homework for me, more I'd like a conceptual walkthrough of how #2 can be thought of and a (somewhat) tangible example of how given an input N I can generate all N^N lists.

Find the lowest sum path from 2d Array

Just thinking about the one algorithm below is the statement for that
Given a matrix, with each node having a value. You start from 0,0 and have to reach n,m. From i,j you can either go to i+1,j or i,j+1. When you step on each block, the value on that block gets added to your current score. What’s the minimum initial score you must carry so that you can always reach n,m(through any possible path) having positive score at the end.
Eg:
Matrix -> 2 3 4
-5 -6 7
8 3 1
Ans -> 6 – for path 2,-5,-6,3,1 we need initial score of 6 so that when we land on 1, we have a positive score of 1
So I can do this using brute force and Dynamic programming, but still thinking for approach which could be better then this, please share ur thoughts, just thoughts/idea I do not need implementation, as I can do this.

There's many search algorithm, i encourage you reading these Wikipedia pages :
https://en.wikipedia.org/wiki/Pathfinding
https://en.wikipedia.org/wiki/Tree_traversal
One possible solution, is to transform the array to graph and apply shortest paths algorithms to it, another solution is to use some IA algorithms such as A*.
Link to Wikipedia for A* (prounced A Star) :
https://en.wikipedia.org/wiki/A*_search_algorithm

Optimal merging of triplets

I'm trying to come up with an algorithm for the following problem :
I've got a collection of triplets of integers - let's call these integers A, B, C. The value stored inside can be big, so generally it's impossible to create an array of size A, B, or C. The goal is to minimize the size of the collection. To do this, we're provided a simple rule that allows us to merge the triplets :
For two triplets (A, B, C) and (A', B', C'), remove the original triplets and place the triplet (A | A', B, C) if B == B' and C = C', where | is bitwise OR. Similar rules hold for B and C also.
In other words, if two values of two triplets are equal, remove these two triplets, bitwise OR the third values and place the result to the collection.
The greedy approach is usually misleading in similar cases and so it is for this problem, but I can't find a simple counterexample that'd lead to a correct solution. For a list with 250 items where the correct solution is 14, the average size computed by greedy merging is about 30 (varies from 20 to 70). The sub-optimal overhead gets bigger as the list size increases.
I've also tried playing around with set bit counts, but I've found no meaningful results. Just the obvious fact that if the records are unique (which is safe to assume), the set bit count always increases.
Here's the stupid greedy implementation (it's just a conceptual thing, please don't regard the code style) :
public class Record {
long A;
long B;
long C;
public static void main(String[] args) {
List<Record> data = new ArrayList<>();
// Fill it with some data
boolean found;
do {
found = false;
outer:
for (int i = 0; i < data.size(); ++i) {
for (int j = i+1; j < data.size(); ++j) {
try {
Record r = merge(data.get(i), data.get(j));
found = true;
data.remove(j);
data.remove(i);
data.add(r);
break outer;
} catch (IllegalArgumentException ignored) {
}
}
}
} while (found);
}
public static Record merge(Record r1, Record r2) {
if (r1.A == r2.A && r1.B == r2.B) {
Record r = new Record();
r.A = r1.A;
r.B = r1.B;
r.C = r1.C | r2.C;
return r;
}
if (r1.A == r2.A && r1.C == r2.C) {
Record r = new Record();
r.A = r1.A;
r.B = r1.B | r2.B;
r.C = r1.C;
return r;
}
if (r1.B == r2.B && r1.C == r2.C) {
Record r = new Record();
r.A = r1.A | r2.A;
r.B = r1.B;
r.C = r1.C;
return r;
}
throw new IllegalArgumentException("Unable to merge these two records!");
}
Do you have any idea how to solve this problem?

This is going to be a very long answer, sadly without an optimal solution (sorry). It is however a serious attempt at applying greedy problem solving to your problem, so it may be useful in principle. I didn't implement the last approach discussed, perhaps that approach can yield the optimal solution -- I can't guarantee that though.
Level 0: Not really greedy
By definition, a greedy algorithm has a heuristic for choosing the next step in a way that is locally optimal, i.e. optimal right now, hoping to reach the global optimum which may or may not be possible always.
Your algorithm chooses any mergable pair and merges them and then moves on. It does no evaluation of what this merge implies and whether there is a better local solution. Because of this I wouldn't call your approach greedy at all. It is just a solution, an approach. I will call it the blind algorithm just so that I can succinctly refer to it in my answer. I will also use a slightly modified version of your algorithm, which, instead of removing two triplets and appending the merged triplet, removes only the second triplet and replaces the first one with the merged one. The order of the resulting triplets is different and thus the final result possibly too. Let me run this modified algorithm over a representative data set, marking to-be-merged triplets with a *:
0: 3 2 3 3 2 3 3 2 3
1: 0 1 0* 0 1 2 0 1 2
2: 1 2 0 1 2 0* 1 2 1
3: 0 1 2*
4: 1 2 1 1 2 1*
5: 0 2 0 0 2 0 0 2 0
Result: 4
Level 1: Greedy
To have a greedy algorithm, you need to formulate the merging decision in a way that allows for comparison of options, when multiple are available. For me, the intuitive formulation of the merging decision was:
If I merge these two triplets, will the resulting set have the maximum possible number of mergable triplets, when compared to the result of merging any other two triplets from the current set?
I repeat, this is intuitive for me. I have no proof that this leads to the globally optimal solution, not even that it will lead to a better-or-equal solution than the blind algorithm -- but it fits the definition of greedy (and is very easy to implement). Let's try it on the above data set, showing between each step, the possible merges (by indicating the indices of triplet pairs) and resulting number of mergables for each possible merge:
mergables
0: 3 2 3 (1,3)->2
1: 0 1 0 (1,5)->1
2: 1 2 0 (2,4)->2
3: 0 1 2 (2,5)->2
4: 1 2 1
5: 0 2 0
Any choice except merging triplets 1 and 5 is fine, if we take the first pair, we get the same interim set as with the blind algorithm (I will this time collapse indices to remove gaps):
mergables
0: 3 2 3 (2,3)->0
1: 0 1 2 (2,4)->1
2: 1 2 0
3: 1 2 1
4: 0 2 0
This is where this algorithm gets it differently: it chooses the triplets 2 and 4 because there is still one merge possible after merging them in contrast to the choice made by the blind algorithm:
mergables
0: 3 2 3 (2,3)->0 3 2 3
1: 0 1 2 0 1 2
2: 1 2 0 1 2 1
3: 1 2 1
Result: 3
Level 2: Very greedy
Now, a second step from this intuitive heuristic is to look ahead one merge further and to ask the heuristic question then. Generalized, you would look ahead k merges further and apply the above heuristic, backtrack and decide the best option. This gets very verbose by now, so to exemplify, I will only perform one step of this new heuristic with lookahead 1:
mergables
0: 3 2 3 (1,3)->(2,3)->0
1: 0 1 0 (2,4)->1*
2: 1 2 0 (1,5)->(2,4)->0
3: 0 1 2 (2,4)->(1,3)->0
4: 1 2 1 (1,4)->0
5: 0 2 0 (2,5)->(1,3)->1*
(2,4)->1*
Merge sequences marked with an asterisk are the best options when this new heuristic is applied.
In case a verbal explanation is necessary:
Instead of checking how many merges are possible after each possible merge for the starting set; this time we check how many merges are possible after each possible merge for each resulting set after each possible merge for the starting set. And this is for lookahead 1. For lookahead n, you'd be seeing a very long sentence repeating the part after each possible merge for each resulting set n times.
Level 3: Let's cut the greed
If you look closely, the previous approach has a disastrous perfomance for even moderate inputs and lookaheads(*). For inputs beyond 20 triplets anything beyond 4-merge-lookahead takes unreasonably long. The idea here is to cut out merge paths that seem to be worse than an existing solution. If we want to perform lookahead 10, and a specific merge path yields less mergables after three merges, than another path after 5 merges, we may just as well cut the current merge path and try another one. This should save a lot of time and allow large lookaheads which would get us closer to the globally optimal solution, hopefully. I haven't implemented this one for testing though.
(*): Assuming a large reduction of input sets is possible, the number of merges is
proportional to input size, and
lookahead approximately indicates how much you permute those merges.
So you have choose lookahead from |input|, which is
the binomial coefficient that for lookahead ≪ |input| can be approximated as
O(|input|^lookahead) -- which is also (rightfully) written as you are thoroughly screwed.
Putting it all together
I was intrigued enough by this problem that I sat and coded this down in Python. Sadly, I was able to prove that different lookaheads yield possibly different results, and that even the blind algorithm occasionally gets it better than lookahead 1 or 2. This is a direct proof that the solution is not optimal (at least for lookahead ≪ |input|). See the source code and helper scripts, as well as proof-triplets on github. Be warned that, apart from memoization of merge results, I made no attempt at optimizing the code CPU-cycle-wise.

I don't have the solution, but I have some ideas.
Representation
A helpful visual representation of the problem is to consider the triplets as points of the 3D space. You have integers, so the records will be nodes of a grid. And two records are mergeable if and only if the nodes representing them sit on the same axis.
Counter-example
I found an (minimal) example where a greedy algorithm may fail. Consider the following records:
(1, 1, 1) \
(2, 1, 1) | (3, 1, 1) \
(1, 2, 1) |==> (3, 2, 1) |==> (3, 3, 1)
(2, 2, 1) | (2, 2, 2) / (2, 2, 2)
(2, 2, 2) /
But by choosing the wrong way, it might get stuck at three records:
(1, 1, 1) \
(2, 1, 1) | (3, 1, 1)
(1, 2, 1) |==> (1, 2, 1)
(2, 2, 1) | (2, 2, 3)
(2, 2, 2) /
Intuition
I feel that this problem is somehow similar to finding the maximal matching in a graph. Most of those algorithms finds the optimal solution by begining with an arbitrary, suboptimal solution, and making it 'more optimal' in each iteration by searching augmenting paths, which have the following properties:
they are easy to find (polynomial time in the number of nodes),
an augmenting path and the current solution can be crafted to a new solution, which is strictly better than the current one,
if no augmenting path is found, the current solution is optimal.
I think that the optimal solution in your problem can be found in the similar spirit.

Based on your problem description:
I'm given a bunch of events in time that's usually got some pattern.
The goal is to find the pattern. Each of the bits in the integer
represents "the event occurred in this particular year/month/day". For
example, the representation of March 7, 2014 would be [1 <<
(2014-1970), 1 << 3, 1 << 7]. The pattern described above allows us to
compress these events so that we can say 'the event occurred every 1st
in years 2000-2010'. – Danstahr Mar 7 at 10:56
I'd like to encourage you with the answers that MicSim has pointed at, specifically
Based on your problem description, you should check out this SO
answers (if you didn't do it already):
stackoverflow.com/a/4202095/44522 and
stackoverflow.com/a/3251229/44522 – MicSim Mar 7 at 15:31
The description of your goal is much more clear than the approach you are using. I'm scared that you won't get anywhere with the idea of merging. Sounds scary. The answer you get depends upon the order that you manipulate your data. You don't want that.
It seems you need to keep data and summarize. So, you might try counting those bits instead of merging them. Try clustering algorithms, sure, but more specifically try regression analysis. I should think you would get great results using a correlation analysis if you create some auxiliary data. For example, if you create data for "Monday", "Tuesday", "first Monday of the month", "first Tuesday of the month", ... "second Monday of the month", ... "even years", "every four years", "leap years", "years without leap days", ... "years ending in 3", ...
What you have right now is "1st day of the month", "2nd day of the month", ... "1st month of the year", "2nd month of the year", ... These don't sound like sophisticated enough descriptions to find the pattern.
If you feel it is necessary to continue the approach you have started, then you might treat it more as a search than a merge. What I mean is that you're going to need a criteria/measure for success. You can do the merge on the original data while requiring strictly that A==A'. Then repeat the merge on the original data while requiring B==B'. Likewise C==C'. Finally compare the results (using the criteria/measure). Do you see where this is going? Your idea of bit counting could be used as a measure.
Another point, you could do better at performance. Instead of double-looping through all your data and matching up pairs, I'd encourage you to do single passes through the data and sort it into bins. The HashMap is your friend. Make sure to implement both hashCode() and equals(). Using a Map you can sort data by a key (say where month and day both match) and then accumulate the years in the value. Oh, man, this could be a lot of coding.
Finally, if the execution time isn't an issue and you don't need performance, then here's something to try. Your algorithm is dependent on the ordering of the data. You get different answers based on different sorting. Your criteria for success is the answer with the smallest size after merging. So, repeatedly loop though this algorithm: shuffle the original data, do your merge, save the result. Now, every time through the loop keep the result which is the smallest so far. Whenever you get a result smaller than the previous minimum, print out the number of iterations, and the size. This is a very simplistic algorithm, but given enough time it will find small solutions. Based on your data size, it might take too long ...
Kind Regards,
-JohnStosh

N-way merge sort a 2G file of strings

This is another question from cracking coding interview, I still have some doubt after reading it.
9.4 If you have a 2 GB file with one string per line, which sorting algorithm
would you use to sort the file and why?
SOLUTION
When an interviewer gives a size limit of 2GB, it should tell you something - in this case, it suggests that they don’t want you to bring all the data into memory.
So what do we do? We only bring part of the data into memory..
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
The rationale behind using external sort is the size of data. Since the data is too huge and we can’t bring it all into memory, we need to go for a disk based sorting algorithm.
Doubt:
When in step 3, doing the merge sort, while comparing 2 arrays, do we need 2*X space each time we compare? And the limit was X MB. Should we make the chunks into (X/2)*2K = 2GB? So that each chunk will be X/2 MB and there will be 2K chunks. Or I am just understanding the merge sort wrong?
Thanks!

http://en.wikipedia.org/wiki/External_sorting
A quick look on Wikipedia tells me that during the merging process you never hold a whole chunk in memory. So basically, if you have K chunks, you will have K open file pointers but you will only hold one line from each file in memory at any given time. You will compare the lines you have in memory and then output the smallest one (say, from chunk 5) to your sorted file (also an open file pointer, not in memory), then overwrite that line with the next line from that file (in our example, file 5) into memory and repeat until you reach the end of all the chunks.

First off, step 3 itself is not a merge sort, the whole thing is a merge sort. Step 3 is just a merge, with no sorting involved at all.
And as to the storage required, there are two possibilities.
The first is to merge the sorted data in groups of two. Say you have three groups:
A: 1 3 5 7 9
B: 0 2 4 6 8
C: 2 3 5 7
With that method, you would merge A and B in to a single group Y then merge Y and C into the final result Z:
Y: 0 1 2 3 4 5 6 7 8 9 (from merging A and B).
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging Y and C).
This has the advantage of a very small constant memory requirement in that you only ever need to store the "next" element from each of two lists but, of course, you need to do multiple merge operations.
The second way is a "proper" N-way merge where you select the next element from any of the groups. With that you would check the lowest value in every list to see which one comes next:
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging A, B and C).
This involves only one merge operation but it requires more storage, basically one element per list.
Which of these you choose depends on the available memory and the element size.
For example, if you have 100M memory available to you and the element size is 100K, you can use the latter. That's because, for a 2G file, you need 20 groups (of 100M each) for the sort phase which means a proper N-way merge will need 100K by 20, or about 2M, well under your memory availability.
Alternatively, let's say you only have 1M available. That will be about 2000 (2G / 1M) groups and multiplying that by 100K gives 200M, well beyond your capacity.
So you would have to do that merge in multiple passes. Keep in mind though that it doesn't have to be multiple passes merging two lists.
You could find a middle ground where for example each pass merges ten lists. Ten groups of 100K is only a meg so will fit into your memory constraint and that will result in fewer merge passes.

The merging process is much simpler than that. You'll be outputting them to a new file, but basically you only need constant memory: you only need to read one element from each of the two input files at a time.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.