Efficient Matching Algorithm for Set Based Triplets - java

I am looking for an efficient way to solve the following problem.
List 1 is a list of records that are identified by a primitive triplet:
X | Y | Z
List 2 is a list of records that are identified by three sets. One Xs, one Ys, one Zs. The X, Y, Zs are of the same 'type' as those in list one so are directly comparable with one another.
Set(X) | Set(Y) | Set(Z)
For an item in list 1 I need to find all the items in list 2 where the X, Y, Z from list 1 all occur in their corresponding sets in list 2. This is best demonstrated by an example:
List 1:
X1, Y1, Z1
List 2:
(X1, X2) | (Y1) | (Z1, Z3)
(X1) | (Y1, Y2) | (Z1, Z2, Z3)
(X3) | (Y1, Y3) | (Z2, Z3)
In the above, the item in list 1 would match the first two items in list 2. The third item would not be matched as X1 does not occur in the X set, and Z1 does not occur in the Z set.
I have written a functionally correct version of the algorithm but am concerned about performance on larger data sets. Both lists are very large so iterating over list 1 and then performing an iteration over list 2 per item is going to be very inefficient.
I tried to build an index by de-normalizing each item in list 2 into a map, but the number of index entries in the index per item is proportional to the size of the item's subsets. As such this uses a very high level of memory and also requires some significant resource to build.
Can anyone suggest to me an optimal way of solving this. I'm happy to consider both memory and CPU optimal solutions but striking a balance would be nice!

There are going to be a lot of ways to approach this. Which is right depends on the data and how much memory is available.
One simple technique is to build a table from list2, to accelerate the queries coming from list1.
from collections import defaultdict
# Build "hits". hits[0] is a table of, for each x,
# which items in list2 contain it. Likewise hits[1]
# is for y and hits[2] is for z.
hits = [defaultdict(set) for i in range(3)]
for rowid, row in enumerate(list2):
for i in range(3):
for v in row[i]:
hits[i][v].add(rowid)
# For each row, query the database to find which
# items in list2 contain all three values.
for x, y, z in list1:
print hits[0][x].intersection(hits[1][y], hits[2][z])

If the total size of the Sets is not too large you could try to model List 2 as bitfields. The structure will be probably quite fragmented though - maybe the structures referenced in the Wikipedia article on Bit arrays (Judy arrays, tries, Bloom filter) can help address the memory problems of you normalization approach.

You could build a tree out of List2; the first level of the tree is the first of (X1..Xn) that appears in set X. The second level is the values for the second item, plus a leaf node containing the set of lists which contain only X1. The next level contains the next possible value, and so on.
Root --+--X1--+--EOF--> List of pointers to list2 lines containing only "X1"
| |
| +--X2---+--EOF--> List of pointers to list2 lines containing only "X1,X2"
| | |
| | +--X3--+--etc--
| |
| +--X3---+--EOF--> "X1,X3"
|
+--X2--+--EOF--> "X2"
| |
| +--X3---+--EOF--> "X2,X3"
| | |
...
This is expensive in memory consumption (N^2 log K, I think? where N=values for X, K=lines in List2) but results in fast retrieval times. If the number of possible Xs is large then this approach will break down...
Obviously you could build this index for all 3 parts of the tuple, and then AND together the results from searching each tree.

There's a fairly efficient way to do this with a single pass over list2. You start by building an index of the items in list1.
from collections import defaultdict
# index is HashMap<X, HashMap<Y, HashMap<Z, Integer>>>
index = defaultdict(lambda: defaultdict(dict))
for rowid, (x, y, z) in enumerate(list1):
index[x][y][z] = rowid
for rowid2, (xs, ys, zs) in enumerate(list2):
xhits = defaultdict(list)
for x in xs:
if x in index:
for y, zmap in index[x].iteritems():
xhits[y].append(zmap)
yhits = defaultdict(list)
for y in ys:
if y in xhits:
for z, rowid1 in xhits[y].iteritems():
yhits[z].append(rowid1)
for z in zs:
if z in yhits:
for rowid1 in yhits[z]:
print "list1[%d] matches list2[%d]" % (hit[z], rowid2)
The extra bookkeeping here will probably make it slower than indexing list2. But since in your case list1 is typically much smaller than list2, this will use much less memory. If you're reading list2 from disk, with this algorithm you never need to keep any part of it in memory.
Memory access can be a big deal, so I can't say for sure which will be faster in practice. Have to measure. The worst-case time complexity in both cases, barring hash table malfunctions, is O(len(list1)*len(list2)).

How about using HashSet (or HashSets) for List 2 ? This way you will only need to iterate over List 1

If you use Guava, there is a high-level way to do this that is not necessarily optimal but doesn't do anything crazy:
List<SomeType> list1 = ...;
List<Set<SomeType>> candidateFromList2 = ...;
if (Sets.cartesianProduct(candidateFromList2).contains(list1)) { ... }
But it's not that hard to check this "longhand" either.

Related

Determining Time and Space Complexity of program

So I had a coding challenge for an internship and part of it was to determine the space and time complexity of my program. The program was roughly as follows.
while(A){
int[][] grid;
// additional variables
while(B){ //for loop involves iterating through grid
// additional variables
for(...)
for(....)
}
for(...) //for loop involves iterating through grid
for(....)
}
So what I said was that the program overall has time complexity of (AN^2+BN^2), therefore concluding that it has an amortized time of O(N^2).
As for the space complexity, was I supposed to sum the number space used by all variables? Assuming every variable is an int and there is 3 in loop A and two in loop B would the space complexity be (A*24 + B*16)?
To avoid mistakes, I tend to use an approach such that you make a side note for each line representing how many times it gets executed (to be more accurate, you can include both at its best and its worst case).
Taking into consideration the example, the idea may look as follows:
num_exec
| while(A){
A | int[][] grid;
A | additional variables
|
| while(B){ //for loop involves iterating through grid
AB | additional variables
ABN^2 | for(...)
| for(....)
| }
|
AN^2 | for(...) //for loop involves iterating through grid
| for(....)
| }
To estimate your code's time complexity a simple summation of those side noted numbers does the thing (as you may have done yourself, though you have obtained slightly different results than mine):
As for your memory complexity, your intuition is right for an 8-bit integer. However, if we are talking about primitive datatypes, you can simply think of them as constants. Thus, you should be rather concerned about complex datatypes i.e. an array, since it aggregates multiple primitives. To sum up, you take into account data sizes of elements designated to preserve your data.
Consequently, applied on the example:
memory
| while(A){
ANk | int[][] grid;
A3k | additional variables
|
| while(B){ //for loop involves iterating through grid
AB2k | additional variables
| for(...)
| for(....)
| }
|
| for(...) //for loop involves iterating through grid
| for(....)
| }
Supposing the grid size of , a primitive datatype with size of and the total number of additional variables to be 3 in the outer loop followed by 2 in the inner one, the total space complexity adds up to:
Note, to assume the complexities given above and have to be both significantly less than and independent of it at all.
You may be interested in further explanation of the matter provided on this link. Hope that helps (even it is just approximate because of coarser details you've provided) and best of luck!

Implementation of a particular Travelling-Salesman variation

I'm looking for an algorithm(C/C++/Java - doesn't matter) which will resolve a problem which consists of finding the shortest path between 2 nodes (A and B) of a graph. The catch is that the path must visit certain other given nodes(cities). A city can be visited more than once. Example of path(A-H-D-C-E-F-G-F-B) (where A is source, B is destination, F and G are cities which must be visited).
I see this as a variation of the Traveling Salesman Problem but I couldn't find or write a working algorithm based on my searches.
I was trying to find a solution starting from these topics but without any luck:
https://stackoverflow.com/questions/24856875/tsp-branch-and-bound-implementation-in-c and
Variation of TSP which visits multiple cities
An easy reduction of the problem to TSP would be:
For each (u,v) that "must be visited", find the distance d(u,v) between them. This can be done efficiently using Floyd-Warshall Algorithm to find all-to-all shortest path.
Create a new graph G' consisting only of those nodes, with all edges existing, with the distances as calculated on (1).
Run standard TSP algorithm to solve the problem on the reduced graph.
I think that in addition to amit's answer, you'll want to increase the cost of the edges that have A or B as endpoints by a sufficient amount (the total cost of the graph + 1 would probably be sufficient) to ensure that you don't end up with a path that goes through A or B (instead of ending at A and B).
A--10--X--0--B
| | |
| 10 |
| | |
+---0--Y--0--+
The above case would result in a path from A to Y to B to X, unless you increase the cost of the A and B edges (by 21).
A--31--X--21--B
| | |
| 10 |
| | |
+---21--Y--21-+
Now it goes from A to X to Y to B.
Also make sure you remove any edges (A,B) (if they exist).

Disjoint set with Union by Size and Path Compression; possible for taller tree to become subtree of shorter tree?

Hi this is my first time posting here,
I have been trying to work out a question for studying but haven't been able to figure it out:
We consider the forest implementation of the disjoint-set abstract data type, with Weighted Union by size and Path Compression. Initially, each element is in a one-node tree.
Starting from the above initial state:
give a (short) sequence of UNION and FIND operations in which the last operation is a UNION that causes a taller tree A to become the subtree of a shorter tree B (ie. the height of A is strictly larger than the height of B).
Show the two trees A and B that the last UNION merges
Hint: You can start from n = 9 elements, each one in a one-node tree.
I'm not sure how that would work since the smaller tree always gets merged with the larger tree because of union by size?
Thanks.
I don't want to answer your homework, but this question is old enough that your semester is likely over, and in any case a hint should help enough.
There's a distinction between union by size and union by height, primarily because of path compression. Specifically, path compression can result in very high degree nodes, and thus trees with many nodes but a very short height. For example, these are two trees you can create with union find with path compression:
|T1: o (n=5, h=2)
| /| |\
| o o o o
|
|T2: o (n=4, h=3)
| /|
| o o
| |
| o
If the next operation is a merge of these two trees, the "union by rank" and "union by height" algorithms would select different parents.
In practice, "union by rank" is usually used. Rank is an upper bound for height which can be updated in constant time, and yields the best asymptotic running time. A web search will yield many good explanations of that algorithm.

Searching words in array [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to find list of possible words from a letter matrix [Boggle Solver]
I have a String[][] array such as
h,b,c,d
e,e,g,h
i,l,k,l
m,l,o,p
I need to match an ArrayList against this array to find the words specified in the ArrayList. When searching for word hello, I need to get a positive match and the locations of the letters, for example in this case (0,0), (1,1), (2,1), (3,1) and (3,2).
When going letter by letter and we suppose we are successfully located the first l letter, the program should try to find the next letter (l) in the places next to it. So it should match against e, e, g, k, o, l, m and i meaning all the letters around it: horizontally, vertically and diagonally. The same position can't be found in the word twice, so (0,0), (1,1), (2,1), (2,1) and (3,2) wouldn't be acceptable because the position (2,1) is matched twice. In this case, both will match the word because diagonally location is allowed but it needs to match the another l due to the requirement that a position can not be used more than once.
This case should also be matched
h,b,c,d
e,e,g,h
l,l,k,l
m,o,f,p
If we suppose that we try to search for helllo, it won't match. Either (x1, y1) (x1, y1) or (x1, y1) (x2, y2) (x1, y1) can't be matched.
What I want to know what is the best way to implement this kind of feature. If I have 4x4 String[][] array and 100 000 words in an ArrayList, what is the most efficient and the easiest way to do this?
I think you will probably spend most of your time trying to match words that can't possibly be built by your grid. So, the first thing I would do is try to speed up that step and that should get you most of the way there.
I would re-express the grid as a table of possible moves that you index by the letter. Start by assigning each letter a number (usually A=0, B=1, C=2, ... and so forth). For your example, let's just use the alphabet of the letters you have (in the second grid where the last row reads " m o f p "):
b | c | d | e | f | g | h | k | l | m | o | p
---+---+---+---+---+---+---+---+---+---+----+----
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
Then you make a 2D boolean array that tells whether you have a certain letter transition available:
| 0 1 2 3 4 5 6 7 8 9 10 11 <- from letter
| b c d e f g h k l m o p
-----+--------------------------------------
0 b | T T T T
1 c | T T T T T
2 d | T T T
3 e | T T T T T T T
4 f | T T T T
5 g | T T T T T T T
6 h | T T T T T T T
7 k | T T T T T T T
8 l | T T T T T T T T T
9 m | T T
10 o | T T T T
11 p | T T T
^
to letter
Now go through your word list and convert the words to transitions (you can pre-compute this):
hello (6, 3, 8, 8, 10):
6 -> 3, 3 -> 8, 8 -> 8, 8 -> 10
Then check if these transitions are allowed by looking them up in your table:
[6][3] : T
[3][8] : T
[8][8] : T
[8][10] : T
If they are all allowed, there's a chance that this word might be found.
For example the word "helmet" can be ruled out on the 4th transition (m to e: helMEt), since that entry in your table is false.
And the word hamster can be ruled out, since the first (h to a) transition is not allowed (doesn't even exist in your table).
Now, for the remaining words that you didn't eliminate, try to actually find them in the grid the way you're doing it now or as suggested in some of the other answers here. This is to avoid false positives that result from jumps between identical letters in your grid. For example the word "help" is allowed by the table, but not by the grid
Let me know when your boggle phone-app is done! ;)
Although I am sure there is a beatiful and efficient answer for this question academically, you can use the same approach, but with a list possibilities. so, for the word 'hello', when you find the letter 'h', next you will add possible 'e' letters and so on. Every possibility will form a path of letters.
I would start by thinking of your grid as a graph, where each grid position is a node and each node connect to its eight neighbors (you shouldn't need to explicitly code it as a graph in code, however). Once you find the potential starting letters, all you need to do is to do a depth first search of the graph from each start position. The key is to remember where you've already searched, so you don't end up making more work for yourself (or worse, get stuck in a cycle).
Depending on the size of character space being used, you might also be able to benefit from building lookup tables. Let's assume English (26 contiguous character codepoints); if you start by building a 26-element List<Point>[] array, you can populate that array once from your grid, and then can quickly get a list of locations to start your search for any word. For example, to get the locations of h I would write arr['h'-'a']
You can even leverage this further if you apply the same strategy and build lookup tables for each edge list in the graph. Instead of having to search all 8 edges for each node, you already know which edges to search (if any).
(One note - if your character space is non-contiguous, you can still do a lookup table, but you'll need to use a HashMap<Character,List<Point>> and map.get('h') instead.)
One approach to investigate is to generate all the possible sequences of letters (strings) from the grid, then check if each word exists in this set of strings, rather than checking each word against the grid. E.g. starting at h in your first grid:
h
hb
he
he // duplicate, but different path
hbc
hbg
hbe
hbe // ditto
heb
hec
heg
...
This is only likely to be faster for very large lists of words because of the overhead of generating the sequences. For small lists of words it would be much faster to test them individually against the grid.
You would either need to store the entire path (including coordinates) or have a separate step to work out the path for the words that match. Which is faster will depend on the hit rate (i.e. what proportion of input words you actually find in the grid).
Depending on what you need to achieve, you could perhaps compare the sequences against a list of dictionary words to eliminate the non-words before beginning the matching.
Update 2 in the linked question there are several working, fast solutions that generate sequences from the grid, deepening recursively to generate longer sequences. However, they test these against a Trie generated from the words list, which enables them to abandon a subtree of sequences early - this prunes the search and improves efficiency greatly. This has a similar effect to the transition filtering suggested by Markus.

Sorting an array while moving duplicates to the end?

This was a question in one my friend's programming class.
Q. How do you sort an array of ints and then arrange them such that all duplicate elements appear at the end of the array?
For example, given the input
{5, 2, 7, 6, 1, 1, 5, 6, 2}
The output would be
{1, 2, 5, 6, 7, 1, 2, 5, 6}
Note that the numbers are sorted and duplicate numbers are after 7, which is the maximum in the array.
This has to be achieved with out using any Java library packages/utils.
I suggested to sort the array first using insertion or bubble sort, and then go over the array, perform something like the following :
for (int i = 0; i < nums.length - 2; i++) {
for (int j = i + 1; j < nums.length; j++) {
//current and next are same, move elements up
//and place the next number at the end.
if (nums[i] == nums[j]) {
int temp = nums[j];
for (int k = j; k < nums.length - 1; k++) {
nums[k] = nums[k + 1];
}
nums[nums.length - 1] = temp;
break;
}
}
}
I tried this myself later (and that is how the code above) - As I try this out, I think this could be achieved by using less code, be more efficiently. And may be I gave a wrong advice.
Any thoughts?
Depending on the parameters of your problem, there are many approaches to solving this.
If you are not allowed to use O(n) external memory, then one option would be to use a standard sorting algorithm to sort the array in-place in O(n log n) time, then to run a second pass over it to move the duplicates to the end (as you've suggested). The code you posted above takes O(n2) time, but I think that this step can be done in O(n log n) time using a slightly more complicated algorithm. The idea works in two steps. In the first step, in O(n log n) time you bring all non-duplicated elements to the front in sorted order and bring all the duplicates to the back in non-sorted order. Once you've done that, you then sort the back half of the array in O(n log n) time using the sorting algorithm from the first step.
I'm not going to go into the code to sort the array. I really love sorting, but there are so many other good resources on how to sort arrays in-place that it's not a good use of my time/space here to go into them. If it helps, here's links to Java implementations of heapsort, quicksort, and smoothsort, all of which runs in O(n log n) time. Heapsort and smoothsort use only O(1) external memory, while quicksort can use O(n) in the worst case (though good implementations can limit this to O(log n) using cute tricks).
The interesting code is the logic to bring all the non-duplicated elements to the front of the range. Intuitively, the code works by storing two pointers - a read pointer and a write pointer. The read pointer points to the next element to read, while the write pointer points to the location where the next unique element should be placed. For example, given this array:
1 1 1 1 2 2 3 4 5 5
We start with the read and write pointers initially pointing at 1:
write v
1 1 1 1 2 2 3 4 5 5
read ^
Next, we skip the read pointer ahead to the next element that isn't 1. This finds 2:
write v
1 1 1 1 2 2 3 4 5 5
read ^
Then, we bump the write pointer to the next location:
write v
1 1 1 1 2 2 3 4 5 5
read ^
Now, we swap the 2 into the spot held by the write pointer:
write v
1 2 1 1 1 2 3 4 5 5
read ^
advance the read pointer to the next value that isn't 2:
write v
1 2 1 1 1 2 3 4 5 5
read ^
then advance the write pointer:
write v
1 2 1 1 1 2 3 4 5 5
read ^
Again, we exchange the values pointed at by 'read' and 'write' and move the write pointer forward, then move the read pointer to the next unique value:
write v
1 2 3 1 1 2 1 4 5 5
read ^
Once more yields
write v
1 2 3 4 1 2 1 1 5 5
read ^
and the final iteration gives
write v
1 2 3 4 5 2 1 1 1 5
read ^
If we now sort from the write pointer to the read pointer, we get
write v
1 2 3 4 5 1 1 1 2 5
read ^
and bingo! We've got the answer we're looking for.
In (untested, sorry...) Java code, this fixup step might look like this:
int read = 0;
int write = 0;
while (read < array.length) {
/* Swap the values pointed at by read and write. */
int temp = array[write];
array[write] = array[read];
array[read] = temp;
/* Advance the read pointer forward to the next unique value. Since we
* moved the unique value to the write location, we compare values
* against array[write] instead of array[read].
*/
while (read < array.length && array[write] == array[read])
++ read;
/* Advance the write pointer. */
++ write;
}
This algorithm runs in O(n) time, which leads to an overall O(n log n) algorithm for the problem. Since the reordering step uses O(1) memory, the overall memory usage would be either O(1) (for something like smoothsort or heapsort) or O(log n) (for something like quicksort).
EDIT: After talking this over with a friend, I think that there is a much more elegant solution to the problem based on a modification of quicksort. Typically, when you run quicksort, you end up partitioning the array into three regions:
+----------------+----------------+----------------+
| values < pivot | values = pivot | values > pivot |
+----------------+----------------+----------------+
The recursion then sorts the first and last regions to put them into sorted order. However, we can modify this for our version of the problem. We'll need as a primitive the rotation algorithm, which takes two adjacent blocks of values in an array and exchanges them in O(n) time. It does not change the relative order of the elements in those blocks. For example, we could use rotation to convert the array
1 2 3 4 5 6 7 8
into
3 4 5 6 7 8 1 2
and can do so in O(n) time.
The modified version of quicksort would work by using the Bentley-McIlroy three-way partition algortihm (described here) to, using O(1) extra space, rearrange the array elements into the configuration shown above. Next, we apply a rotation to reorder the elements so that they look like this:
+----------------+----------------+----------------+
| values < pivot | values > pivot | values = pivot |
+----------------+----------------+----------------+
Next, we perform a swap so that we move exactly one copy of the pivot element into the set of elements at least as large as the pivot. This may have extra copies of the pivot behind. We then recursively apply the sorting algorithm to the < and > ranges. When we do this, the resulting array will look like this:
+---------+-------------+---------+-------------+---------+
| < pivot | dup < pivot | > pivot | dup > pivot | = pivot |
+---------+-------------+---------+-------------+---------+
We then apply two rotations to the range to put it into the final order. First, rotate the duplicate values less than the pivot with the values greater than the pivot. This gives
+---------+---------+-------------+-------------+---------+
| < pivot | > pivot | dup < pivot | dup > pivot | = pivot |
+---------+---------+-------------+-------------+---------+
At this point, this first range is the unique elements in ascending order:
+---------------------+-------------+-------------+---------+
| sorted unique elems | dup < pivot | dup > pivot | = pivot |
+---------------------+-------------+-------------+---------+
Finally, do one last rotation of the duplicate elements greater than the pivot and the elements equal to the pivot to yield this:
+---------------------+-------------+---------+-------------+
| sorted unique elems | dup < pivot | = pivot | dup > pivot |
+---------------------+-------------+---------+-------------+
Notice that these last three blocks are just the sorted duplicate values:
+---------------------+-------------------------------------+
| sorted unique elems | sorted duplicate elements |
+---------------------+-------------------------------------+
and voila! We've got everything in the order we want. Using the same analysis that you'd do for normal quicksort, plus the fact that we're only doing O(n) work at each level (three extra rotations), this works out to O(n log n) in the best case with O(log n) memory usage. It's still O(n2) in the worst case with O(log n) memory, but that happens with extremely low probability.
If you are allowed to use O(n) memory, one option would be to build a balanced binary search tree out of all of the elements that stores key/value pairs, where each key is an element of the array and the value is the number of times it appears. You could then sort the array in your format as follows:
For each element in the array:
If that element already exists in the BST, increment its count.
Otherwise, add a new node to the BST with that element having count 1.
Do an inorder walk of the BST. When encountering a node, output its key.
Do a second inorder walk of the BST. When encountering a node, if it has count greater than one, output n - 1 copies of that node, where n is the number of times it appears.
The runtime of this algorithm is O(n log n), but it would be pretty tricky to code up a BST from scratch. It also requires external space, which I'm not sure you're allowed to do.
However, if you are allowed external space and the arrays you are sorting are small and contain small integers, you could modify the above approach by using a modified counting sort. Just replace the BST with an array large enough for each integer in the original array to be a key. This reduces the runtime to O(n + k), with memory usage O(k), where k is the largest element in the array.
Hope this helps!
a modified merge sort could do the trick: on the last merge pass keep track of the last number you pushed on the front of result array and if the lowest of the next numbers is equal add to the end instead of front
Welcome to the world of Data Structures and Algorithms. You're absolutely right in that you could sort that faster. You could also do it a dozen different ways. PHD's are spent on this stuff :)
Here's a link where you can see an optimized bubble sort
You might also want to check out Big O Notation
Have fun and good luck!
Use quicksort to sort the array. When implementing the sort you can modify it slightly by adding all duplicates to a seperate duplicate array. When done simply append the duplicate array to the end of the sorted array.

Categories