Searching words in array [duplicate]

Searching words in array [duplicate] - java

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to find list of possible words from a letter matrix [Boggle Solver]
I have a String[][] array such as
h,b,c,d
e,e,g,h
i,l,k,l
m,l,o,p
I need to match an ArrayList against this array to find the words specified in the ArrayList. When searching for word hello, I need to get a positive match and the locations of the letters, for example in this case (0,0), (1,1), (2,1), (3,1) and (3,2).
When going letter by letter and we suppose we are successfully located the first l letter, the program should try to find the next letter (l) in the places next to it. So it should match against e, e, g, k, o, l, m and i meaning all the letters around it: horizontally, vertically and diagonally. The same position can't be found in the word twice, so (0,0), (1,1), (2,1), (2,1) and (3,2) wouldn't be acceptable because the position (2,1) is matched twice. In this case, both will match the word because diagonally location is allowed but it needs to match the another l due to the requirement that a position can not be used more than once.
This case should also be matched
h,b,c,d
e,e,g,h
l,l,k,l
m,o,f,p
If we suppose that we try to search for helllo, it won't match. Either (x1, y1) (x1, y1) or (x1, y1) (x2, y2) (x1, y1) can't be matched.
What I want to know what is the best way to implement this kind of feature. If I have 4x4 String[][] array and 100 000 words in an ArrayList, what is the most efficient and the easiest way to do this?

I think you will probably spend most of your time trying to match words that can't possibly be built by your grid. So, the first thing I would do is try to speed up that step and that should get you most of the way there.
I would re-express the grid as a table of possible moves that you index by the letter. Start by assigning each letter a number (usually A=0, B=1, C=2, ... and so forth). For your example, let's just use the alphabet of the letters you have (in the second grid where the last row reads " m o f p "):
b | c | d | e | f | g | h | k | l | m | o | p
---+---+---+---+---+---+---+---+---+---+----+----
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
Then you make a 2D boolean array that tells whether you have a certain letter transition available:
| 0 1 2 3 4 5 6 7 8 9 10 11 <- from letter
| b c d e f g h k l m o p
-----+--------------------------------------
0 b | T T T T
1 c | T T T T T
2 d | T T T
3 e | T T T T T T T
4 f | T T T T
5 g | T T T T T T T
6 h | T T T T T T T
7 k | T T T T T T T
8 l | T T T T T T T T T
9 m | T T
10 o | T T T T
11 p | T T T
^
to letter
Now go through your word list and convert the words to transitions (you can pre-compute this):
hello (6, 3, 8, 8, 10):
6 -> 3, 3 -> 8, 8 -> 8, 8 -> 10
Then check if these transitions are allowed by looking them up in your table:
[6][3] : T
[3][8] : T
[8][8] : T
[8][10] : T
If they are all allowed, there's a chance that this word might be found.
For example the word "helmet" can be ruled out on the 4th transition (m to e: helMEt), since that entry in your table is false.
And the word hamster can be ruled out, since the first (h to a) transition is not allowed (doesn't even exist in your table).
Now, for the remaining words that you didn't eliminate, try to actually find them in the grid the way you're doing it now or as suggested in some of the other answers here. This is to avoid false positives that result from jumps between identical letters in your grid. For example the word "help" is allowed by the table, but not by the grid
Let me know when your boggle phone-app is done! ;)

Although I am sure there is a beatiful and efficient answer for this question academically, you can use the same approach, but with a list possibilities. so, for the word 'hello', when you find the letter 'h', next you will add possible 'e' letters and so on. Every possibility will form a path of letters.

I would start by thinking of your grid as a graph, where each grid position is a node and each node connect to its eight neighbors (you shouldn't need to explicitly code it as a graph in code, however). Once you find the potential starting letters, all you need to do is to do a depth first search of the graph from each start position. The key is to remember where you've already searched, so you don't end up making more work for yourself (or worse, get stuck in a cycle).
Depending on the size of character space being used, you might also be able to benefit from building lookup tables. Let's assume English (26 contiguous character codepoints); if you start by building a 26-element List<Point>[] array, you can populate that array once from your grid, and then can quickly get a list of locations to start your search for any word. For example, to get the locations of h I would write arr['h'-'a']
You can even leverage this further if you apply the same strategy and build lookup tables for each edge list in the graph. Instead of having to search all 8 edges for each node, you already know which edges to search (if any).
(One note - if your character space is non-contiguous, you can still do a lookup table, but you'll need to use a HashMap<Character,List<Point>> and map.get('h') instead.)

One approach to investigate is to generate all the possible sequences of letters (strings) from the grid, then check if each word exists in this set of strings, rather than checking each word against the grid. E.g. starting at h in your first grid:
h
hb
he
he // duplicate, but different path
hbc
hbg
hbe
hbe // ditto
heb
hec
heg
...
This is only likely to be faster for very large lists of words because of the overhead of generating the sequences. For small lists of words it would be much faster to test them individually against the grid.
You would either need to store the entire path (including coordinates) or have a separate step to work out the path for the words that match. Which is faster will depend on the hit rate (i.e. what proportion of input words you actually find in the grid).
Depending on what you need to achieve, you could perhaps compare the sequences against a list of dictionary words to eliminate the non-words before beginning the matching.
Update 2 in the linked question there are several working, fast solutions that generate sequences from the grid, deepening recursively to generate longer sequences. However, they test these against a Trie generated from the words list, which enables them to abandon a subtree of sequences early - this prunes the search and improves efficiency greatly. This has a similar effect to the transition filtering suggested by Markus.

Related

Implementation of a particular Travelling-Salesman variation

I'm looking for an algorithm(C/C++/Java - doesn't matter) which will resolve a problem which consists of finding the shortest path between 2 nodes (A and B) of a graph. The catch is that the path must visit certain other given nodes(cities). A city can be visited more than once. Example of path(A-H-D-C-E-F-G-F-B) (where A is source, B is destination, F and G are cities which must be visited).
I see this as a variation of the Traveling Salesman Problem but I couldn't find or write a working algorithm based on my searches.
I was trying to find a solution starting from these topics but without any luck:
https://stackoverflow.com/questions/24856875/tsp-branch-and-bound-implementation-in-c and
Variation of TSP which visits multiple cities

An easy reduction of the problem to TSP would be:
For each (u,v) that "must be visited", find the distance d(u,v) between them. This can be done efficiently using Floyd-Warshall Algorithm to find all-to-all shortest path.
Create a new graph G' consisting only of those nodes, with all edges existing, with the distances as calculated on (1).
Run standard TSP algorithm to solve the problem on the reduced graph.

I think that in addition to amit's answer, you'll want to increase the cost of the edges that have A or B as endpoints by a sufficient amount (the total cost of the graph + 1 would probably be sufficient) to ensure that you don't end up with a path that goes through A or B (instead of ending at A and B).
A--10--X--0--B
| | |
| 10 |
| | |
+---0--Y--0--+
The above case would result in a path from A to Y to B to X, unless you increase the cost of the A and B edges (by 21).
A--31--X--21--B
| | |
| 10 |
| | |
+---21--Y--21-+
Now it goes from A to X to Y to B.
Also make sure you remove any edges (A,B) (if they exist).

Optimal merging of triplets

I'm trying to come up with an algorithm for the following problem :
I've got a collection of triplets of integers - let's call these integers A, B, C. The value stored inside can be big, so generally it's impossible to create an array of size A, B, or C. The goal is to minimize the size of the collection. To do this, we're provided a simple rule that allows us to merge the triplets :
For two triplets (A, B, C) and (A', B', C'), remove the original triplets and place the triplet (A | A', B, C) if B == B' and C = C', where | is bitwise OR. Similar rules hold for B and C also.
In other words, if two values of two triplets are equal, remove these two triplets, bitwise OR the third values and place the result to the collection.
The greedy approach is usually misleading in similar cases and so it is for this problem, but I can't find a simple counterexample that'd lead to a correct solution. For a list with 250 items where the correct solution is 14, the average size computed by greedy merging is about 30 (varies from 20 to 70). The sub-optimal overhead gets bigger as the list size increases.
I've also tried playing around with set bit counts, but I've found no meaningful results. Just the obvious fact that if the records are unique (which is safe to assume), the set bit count always increases.
Here's the stupid greedy implementation (it's just a conceptual thing, please don't regard the code style) :
public class Record {
long A;
long B;
long C;
public static void main(String[] args) {
List<Record> data = new ArrayList<>();
// Fill it with some data
boolean found;
do {
found = false;
outer:
for (int i = 0; i < data.size(); ++i) {
for (int j = i+1; j < data.size(); ++j) {
try {
Record r = merge(data.get(i), data.get(j));
found = true;
data.remove(j);
data.remove(i);
data.add(r);
break outer;
} catch (IllegalArgumentException ignored) {
}
}
}
} while (found);
}
public static Record merge(Record r1, Record r2) {
if (r1.A == r2.A && r1.B == r2.B) {
Record r = new Record();
r.A = r1.A;
r.B = r1.B;
r.C = r1.C | r2.C;
return r;
}
if (r1.A == r2.A && r1.C == r2.C) {
Record r = new Record();
r.A = r1.A;
r.B = r1.B | r2.B;
r.C = r1.C;
return r;
}
if (r1.B == r2.B && r1.C == r2.C) {
Record r = new Record();
r.A = r1.A | r2.A;
r.B = r1.B;
r.C = r1.C;
return r;
}
throw new IllegalArgumentException("Unable to merge these two records!");
}
Do you have any idea how to solve this problem?

This is going to be a very long answer, sadly without an optimal solution (sorry). It is however a serious attempt at applying greedy problem solving to your problem, so it may be useful in principle. I didn't implement the last approach discussed, perhaps that approach can yield the optimal solution -- I can't guarantee that though.
Level 0: Not really greedy
By definition, a greedy algorithm has a heuristic for choosing the next step in a way that is locally optimal, i.e. optimal right now, hoping to reach the global optimum which may or may not be possible always.
Your algorithm chooses any mergable pair and merges them and then moves on. It does no evaluation of what this merge implies and whether there is a better local solution. Because of this I wouldn't call your approach greedy at all. It is just a solution, an approach. I will call it the blind algorithm just so that I can succinctly refer to it in my answer. I will also use a slightly modified version of your algorithm, which, instead of removing two triplets and appending the merged triplet, removes only the second triplet and replaces the first one with the merged one. The order of the resulting triplets is different and thus the final result possibly too. Let me run this modified algorithm over a representative data set, marking to-be-merged triplets with a *:
0: 3 2 3 3 2 3 3 2 3
1: 0 1 0* 0 1 2 0 1 2
2: 1 2 0 1 2 0* 1 2 1
3: 0 1 2*
4: 1 2 1 1 2 1*
5: 0 2 0 0 2 0 0 2 0
Result: 4
Level 1: Greedy
To have a greedy algorithm, you need to formulate the merging decision in a way that allows for comparison of options, when multiple are available. For me, the intuitive formulation of the merging decision was:
If I merge these two triplets, will the resulting set have the maximum possible number of mergable triplets, when compared to the result of merging any other two triplets from the current set?
I repeat, this is intuitive for me. I have no proof that this leads to the globally optimal solution, not even that it will lead to a better-or-equal solution than the blind algorithm -- but it fits the definition of greedy (and is very easy to implement). Let's try it on the above data set, showing between each step, the possible merges (by indicating the indices of triplet pairs) and resulting number of mergables for each possible merge:
mergables
0: 3 2 3 (1,3)->2
1: 0 1 0 (1,5)->1
2: 1 2 0 (2,4)->2
3: 0 1 2 (2,5)->2
4: 1 2 1
5: 0 2 0
Any choice except merging triplets 1 and 5 is fine, if we take the first pair, we get the same interim set as with the blind algorithm (I will this time collapse indices to remove gaps):
mergables
0: 3 2 3 (2,3)->0
1: 0 1 2 (2,4)->1
2: 1 2 0
3: 1 2 1
4: 0 2 0
This is where this algorithm gets it differently: it chooses the triplets 2 and 4 because there is still one merge possible after merging them in contrast to the choice made by the blind algorithm:
mergables
0: 3 2 3 (2,3)->0 3 2 3
1: 0 1 2 0 1 2
2: 1 2 0 1 2 1
3: 1 2 1
Result: 3
Level 2: Very greedy
Now, a second step from this intuitive heuristic is to look ahead one merge further and to ask the heuristic question then. Generalized, you would look ahead k merges further and apply the above heuristic, backtrack and decide the best option. This gets very verbose by now, so to exemplify, I will only perform one step of this new heuristic with lookahead 1:
mergables
0: 3 2 3 (1,3)->(2,3)->0
1: 0 1 0 (2,4)->1*
2: 1 2 0 (1,5)->(2,4)->0
3: 0 1 2 (2,4)->(1,3)->0
4: 1 2 1 (1,4)->0
5: 0 2 0 (2,5)->(1,3)->1*
(2,4)->1*
Merge sequences marked with an asterisk are the best options when this new heuristic is applied.
In case a verbal explanation is necessary:
Instead of checking how many merges are possible after each possible merge for the starting set; this time we check how many merges are possible after each possible merge for each resulting set after each possible merge for the starting set. And this is for lookahead 1. For lookahead n, you'd be seeing a very long sentence repeating the part after each possible merge for each resulting set n times.
Level 3: Let's cut the greed
If you look closely, the previous approach has a disastrous perfomance for even moderate inputs and lookaheads(*). For inputs beyond 20 triplets anything beyond 4-merge-lookahead takes unreasonably long. The idea here is to cut out merge paths that seem to be worse than an existing solution. If we want to perform lookahead 10, and a specific merge path yields less mergables after three merges, than another path after 5 merges, we may just as well cut the current merge path and try another one. This should save a lot of time and allow large lookaheads which would get us closer to the globally optimal solution, hopefully. I haven't implemented this one for testing though.
(*): Assuming a large reduction of input sets is possible, the number of merges is
proportional to input size, and
lookahead approximately indicates how much you permute those merges.
So you have choose lookahead from |input|, which is
the binomial coefficient that for lookahead ≪ |input| can be approximated as
O(|input|^lookahead) -- which is also (rightfully) written as you are thoroughly screwed.
Putting it all together
I was intrigued enough by this problem that I sat and coded this down in Python. Sadly, I was able to prove that different lookaheads yield possibly different results, and that even the blind algorithm occasionally gets it better than lookahead 1 or 2. This is a direct proof that the solution is not optimal (at least for lookahead ≪ |input|). See the source code and helper scripts, as well as proof-triplets on github. Be warned that, apart from memoization of merge results, I made no attempt at optimizing the code CPU-cycle-wise.

I don't have the solution, but I have some ideas.
Representation
A helpful visual representation of the problem is to consider the triplets as points of the 3D space. You have integers, so the records will be nodes of a grid. And two records are mergeable if and only if the nodes representing them sit on the same axis.
Counter-example
I found an (minimal) example where a greedy algorithm may fail. Consider the following records:
(1, 1, 1) \
(2, 1, 1) | (3, 1, 1) \
(1, 2, 1) |==> (3, 2, 1) |==> (3, 3, 1)
(2, 2, 1) | (2, 2, 2) / (2, 2, 2)
(2, 2, 2) /
But by choosing the wrong way, it might get stuck at three records:
(1, 1, 1) \
(2, 1, 1) | (3, 1, 1)
(1, 2, 1) |==> (1, 2, 1)
(2, 2, 1) | (2, 2, 3)
(2, 2, 2) /
Intuition
I feel that this problem is somehow similar to finding the maximal matching in a graph. Most of those algorithms finds the optimal solution by begining with an arbitrary, suboptimal solution, and making it 'more optimal' in each iteration by searching augmenting paths, which have the following properties:
they are easy to find (polynomial time in the number of nodes),
an augmenting path and the current solution can be crafted to a new solution, which is strictly better than the current one,
if no augmenting path is found, the current solution is optimal.
I think that the optimal solution in your problem can be found in the similar spirit.

Based on your problem description:
I'm given a bunch of events in time that's usually got some pattern.
The goal is to find the pattern. Each of the bits in the integer
represents "the event occurred in this particular year/month/day". For
example, the representation of March 7, 2014 would be [1 <<
(2014-1970), 1 << 3, 1 << 7]. The pattern described above allows us to
compress these events so that we can say 'the event occurred every 1st
in years 2000-2010'. – Danstahr Mar 7 at 10:56
I'd like to encourage you with the answers that MicSim has pointed at, specifically
Based on your problem description, you should check out this SO
answers (if you didn't do it already):
stackoverflow.com/a/4202095/44522 and
stackoverflow.com/a/3251229/44522 – MicSim Mar 7 at 15:31
The description of your goal is much more clear than the approach you are using. I'm scared that you won't get anywhere with the idea of merging. Sounds scary. The answer you get depends upon the order that you manipulate your data. You don't want that.
It seems you need to keep data and summarize. So, you might try counting those bits instead of merging them. Try clustering algorithms, sure, but more specifically try regression analysis. I should think you would get great results using a correlation analysis if you create some auxiliary data. For example, if you create data for "Monday", "Tuesday", "first Monday of the month", "first Tuesday of the month", ... "second Monday of the month", ... "even years", "every four years", "leap years", "years without leap days", ... "years ending in 3", ...
What you have right now is "1st day of the month", "2nd day of the month", ... "1st month of the year", "2nd month of the year", ... These don't sound like sophisticated enough descriptions to find the pattern.
If you feel it is necessary to continue the approach you have started, then you might treat it more as a search than a merge. What I mean is that you're going to need a criteria/measure for success. You can do the merge on the original data while requiring strictly that A==A'. Then repeat the merge on the original data while requiring B==B'. Likewise C==C'. Finally compare the results (using the criteria/measure). Do you see where this is going? Your idea of bit counting could be used as a measure.
Another point, you could do better at performance. Instead of double-looping through all your data and matching up pairs, I'd encourage you to do single passes through the data and sort it into bins. The HashMap is your friend. Make sure to implement both hashCode() and equals(). Using a Map you can sort data by a key (say where month and day both match) and then accumulate the years in the value. Oh, man, this could be a lot of coding.
Finally, if the execution time isn't an issue and you don't need performance, then here's something to try. Your algorithm is dependent on the ordering of the data. You get different answers based on different sorting. Your criteria for success is the answer with the smallest size after merging. So, repeatedly loop though this algorithm: shuffle the original data, do your merge, save the result. Now, every time through the loop keep the result which is the smallest so far. Whenever you get a result smaller than the previous minimum, print out the number of iterations, and the size. This is a very simplistic algorithm, but given enough time it will find small solutions. Based on your data size, it might take too long ...
Kind Regards,
-JohnStosh

Sorting an array while moving duplicates to the end?

This was a question in one my friend's programming class.
Q. How do you sort an array of ints and then arrange them such that all duplicate elements appear at the end of the array?
For example, given the input
{5, 2, 7, 6, 1, 1, 5, 6, 2}
The output would be
{1, 2, 5, 6, 7, 1, 2, 5, 6}
Note that the numbers are sorted and duplicate numbers are after 7, which is the maximum in the array.
This has to be achieved with out using any Java library packages/utils.
I suggested to sort the array first using insertion or bubble sort, and then go over the array, perform something like the following :
for (int i = 0; i < nums.length - 2; i++) {
for (int j = i + 1; j < nums.length; j++) {
//current and next are same, move elements up
//and place the next number at the end.
if (nums[i] == nums[j]) {
int temp = nums[j];
for (int k = j; k < nums.length - 1; k++) {
nums[k] = nums[k + 1];
}
nums[nums.length - 1] = temp;
break;
}
}
}
I tried this myself later (and that is how the code above) - As I try this out, I think this could be achieved by using less code, be more efficiently. And may be I gave a wrong advice.
Any thoughts?

Depending on the parameters of your problem, there are many approaches to solving this.
If you are not allowed to use O(n) external memory, then one option would be to use a standard sorting algorithm to sort the array in-place in O(n log n) time, then to run a second pass over it to move the duplicates to the end (as you've suggested). The code you posted above takes O(n2) time, but I think that this step can be done in O(n log n) time using a slightly more complicated algorithm. The idea works in two steps. In the first step, in O(n log n) time you bring all non-duplicated elements to the front in sorted order and bring all the duplicates to the back in non-sorted order. Once you've done that, you then sort the back half of the array in O(n log n) time using the sorting algorithm from the first step.
I'm not going to go into the code to sort the array. I really love sorting, but there are so many other good resources on how to sort arrays in-place that it's not a good use of my time/space here to go into them. If it helps, here's links to Java implementations of heapsort, quicksort, and smoothsort, all of which runs in O(n log n) time. Heapsort and smoothsort use only O(1) external memory, while quicksort can use O(n) in the worst case (though good implementations can limit this to O(log n) using cute tricks).
The interesting code is the logic to bring all the non-duplicated elements to the front of the range. Intuitively, the code works by storing two pointers - a read pointer and a write pointer. The read pointer points to the next element to read, while the write pointer points to the location where the next unique element should be placed. For example, given this array:
1 1 1 1 2 2 3 4 5 5
We start with the read and write pointers initially pointing at 1:
write v
1 1 1 1 2 2 3 4 5 5
read ^
Next, we skip the read pointer ahead to the next element that isn't 1. This finds 2:
write v
1 1 1 1 2 2 3 4 5 5
read ^
Then, we bump the write pointer to the next location:
write v
1 1 1 1 2 2 3 4 5 5
read ^
Now, we swap the 2 into the spot held by the write pointer:
write v
1 2 1 1 1 2 3 4 5 5
read ^
advance the read pointer to the next value that isn't 2:
write v
1 2 1 1 1 2 3 4 5 5
read ^
then advance the write pointer:
write v
1 2 1 1 1 2 3 4 5 5
read ^
Again, we exchange the values pointed at by 'read' and 'write' and move the write pointer forward, then move the read pointer to the next unique value:
write v
1 2 3 1 1 2 1 4 5 5
read ^
Once more yields
write v
1 2 3 4 1 2 1 1 5 5
read ^
and the final iteration gives
write v
1 2 3 4 5 2 1 1 1 5
read ^
If we now sort from the write pointer to the read pointer, we get
write v
1 2 3 4 5 1 1 1 2 5
read ^
and bingo! We've got the answer we're looking for.
In (untested, sorry...) Java code, this fixup step might look like this:
int read = 0;
int write = 0;
while (read < array.length) {
/* Swap the values pointed at by read and write. */
int temp = array[write];
array[write] = array[read];
array[read] = temp;
/* Advance the read pointer forward to the next unique value. Since we
* moved the unique value to the write location, we compare values
* against array[write] instead of array[read].
*/
while (read < array.length && array[write] == array[read])
++ read;
/* Advance the write pointer. */
++ write;
}
This algorithm runs in O(n) time, which leads to an overall O(n log n) algorithm for the problem. Since the reordering step uses O(1) memory, the overall memory usage would be either O(1) (for something like smoothsort or heapsort) or O(log n) (for something like quicksort).
EDIT: After talking this over with a friend, I think that there is a much more elegant solution to the problem based on a modification of quicksort. Typically, when you run quicksort, you end up partitioning the array into three regions:
+----------------+----------------+----------------+
| values < pivot | values = pivot | values > pivot |
+----------------+----------------+----------------+
The recursion then sorts the first and last regions to put them into sorted order. However, we can modify this for our version of the problem. We'll need as a primitive the rotation algorithm, which takes two adjacent blocks of values in an array and exchanges them in O(n) time. It does not change the relative order of the elements in those blocks. For example, we could use rotation to convert the array
1 2 3 4 5 6 7 8
into
3 4 5 6 7 8 1 2
and can do so in O(n) time.
The modified version of quicksort would work by using the Bentley-McIlroy three-way partition algortihm (described here) to, using O(1) extra space, rearrange the array elements into the configuration shown above. Next, we apply a rotation to reorder the elements so that they look like this:
+----------------+----------------+----------------+
| values < pivot | values > pivot | values = pivot |
+----------------+----------------+----------------+
Next, we perform a swap so that we move exactly one copy of the pivot element into the set of elements at least as large as the pivot. This may have extra copies of the pivot behind. We then recursively apply the sorting algorithm to the < and > ranges. When we do this, the resulting array will look like this:
+---------+-------------+---------+-------------+---------+
| < pivot | dup < pivot | > pivot | dup > pivot | = pivot |
+---------+-------------+---------+-------------+---------+
We then apply two rotations to the range to put it into the final order. First, rotate the duplicate values less than the pivot with the values greater than the pivot. This gives
+---------+---------+-------------+-------------+---------+
| < pivot | > pivot | dup < pivot | dup > pivot | = pivot |
+---------+---------+-------------+-------------+---------+
At this point, this first range is the unique elements in ascending order:
+---------------------+-------------+-------------+---------+
| sorted unique elems | dup < pivot | dup > pivot | = pivot |
+---------------------+-------------+-------------+---------+
Finally, do one last rotation of the duplicate elements greater than the pivot and the elements equal to the pivot to yield this:
+---------------------+-------------+---------+-------------+
| sorted unique elems | dup < pivot | = pivot | dup > pivot |
+---------------------+-------------+---------+-------------+
Notice that these last three blocks are just the sorted duplicate values:
+---------------------+-------------------------------------+
| sorted unique elems | sorted duplicate elements |
+---------------------+-------------------------------------+
and voila! We've got everything in the order we want. Using the same analysis that you'd do for normal quicksort, plus the fact that we're only doing O(n) work at each level (three extra rotations), this works out to O(n log n) in the best case with O(log n) memory usage. It's still O(n2) in the worst case with O(log n) memory, but that happens with extremely low probability.
If you are allowed to use O(n) memory, one option would be to build a balanced binary search tree out of all of the elements that stores key/value pairs, where each key is an element of the array and the value is the number of times it appears. You could then sort the array in your format as follows:
For each element in the array:
If that element already exists in the BST, increment its count.
Otherwise, add a new node to the BST with that element having count 1.
Do an inorder walk of the BST. When encountering a node, output its key.
Do a second inorder walk of the BST. When encountering a node, if it has count greater than one, output n - 1 copies of that node, where n is the number of times it appears.
The runtime of this algorithm is O(n log n), but it would be pretty tricky to code up a BST from scratch. It also requires external space, which I'm not sure you're allowed to do.
However, if you are allowed external space and the arrays you are sorting are small and contain small integers, you could modify the above approach by using a modified counting sort. Just replace the BST with an array large enough for each integer in the original array to be a key. This reduces the runtime to O(n + k), with memory usage O(k), where k is the largest element in the array.
Hope this helps!

a modified merge sort could do the trick: on the last merge pass keep track of the last number you pushed on the front of result array and if the lowest of the next numbers is equal add to the end instead of front

Welcome to the world of Data Structures and Algorithms. You're absolutely right in that you could sort that faster. You could also do it a dozen different ways. PHD's are spent on this stuff :)
Here's a link where you can see an optimized bubble sort
You might also want to check out Big O Notation
Have fun and good luck!

Use quicksort to sort the array. When implementing the sort you can modify it slightly by adding all duplicates to a seperate duplicate array. When done simply append the duplicate array to the end of the sorted array.

Combinations Array Java

Given a array for any dimension (for instance [1 2 3]), a function that gives all combinations like
1 |
1 2 |
1 2 3 |
1 3 |
2 |
2 1 3 |
2 3 |
...

Since I'm guessing this is homework, I'll try to refrain from giving a complete answer.
Suppose you already had all combinations (or permutations if that is what you are looking for) of an array of size n-1. If you had that, you could use those combinations/permutations as a basis for forming the new combinations/permutations by adding the nth element to them in the appropriate way. That is the basis for what computer scientists call recursion (and mathematicians like to call a very similar idea induction).
So you could write a method that would handle the the n case, assuming the n-1 case had been handled, and you can put a check to handle the base case as well.

Efficient Matching Algorithm for Set Based Triplets

I am looking for an efficient way to solve the following problem.
List 1 is a list of records that are identified by a primitive triplet:
X | Y | Z
List 2 is a list of records that are identified by three sets. One Xs, one Ys, one Zs. The X, Y, Zs are of the same 'type' as those in list one so are directly comparable with one another.
Set(X) | Set(Y) | Set(Z)
For an item in list 1 I need to find all the items in list 2 where the X, Y, Z from list 1 all occur in their corresponding sets in list 2. This is best demonstrated by an example:
List 1:
X1, Y1, Z1
List 2:
(X1, X2) | (Y1) | (Z1, Z3)
(X1) | (Y1, Y2) | (Z1, Z2, Z3)
(X3) | (Y1, Y3) | (Z2, Z3)
In the above, the item in list 1 would match the first two items in list 2. The third item would not be matched as X1 does not occur in the X set, and Z1 does not occur in the Z set.
I have written a functionally correct version of the algorithm but am concerned about performance on larger data sets. Both lists are very large so iterating over list 1 and then performing an iteration over list 2 per item is going to be very inefficient.
I tried to build an index by de-normalizing each item in list 2 into a map, but the number of index entries in the index per item is proportional to the size of the item's subsets. As such this uses a very high level of memory and also requires some significant resource to build.
Can anyone suggest to me an optimal way of solving this. I'm happy to consider both memory and CPU optimal solutions but striking a balance would be nice!

There are going to be a lot of ways to approach this. Which is right depends on the data and how much memory is available.
One simple technique is to build a table from list2, to accelerate the queries coming from list1.
from collections import defaultdict
# Build "hits". hits[0] is a table of, for each x,
# which items in list2 contain it. Likewise hits[1]
# is for y and hits[2] is for z.
hits = [defaultdict(set) for i in range(3)]
for rowid, row in enumerate(list2):
for i in range(3):
for v in row[i]:
hits[i][v].add(rowid)
# For each row, query the database to find which
# items in list2 contain all three values.
for x, y, z in list1:
print hits[0][x].intersection(hits[1][y], hits[2][z])

If the total size of the Sets is not too large you could try to model List 2 as bitfields. The structure will be probably quite fragmented though - maybe the structures referenced in the Wikipedia article on Bit arrays (Judy arrays, tries, Bloom filter) can help address the memory problems of you normalization approach.

You could build a tree out of List2; the first level of the tree is the first of (X1..Xn) that appears in set X. The second level is the values for the second item, plus a leaf node containing the set of lists which contain only X1. The next level contains the next possible value, and so on.
Root --+--X1--+--EOF--> List of pointers to list2 lines containing only "X1"
| |
| +--X2---+--EOF--> List of pointers to list2 lines containing only "X1,X2"
| | |
| | +--X3--+--etc--
| |
| +--X3---+--EOF--> "X1,X3"
|
+--X2--+--EOF--> "X2"
| |
| +--X3---+--EOF--> "X2,X3"
| | |
...
This is expensive in memory consumption (N^2 log K, I think? where N=values for X, K=lines in List2) but results in fast retrieval times. If the number of possible Xs is large then this approach will break down...
Obviously you could build this index for all 3 parts of the tuple, and then AND together the results from searching each tree.

There's a fairly efficient way to do this with a single pass over list2. You start by building an index of the items in list1.
from collections import defaultdict
# index is HashMap<X, HashMap<Y, HashMap<Z, Integer>>>
index = defaultdict(lambda: defaultdict(dict))
for rowid, (x, y, z) in enumerate(list1):
index[x][y][z] = rowid
for rowid2, (xs, ys, zs) in enumerate(list2):
xhits = defaultdict(list)
for x in xs:
if x in index:
for y, zmap in index[x].iteritems():
xhits[y].append(zmap)
yhits = defaultdict(list)
for y in ys:
if y in xhits:
for z, rowid1 in xhits[y].iteritems():
yhits[z].append(rowid1)
for z in zs:
if z in yhits:
for rowid1 in yhits[z]:
print "list1[%d] matches list2[%d]" % (hit[z], rowid2)
The extra bookkeeping here will probably make it slower than indexing list2. But since in your case list1 is typically much smaller than list2, this will use much less memory. If you're reading list2 from disk, with this algorithm you never need to keep any part of it in memory.
Memory access can be a big deal, so I can't say for sure which will be faster in practice. Have to measure. The worst-case time complexity in both cases, barring hash table malfunctions, is O(len(list1)*len(list2)).

How about using HashSet (or HashSets) for List 2 ? This way you will only need to iterate over List 1

If you use Guava, there is a high-level way to do this that is not necessarily optimal but doesn't do anything crazy:
List<SomeType> list1 = ...;
List<Set<SomeType>> candidateFromList2 = ...;
if (Sets.cartesianProduct(candidateFromList2).contains(list1)) { ... }
But it's not that hard to check this "longhand" either.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.