This is a more descriptive version of my previous question. The problem I am trying to solve relates to block-matching or image-within-image recognition.
I see an image, extract the [x,y] of every black pixel and create a set for that image, such as
{[8,0], [9,0], [11,0]}
The set is then augmented so that the first pixel in the set is at [0,0], but the relationship of the pixels is preserved. For example, I see {[8,0], [9,0]} and change the set to {[0,0], [1,0]}. The point of the extraction is that now if I see {[4,0], [5,0]}, I can recognize that basic relationship as two vertically adjacent pixels, my {[0,0], [1,0]}, since it is the same image but only in a different location.
I have a list of these pixel sets, called "seen images". Each 'seen image' has a unique identifier, that allows it to be used as a nested component of other sets. For example:
{[0,0], [1,0]} has the identifier 'Z'
So if I see:
{[0,0], [1, 0], [5,6]}
I can identify and store it as:
{[z], [5, 6]}
The problem with this is that I have to generate every combination of [x,y]'s within the pixel set to check for a pattern match, and to build the best representation. Using the previous example, I have to check:
{[0,0], [1,0]},
{[0,0], [5,6]},
{[1,0], [5,6]} which is {[0,0], [4,5]}
{[0,0], [1,0], [5,6]}
And then if a match occurs, that subset gets replaced with it's ID, merged with the remainder of the original set, and the new combination needs to be checked if it is a 'seen image':
{[z],[5, 6]}
The point of all that is to match as many of the [x,y]'s possible, using the fewest pre-existing pieces as components, to represent a newly seen image concisely. The greedy solution to get the component that matches the largest subset is not the right one. Complexity arises in generating all of the combinations that I need to check, and then the combinations that spawn from finding a match, meaning that if some match and swap produces {[z], [1,0], [2,0]}, then I need to check (and if matched, repeat the process):
{[z], [1,0]}
{[z], [2,0]}
{[1,0], [2,0]} which is {[0,0], [1,0]}
{[z], [1,0], [2,0]}
Currently I generate the pixel combinations this way (here I use numbers to represent pixels 1 == [x,y]) Ex. (1, 2, 3, 4): Make 3 lists:
1.) 2.) 3.)
12 23 34
13 24
14
Then for each number, for each list starting at that number index + 1, concatenate the number and each item and store on the appropriate list, ex. (1+23) = 123, (1+24) = 124
1.) 2.) 3.)
12 23 34
13 24
14
---- ---- ----
123 234
124
134
So those are all the combinations I need to check if they are in my 'seen images'. This is a bad way to do this whole process. I have considered different variations / optimizations, including once the second half of a list has been generated (below the ----), check every item on the list for matches, and then destroy the list to save space, and then continue generating combinations. Another option would be to generate a single combination, and then check it for a match, and somehow index the combinations so you know which one to generate next.
Does anyone recognize this process in general? Can you help me optimize what I am doing for a set of ~million items. I also have not yet come up with a non-recursive or efficient way to handle that each match generates additional combinations to check.
Related
This question is related to "Comparison method violates its general contract!" - TimSort and GridLayout
and several other similar "general contract violation" questions. My question is particularly related to Ceekay's answer at the bottom of the page about "How to test the TimSort implementation". In my case I have fixed the application bug that brought me here which was due to a symmetry violation, but I am having trouble creating a unit test to expose that violation (if the fix is commented out or unfixed in future).
public class TickNumber implements Comparable<TickNumber> {
protected String zone;
protected String track;
}
public class GisTickNumber extends TickNumber implements Comparable<TickNumber> {
private String suffix;
}
I've left out all the implementation details, but basically a Tick number is a 4 digit number where the first two digits are the zone and the second two digits are the track. GisTickNumbers can have alpha characters in the zone and or track fields, and they can optionally have an alpha suffix of one or two characters. Valid ticks are all integers in the range [0000, 9999] (even when represented as Strings). All valid Tick numbers are valid Gis Tick numbers, but valid Gis Ticks can also look like A912, R123, 0123G, A346*.
My symmetry violation was that in the GisTick compareTo, I was accounting for the possible suffix, but in the plain Tick compareTo I was not accounting for it. Thus, if 'this' was a 0000 Tick and 'that' was a 0000* Gis Tick, 0000.compareTo(0000*) would return 0. While if 'this' was a 0000* Gis Tick and 'that' was a 0000 Tick, 0000*.compareTo(0000) would return 1. A clear symmetry violation (once the shrouds pulled back)
According to Ceekay in an answer to the linked question,
Create a list with 32 or more objects.
Within that list, there needs to [be] two or more runs.
Each run must contain 3 or more objects.
Once you meet those [three] criteria you can begin testing for this failure.
I believe I have set up such a list of TickNumber (and GisTickNumber) objects for my unit test, but I can't seem to get the test to fail. Even though the list has over 100 objects, more than two runs, and each run contains about 10 objects. So, my question is what other characteristics does the list of objects under test need to satisfy in order for a call to Collections.sort(testList) to fail due to "general (symmetry) contract violation"?
and yes, I commented out the fix before I ran the unit test that I was expecting to fail.
Solved:
I ended up debugging to a breakpoint where I could view the toString() representation of the Objects in the List getting sorted, and was then able to extract the TickNumber information from the rest of that data and eventually use that extracted data in my unit test. Finally, I went back and removed list items until I crafted what seems to be a list that satisfies the "minimal requirements" for triggering symmetry related "general contract violations".
I'm not sure how to generalize my specific solution into generic characteristics a list must satisfy in order to trigger TimsSort and this "general contract violation". But here goes...
The list must contain 64 elements (49 + 1 + 12 + 1 + 1)
The list must contain a run of 50, where for 49 of the 50 elements the compare result is 0 (i.e. comparisons match)
Within the front half of that "matching run" there must be 1 element that sorts before all the others in the run (all the others in the run match when compared), and that single odd element must also "symmetry mismatch" the element at the end of the other runs.
The list must contain a minimum of 2 other runs of three or more elements (my test list has a run of 8 followed by a run of 4)
The other half of the "symmetry mismatch" must be the last item in the run of 4 (the second other run).
The list must contain an element at (end - 1) position that sorts to the beginning of the sorted list
The list must contain an element at (end) position that sorts somewhere in the middle of the sorted list
I'm pretty sure the above bullets are not an exhaustive list of general requirements a list must satisfy to expose a symmetry violation when the list is sorted, but they worked for me in one specific case.
Specifically, my crafted test list starts with 49 TickNumber objects where Tick = "9999", and somewhere in the front half of the 49 Ticks there is a "9910" Tick, for a total of 50 Tick numbers in this opening pseudo-run. (Pseudo because "9910" breaks up the unsorted run of 49 matching "9999" Ticks.) The "9910" Tick in the opening run is one-half of the symmetry mismatch I am testing for. Then the test list contains 12 GisTickNumber objects as a run of 8 ("9915*", "9920*", "9922*", "9931*", "9933*", "9934*", "9936*", "9939*"), followed by a run of 4 ("9907*", "9908*", "9909*", "9910*"). Note that the last item in the run of 4 is the other half of the "symmetry mismatch" I am testing for. Finally, the list caps off with a "9901" TickNumber object that will lead off the sorted list, and a "9978*" GisTickNumber object that sorts somewhere in the middle. I have tried removing and/or rearranging the Objects in the test list to no avail. The Unit test will start issuing false-positive (success) results if, for example, the "9901" element is removed from the test list. (false-positives will also occur if "9901" is moved to the front of the unsorted list)
Note: I suspect that the plain TickNumber part of the "9910" symmetry mismatch can appear anywhere in the opening run before the MIN_RUN'th element. In other words, if MIN_RUN is 32 and the leading run in my test list has 50 elements with 49 that compare "the same", then the "9910" symmetry mismatch element can appear at any position in the run less than position 32. This supposition hasn't been proven; but I have empirically determined that the symmetry mismatch element can't appear near the end of the leading run, and that it can appear in multiple spots near the start of the leading run. (one different spot per test run)
In general, if any of these conditions are not "exactly right" you won't trigger the "general contract violation" even though you are testing list data where comparisons should violate the contract.
In my case, the only TickNumber objects that match in my test list are the 49 "9999" Ticks and the 2 ("9910" and "9910*") Ticks that violate symmetry on comparison.
I implemented a custom HashMap class (in C++, but shouldn't matter). The implementation is simple -
A large array holds pointers to Items.
Each item contains the key - value pair, and a pointer to an Item (to form a linked list in case of key collision).
I also implemented an iterator for it.
My implementation of incrementing/decrementing the iterator is not very efficient. From the present position, the iterator scans the array of hashes for the next non-null entry. This is very inefficient, when the map is sparsely populated (which it would be for my use case).
Can anyone suggest a faster implementation, without affecting the complexity of other operations like insert and find? My primary use case is find, secondary is insert. Iteration is not even needed, I just want to know this for the sake of learning.
PS: Why I implemented a custom class? Because I need to find strings with some error tolerance, while ready made hash maps that I have seen provide only exact match.
EDIT: To clarify, I am talking about incrementing/decrementing an already obtained iterator. Yes, this is mostly done in order to traverse the whole map.
The errors in strings (keys) in my case occur from OCR errors. So I can not use the error handling techniques used to detect typing errors. The chance of fist character being wrong is almost the same as that of the last one.
Also, my keys are always string, one word to be exact. Number of entries will be less than 5000. So hash table size of 2^16 is enough for me. Even though it will still be sparsely populated, but that's ok.
My hash function:
hash code size is 16 bits.
First 5 bits for the word length. ==> Max possible key length = 32. Reasonable, given that key is a single word.
Last 11 bits for sum of the char codes. I only store the English alphabet characters, and do not need case sensitivity. So 26 codes are enough, 0 to 25. So a key with 32 'z' = 25 * 32 = 800. Which is well within 2^11. I even have scope to add case sensitivity, if needed in future.
Now when you compare a key containing an error with the correct one,
say "hell" with "hello"
1. Length of the keys is approx the same
2. sum of their chars will differ by the sum of the dropped/added/distorted chars.
in the hash code, as first 5 bits are for length, the whole table has fixed sections for every possible length of keys. All sections are of same size. First section stores keys of length 1, second of length 2 and so on.
Now 'hello' is stored in the 5th section, as length is 5.'When we try to find 'hello',
Hashcode of 'hello' = (length - 1) (sum of chars) = (4) (7 + 4 + 11 + 11 + 14) = (4) (47)
= (00100)(00000101111)
similarly, hashcode of 'helo' = (3)(36)
= (00011)(00000100100)
We jump to its bucket, and don't find it there.
so we try to check for ONE distorted character. This will not change the length, but change the sum of characters by at max -25 to +25. So we search from 25 places backwards to 25 places forward. i.e, we check the sum part from (36-25) to (36+25) in the same section. We won't find it.
We check for an additional character error. That means the correct string would contain only 3 characters. So we go to the third section. Now sum of chars due to additional char would have increased by max 25, it has to be compensated. So search the third section for appropriate 25 places (36 - 0) to (36 - 25). Again we don't find.
Now we consider the case of a missing character. So the original string would contain 5 chars. And the second part of hashcode, sum of chars in the original string, would be more by a factor of 0 to 25. So we search the corresponding 25 buckets in the 5th section, (36 + 0) to (36 + 25). Now as 47 (the sum part of 'hello') lies in this range, we will find a match of the hashcode. Ans we also know that this match will be due to a missing character. So we compare the keys allowing for a tolerance of 1 missing character. And we get a match!
In reality, this has been implemented to allow more than one error in key.
It can also be optimized to use only 25 places for the first section (since it has only one character) and so on.
Also, checking 25 places seems overkill, as we already know the largest and smallest char of the key. But it gets complex in case of multiple errors.
You mention an 'error tolerance' for the string. Why not build in the "tolerance' into the hash function itself and thus obviate the need for iteration.
You could go the way of Javas LinkedHashMap class. It adds efficient iteration to a hashmap by also making it a doubly-linked list.
The entries are key-value pairs that have pointers to the previous and next entries. The hashmap itself has the large array as well as the head of the linked list.
Insertion/deletion are constant time for both data structures, searches are done via the hashmap, and iteration via the linked list.
I was going through data structures in java under the topic Skip list and I came across the following:
In a skip list of n nodes, for each k and i such that 1 ≤ k ≤lg n and 1 ≤ i ≤
n/2k–1⎦ – 1, the node in position 2k–1 · i points to the node in position 2k–1 · (i + 1).
This means that every second node points to the node two positions ahead, every
fourth node points to the node four positions ahead, and so on, as shown in Figure
3.17a. This is accomplished by having different numbers of reference fields in nodes
on the list: Half of the nodes have just one reference field, one-fourth of the nodes
have two reference fields, one-eighth of the nodes have three reference fields, and so
on. The number of reference fields indicates the level of each node, and the number of
levels is maxLevel = ⎣lg n⎦ + 1.
And the figure is :
A skip list with (a) evenly and (b) unevenly spaced nodes of different levels;
(c) the skip list with reference nodes clearly shown.
I don't understand the mathematical part and what exactly the sktip list is and even nodes?
Ok let me try to make you understand this.
A skip list is a data-structure which definitely makes your searches faster in a list of given elements.
A better analogy would be a network of subway in any of the bigger cities. Imagine there are 90 stations to cover and there are different lines (Green, Yellow and Blue).
The Green line only connects the stations numbered 0, 30, 60 and 90
The Yellow line connects 0, 10, 20, 30, 40, 50, 60, 70, 80 and 90
The blue line connects all the station from 0 through 90.
If you want to board the train at station 0 and want to get down at 75. What is the best strategy?
Common sense would suggest to board a train on Green line from station 0 and get down at station 60.
Board another train on Yellow line from station 60 and get down at station 70.
Board another train on Blue line from station 70 and get down at 75.
Any other way would have been more time consuming.
Now replace the stations with the nodes and lines with three individual lists (the set of these lists are called skip list).
And just imaging that you wanted to search an element at a node containing the value 75.
I hope this explains what Skip Lists are and how they are efficient.
In the traditional approach of searching, you could have visited each node and got to 75 in 75 hops.
In case of binary search you would have done it in logN
In skip list you can do the same in 1 + 1 + 15 in our particular case. You can do the math, seems to be simple though :)
EDIT: Evenly spaced nodes & Unevenly spaced nodes
As you can see my analogy, it has equal number of stations between each node on each line.
This is evenly spaced nodes. It is an ideal situation.
To understand it better we need to understand the creation of Skip Lists.
In the early stages of its construction there is only one list (the blue line) and each new node is first added to the list at an appropriate location. When the number of nodes in the blue line increases then there comes a need to create another list (yellow line) and promote one of the nodes to list 2. (PS: The first and the last element of list 1 is always promoted to the newly added list in the skip lists set). Hence, the moment a new list is added it will have three nodes.
Promotion Strategy : How to find out which node to promote from the bottom most list(blue line) to the upper lists (yellow line and green line).
The best way to decide is randomly :) So lets say upon addition of a new node, we flip a coin to see if it can be promoted to the second list. if yes, then we add it to the second list and then flip a coin again to check if it has to be added in the third list or not.
So you see, if you use this random mechanism, there might arrive situations where the nodes are unevenly spaced. :)
Hope this helps.
Had a question regarding generating a list of 10-digit phone numbers on a PhonePad, given a set of possible moves and a starting number.
The PhonePad:
1 2 3
4 5 6
7 8 9
* 0 #
Possible moves:
The same number of moves a Queen in chess can make (so north, south, east, west, north-east, north-west, south-east, south-west... n-spaces per each orientation)
Starting number: 5
So far I have implemented the PhonePad as a 2-dimensional char array, implemented the possible moves a Queen can make in a HashMap (using offsets of x and y), and I can make the Queen move one square using one of the possible moves.
My next step is to figure out an algorithm that would give me all 10-digit permutations (phone numbers), using the possible moves in my HasMap. Repetition of a number is allowed. * and # are not allowed in the list of phone numbers returned.
I would imagine starting out with
- 5555555555, 5555555551, 5555555552... and so on up to 0,
- 5555555515, 5555555155, 5555551555.. 5155555555.. and with the numbers 2 upto 0
- 5555555151, 5555551515, 5555515155.. 5151555555.. and with numbers 2 upto 0
... and so on for a two digit combination
Any suggestions on a systematic approach generating 10-digit combinations? Even a pseudocode algorithm is appreciated! Let me know if further clarification is required.
Thanks in advance! :)
In more detail, the simplest approach would be a recursive method, roughly like:
It accepts a prefix string initially empty, a current digit (initially '5'), and a number of digits to generate (initially 10).
If the number of digits is 1, it will simply output the prefix concatenated with the current digit.
If the number of digits is greater than 1, then it will make a list of all possible next digits and call itself recursively with (prefix + (current digit), next digit, (number of digits)-1 ) as the arguments.
Other approaches, and refinements to this one, are possible of course. The "output" action could be writing to a file, adding to a field in the current class or object, or adding to a local variable collection (List or Set) that will be returned as a result. In that last case, the (ndigits>1) logic would have to combine results from multiple recursive calls to get a single return value.
I want to check the sequential order of decimal numbers and find the missing number.
For eg: If i have 1.1.1, 1.1.3, 1.1.4, 2.1.1, 2.1.3, 2.1.2, 3, etc
Here i need to find the missing number 1.1.2 and also out of sequence 2.1.2. Kindly help me with logic.
This does sound suspiciously like homework but, here's some hints for the algorithm. For simplicity, not efficiency, try a 2-step approach.
You'll have to treat each value in your initial list as an ordered set of integers. That is the value 2.1.3 is, an ArrayList whose elements are 2, 1, 3.
First determine what's out of sequence - this catches the 2.1.2 value. Something's out of sequence when the value of any part of the n-th element of the list is greater than any part of the (n+1)-th element. Walk through the list of values comparing two at a time; breaking each element into an list of integers.
Second, sort the list and determine if there are gaps. Sorting would still need to treat each value as a set of integers. A gap in the sorted list would be defined as a change
of more than 1 in any part of the two values. Stop comparing 2 values when you find a gap and move onto the next 2 values to compare.