How to improve the complexity of HashMap iteration? - java

I implemented a custom HashMap class (in C++, but shouldn't matter). The implementation is simple -
A large array holds pointers to Items.
Each item contains the key - value pair, and a pointer to an Item (to form a linked list in case of key collision).
I also implemented an iterator for it.
My implementation of incrementing/decrementing the iterator is not very efficient. From the present position, the iterator scans the array of hashes for the next non-null entry. This is very inefficient, when the map is sparsely populated (which it would be for my use case).
Can anyone suggest a faster implementation, without affecting the complexity of other operations like insert and find? My primary use case is find, secondary is insert. Iteration is not even needed, I just want to know this for the sake of learning.
PS: Why I implemented a custom class? Because I need to find strings with some error tolerance, while ready made hash maps that I have seen provide only exact match.
EDIT: To clarify, I am talking about incrementing/decrementing an already obtained iterator. Yes, this is mostly done in order to traverse the whole map.
The errors in strings (keys) in my case occur from OCR errors. So I can not use the error handling techniques used to detect typing errors. The chance of fist character being wrong is almost the same as that of the last one.
Also, my keys are always string, one word to be exact. Number of entries will be less than 5000. So hash table size of 2^16 is enough for me. Even though it will still be sparsely populated, but that's ok.
My hash function:
hash code size is 16 bits.
First 5 bits for the word length. ==> Max possible key length = 32. Reasonable, given that key is a single word.
Last 11 bits for sum of the char codes. I only store the English alphabet characters, and do not need case sensitivity. So 26 codes are enough, 0 to 25. So a key with 32 'z' = 25 * 32 = 800. Which is well within 2^11. I even have scope to add case sensitivity, if needed in future.
Now when you compare a key containing an error with the correct one,
say "hell" with "hello"
1. Length of the keys is approx the same
2. sum of their chars will differ by the sum of the dropped/added/distorted chars.
in the hash code, as first 5 bits are for length, the whole table has fixed sections for every possible length of keys. All sections are of same size. First section stores keys of length 1, second of length 2 and so on.
Now 'hello' is stored in the 5th section, as length is 5.'When we try to find 'hello',
Hashcode of 'hello' = (length - 1) (sum of chars) = (4) (7 + 4 + 11 + 11 + 14) = (4) (47)
= (00100)(00000101111)
similarly, hashcode of 'helo' = (3)(36)
= (00011)(00000100100)
We jump to its bucket, and don't find it there.
so we try to check for ONE distorted character. This will not change the length, but change the sum of characters by at max -25 to +25. So we search from 25 places backwards to 25 places forward. i.e, we check the sum part from (36-25) to (36+25) in the same section. We won't find it.
We check for an additional character error. That means the correct string would contain only 3 characters. So we go to the third section. Now sum of chars due to additional char would have increased by max 25, it has to be compensated. So search the third section for appropriate 25 places (36 - 0) to (36 - 25). Again we don't find.
Now we consider the case of a missing character. So the original string would contain 5 chars. And the second part of hashcode, sum of chars in the original string, would be more by a factor of 0 to 25. So we search the corresponding 25 buckets in the 5th section, (36 + 0) to (36 + 25). Now as 47 (the sum part of 'hello') lies in this range, we will find a match of the hashcode. Ans we also know that this match will be due to a missing character. So we compare the keys allowing for a tolerance of 1 missing character. And we get a match!
In reality, this has been implemented to allow more than one error in key.
It can also be optimized to use only 25 places for the first section (since it has only one character) and so on.
Also, checking 25 places seems overkill, as we already know the largest and smallest char of the key. But it gets complex in case of multiple errors.

You mention an 'error tolerance' for the string. Why not build in the "tolerance' into the hash function itself and thus obviate the need for iteration.

You could go the way of Javas LinkedHashMap class. It adds efficient iteration to a hashmap by also making it a doubly-linked list.
The entries are key-value pairs that have pointers to the previous and next entries. The hashmap itself has the large array as well as the head of the linked list.
Insertion/deletion are constant time for both data structures, searches are done via the hashmap, and iteration via the linked list.

Related

Java default string hash function produces collisions on single character strings

I am a CS student, so please bear with me if what I say sounds too ridiculous. It definitely does so to me, that is why I am here in search of an answer.
I read how strings are hashed in Java, and then I took a look at the ASCII table. The letters "d" and "n" hash to 100 and 110 respectively. Now if I were to create a brand new hashmap in Java, by default it has 10 buckets. So even though the hashcodes are unique, mod 10 they are both 0. This leads to a collision.
Having collision on 1 character strings just doesn't sit well with me, so is the process I described correct?
Thanks in advance.
What you described is probably correct, both would fall on the same bucket due to the pigeonhole principle, which basically means that if you have more items than holes to put them on, two or more will end up on the same hole. In this case, considering only the 95 printable ASCII characters, the principle states that there would be at least 10 in each hole (not considering the actual values, only the amount of them).
However, shazin's answer is also correct in that the hash values are not actually used as the identity for the values in a map, instead they are used to find the bucket in which the kay/value pair belongs, and then values in the bucket are checked for equality with their equals() method (or with ==, if using IdentityHashMap.)
Hash is actually used in Hash based collections as an index or grouping mechanism. Hash is not used as an actual reference. Hash used to find the bucket in which the element may contain first.
As you said d and n can be contained in the same bucket but after that the actual value in this case d and n can be used to refer the actual object. Since Hashmaps don't allow duplicate keys you can be sure that there will always be one d and one n.

Sorting string so that there aren't two same characters on adjacent places [duplicate]

It's a bonus school task for which we didn't receive any teaching yet and I'm not looking for a complete code, but some tips to get going would be pretty cool. Going to post what I've done so far in Java when I get home, but here's something I've done already.
So, we have to do a sorting algorithm, which for example sorts "AAABBB" to the ABABAB. Max input size is 10^6, and it all has to happen under 1 second. If there's more than one answer, the first one in alphabetical order is the right one. I started to test different algorithms to even sort them without that alphabetical order requirement in mind, just to see how the things work out.
First version:
Save the ascii codes to the Integer array where index is the ascii code, and the value is amount which that character occurs in the char array.
Then I picked 2 highest numbers, and started to spam them to the new character array after each other, until some number was higher, and I swapped to it. It worked well, but of course the order wasn't right.
Second version:
Followed the same idea, but stopped picking the most occurring number and just picked the indexes in the order they were in my array. Works well until the input is something like CBAYYY. Algorithm sorts it to the ABCYYY instead of AYBYCY. Of course I could try to find some free spots for those Y's, but at that point it starts to take too long.
An interesting problem, with an interesting tweak. Yes, this is a permutation or rearranging rather than a sort. No, the quoted question is not a duplicate.
Algorithm.
Count the character frequencies.
Output alternating characters from the two lowest in alphabetical order.
As each is exhausted, move to the next.
At some point the highest frequency char will be exactly half the remaining chars. At that point switch to outputting all of that char alternating in turn with the other remaining chars in alphabetical order.
Some care required to avoid off-by-one errors (odd vs even number of input characters). Otherwise, just writing the code and getting it to work right is the challenge.
Note that there is one special case, where the number of characters is odd and the frequency of one character starts at (half plus 1). In this case you need to start with step 4 in the algorithm, outputting all one character alternating with each of the others in turn.
Note also that if one character comprises more than half the input then apart for this special case, no solution is possible. This situation may be detected in advance by inspecting the frequencies, or during execution when the tail consists of all one character. Detecting this case was not part of the spec.
Since no sort is required the complexity is O(n). Each character is examined twice: once when it is counted and once when it is added to the output. Everything else is amortised.
My idea is the following. With the right implementation it can be almost linear.
First establish a function to check if the solution is even possible. It should be very fast. Something like most frequent letter > 1/2 all letters and take into cosideration if it can be first.
Then while there are still letters remaining take the alphabetically first letter that is not the same as previous, and makes further solution possible.
The correct algorithm would be the following:
Build a histogram of the characters in the input string.
Put the CharacterOccurrences in a PriorityQueue / TreeSet where they're ordered on highest occurrence, lowest alphabetical order
Have an auxiliary variable of type CharacterOccurrence
Loop while the PQ is not empty
Take the head of the PQ and keep it
Add the character of the head to the output
If the auxiliary variable is set => Re-add it to the PQ
Store the kept head in the auxiliary variable with 1 occurrence less unless the occurrence ends up being 0 (then unset it)
if the size of the output == size of the input, it was possible and you have your answer. Else it was impossible.
Complexity is O(N * log(N))
Make a bi directional table of character frequencies: character->count and count->character. Record an optional<Character> which stores the last character (or none of there is none). Store the total number of characters.
If (total number of characters-1)<2*(highest count character count), use the highest count character count character. (otherwise there would be no solution). Fail if this it the last character output.
Otherwise, use the earliest alphabetically that isn't the last character output.
Record the last character output, decrease both the total and used character count.
Loop while we still have characters.
While this question is not quite a duplicate, the part of my answer giving the algorithm for enumerating all permutations with as few adjacent equal letters as possible readily can be adapted to return only the minimum, as its proof of optimality requires that every recursive call yield at least one permutation. The extent of the changes outside of the test code are to try keys in sorted order and to break after the first hit is found. The running time of the code below is polynomial (O(n) if I bothered with better data structures), since unlike its ancestor it does not enumerate all possibilities.
david.pfx's answer hints at the logic: greedily take the least letter that doesn't eliminate all possibilities, but, as he notes, the details are subtle.
from collections import Counter
from itertools import permutations
from operator import itemgetter
from random import randrange
def get_mode(count):
return max(count.items(), key=itemgetter(1))[0]
def enum2(prefix, x, count, total, mode):
prefix.append(x)
count_x = count[x]
if count_x == 1:
del count[x]
else:
count[x] = count_x - 1
yield from enum1(prefix, count, total - 1, mode)
count[x] = count_x
del prefix[-1]
def enum1(prefix, count, total, mode):
if total == 0:
yield tuple(prefix)
return
if count[mode] * 2 - 1 >= total and [mode] != prefix[-1:]:
yield from enum2(prefix, mode, count, total, mode)
else:
defect_okay = not prefix or count[prefix[-1]] * 2 > total
mode = get_mode(count)
for x in sorted(count.keys()):
if defect_okay or [x] != prefix[-1:]:
yield from enum2(prefix, x, count, total, mode)
break
def enum(seq):
count = Counter(seq)
if count:
yield from enum1([], count, sum(count.values()), get_mode(count))
else:
yield ()
def defects(lst):
return sum(lst[i - 1] == lst[i] for i in range(1, len(lst)))
def test(lst):
perms = set(permutations(lst))
opt = min(map(defects, perms))
slow = min(perm for perm in perms if defects(perm) == opt)
fast = list(enum(lst))
assert len(fast) == 1
fast = min(fast)
print(lst, fast, slow)
assert slow == fast
for r in range(10000):
test([randrange(3) for i in range(randrange(6))])
You start by count each number of letter you have in your array:
For example you have 3 - A, 2 - B, 1 - C, 4 - Y, 1 - Z.
1) Then you put each time the lowest one (it is A), you can put.
so you start by :
A
then you can not put A any more so you put B:
AB
then:
ABABACYZ
These works if you have still at least 2 kind of characters. But here you will have still 3 Y.
2) To put the last characters, you just go from your first Y and insert one on 2 in direction of beginning.(I don't know if these is the good way to say that in english).
So ABAYBYAYCYZ.
3) Then you take the subsequence between your Y so YBYAYCY and you sort the letter between the Y :
BAC => ABC
And you arrive at
ABAYAYBYCYZ
which should be the solution of your problem.
To do all this stuff, I think a LinkedList is the best way
I hope it help :)

Selecting top 10 most frequently occurring strings from an array, java

I have an array of strings from which I want to find the top 10 most frequently occurring strings.
One primitive way of doing this is to of course loop through the array once, get a stack/queue of all the distinct strings, store these distinct strings in an array, then check the number of times each string in this new array occurs in the original array, and finally store the values in 'n' distinct integers, where n is the number of distinct strings.
Obviously this is a horrible method when it comes to time efficiency, so I was wondering if there is a better way of doing this.
If you don't care about memory, you can build a hash map holding the count of each string: you loop through all your strings and for each one you do
myhash[mystring] += 1
if the string is already present in the hash, or
myhash[mystring] = 1
otherwise.
If you consider that looking up a value in a hash map is made in constant time (which could not be true) then this algorithm is "only" O(n) (but it takes up a lot of memory).
If you care about memory, you can sort the array and then count how many times each string appears easily (each string will appear firstly at position i, i+1, i+2, ..., i+k and nowhere else).
Sorting will take O(n log n), than O(n) for counting occurences of strings.
You could use a Guava Multiset adding all the strings then call Multisets.copyHighestCountFirst() only looking at the first 10
See this question for an example

Searching an array for sum of values

I have a system that generates values in a text file which contains values as below
Line 1 : Total value possible
Line 2 : No of elements in the array
Line 3(extra lines if required) : The numbers themselves
I am now thinking of an approach where I can subtract the total value from the first integer in the array and then searching the array for the remainder and then doing the same until the pair is found.
The other approach is to add the two integers in the array on a permutation and combination basis and finding the pair.
As per my analysis the first solution is better since it cuts down on the number of iterations.Is my analysis correct here and is there any other better approach?
Edit :
I'll give a sample here to make it more clear
Line 1 : 200
Line 2=10
Line 3 : 10 20 80 78 19 25 198 120 12 65
Now the valid pair here is 80,120 since it sums up to 200 (represented in line one as Total Value possible in the input file) and their positions in the array would be 3,8.So find to this pair I listed out my approach where I take the first element and I subtract it with the Total value possible and searching the other element through basic search algorithms.
Using the example here I first take 10 and subtract it with 200 which gives 190,then I search for 190,if it is found then the pair is found otherwise continue the same process.
Your problem is vague, but if you are looking for a pair in the array that is summed to a certain number, it can be done in O(n) on average using hash tables.
Iterate the array, and for each element:
(1) Check if it is in the table. If it is - stop and return there is such a pair.
(2) Else: insert num-element to the hash table.
If your iteration terminated without finding a match - there is no such pair.
pseudo code:
checkIfPairExists(arr,num):
set <- new empty hash set
for each element in arr:
if set.contains(element):
return true
else:
set.add(num-element)
return false
The general problem of "is there a subset that sums to a certain number" is NP-Hard, and is known as the subset-sum problem, so there is no known polynomial solution to it.
If you're trying to find a pair (2) numbers which sum to a third number, in general you'll have something like:
for(i=0;i<N;i++)
for(j=i+1;j<N;j++)
if(numbers[i]+numbers[j]==result)
The answer is <i,j>
end
which is O(n^2). However, it is possible to do better.
If the list of numbers is sorted (which takes O(n log n) time) then you can try:
for(i=0;i<N;i++)
binary_search 'numbers[i+1:N]' for result-numbers[i]
if search succeeds:
The answer is <i, search_result_index>
end
That is you can step through each number and then do a binary search on the remaining list for its companion number. This takes O(n log n) time. You may need to implement the search function above yourself as built-in functions may just walk down the list in O(n) time leading to an O(n^2) result.
For both methods, you'll want to check to for the special case that the current number is equal to your result.
Both algorithms use no more space than is taken by the array itself.
Apologies for the coding style, I'm not terribly familiar with Java and it's the ideas here which are important.

How can a Hash Set incur collision?

If a hash set contains only one instance of any distinct element(s), how might collision occur at this case?
And how could load factor be an issue since there is only one of any given element?
While this is homework, it is not for me. I am tutoring someone, and I need to know how to explain it to them.
Let's assume you have a HashSet of Integers, and your Hash Function is mod 4. The integers 0, 4, 8, 12, 16, etc. will all colide, if you try to insert them. (mod 4 is a terrible hash function, but it illustrates the concept)
Assuming a proper function, the load factor is correlated to the chance of having a collision; please note that I say correlated and not equal because it depends on the strategy you use to handle collisions. In general, a high load factor increases the possibility of collisions. Assuming that you have 4 slots and you use mod 4 as the hash function, when the load factor is 0 (empty table), you won't have a collision. When you have one element, the probability of a collision is .25, which obviously degrades the performance, since you have to solve the collision.
Now, assuming that you use linear probing (i.e. on collision, use the next entry available), once you reach 3 entries in the table, you have a .75 probability of a collision, and if you have a collision, in the best case you will go to the next entry, but in the worst, you will have to go through the 3 entries, so the collision means that instead of a direct access, you need in average a linear search with an average of 2 items.
Of course, you have better strategies to handle collisions, and generally, in non-pathological cases, a load of .7 is acceptable, but after that collisions shoot up and performance degrades.
The general idea behind a "hash table" (which a "hash set" is a variety of) is that you have a number of objects containing "key" values (eg, character strings) that you want to put into some sort of container and then be able to find individual objects by their "key" values easily, without having to examine every item in the container.
One could, eg, put the values into a sorted array and then do a binary search to find a value, but maintaining a sorted array is expensive if there are lots of updates.
So the key values are "hashed". One might, for instance, add together all of the ASCII values of the characters to create a single number which is the "hash" of the character string. (There are better hash computation algorithms, but the precise algorithm doesn't matter, and this is an easy one to explain.)
When you do this you'll get a number that, for a ten-character string, will be in the range from maybe 600 to 1280. Now, if you divide that by, say, 500 and take the remainder, you'll have a value between 0 and 499. (Note that the string doesn't have to be ten characters -- longer strings will add to larger values, but when you divide and take the remainder you still end up with a number between 0 and 499.)
Now create an array of 500 entries, and each time you get a new object, calculate its hash as described above and use that value to index into the array. Place the new object into the array entry that corresponds to that index.
But (especially with the naive hash algorithm above) you could have two different strings with the same hash. Eg, "ABC" and "CBA" would have the same hash, and would end up going into the same slot in the array.
To handle this "collision" there are several strategies, but the most common is to create a linked list off the array entry and put the various "hash synonyms" into that list.
You'd generally try to have the array large enough (and have a better hash calculation algorithm) to minimize such collisions, but, using the hash scheme, there's no way to absolutely prevent collisions.
Note that the multiple entries in a synonym list are not identical -- they have different key values -- but they have the same hash value.

Categories