Running time of insertion into 2 hashtables with iteration and printing - java

I have a program that does the following:
Iterates through a string, placing words into a HashMap<String, Integer> where the key represents the unique word, and the value represents a running total occurrences (incremented each time the word is found).
I believe up to this point we are O(n) since each of the insertions is constant time.
Then, I iterate through the hashmap and insert the values into a new HashMap<Integer, List<String>>. The String goes into the List in the value where the count matches. I think that we are still at O(n) because the operations used on HashMaps and Lists are constant time.
Then, I iterate through the HashMap and print the Strings in each List.
Does anything in this program cause me to go above O(n) complexity?

That is O(n), unless your word-parsing algorithm is not linear (but it should be).

You're correct, with a caveat. In a hash table, insertions and lookups take expected O(1) time each, so the expected runtime of your algorithm is O(n). If you have a bad hash function, there's a chance it will take longer than that, usually (for most reasonable hash table implementations) O(n2) in the worst-case.
Additionally, as #Paul Draper pointed out, this assumes that the computation of the hash code for each string takes time O(1) and that comparing the strings in the table takes time O(1). If you have strings whose lengths aren't bounded from above by some constant, it might take longer to compute the hash codes. In fact, a more accurate analysis would be that the runtime is O(n + L), where L is the total length of all the strings.
Hope this helps!

Beyond the two issues that Paul Draper and templatetypedef point out, there's another potential one. You write that your second map is a hashmap < int,list < string > >. This allows for a total linear complexity only if the implementation you choose for the list allows for (amortized) constant time appending. This is the case if you use an ArrayList and you add entries at the end, or you choose a LinkedList and add entries at either end.
I think this covers the default choices for most developers, so it's not really an obstacle.

Related

HashSet disjoint() complexity

What is the big O/time complexity for Java Collections disjoint() Method for two hash sets of integers?
Would appreciate any help, really stumped as I'm not sure if it's O(1) or O(n).
I know the hash set contains is an O(1) operation , but I'm not sure if the disjoint operation loops through all the elements of set 1 and checks if set 2 contains any of these elements.
It's O(n).
Let's assume the set query is O(1). You need to iterate through one of the sets, and make queries in the other set to see if it contains the items.
Therefore, the set iterations take O(n) time at least (you can choose to iterate the set with a smaller size. That's what they did in the source code).
In total, time complexity is O(n).

Want to delete some elements from a list that is existed in another list

I have a list suppose
listA=[679,890,907,780,5230,781]
and want to delete some elements that is existed in another
listB=[907,5230]
in minimum time complexity?
I can do this problem by using two "for loops" means O(n2) time complexity, but I want to reduce this complexity to O(nlog(n)) or O(n)?
Is it possible?
It's possible - if one of the lists is sorted. Assuming that list A is sorted and list B is unsorted, with respective dimensions M and N, the minimum time complexity to remove all of list B's elements from list A will be O((N+M)*log(M)). The way you can achieve this is by binary search - each lookup for an element in list A takes O(log(M)) time, and there are N lookups (one for each element in list B). Since it takes O(M*log(M)) time to sort A, it's more efficient for huge lists to sort and then remove all elements, with total time complexity O((N+M)*log(M)).
On the other hand, if you don't have a sorted list, just use Collection.removeAll, which has a time complexity of O(M*N) in this case. The reason for this time complexity is that removeAll does (by default) something like the following pseudocode:
public boolean removeAll(Collection<?> other)
for each elem in this list
if other contains elem
remove elem from this list
Since contains has a time complexity of O(N) for lists, and you end up doing M iterations, this takes O(M*N) time in total.
Finally, if you want to minimize the time complexity of removeAll (with possibly degraded real world performance) you can do the following:
List<Integer> a = ...
List<Integer> b = ...
HashSet<Integer> lookup = new HashSet<>(b);
a.removeAll(lookup);
For bad values of b, the time to construct lookup could take up to time O(N*log(N)), as shown here (see "pathologically distributed keys"). After that, invoking removeAll will take O(1) for contains over M iterations, taking O(M) time to execute. Therefore, the time complexity of this approach is O(M + N*log(N)).
So, there are three approaches here. One provides you with time complexity O((N+M)*log(M)), another provides you with time complexity O(M*N), and the last provides you with time complexity O(M + N*log(N)). Considering that the first and last approaches are similar in time complexity (as log tends to be very small even for large numbers), I would suggest going with the naive O(M*N) for small inputs, and the simplest O(M + N*log(N)) for medium-sized inputs. At the point where your memory usage starts to suffer from creating a HashSet to store the elements of B (very large inputs), I would finally switch to the more complex O((N+M)*log(M)) approach.
You can find an AbstractCollection.removeAll implementation here.
Edit:
The first approach doesn't work so well for ArrayLists - removing from the middle of list A takes O(M) time, apparently. Instead, sort list B (O(N*log(N))), and iterate through list A, removing items as appropriate. This takes O((M+N)*log(N)) time and is better than the O(M*N*log(M)) that you end up with when using an ArrayList. Unfortunately, the "removing items as appropriate" part of this algorithm requires that you create data to store the non-removed elements in O(M), as you don't have access to the internal data array of list A. In this case, it's strictly better to go with the HashSet approach. This is because (1) the time complexity of O((M+N)*log(N)) is actually worse than the time complexity for the HashSet method, and (2) the new algorithm doesn't save on memory. Therefore, only use the first approach when you have a List with O(1) time for removal (e.g. LinkedList) and a large amount of data. Otherwise, use removeAll. It's simpler, often faster, and supported by library designers (e.g. ArrayList has a custom removeAll implementation that allows it to take linear instead of quadratic time using negligible extra memory).
You can achieve this in following way
Sort second list( you can sort any one of the list. Here I have sorted second list). After that loop through first array and for each element of first array, do binary search in second array.
You can sort list by using Collections.sort() method.
Total complexity:-
For sorting :- O(mLogm) where m is size of second array. I have sorted only second array.
For removing :- O(nLogm)

Difference between a general Hash table and java's HashMap in Big O

The Hash table wiki entry lists its Big O as:
Search: O(n)
Insert: O(n)
Delete: O(n)
while a java HashMap is listed with Big O as:
get: O(1)
put: O(1)
remove: O(1)
Can someone plz explain why does the Big O differ between the concept and the implementation? I mean if there an implementation with a worst case of O(1) then why is there a possibility of O(n) in the concept?
The worst case is O(n) because it might be possible that every entry you put into the HashMap produces the same hash value (lets say 10). This produces a conflict for every entry because every entry is put at HashMap[10]. Depending on what collision resolution strategy was implemented, the HashMap either creates a list at the index 10 or moves the entry to the next index.
Nevertheless when the entry should be accessed again, the hash value is used to get the initial index of the HashMap. As it is 10 in every case, the HashMap has to resolve this.
Because there's a difference between worst case and average case, and even wikipedia lists the O(1) complexity for the avarage case. Java's HashMap is exactly the same as wikipedia's Hash table. So it is just a documentation issue.
Basically, hash tables compute a numerical value from the object you want to store. That numerical value is roughly used as an index to access the location to store the object into (leading to O(1) complexity). However, sometimes certain objects may lead to the same numerical value. In this case those objects will be stored in a list stored in the corresponding location in the hash map, hence the O(n) complexity for the worst case.
I'm not sure where you found the reported complexity of a java HashMap, but it is listing the average case, which matches what wikipedia states on the page you linked.

Is efficiency of java's TreeMap based on number of keys or values?

Since Java uses a red-black tree to implement the TreeMap class, is the efficiency of put() and get() lg(N), where N = number of distinct keys, or N = number of insertions/retrievals you plan to do?
For example, say I want to use a
TreeMap<Integer, ArrayList<String>>
to store the following data:
1 million <1, "bob"> pairs and 1 million <2, "jack"> pairs (the strings get inserted into the arraylist value corresponding to the key)
The final treemap will have 2 keys, with each one storing arraylist of million "bob" or "jack" strings. Is the time efficiency lg(2mil) or lg(2)? I am guessing it's lg(2) since that's how a red-black tree works, but just wanted to check.
Performance of a TreeMap with 2 pairs will behave as N=2, regardless of how many duplicate additions were previously made. There is no "memory" of the excess additions so they cannot possibly produce any overhead.
So yes, you can informally assume that time efficiency is "log 2".
Although it's fairly meaningless as big-O notation is intended to relate to asymptotic efficiency rather than be relevant for small sizes. An O(N^3) algorithm could easily be faster than a O(log N) algorithm for N=2.
For this case, a tree map is lg(n) where n=2 as you describe. There are only 2 values in the map: one arraylist, and another arraylist. No matter what is contained inside those, the map only knows of two values.
While not directly concerned with your question, you may want to consider not using a treemap for this... I mean, how do you plan to access the data stored inside your "bob" or "jack" lists? These are going to be O(n) searches unless you're going to use some kind of binary search on them or something, and the n here is 1 million. If you elaborate more on your end goal, perhaps a more encompassing solution can be achieved.

Is a Java hashmap search really O(1)?

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

Categories