MultiMap Data Structure for constant time ranking

MultiMap Data Structure for constant time ranking - java

let me just say this is a HW project from UC Berkeley, so If you could give me a hint instead of the solution, It'd be great.
The problem is to create a data structure called "YearlyRecord" that maps a word to the number of times it is encountered in an ngram file. Here are the restrictions.
getting the count must be O(1)
getting the number of mappings must be O(1)
"put" must be on average O(log n)
getting the "rank" must be O(1)
counts, which provides a data structure of all of counts in increasing order
words, which provides a data structure of all words, in the order of counts
Note: "rank" is a function that takes a word and tells me its rank (e.g. 1, 2, 3, 4), which is based on the number of occurrences we have of it.
My solutions so far:
Hashmap (word -> count) provides O(1) getCount operation and
O(1) size operation
make a private inner class that implements Comparator which compares based on the number of occurrences in the hashmap. Then use
this comparator as input to a TreeSet of the words. This provides the
"words" collection, which we can return in O(1) time.
Make another TreeSet of the counts. This provides an O(1) return of the counts in increasing order.
rank: Here is where I am stuck. It is clear that rank should be a map as well. I have the words in the TreeSet in increasing order of rank, but no indices to map to.

Related

algorithm - sort most repeated number first (descending order)

I have a list of numbers, for example:
10 4
5 3
7 1
-2 2
first line means the number 10 repeats 4 times, the second line means the number 5 repeats thrice and so on. The objective is to sort these numbers, the most repeated first in descending order. I think using Hashmap to record the data and then feeding it to treeset and sort by value would be the most efficient way -> O(n log n), but is there a more efficient way? I've heard this problem is solved with max-heap, but I dont think heap can do better than O(n log n).

I think with bucket-style sorting, O(N) complexity is possible. At least in theory. But it comes with additional cost in terms of memory for the buckets. That may make the approach intractable in practice.
The buckets are HashSets. Each bucket holds all numbers with the same count. For fast access, we keep the buckets in an ArrayList, each bucket at the index position of its count.
Like the OP, we use a HashMap to associate numbers with counters. When a number arrives, we increment the counter and move the number from the bucket of the old count to the bucket of the new count. That keeps the numbers sorted at all times.
Each arriving number takes O(1) to process, so all take O(N).

You can get a sorted list of Map.Entry<Integer,Integer> like this:
List<Map.Entry<Integer,Integer>> entries = map.entrySet()
.stream()
.sorted((a,b)->a.getValue().compareTo(b.getValue()))
.collect(Collectors.toList());
It should be in O(n*log(n)).

(I Replaced my original answer.)
GeekForGeeks has example code for sorting a HashMap by value.
Since a HashMap can't be sorted, and the order of elements in a LinkedHashMap can't be changed unless they are removed and re-entered in a different order, the code copies the contents into a List, then sorts the list. The author(s) want to retain the data in a HashMap, so the sorted list is then copied to a new LinkedHashMap.
For the O/P's purposes, <Integer, Integer> would be used where the example has <String, Integer>. Also, since the desired order is descending, o2 and o1 would be reversed in the return statement of the comparator:
return (o2.getValue()).compareTo(o1.getValue());
There are mathematical proofs that no comparison sort can run in less than O(n log n).

Assuming the number is unique,
Why sort? just loop through every key save max occurence, and you’ll get it in O(n)
maxNum = 0
maxValue = -1
loop numbers:
if (numbers[i].value > maxValue) :
maxValue = numbers[i].value
maxNum = numbers[i].key
return maxNum;

Way to make this HashMap more efficient

I have a class User that has 3 objects(?) I'm not sure of the terminology.
an (int) ID code
an (int) date that the user was created
and a (string) name
I am trying to create a methods that
Add a user to my data structure (1)
return the name of a user based on their ID number (2)
return a full list of all users sorted by date (3)
return a list of users who's name has a certain string, sorted by date (4)
return a list of users who joined before a certain date (5)
I have made 10 arrays based on the years joined (2004-2014) and then sort the elements in the arrays again by the date (sorting by month then day)
Am I correct in thinking that this means methods (3) and (5) have O(1) time complexity but that (1),(4) and (2) have O(N)?
Also is there another data structure/method that I can use to have O(1) for all my methods? I tried repeatably to come up with one but the inclusion of method (2) has me stumped.

Comparison based sorting is always O(N*log N), and adding to already sorted container is O(log N). To avoid that, you need buckets, the way you now have them for years now. This trades memory for execution time.
(1) can be O(1) only if you only add things to HashMaps.
(2) can be O(1) if you have a separate HashMap which maps the ID to the user.
(3) of course is O(N) because you need to list all N users, but if you have a HashMap where key is the day and value is list of users, you only need to go through constant (10 years * 365 days + 2) number of arrays to list all users. So O(N) with (1) still being O(1). Assuming users are unsorted withing a single day.
(4) Basically same as 3 for simple implementation, just with less printing. You could perhaps speed up the best case with a trie or something, but it'll still be O(N) because it will be certain % of N which will match.
(5) Same as (3), you just can break out sooner.

You have to make compromises, and make informed guesses about the most common operations. There is a good chance that the most common operation will be to find a user by ID. A HashMap is thus the ideal structure for that: it's O(1), as well as the insertion into the map.
To implement the list of users sorted by date, and the list of users before a given date, the best data structure would be a TreeSet. The TreeSet is already sorted (so your 3rd operation would be O(1), and can return a sorted subset in O(log(n)) time.
But maintaining a TreeSet in parallel to a HashMap is cumbersome, error-prone, and costs memory. And insertion complexity would become O(log(N)). If these aren't common operations, you could simply iterate over the entries and filter them/sort them. Definitely forget about your 10 arrays. This is unmaintainable, and a TreeSet is a much better and easier solution, not limited to 10 years.
The list of users by name containing a given string is O(N), whatever the data structure you choose.

A HashMap doesn't sort anything; its primary purpose/advantage is to offer near-O(1) lookups (which you can use for lookups by ID). If you need to sort something, you should have the class implement Comparable, add it to a List, and use Collections.sort to sort the elements.
As for efficiency:
O(1)
O(1)
O(n log n)
O(n) (at least)
O(n) (or less, but I think it will have to be O(n))

Also is there another data structure/method that I can use to have O(1) for all my methods?
Three HashMaps. This isn't a database, so you have to maintain "indices" by hand.

Running time of insertion into 2 hashtables with iteration and printing

I have a program that does the following:
Iterates through a string, placing words into a HashMap<String, Integer> where the key represents the unique word, and the value represents a running total occurrences (incremented each time the word is found).
I believe up to this point we are O(n) since each of the insertions is constant time.
Then, I iterate through the hashmap and insert the values into a new HashMap<Integer, List<String>>. The String goes into the List in the value where the count matches. I think that we are still at O(n) because the operations used on HashMaps and Lists are constant time.
Then, I iterate through the HashMap and print the Strings in each List.
Does anything in this program cause me to go above O(n) complexity?

That is O(n), unless your word-parsing algorithm is not linear (but it should be).

You're correct, with a caveat. In a hash table, insertions and lookups take expected O(1) time each, so the expected runtime of your algorithm is O(n). If you have a bad hash function, there's a chance it will take longer than that, usually (for most reasonable hash table implementations) O(n2) in the worst-case.
Additionally, as #Paul Draper pointed out, this assumes that the computation of the hash code for each string takes time O(1) and that comparing the strings in the table takes time O(1). If you have strings whose lengths aren't bounded from above by some constant, it might take longer to compute the hash codes. In fact, a more accurate analysis would be that the runtime is O(n + L), where L is the total length of all the strings.
Hope this helps!

Beyond the two issues that Paul Draper and templatetypedef point out, there's another potential one. You write that your second map is a hashmap < int,list < string > >. This allows for a total linear complexity only if the implementation you choose for the list allows for (amortized) constant time appending. This is the case if you use an ArrayList and you add entries at the end, or you choose a LinkedList and add entries at either end.
I think this covers the default choices for most developers, so it's not really an obstacle.

Is efficiency of java's TreeMap based on number of keys or values?

Since Java uses a red-black tree to implement the TreeMap class, is the efficiency of put() and get() lg(N), where N = number of distinct keys, or N = number of insertions/retrievals you plan to do?
For example, say I want to use a
TreeMap<Integer, ArrayList<String>>
to store the following data:
1 million <1, "bob"> pairs and 1 million <2, "jack"> pairs (the strings get inserted into the arraylist value corresponding to the key)
The final treemap will have 2 keys, with each one storing arraylist of million "bob" or "jack" strings. Is the time efficiency lg(2mil) or lg(2)? I am guessing it's lg(2) since that's how a red-black tree works, but just wanted to check.

Performance of a TreeMap with 2 pairs will behave as N=2, regardless of how many duplicate additions were previously made. There is no "memory" of the excess additions so they cannot possibly produce any overhead.
So yes, you can informally assume that time efficiency is "log 2".
Although it's fairly meaningless as big-O notation is intended to relate to asymptotic efficiency rather than be relevant for small sizes. An O(N^3) algorithm could easily be faster than a O(log N) algorithm for N=2.

For this case, a tree map is lg(n) where n=2 as you describe. There are only 2 values in the map: one arraylist, and another arraylist. No matter what is contained inside those, the map only knows of two values.
While not directly concerned with your question, you may want to consider not using a treemap for this... I mean, how do you plan to access the data stored inside your "bob" or "jack" lists? These are going to be O(n) searches unless you're going to use some kind of binary search on them or something, and the n here is 1 million. If you elaborate more on your end goal, perhaps a more encompassing solution can be achieved.

Peak Value of Number of Occurences in Array of Integers

In Java, I need an algorithm to find the max. number of occurrences in a collection of integers. For example, if my set is [2,4,3,2,2,1,4,2,2], the algorithm needs to output 5 because 2 is the mostly occurring integer and it appears 5 times. Consider it like finding the peak of the histogram of the set of integers.
The challenge is, I have to do it one by one for multiple sets of many integers so it needs to be efficient. Also, I do not know which element will be mostly appearing in the sets. It is totally random.
I thought about putting those values of the set into an array, sorting it and then iterating over the array, counting consecutive appearances of the numbers and identifying the maximum of the counts but I am guessing it will take a huge time. Are there any libraries or algorithms that could help me do it efficiently?

I would loop over the collection inserting into a Map datastructure with the following logic:
If the integer has not yet been inserted into the map, then insert key=integer, value=1.
If the key exists, increment the value.
There are two Maps in Java you could use - HashMap and TreeMap - these are compared below:
HashMap vs. TreeMap
You can skip the detailed explanation a jump straight to the summary if you wish.
A HashMap is a Map which stores key-value pairs in an array. The index used for key k is:
h.hashCode() % map.size()
Sometimes two completely different keys will end up at the same index. To solve this, each location in the array is really a linked list, which means every lookup always has to loop over the linked list and check for equality using the k.equals(other) method. Worst case, all keys get stored at the same location and the HashMap becomes an unindexed list.
As the HashMap gains more entries, the likelihood of these clashes increase, and the efficiency of the structure decreases. To solve this, when the number of entries reaches a critical point (determined by the loadFactor argument in the constructor), the structure is resized:
A new array is allocated at about twice the current size
A loop is run over all the existing keys
The key's location is recomputed for the new array
The key-value pair is inserted into the new structure
As you can see, this can become relatively expensive if there are many resizes.
This problem can be overcome if you can pre-allocate the HashMap at an appropriate size before you begin, eg map = new HashMap(input.size()*1.5). For large datasets, this can dramatically reduce memory churn.
Because the keys are essentially randomly positioned in the HashMap, the key iterator will iterate over them in a random order. Java does provide the LinkedHashMap which will iterator in the order the keys were inserted.
Performance for a HashMap:
Given the correct size and good distribution of hashes, lookup is constant-time.
With bad distribution, performance drops to (in the worst case) linear search - O(n).
With bad initial sizing, performance becomes that of rehashing. I can't trivially calculate this, but it's not good.
OTOH a TreeMap stores entries in a balanced tree - a dynamic structure that is incrementally built up as key-value pairs are added. Insert is dependent on the depth of the tree (log(tree.size()), but is predictable - unlike HashMap, there are no hiatuses, and no edge conditions where performance drops.
Each insert and lookup is more expensive given a well-distributed HashMap.
Further, in order to insert the key in the tree, every key must be comparable to every other key, requiring the k.compare(other) method from the Comparable interface. Obviously, given the question is about integers, this is not a problem.
Performance for a TreeMap:
Insert of n elements is O(n log n)
Lookup is O(log n)
Summary
First thoughts: Dataset size:
If small (even in the 1000's and 10,000's) it really doesn't matter on any modern hardware
If large, to the point of causing the machine to run out of memory, then TreeMap may be the only option
Otherwise, size is probably not be the determining factor
In this specific case, a key factor is whether the expected number of unique integers is large or small compared to the overall dataset size?
If small, then the overall time will be dominated by key lookup in a small set, so optimization is irrelevant (you can stop here).
If large, then the overall time will be dominated by insert, and the decision rests on more factors:
Dataset is of known size?
If yes: The HashMap can be pre-allocated, and so memory churn eliminated. This is especially important if the hashCode() method is expensive (not in our case)
If no: A TreeMap provides more predictable performance and may be the better choice
Is predictable performance with no large stalls required, eg in real-time systems or on the event thread of a GUI?
If yes: A TreeMap provides much better predictability with no stalls
If no: A HashMap probably provides better overall performance for the whole computation
One final point if there is not an overwhelming point from above:
Is a sorted list of keys of value?
If yes (eg to print a histogram): A TreeMap has already sorted the keys, and so is convenient
However, if performance is important, the only way to decide would be to implement to the Map interface, then profile both the HashMap and the TreeMap to see which is actually better in your situation. Premature optimization is the root of much evil :)

What's wrong with sorting? That's O(n log n), which isn't bad at all. Any better solution would either require more information about the input sets (an upper bound on the numbers involved, perhaps) or involve a Map<Integer, Integer> or something equivalent.

The basic method is to sort the collection and then simply run through the sorted collection. (This would be done in O(nLog(n) + n) which is O(nLog(n))).
If the numbers are bounded (say for example, -10000,10000) and the collection contains a lot of integers you can use a lookup table and count each element. This would take O(n + l) (O(n) for the count, O(l) to find the max element) where l is the range length (20001 in this case).
As you can see, if n >> l then this would become O(n) which is better than 1, but if n << l then it's O(l) which is constant but big enough to make this unusable.
Another variant of the previous is to use a HashTable instead of a lookup table. This would improve the complexity to O(n) but is not guaranteed to be faster than 2 when n>>l.
The good news is that the values don't have to be bounded.
I'm not much of a java but if you need help coding these, let me know.

Here is a sample implementation of your program. It returns the no with most frequency and if two nos are found with max occurences, then the larger no is returned. If u want to return the frequency then change the last line of the code to "return mf".
{public int mode(int[]a,int n)
{int i,j,f,mf=0,mv=a[0];
for(i=0;i<n;i++)
{f=0;
for(j=0;j<n;j++)
{if(a[i]==a[j])
{f++;
}
}
if(f>mf||f==mf && a[i]>mv)
{mf=f;
mv=a[i];
}
}
return mv;
}
}

Since it's a collection of integers, one can use either
radix sort to sort the collection and that takes O(nb) where b is the number of bits used to represent the integers (32 or 64, if you use java's primitive integer data types), or
a comparison-based sort (quicksort, merge sort, etc) and that takes O(n log n).
Notes:
The larger your n becomes, the more likely that radix sort will be faster than comparison-based sorts. For smaller n, you are probably better off with a comparison-based sort.
If you know a bound on the values in the collection, b will be even smaller than 32 (or 64) making the radix sort more desirable.

This little puppy works (edited to return the frequency instead of the number):
public static int mostFrequent(int[] numbers) {
Map<Integer, AtomicInteger> map = new HashMap<Integer, AtomicInteger>() {
public AtomicInteger get(Object key) {
AtomicInteger value = super.get(key);
if (value == null) {
value = new AtomicInteger();
super.put((Integer) key, value);
}
return value;
}
};
for (int number : numbers)
map.get(number).incrementAndGet();
List<Entry<Integer, AtomicInteger>> entries = new ArrayList<Map.Entry<Integer, AtomicInteger>>(map.entrySet());
Collections.sort(entries, new Comparator<Entry<Integer, AtomicInteger>>() {
#Override
public int compare(Entry<Integer, AtomicInteger> o1, Entry<Integer, AtomicInteger> o2) {
return o2.getValue().get() - o1.getValue().get();
}
});
return entries.get(0).getValue().get(); // return the largest *frequency*
// Use this next line instead to return the most frequent *number*
// return entries.get(0).getKey();
}
AtomicInteger was chosen to avoid creating new objects with every increment, and the code reads a little cleaner.
The anonymous map class was used to centralize the "if null" code
Here's a test:
public static void main(String[] args) {
System.out.println(mostFrequent(new int[] { 2, 4, 3, 2, 2, 1, 4, 2, 2 }));
}
Output:
5

useing HashMap:
import java.util.HashMap;
public class NumberCounter {
static HashMap<Integer,Integer> map;
static int[] arr = {1, 2, 1, 23, 4, 5, 4, 1, 2, 3, 12, 23};
static int max=0;
public NumberCounter(){
map=new HashMap<Integer, Integer>();
}
public static void main (String[] args)
{
Integer newValue=1;
NumberCounter c=new NumberCounter();
for(int i=0;i<arr.length;i++){
if(map.get(arr[i])!=null) {
newValue = map.get(arr[i]);
newValue += 1;
map.put(arr[i], newValue);
}
else
map.put(arr[i],1);
}
max=map.get(arr[0]);
for(int i=0;i<map.size();i++){
if(max<map.get(arr[i]))
max=map.get(arr[i]);
}
System.out.print(max);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.