I use a Dico class to store weight of term and id of document where it appears
public class Dico
{
private String m_term; // term
private double m_weight; // weight of term
private int m_Id_doc; // id of doc that contain term
public Dico(int Id_Doc,String Term,double tf_ief )
{
this.m_Id_doc = Id_Doc;
this.m_term = Term;
this.m_weight = tf_ief;
}
public String getTerm()
{
return this.m_term;
}
public double getWeight()
{
return this.m_weight;
}
public void setWeight(double weight)
{
this.m_weight= weight;
}
public int getDocId()
{
return this.m_Id_doc;
}
}
And I use this method to calculate final weight from a Map<String,Double> and List<Dico>:
public List<Dico> merge_list_map(List<Dico> list,Map<String,Double> map)
{
// in map each term is unique but in list i have redundancy
List<Dico> list_term_weight = new ArrayList <>();
for (Map.Entry<String,Double> entrySet : map.entrySet())
{
String key = entrySet.getKey();
Double value = entrySet.getValue();
for(Dico dic : list)
{
String term =dic.getTerm();
double weight = dic.getWeight();
if(key.equals(term))
{
double new_weight =weight*value;
list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
}
}
}
return list_term_weight;
}
I have 36736 elements in the map and 1053914 in list, currently this program take lot of time to compile: BUILD SUCCESSFUL (total time: 17 minutes 15 seconds).
How can I get only the term from the list that equals the term from map ?
You can use the lookup functionality of the Map, i.e. Map.get() given that your map maps terms to weights. This should have significant performance improvements. The only difference is the output list is in the order as the input list, rather than the order the keys occur in the weighting Map.
public List<Dico> merge_list_map(List<Dico> list, Map<String, Double> map)
{
// in map each term is unique but in list i have redundancy
List<Dico> list_term_weight = new ArrayList<>();
for (Dico dic : list)
{
String term = dic.getTerm();
double weight = dic.getWeight();
Double value = map.get(term); // <== fetch weight from Map
if (value != null)
{
double new_weight = weight * value;
list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
}
}
return list_term_weight;
}
Basic test
List<Dico> list = Arrays.asList(new Dico(1, "foo", 1), new Dico(2, "bar", 2), new Dico(3, "baz", 3));
Map<String, Double> weights = new HashMap<String, Double>();
weights.put("foo", 2d);
weights.put("bar", 3d);
System.out.println(merge_list_map(list, weights));
Output
[Dico [m_term=foo, m_weight=2.0, m_Id_doc=1], Dico [m_term=bar, m_weight=6.0, m_Id_doc=2]]
Timing test - 10,000 elements
List<Dico> list = new ArrayList<Dico>();
Map<String, Double> weights = new HashMap<String, Double>();
for (int i = 0; i < 1e4; i++) {
list.add(new Dico(i, "foo-" + i, i));
if (i % 3 == 0) {
weights.put("foo-" + i, (double) i); // <== every 3rd has a weight
}
}
long t0 = System.currentTimeMillis();
List<Dico> result1 = merge_list_map_original(list, weights);
long t1 = System.currentTimeMillis();
List<Dico> result2 = merge_list_map_fast(list, weights);
long t2 = System.currentTimeMillis();
System.out.println(String.format("Original: %d ms", t1 - t0));
System.out.println(String.format("Fast: %d ms", t2 - t1));
// prove results equivalent, just different order
// requires Dico class to have hashCode/equals() - used eclipse default generator
System.out.println(new HashSet<Dico>(result1).equals(new HashSet<Dico>(result2)));
Output
Original: 1005 ms
Fast: 16 ms <=== loads quicker
true
Also, check the initialization of the Map. (http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html) The rehash of the map is costly in performance.
As a general rule, the default load factor (.75) offers a good
tradeoff between time and space costs. Higher values decrease the
space overhead but increase the lookup cost (reflected in most of the
operations of the HashMap class, including get and put). The expected
number of entries in the map and its load factor should be taken into
account when setting its initial capacity, so as to minimize the
number of rehash operations. If the initial capacity is greater than
the maximum number of entries divided by the load factor, no rehash
operations will ever occur.
If many mappings are to be stored in a HashMap instance, creating it
with a sufficiently large capacity will allow the mappings to be
stored more efficiently than letting it perform automatic rehashing as
needed to grow the table.
If you know, or have an approximation of the number of elements that you put in the map, you can create your Map like this:
Map<String, Double> foo = new HashMap<String, Double>(maxSize * 2);
In my experience, you can increase your performance by a factor of 2 or more.
In order to have the merge_list_map function to be efficient, you need to actually use the Map for what it is: an efficient data structure for key lookup.
As you are doing, looping on the Map entries and looking for a match in the List, the algorithm is O(N*M) where M is the size of the map and N the size of the list. That is certainly the worst you can get.
If you loop first through the List and then, for each Term, do a lookup in the the Map with Map$get(String key), you will get a time complexity of O(N) since the map lookup can be considered as O(1).
In term of design, and if you can use Java8, your problem can be translated in terms of Streams:
public static List<Dico> merge_list_map(List<Dico> dico, Map<String, Double> weights) {
List<Dico> wDico = dico.stream()
.filter (d -> weights.containsKey(d.getTerm()))
.map (d -> new Dico(d.getTerm(), d.getWeight()*weights.get(d.getTerm())))
.collect (Collectors.toList());
return wDico;
}
The new weighted list is built following a logical process:
stream(): take the list as a stream of Dico elements
filter(): keep only the Dico elements whose term is in the weights map
map(): for each filtered element, create a new Dico() instance with the computed weight.
collect(): collect all the new instances in a new list
return the new list that contains the filtered Dico with the new weight.
Performance wise, I tested it against some text, the narrative of Arthur Gordon Pym from E.A.Poe:
String text = null;
try (InputStream url = new URL("http://www.gutenberg.org/files/2149/2149-h/2149-h.htm").openStream()) {
text = new Scanner(url, "UTF-8").useDelimiter("\\A").next();
}
String[] words = text.split("[\\p{Punct}\\s]+");
System.out.println(words.length); // => 108028
Since there are only 100k words in the book, for good measure, just x10 (initDico() is a helper to build the List<Dico> from the words):
List<Dico> dico = initDico(words);
List<Dico> bigDico = new ArrayList<>(10*dico.size());
for (int i = 0; i < 10; i++) {
bigDico.addAll(dico);
}
System.out.println(bigDico.size()); // 1080280
Build the weights map, using all words (initWeights() builds a frequency map of the words in the book):
Map<String, Double> weights = initWeights(words);
System.out.println(weights.size()); // 9449 distinct words
The the test of merging the 1M words against the map of weights:
long start = System.currentTimeMillis();
List<Dico> wDico = merge_list_map(bigDico, weights);
long end = System.currentTimeMillis();
System.out.println("===== Elapsed time (ms): "+(end-start));
// => 105 ms
The weights map is significantly smaller than yours, but it should not impact the the timing since the lookup operations are in quasi-constant
time.
This is no serious benchmark for the function, but it already shows that merge_list_map() should score less than 1s (loading and building list and map are not part of the function).
Just to complete the exercise, following are the initialisation methods used in the test above:
private static List<Dico> initDico(String[] terms) {
List<Dico> dico = Arrays.stream(terms)
.map(String::toLowerCase)
.map(s -> new Dico(s, 1.0))
.collect(Collectors.toList());
return dico;
}
// weight of a word is the frequency*1000
private static Map<String, Double> initWeights(String[] terms) {
Map<String, Long> wfreq = termFreq(terms);
long total = wfreq.values().stream().reduce(0L, Long::sum);
return wfreq.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, e -> (double)(1000.0*e.getValue()/total)));
}
private static Map<String, Long> termFreq(String[] terms) {
Map<String, Long> wfreq = Arrays.stream(terms)
.map(String::toLowerCase)
.collect(groupingBy(Function.identity(), counting()));
return wfreq;
}
You should use the method contains() for list. In this way you'll avoid the second for. Even if the contains() method has O(n) complexity, you should see a a small improvement. Of course, rememeber to re-implement the equals(). Otherwise you should use a second Map, as bot suggested.
Use the lookup functionality of the Map, as Adam pointed out, and use HashMap as implementation of Map - HashMap lookup complexity is O(1). This should result in increased performance.
Related
I am working on below question:
Suppose you have a list of Dishes, where each dish is associated with
a list of ingredients. Group together dishes with common ingredients.
For example:
Input:
"Pasta" -> ["Tomato Sauce", "Onions", "Garlic"]
"Chicken Curry" --> ["Chicken", "Curry Sauce"]
"Fried Rice" --> ["Rice", "Onions", "Nuts"]
"Salad" --> ["Spinach", "Nuts"]
"Sandwich" --> ["Cheese", "Bread"]
"Quesadilla" --> ["Chicken", "Cheese"]
Output:
("Pasta", "Fried Rice")
("Fried Rice, "Salad")
("Chicken Curry", "Quesadilla")
("Sandwich", "Quesadilla")
Also what is the time and space complexity?
I came up with below code. Is there any better way to do this problem? It looks like algorithm is connected components from graph theory.
public static void main(String[] args) {
List<String> ing1 = Arrays.asList("Tomato Sauce", "Onions", "Garlic");
List<String> ing2 = Arrays.asList("Chicken", "Curry Sauce");
List<String> ing3 = Arrays.asList("Rice", "Onions", "Nuts");
List<String> ing4 = Arrays.asList("Spinach", "Nuts");
List<String> ing5 = Arrays.asList("Cheese", "Bread");
List<String> ing6 = Arrays.asList("Chicken", "Cheese");
Map<String, List<String>> map = new HashMap<>();
map.put("Pasta", ing1);
map.put("Chicken Curry", ing2);
map.put("Fried Rice", ing3);
map.put("Salad", ing4);
map.put("Sandwich", ing5);
map.put("Quesadilla", ing6);
System.out.println(group(map));
}
private static List<List<String>> group(Map<String, List<String>> map) {
List<List<String>> output = new ArrayList<>();
if (map == null || map.isEmpty()) {
return output;
}
Map<String, List<String>> holder = new HashMap<>();
for (Map.Entry<String, List<String>> entry : map.entrySet()) {
String key = entry.getKey();
List<String> value = entry.getValue();
for (String v : value) {
if (!holder.containsKey(v)) {
holder.put(v, new ArrayList<String>());
}
holder.get(v).add(key);
}
}
return new ArrayList<List<String>>(holder.values());
}
We can have an actual complexity estimation of this approach using graph theory. A "connected components" approach would have O(|V| + |E|) complexity, where V is the set of all ingredients and dishes, and E is the set containing all relations (a, b) where each a is a dish and b is an ingredient of the dish b. (i.e. assuming that you are storing this graph G = (V, E) in an adjacency list, as opposed to an adjacency matrix)
In any algorithm that needs to find out all the ingredients of each dish to find the result, you would have to investigate each and every dish and all of their ingredients. This would result in an investigation (i.e. traversal) that takes O(|V| + |E|) time, which would mean that no such algorithm could be better than your approach.
Let's first turn this problem into a graphs problem. Each dish and each ingredient will be a vertex. Each relation between dish and ingredient will be an edge.
Let's analyse the maximal size of the solution. Assuming there are N dishes and M ingredients overall, the maximal solution output is when every single dish is related. In that case the output is of size N^2 so this is a lower bound on the time complexity you can achieve. We can quite easily create a input for which we will must iterate over all vertices and edges so another lower bound on time complexity is N * M. Also we must save all of the vertices and edges so M * N is a lower bound on space complexity.
Now let's analyse your solution. You iterate over all dishes = N and for each one of the dishes you iterate over all of the values = M and with O(1) you check if in the dictionary so in total O(N * M). Your space complexity is O(M * N) as well. I would say your solution is good.
You just need to build a reverse map here.
I think you can write the code in a more expressive way by using Stream API introduced in Java8.
Basic steps:
Extract all the ingredients from the map
For each ingredient, get a set of dishes, and you will have many such sets - collect all such sets into a set - and so the return-type of the method becomes Set<Set<String>>
The following is the implementation:
private static Set<Set<String>> buildReverseMap(Map<String, Set<String>> map) {
// extracting all the values of map in a Set
Set<String> ingredients = map.values()
.stream()
.flatMap(Set::stream)
.collect(Collectors.toSet());
return ingredients.stream()
// map each ingredient to a set
.map(s ->
map.entrySet()
.stream()
.filter(entry -> entry.getValue().contains(s))
.map(Map.Entry::getKey)
.collect(Collectors.toSet())
).collect(Collectors.toSet());
}
Time complexity analysis:
Assuming you have N dishes and M ingredients, and in worst case each dist can have every ingredient. For each ingredient you need to iterate through every dish and check whether this contains the current-ingredient or not. This check can be done in amortized O(1) as we can have ingredients as HashSet<String> for each dish.
So for each ingredient you will iterate through every dish and check whether that dish contains this ingredient or not in amortized O(1). This give the time complexity to be amortized O(M*N).
Space-complexity Analysis:
Simply O(M*N) as in worst case you can have every dist made up of every available ingredient.
Note:
You can return a List<Set<String>> instead of Set<Set<String>> just by changing .collect(Collectors.toSet()) to .collect(Collectors.toList())
I have a map Map<String, List<String>>. I'd like to merge keys, if one key is a function of another, for example:
if the function is "prefix", I'd like given these values in the map:
{"123", ["a"]]}
{"85", ["a","b"]]}
{"8591", ["c"]}
to get a new map with these values:
{"123", ["a"]}
{"85", ["a","b","c"]}
This map "reduction" is called as part of a user request, so it must be fast. I know I can do O(n^2) but I'm looking for something better, parallel if possible.
Below is a code that find the super key for each key by calling the getMatchingKey function:
Map<String, Set<String>> result= new HashMap<>();
for (Map.Entry<String, List<String>> entry : input.entrySet()){
String x = getMatchingKey(entry.getKey(), input.keySet());
if (!resultt.containsKey(x)){
resultt.put(x, new HashSet<String>());
}
resultt.get(x).addAll((input.get(x)));
resultt.get(x).addAll((entry.getValue()));
}
EDIT
The full problem I'm having is such:
Given a map of entities names to their footprint Map<String, Footprint> I would like to remove Subnet from Footprint which is included in a different entity.
Footprint object include a List of Subent.
So my though was to reverse the map to be a Map<Subnet, List<String>> mapping all subnets to their entities names, than union all subnets and at the end filter the subnets from the original Map. Something like this:
public Map<String, Footprint> clearOverlaps(Map<String, Footprint> footprintsMap) {
Map<Subnet, List<String>> subnetsToGroupNameMap =
footprintsMap.entrySet()
.parallelStream()
.flatMap(e -> e.getValue().getSubnets().stream().map(i -> new AbstractMap.SimpleEntry<>(i, e.getKey())))
.collect(groupingBy(e->e.getKey(), mapping(e->e.getValue(), toList())));
Map<Subnet, Set<String>> subnetsToGroupNameFiltered = new HashMap<>();
for (Map.Entry<Subnet, List<String>> entry : subnetsToGroupNameMap.entrySet()){
Subnet x = findSubnetBiggerOrEqualToMe(entry.getKey(), subnetsToGroupNameMap.keySet());
if (!subnetsToGroupNameFiltered .containsKey(x)){
subnetsToGroupNameFiltered .put(x, new HashSet<String>());
}
subnetsToGroupNameFiltered .get(x).addAll((subnetsToGroupNameMap.get(x)));
subnetsToGroupNameFiltered .get(x).addAll((entry.getValue()));
}
footprintsMap.entrySet().stream().forEach(entry->entry.getValue().getSubnets().stream().filter(x->!subnetsToGroupNameFiltered .containsKey(x)));
return footprintsMap;
}
The function findSubnetBiggerOrEqualToMe finds in all the subnets the biggest one that include Subnet instance.
But since this function should run on user request, and the Map contains tens of entities with tens of thousands of subnets, I need something that will be fast (memory is free:))
I played around with an approach that first sorts the subnets lexicographically. This would reduce the overhead caused by your call to findSubnetBiggerOrEqualToMe from n^2 to the sort algorithms complexity (usually ~nlog(n)). I will assume that you can order the subnets as the logic should be similar to what you have in findSubnetBiggerOrEqualToMe.
Ideally, if all of a subnet's supernets were prefixes of the same set, it would then be a simple reduction in linear time. Example [1, 2, 22, 222, 3]:
for (int i = 0; i < sortedEntries.size() - 1; i++)
{
Entry<Subnet, Set<String>> subnet = sortedEntries.get(i);
Entry<Subnet, Set<String>> potentialSupernet = sortedEntries.get(i + 1);
if (subnet.getKey().isPrefix(potentialSupernet.getKey()))
{
potentialSupernet.getValue().addAll(subnet.getValue());
sortedEntries.remove(i);
i--;
}
}
But as soon as you encounter cases like [1, 2, 22, 23] (22 and 23 are not prefixes of the same net), it is not a simple reduction anymore, as you have to look further than just the next entry to make sure you find all supernets (2 has to be merged into both 22 and 23):
for (int i = 0; i < sortedEntries.size(); i++)
{
Entry<Subnet, Set<String>> subnet = sortedEntries.get(i);
for (int j = i + 1; j < sortedEntries.size(); j++)
{
Entry<Subnet, Set<String>> nextNet = sortedEntries.get(j);
if (!subnet.getKey().isPrefix(nextNet.getKey()))
{
break;
}
Entry<Subnet, Set<String>> nextNextNet = j < sortedEntries.size() - 1 ? sortedEntries.get(j + 1) : null;
if (nextNextNet == null || !subnet.getKey().isPrefix(nextNextNet.getKey()))
{
// biggest, and last superset found
nextNet.getValue().addAll(subnet.getValue());
sortedEntries.remove(i);
i--;
}
else if (!nextNet.getKey().isPrefix(nextNextNet.getKey()))
{
// biggest superset found, but not last
nextNet.getValue().addAll(subnet.getValue());
}
}
}
How well this approach reduces n^2 depends on the number of independent nets. The smaller the sets with equal prefix are, the less square the runtime should be.
In the end, I think this approach is very similar in behavior to a prefix tree approach. There, you would build the tree and then iterate the leaves (i.e. the biggest supersets) and merge all their ancestors' items into their sets.
Here is what I do. I have a list of objects to be converted to map with key as object id and value as an object. I have thousands of objects in the list and it is causing a performance issue. Is there any simple way to do it without using a loop or use some other dataset?
final List<Object> objects = new ArrayList<Object>();
final Map<Id, Object> objectMap = new HashMap<Id, Object>();
for (final Object object : objects)
{
objectMap.put(object.getId(), object);
}
You can try to optimize the HashMap with the right capacity and load factor:
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
The best value for capacity is n / lf so adding elements will not trigger the rehash where n is the max element count and lf the load factor. The default load factor is 0.75 but you can set it in the constructor to meet your need.
The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
The default values make your map rehash the elements many times with so many put operations and this impacts the performances
The loop is mandatory, made by you or by the collector.
you can try to invoke parallel stream on the list:
objects.parallelStream().collect(Collectors.toMap(object -> object.getId(), object -> object));
or else see some more of Java 8 parallel capabilities in the Parallelism Java tutorial
The use of java 8's Stream won't spare you the iteration over the list but might be slightly more optimised than repeated puts :
final List<Object> objects = new ArrayList<Object>();
final Map<Id, Object> objectMap = objects.stream().collect(Collectors.toMap(e -> e.getId(), e -> e));
Try using stream to convert List to Map. But anyway internally loop is used.
Map<Id, Object> objectMap = objects.stream().collect(
Collectors.toMap(Object ::getId, Object));
I have run a jmh benchmark with one million objects to compare what is best.
forloop: 26.191 ± 0.567 ms/op
java8 Parallel: 42.693 ± 1.784 ms/op
Guava.uniqueIndex: 38.097 ± 3.521 ms/op
It seems that the for loop is the fastest!
Here is the benchmark: (MyObject extends Object and has an ID integer field)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
#Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
#Fork(5)
#State(Scope.Benchmark)
public class ZipIteratorBenchmark {
static ArrayList<MyObject> objects;
#Setup(Level.Trial)
public void setup() {
objects = new ArrayList<>();
for (int i = 0; i < 1000000; i++) {
objects.add(new MyObject(i));
}
}
#Benchmark
public static Map<Integer, MyObject> forloop() {
final Map<Integer, MyObject> objectMap = new HashMap<>();
for (final MyObject object : objects) {
objectMap.put(object.getId(), object);
}
return objectMap;
}
#Benchmark
public static Map<Integer, MyObject> toMap() {
return FluentIterable.from(objects).uniqueIndex(MyObject::getId);
}
#Benchmark
public static Map<Integer, MyObject> java8Parallel() {
return objects.parallelStream().collect(Collectors.toConcurrentMap(MyObject::getId, object -> object));
}
}
I have a List of Orders with Order Date and Order Value. How do I group by order date and calculate the total Order Value per Order Date. How can I achieve this in Google Guava? If its complicating things.. how do I achieve this in Java collections?
Order POJO
Date date;
Integer Value;
Util.java
ListMultimap<Date, Integer> listMultiMap = ArrayListMultimap.create();
for(Order o : orders){
listMultiMap.put(o.date, o.value);
}
//Now how do I iterate this listMultiMap and calculate the total value?
I don't think Guava is necessarily the best tool here...nor any normal map, for that matter: if you will have a huge amount of Orders, you should think about using Java8 Streams, that will let you parallelise your calculation. It will also have optimization about primitive types (int vs. Integer)...
In any case, for the specific use case you describe and following the starting code you posted, here it is a potential solution (using LocalDate instead of Date just because it's more handy):
#Test
public void test(){
// Basic test data
Order today1 = new Order(LocalDate.now(),1);
Order today2 = new Order(LocalDate.now(),2);
Order today3 = new Order(LocalDate.now(),5);
Order tomorrow1 = new Order(LocalDate.now().plusDays(1),2);
Order yesterday1 = new Order(LocalDate.now().minusDays(1),5);
Order yesterday2 = new Order(LocalDate.now().minusDays(1),4);
List<Order> list = Lists.newArrayList(today1,today2,today3,tomorrow1,yesterday1,yesterday2);
// Setup multimap and fill it with Orders
ListMultimap<LocalDate, Integer> mm = ArrayListMultimap.create();
for(Order o : list){
mm.put(o.date,o.value);
}
// At this point, all you need to do is, for each date "bucket", sum up all values.
Map<LocalDate, Integer> resultMap = Maps.newHashMap();
for(LocalDate d : mm.keySet()){
List<Integer> values = mm.get(d);
int valuesSum = 0;
for(int i : values){
valuesSum += i;
}
resultMap.put(d,valuesSum);
}
/*
* Result map should contain:
* today -> 8
* tomorrow -> 2
* yesterday -> 9
* */
assertThat(resultMap.size(), is(3));
assertThat(resultMap.get(LocalDate.now()), is(8));
assertThat(resultMap.get(LocalDate.now().minusDays(1)), is(9));
assertThat(resultMap.get(LocalDate.now().plusDays(1)), is(2));
}
One famous programmer said "why anybody need DB, just give me hash table!". I have list of grammar symbols together with their frequencies. One way it's a map: symbol#->frequency. The other way its a [binary] relation. Problem: get top 5 symbols by frequency.
More general question. I'm aware of [binary] relation algebra slowly making inroad into CS theory. Is there java library supporting relations?
List<Entry<String, Integer>> myList = new ArrayList<...>();
for (Entry<String, Integer> e : myMap.entrySet())
myList.add(e);
Collections.sort(myList, new Comparator<Entry<String, Integer>>(){
int compare(Entry a, Entry b){
// compare b to a to get reverse order
return new Integer(b.getValue()).compareTo(new Integer(a.getValue());
}
});
List<Entry<String, Integer>> top5 = myList.sublist(0, 5);
More efficient:
TreeSet<Entry<String, Integer>> myTree = new TreeSet<...>(
new Comparator<Entry<String, Integer>>(){
int compare(Entry a, Entry b){
// compare b to a to get reverse order
return new Integer(b.getValue()).compareTo(new Integer(a.getValue());
}
});
for (Entry<String, Integer> e : myMap.entrySet())
myList.add(e);
List<Entry<String, Integer>> top5 = new ArrayList<>();
int i=0;
for (Entry<String, Integer> e : myTree) {
top5.add(e);
if (i++ == 4) break;
}
With TreeSet it should be easy:
int i = 0;
for(Symbol s: symbolTree.descendingSet()) {
i++;
if(i > 5) break; // or probably return
whatever(s);
}
Here is a general algorithm, assuming you already have a completed symbol HashTable
Make 2 arrays:
freq[5] // Use this to save the frequency counts for the 5 most frequent seen so far
word[5] // Use this to save the words that correspond to the above array, seen so far
Use an iterator to traverse your HashTable or Map:
Compare the current symbol's frequency against the ones in freq[5] in sequential order.
If the current symbol has a higher frequency than any entry in the array pairing above, shift that entry and all entries below it one position (i.e. the 5th position gets kicked out)
Add the current symbol / frequency pair to the newly vacated position
Otherwise, ignore.
Analysis:
You make at most 5 comparisons (constant time) against the arrays with each symbol seen in the HashTable, so this is O(n)
Each time you have to shift the entries in the array down, it is also constant time. Assuming you do a shift every time, this is still O(n)
Space: O(1) to store the arrays
Runtime: O(n) to iterate through all the symbols