Java stream: merge map keys by function - java

I have a map Map<String, List<String>>. I'd like to merge keys, if one key is a function of another, for example:
if the function is "prefix", I'd like given these values in the map:
{"123", ["a"]]}
{"85", ["a","b"]]}
{"8591", ["c"]}
to get a new map with these values:
{"123", ["a"]}
{"85", ["a","b","c"]}
This map "reduction" is called as part of a user request, so it must be fast. I know I can do O(n^2) but I'm looking for something better, parallel if possible.
Below is a code that find the super key for each key by calling the getMatchingKey function:
Map<String, Set<String>> result= new HashMap<>();
for (Map.Entry<String, List<String>> entry : input.entrySet()){
String x = getMatchingKey(entry.getKey(), input.keySet());
if (!resultt.containsKey(x)){
resultt.put(x, new HashSet<String>());
}
resultt.get(x).addAll((input.get(x)));
resultt.get(x).addAll((entry.getValue()));
}
EDIT
The full problem I'm having is such:
Given a map of entities names to their footprint Map<String, Footprint> I would like to remove Subnet from Footprint which is included in a different entity.
Footprint object include a List of Subent.
So my though was to reverse the map to be a Map<Subnet, List<String>> mapping all subnets to their entities names, than union all subnets and at the end filter the subnets from the original Map. Something like this:
public Map<String, Footprint> clearOverlaps(Map<String, Footprint> footprintsMap) {
Map<Subnet, List<String>> subnetsToGroupNameMap =
footprintsMap.entrySet()
.parallelStream()
.flatMap(e -> e.getValue().getSubnets().stream().map(i -> new AbstractMap.SimpleEntry<>(i, e.getKey())))
.collect(groupingBy(e->e.getKey(), mapping(e->e.getValue(), toList())));
Map<Subnet, Set<String>> subnetsToGroupNameFiltered = new HashMap<>();
for (Map.Entry<Subnet, List<String>> entry : subnetsToGroupNameMap.entrySet()){
Subnet x = findSubnetBiggerOrEqualToMe(entry.getKey(), subnetsToGroupNameMap.keySet());
if (!subnetsToGroupNameFiltered .containsKey(x)){
subnetsToGroupNameFiltered .put(x, new HashSet<String>());
}
subnetsToGroupNameFiltered .get(x).addAll((subnetsToGroupNameMap.get(x)));
subnetsToGroupNameFiltered .get(x).addAll((entry.getValue()));
}
footprintsMap.entrySet().stream().forEach(entry->entry.getValue().getSubnets().stream().filter(x->!subnetsToGroupNameFiltered .containsKey(x)));
return footprintsMap;
}
The function findSubnetBiggerOrEqualToMe finds in all the subnets the biggest one that include Subnet instance.
But since this function should run on user request, and the Map contains tens of entities with tens of thousands of subnets, I need something that will be fast (memory is free:))

I played around with an approach that first sorts the subnets lexicographically. This would reduce the overhead caused by your call to findSubnetBiggerOrEqualToMe from n^2 to the sort algorithms complexity (usually ~nlog(n)). I will assume that you can order the subnets as the logic should be similar to what you have in findSubnetBiggerOrEqualToMe.
Ideally, if all of a subnet's supernets were prefixes of the same set, it would then be a simple reduction in linear time. Example [1, 2, 22, 222, 3]:
for (int i = 0; i < sortedEntries.size() - 1; i++)
{
Entry<Subnet, Set<String>> subnet = sortedEntries.get(i);
Entry<Subnet, Set<String>> potentialSupernet = sortedEntries.get(i + 1);
if (subnet.getKey().isPrefix(potentialSupernet.getKey()))
{
potentialSupernet.getValue().addAll(subnet.getValue());
sortedEntries.remove(i);
i--;
}
}
But as soon as you encounter cases like [1, 2, 22, 23] (22 and 23 are not prefixes of the same net), it is not a simple reduction anymore, as you have to look further than just the next entry to make sure you find all supernets (2 has to be merged into both 22 and 23):
for (int i = 0; i < sortedEntries.size(); i++)
{
Entry<Subnet, Set<String>> subnet = sortedEntries.get(i);
for (int j = i + 1; j < sortedEntries.size(); j++)
{
Entry<Subnet, Set<String>> nextNet = sortedEntries.get(j);
if (!subnet.getKey().isPrefix(nextNet.getKey()))
{
break;
}
Entry<Subnet, Set<String>> nextNextNet = j < sortedEntries.size() - 1 ? sortedEntries.get(j + 1) : null;
if (nextNextNet == null || !subnet.getKey().isPrefix(nextNextNet.getKey()))
{
// biggest, and last superset found
nextNet.getValue().addAll(subnet.getValue());
sortedEntries.remove(i);
i--;
}
else if (!nextNet.getKey().isPrefix(nextNextNet.getKey()))
{
// biggest superset found, but not last
nextNet.getValue().addAll(subnet.getValue());
}
}
}
How well this approach reduces n^2 depends on the number of independent nets. The smaller the sets with equal prefix are, the less square the runtime should be.
In the end, I think this approach is very similar in behavior to a prefix tree approach. There, you would build the tree and then iterate the leaves (i.e. the biggest supersets) and merge all their ancestors' items into their sets.

Related

Effective way. of comparing list elements in Java

Is there any **effective way **of comparing elements in Java and print out the position of the element which occurs once.
For example: if I have a list: ["Hi", "Hi", "No"], I want to print out 2 because "No" is in position 2. I have solved this using the following algorithm and it works, BUT the problem is that if I have a large list it takes too much time to compare the entire list to print out the first position of the unique word.
ArrayList<String> strings = new ArrayList<>();
for (int i = 0; i < strings.size(); i++) {
int oc = Collections.frequency(strings, strings.get(i));
if (oc == 1)
System.out.print(i);
break;
}
I can think of counting each element's occurrence no and filter out the first element though not sure how large your list is.
Using Stream:
List<String> list = Arrays.asList("Hi", "Hi", "No");
//iterating thorugh the list and storing each element and their no of occurance in Map
Map<String, Long> counts = list.stream().collect(Collectors.groupingBy(Function.identity(), LinkedHashMap::new, Collectors.counting()));
String value = counts.entrySet().stream()
.filter(e -> e.getValue() == 1) //filtering out all the elements which have more than 1 occurance
.map(Map.Entry::getKey) // creating a stream of element from map as all of these have only single occurance
.findFirst() //finding the first element from the element stream
.get();
System.out.println(list.indexOf(value));
EDIT:
A simplified version can be
Map<String, Long> counts2 = new LinkedHashMap<String, Long>();
for(String val : list){
long count = counts2.getOrDefault(val, 0L);
counts2.put(val, ++count);
}
for(String key: counts2.keySet()){
if(counts2.get(key)==1){
System.out.println(list.indexOf(key));
break;
}
}
The basic idea is to count each element's occurrence and store them in a Map.Once you have count of all elements occurrences. then you can simply check for the first element which one has 1 as count.
You can use HashMap.For example you can put word as key and index as value.Once you find the same word you can delete the key and last the map contain the result.
If there's only one word that's present only once, you can probably use a HashMap or HashSet + Deque (set for values, Deque for indices) to do this in linear time. A sort can give you the same in n log(n), so slower than linear but a lot faster than your solution. By sorting, it's easy to find in linear time (after the sort) which element is present only once because all duplicates will be next to each other in the array.
For example for a linear solution in pseudo-code (pseudo-Kotlin!):
counters = HashMap()
for (i, word in words.withIndex()) {
counters.merge(word, Counter(i, 1), (oldVal, newVal) -> Counter(oldVald.firstIndex, oldVald.count + newVal.count));
}
for (counter in counters.entrySet()) {
if (counter.count == 1) return counter.firstIndex;
}
class Counter(firstIndex, count)
Map<String,Boolean> + loops
Instead of using Map<String,Integer> as suggested in other answers.
You can maintain a HashMap (if you need to maintain the order, use LinkedHashMap instead) of type Map<String,Boolean> where a value would denote whether an element is unique or not.
The simplest way to generate the map is method put() in conjunction with containsKey() check.
But there are also more concise options like replace() + putIfAbsent(). putIfAbsent() would create a new entry only if key is not present in the map, therefore we can associate such string with a value of true (considered to be unique). On the other hand replace() would update only existing entry (otherwise map would not be effected), and if entry exist, the key is proved to be a duplicate, and it has to be associated with a value of false (non-unique).
And since Java 8 we also have method merge(), which expects tree arguments: a key, a value, and a function which is used when the given key already exists to resolve the old value and the new one.
The last step is to generate list of unique strings by iterating over the entry set of the newly created map. We need every key having a value of true (is unique) associated with it.
List<String> strings = // initializing the list
Map<String, Boolean> isUnique = new HashMap<>(); // or LinkedHashMap if you need preserve initial order of strings
for (String next: strings) {
isUnique.replace(next, false);
isUnique.putIfAbsent(next, true);
// isUnique.merge(next, true, (oldV, newV) -> false); // does the same as the commented out lines above
}
List<String> unique = new ArrayList<>();
for (Map.Entry<String, Boolean> entry: isUnique.entrySet()) {
if (entry.getValue()) unique.add(entry.getKey());
}
Stream-based solution
With streams, it can be done using collector toMap(). The overall logic remains the same.
List<String> unique = strings.stream()
.collect(Collectors.toMap( // creating intermediate map Map<String, Boolean>
Function.identity(), // key
key -> true, // value
(oldV, newV) -> false, // resolving duplicates
LinkedHashMap::new // Map implementation, if order is not important - discard this argument
))
.entrySet().stream()
.filter(Map.Entry::getValue)
.map(Map.Entry::getKey)
.toList(); // for Java 16+ or collect(Collectors.toList()) for earlier versions

Check if list of integers contains two groups of different repeated numbers

How to using java stream, check if list of integers contains two groups of different repeated numbers. Number must be repeated not more then two time.
Example: list of 23243.
Answer: true, because 2233
Example 2: list of 23245.
Answer: none
Example 3: list of 23232.
Answer: none, because 222 repeated three times
One more question, how can i return not anyMatch, but the biggest of repeated number?
listOfNumbers.stream().anyMatch(e -> Collections.frequency(listOfNumbers, e) == 2)
This will tell you if the list meets your requirements.
stream the list of digits.
do a frequency count.
stream the resultant counts
filter out those not equal to a count of 2.
and count how many of those there are.
Returns true if final count == 2, false otherwise.
List<Integer> list = List.of(2,2,3,3,3,4,4);
boolean result = list.stream()
.collect(Collectors.groupingBy(a -> a,
Collectors.counting()))
.values().stream().filter(count -> count == 2).limit(2)
.count() >= 2; // fixed per OP's comment
The above prints true since there are two groups of just two digits, namely 2's and 4's
EDIT
First, I made Holger's suggestion to short circuit the count check.
To address your question about returning multiple values, I broke up the process into parts. The first is the normal frequency count that I did before. The next is gathering the information requested. I used a record to return the information. A class would also work. The max count for some particular number is housed in an AbstractMap.SimpleEntry
List<Integer> list = List.of(2, 3, 3, 3, 4, 4, 3, 2, 3);
Results results = groupCheck(list);
System.out.println(results.check);
System.out.println(results.maxEntry);
Prints (getKey() and getValue() may be used to get the individual values. First is the number, second is the occurrences of that number.)
true
3=5
The method and record declaration
record Results(boolean check,
AbstractMap.SimpleEntry<Integer, Long> maxEntry) {
}
Once the frequency count is computed, simply iterate over the entries and
count the pairs and compute the maxEntry by comparing the existing maximum count to the iterated one and update as required.
public static Results groupCheck(List<Integer> list) {
Map<Integer, Long> map = list.stream().collect(
Collectors.groupingBy(a -> a, Collectors.counting()));
AbstractMap.SimpleEntry<Integer, Long> maxEntry =
new AbstractMap.SimpleEntry<>(0, 0L);
int count = 0;
for (Entry<Integer, Long> e : map.entrySet()) {
if (e.getValue() == 2) {
count++;
}
maxEntry = e.getValue() > maxEntry.getValue() ?
new AbstractMap.SimpleEntry<>(e) : maxEntry;
}
return new Results(count >= 2, maxEntry);
}
One could write a method which builds a TreeMap of the frequencies.
What happens here, is that a frequency map is built first (by groupingBy(Function.identity(), Collectors.counting()))), and then we must 'swap' the keys and values, because we want to use the frequencies as keys.
public static TreeMap<Long, List<Integer>> frequencies(List<Integer> list) {
return list.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream()
.collect(Collectors.toMap(e -> e.getValue(), e -> List.of(e.getKey()), (a, b) -> someMergeListsFunction(a, b), TreeMap::new));
}
And then we can just use our method like this:
// We assume the input list is not empty
TreeMap<Long, List<Integer>> frequencies = frequencies(list);
var higher = frequencies.higherEntry(2L);
if (higher != null) {
System.out.printf("There is a number which occurs more than twice: %s (occurs %s times)\n", higher.getValue().get(0), higher.getKey());
}
else {
List<Integer> occurTwice = frequencies.lastEntry().getValue();
if (occurTwice.size() < 2) {
System.out.println("Only " + occurTwice.get(0) " occurs twice...");
}
else {
System.out.println(occurTwice);
}
}
A TreeMap is a Map with keys sorted by some comparator, or the natural order if none is given. The TreeMap class contains methods to search for certain keys. For example, the higherEntry method returns the first entry which is higher than the given key. With this method, you can easily check if a key higher than 2 exists, for one of the requirements is that none of the numbers may occur more than twice.
The above code checks whether there is a number occurring more than twice, that is when higherEntry(2L) returns a nonnull value. Otherwise, lastEntry() is the highest number occurring. With getValue(), you can retrieve the list of these numbers.

getting map key via value

i have this kind of data structure
Map<Integer, Integer> groupMap= new LinkedHashMap<>();
groupMap.put(10, 1);
groupMap.put(11, 0);
groupMap.put(14, 1);
groupMap.put(13, 0);
groupMap.put(12, 0);
groupMap.put(15, 1);
what can be the best way to find the key which has value 1 if i have a present key with one value.
Ex:i have key 14, now need to find the key 15 which has value 1
least looping will be helpfull.
my approch:
List<Integer> keys = new ArrayList<>();
keys.putAll(groupMap.keySet());
//getting the index of current key i have
int index = keys.indexOf(14);
if(keys.size() == index) return -1;
for(int i = index+1;i<keys.size();i++){
if(groupMap.get(i) == 1) return i;
}
i know it isn't a very good approach, but can you please suggest a good one.
This completely defeats the purpose of a key-value map. But if it's really what you want, I suppose you could do the following:
public static int getNextKeyByValue(int value, int previousKey) {
final Map<Integer, Integer> groupMap = new HashMap<>();
Iterator iterator = groupMap.entrySet().iterator();
while (iterator.hasNext()) {
Map.Entry<Integer, Integer> entry = (Map.Entry<Integer, Integer>) iterator.next();
if (entry.getValue() == value && entry.getKey() != previousKey) {
return entry.getKey();
}
}
return -1;
}
From the topic which #Titus mentioned in the comment, the most elegant and shortest solution is to use stream:
int getFirstCorrectValueBiggerThan (int lastValue) {
return groupMap.entrySet().stream()
.filter(entry -> Objects.equals(entry.getValue(), 1))
.map(Map.Entry::getKey)
.filter(value -> value > lastValue)
.findFirst();
}
edit:
sorry for the mistake, the code provided does not solve your problem since it is comparing keys not indexes. Here you have proper version, however it is not so cool anymore.
ArrayList<Integer> filteredList = groupMap.entrySet().stream()
.filter(entry -> entry.getValue().equals(1))
.map(Map.Entry::getKey)
.collect(Collectors.toCollection(ArrayList::new));
int nextCorrectElement = filteredList.get(filteredList.indexOf(14) + 1);
update
as far as i undestand what is written in this tutorial about map:
When a user calls put(K key, V value) or get(Object key), the function computes the index of the bucket in which the Entry should be. Then, the function iterates through the list to look for the Entry that has the same key (using the equals() function of the key).
and check out this topic about hash map complexity.
O(1) certainly isn't guaranteed - but it's usually what you should assume when considering which algorithms and data structures to use.
On top of that, the key part of your solution- the ArrayList::indexOf- is O(N) complex- you have to iterate through each element till the one which meets the condition. More info is in this topic.
So efectively you are iterating through every element of your hashmap anyway. And what is more, the hashmap searching (get method) is not quaranteed to be O(1) complex so there is a chance that you will double your work.
I have made a simple test of performance for stream based solution and simple loop proposed in this topic. In fact loop will be faster than sequential stream for each case I think, but still if you want that kind of performance gain then try to write it in in C++. Otherwise if you have more complex example then using the parallel stream may get some advantage due to higher abstraction level of the problem stating.
I have not really clear your question. If you are looking for all the tuples with value equals to 1, you could follow the approach below:
for (Entry<Integer, Integer> entry : groupMap.entrySet()) {
if (entry.getValue() == 1) {
System.out.println("The key is: " + entry.getKey().toString());
}
}

Increase speed of composition from list and map

I use a Dico class to store weight of term and id of document where it appears
public class Dico
{
private String m_term; // term
private double m_weight; // weight of term
private int m_Id_doc; // id of doc that contain term
public Dico(int Id_Doc,String Term,double tf_ief )
{
this.m_Id_doc = Id_Doc;
this.m_term = Term;
this.m_weight = tf_ief;
}
public String getTerm()
{
return this.m_term;
}
public double getWeight()
{
return this.m_weight;
}
public void setWeight(double weight)
{
this.m_weight= weight;
}
public int getDocId()
{
return this.m_Id_doc;
}
}
And I use this method to calculate final weight from a Map<String,Double> and List<Dico>:
public List<Dico> merge_list_map(List<Dico> list,Map<String,Double> map)
{
// in map each term is unique but in list i have redundancy
List<Dico> list_term_weight = new ArrayList <>();
for (Map.Entry<String,Double> entrySet : map.entrySet())
{
String key = entrySet.getKey();
Double value = entrySet.getValue();
for(Dico dic : list)
{
String term =dic.getTerm();
double weight = dic.getWeight();
if(key.equals(term))
{
double new_weight =weight*value;
list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
}
}
}
return list_term_weight;
}
I have 36736 elements in the map and 1053914 in list, currently this program take lot of time to compile: BUILD SUCCESSFUL (total time: 17 minutes 15 seconds).
How can I get only the term from the list that equals the term from map ?
You can use the lookup functionality of the Map, i.e. Map.get() given that your map maps terms to weights. This should have significant performance improvements. The only difference is the output list is in the order as the input list, rather than the order the keys occur in the weighting Map.
public List<Dico> merge_list_map(List<Dico> list, Map<String, Double> map)
{
// in map each term is unique but in list i have redundancy
List<Dico> list_term_weight = new ArrayList<>();
for (Dico dic : list)
{
String term = dic.getTerm();
double weight = dic.getWeight();
Double value = map.get(term); // <== fetch weight from Map
if (value != null)
{
double new_weight = weight * value;
list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
}
}
return list_term_weight;
}
Basic test
List<Dico> list = Arrays.asList(new Dico(1, "foo", 1), new Dico(2, "bar", 2), new Dico(3, "baz", 3));
Map<String, Double> weights = new HashMap<String, Double>();
weights.put("foo", 2d);
weights.put("bar", 3d);
System.out.println(merge_list_map(list, weights));
Output
[Dico [m_term=foo, m_weight=2.0, m_Id_doc=1], Dico [m_term=bar, m_weight=6.0, m_Id_doc=2]]
Timing test - 10,000 elements
List<Dico> list = new ArrayList<Dico>();
Map<String, Double> weights = new HashMap<String, Double>();
for (int i = 0; i < 1e4; i++) {
list.add(new Dico(i, "foo-" + i, i));
if (i % 3 == 0) {
weights.put("foo-" + i, (double) i); // <== every 3rd has a weight
}
}
long t0 = System.currentTimeMillis();
List<Dico> result1 = merge_list_map_original(list, weights);
long t1 = System.currentTimeMillis();
List<Dico> result2 = merge_list_map_fast(list, weights);
long t2 = System.currentTimeMillis();
System.out.println(String.format("Original: %d ms", t1 - t0));
System.out.println(String.format("Fast: %d ms", t2 - t1));
// prove results equivalent, just different order
// requires Dico class to have hashCode/equals() - used eclipse default generator
System.out.println(new HashSet<Dico>(result1).equals(new HashSet<Dico>(result2)));
Output
Original: 1005 ms
Fast: 16 ms <=== loads quicker
true
Also, check the initialization of the Map. (http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html) The rehash of the map is costly in performance.
As a general rule, the default load factor (.75) offers a good
tradeoff between time and space costs. Higher values decrease the
space overhead but increase the lookup cost (reflected in most of the
operations of the HashMap class, including get and put). The expected
number of entries in the map and its load factor should be taken into
account when setting its initial capacity, so as to minimize the
number of rehash operations. If the initial capacity is greater than
the maximum number of entries divided by the load factor, no rehash
operations will ever occur.
If many mappings are to be stored in a HashMap instance, creating it
with a sufficiently large capacity will allow the mappings to be
stored more efficiently than letting it perform automatic rehashing as
needed to grow the table.
If you know, or have an approximation of the number of elements that you put in the map, you can create your Map like this:
Map<String, Double> foo = new HashMap<String, Double>(maxSize * 2);
In my experience, you can increase your performance by a factor of 2 or more.
In order to have the merge_list_map function to be efficient, you need to actually use the Map for what it is: an efficient data structure for key lookup.
As you are doing, looping on the Map entries and looking for a match in the List, the algorithm is O(N*M) where M is the size of the map and N the size of the list. That is certainly the worst you can get.
If you loop first through the List and then, for each Term, do a lookup in the the Map with Map$get(String key), you will get a time complexity of O(N) since the map lookup can be considered as O(1).
In term of design, and if you can use Java8, your problem can be translated in terms of Streams:
public static List<Dico> merge_list_map(List<Dico> dico, Map<String, Double> weights) {
List<Dico> wDico = dico.stream()
.filter (d -> weights.containsKey(d.getTerm()))
.map (d -> new Dico(d.getTerm(), d.getWeight()*weights.get(d.getTerm())))
.collect (Collectors.toList());
return wDico;
}
The new weighted list is built following a logical process:
stream(): take the list as a stream of Dico elements
filter(): keep only the Dico elements whose term is in the weights map
map(): for each filtered element, create a new Dico() instance with the computed weight.
collect(): collect all the new instances in a new list
return the new list that contains the filtered Dico with the new weight.
Performance wise, I tested it against some text, the narrative of Arthur Gordon Pym from E.A.Poe:
String text = null;
try (InputStream url = new URL("http://www.gutenberg.org/files/2149/2149-h/2149-h.htm").openStream()) {
text = new Scanner(url, "UTF-8").useDelimiter("\\A").next();
}
String[] words = text.split("[\\p{Punct}\\s]+");
System.out.println(words.length); // => 108028
Since there are only 100k words in the book, for good measure, just x10 (initDico() is a helper to build the List<Dico> from the words):
List<Dico> dico = initDico(words);
List<Dico> bigDico = new ArrayList<>(10*dico.size());
for (int i = 0; i < 10; i++) {
bigDico.addAll(dico);
}
System.out.println(bigDico.size()); // 1080280
Build the weights map, using all words (initWeights() builds a frequency map of the words in the book):
Map<String, Double> weights = initWeights(words);
System.out.println(weights.size()); // 9449 distinct words
The the test of merging the 1M words against the map of weights:
long start = System.currentTimeMillis();
List<Dico> wDico = merge_list_map(bigDico, weights);
long end = System.currentTimeMillis();
System.out.println("===== Elapsed time (ms): "+(end-start));
// => 105 ms
The weights map is significantly smaller than yours, but it should not impact the the timing since the lookup operations are in quasi-constant
time.
This is no serious benchmark for the function, but it already shows that merge_list_map() should score less than 1s (loading and building list and map are not part of the function).
Just to complete the exercise, following are the initialisation methods used in the test above:
private static List<Dico> initDico(String[] terms) {
List<Dico> dico = Arrays.stream(terms)
.map(String::toLowerCase)
.map(s -> new Dico(s, 1.0))
.collect(Collectors.toList());
return dico;
}
// weight of a word is the frequency*1000
private static Map<String, Double> initWeights(String[] terms) {
Map<String, Long> wfreq = termFreq(terms);
long total = wfreq.values().stream().reduce(0L, Long::sum);
return wfreq.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, e -> (double)(1000.0*e.getValue()/total)));
}
private static Map<String, Long> termFreq(String[] terms) {
Map<String, Long> wfreq = Arrays.stream(terms)
.map(String::toLowerCase)
.collect(groupingBy(Function.identity(), counting()));
return wfreq;
}
You should use the method contains() for list. In this way you'll avoid the second for. Even if the contains() method has O(n) complexity, you should see a a small improvement. Of course, rememeber to re-implement the equals(). Otherwise you should use a second Map, as bot suggested.
Use the lookup functionality of the Map, as Adam pointed out, and use HashMap as implementation of Map - HashMap lookup complexity is O(1). This should result in increased performance.

Representing binary relation in java

One famous programmer said "why anybody need DB, just give me hash table!". I have list of grammar symbols together with their frequencies. One way it's a map: symbol#->frequency. The other way its a [binary] relation. Problem: get top 5 symbols by frequency.
More general question. I'm aware of [binary] relation algebra slowly making inroad into CS theory. Is there java library supporting relations?
List<Entry<String, Integer>> myList = new ArrayList<...>();
for (Entry<String, Integer> e : myMap.entrySet())
myList.add(e);
Collections.sort(myList, new Comparator<Entry<String, Integer>>(){
int compare(Entry a, Entry b){
// compare b to a to get reverse order
return new Integer(b.getValue()).compareTo(new Integer(a.getValue());
}
});
List<Entry<String, Integer>> top5 = myList.sublist(0, 5);
More efficient:
TreeSet<Entry<String, Integer>> myTree = new TreeSet<...>(
new Comparator<Entry<String, Integer>>(){
int compare(Entry a, Entry b){
// compare b to a to get reverse order
return new Integer(b.getValue()).compareTo(new Integer(a.getValue());
}
});
for (Entry<String, Integer> e : myMap.entrySet())
myList.add(e);
List<Entry<String, Integer>> top5 = new ArrayList<>();
int i=0;
for (Entry<String, Integer> e : myTree) {
top5.add(e);
if (i++ == 4) break;
}
With TreeSet it should be easy:
int i = 0;
for(Symbol s: symbolTree.descendingSet()) {
i++;
if(i > 5) break; // or probably return
whatever(s);
}
Here is a general algorithm, assuming you already have a completed symbol HashTable
Make 2 arrays:
freq[5] // Use this to save the frequency counts for the 5 most frequent seen so far
word[5] // Use this to save the words that correspond to the above array, seen so far
Use an iterator to traverse your HashTable or Map:
Compare the current symbol's frequency against the ones in freq[5] in sequential order.
If the current symbol has a higher frequency than any entry in the array pairing above, shift that entry and all entries below it one position (i.e. the 5th position gets kicked out)
Add the current symbol / frequency pair to the newly vacated position
Otherwise, ignore.
Analysis:
You make at most 5 comparisons (constant time) against the arrays with each symbol seen in the HashTable, so this is O(n)
Each time you have to shift the entries in the array down, it is also constant time. Assuming you do a shift every time, this is still O(n)
Space: O(1) to store the arrays
Runtime: O(n) to iterate through all the symbols

Categories