How to get Second Most repeated Word in String using Maps

How to get Second Most repeated Word in String using Maps - java

I am trying to get the second most repeated word in the sentence.
eg:
String paraString = "This is a paragraph with multiple strings. Get the second most repeated word from the paragraph text and print the words with count".
Here 'the' is repeated for thrice and 'paragraph' & 'with' are repeated twice.
I need to print the second most repeated words 'paragraph' & 'with'.
Here is the program which I wrote to get the First Most Repeated Words.
public Set<String> getMostRepeatedWords(Map<String, Integer> sortedMap) {
Set<String> mostRepeatedWords = new HashSet<String>();
int mostrepeatedWord = Collections.max(sortedMap.values());
for (Map.Entry<String, Integer> entry : sortedMap.entrySet()) {
if (mostrepeatedWord == entry.getValue()) {
mostRepeatedWords.add(entry.getKey());
}
}
return mostRepeatedWords;
}
Please help me out.
The one option which I have is below. Let me know if there are any other ways.
int mostrepeatedWord = Collections.max(sortedMap.values())-1;

Here is an example of what you could do with Java 8 :
public List<String> getMostRepeatedWords(String s) {
Map<String,Integer> map = new HashMap<>();
String[] words = s.split("\\s+");
for (String word : words)
map.put(word,map.containsKey(word) ? map.get(word) + 1 : 0);
List<Entry<String,Integer>> tmp = new ArrayList<>(map.entrySet());
Collections.sort(tmp,(e1,e2) -> Integer.compare(e2.getValue(),e1.getValue()));
return tmp.stream().map(e -> e.getKey()).collect(Collectors.toList());
}
This method computes the complete list of the words sorted by decreasing number of occurrences. If you don't need the whole list, you should rather store the entries of the map in an array and then apply a quickselect on it, with a custom Comparator. Let me know if you are interested and I'll go in further details.

Following your solution
So you have getMostRepeatedWords and now want the second most repeated words.
In pseudo-code this would be:
Map<String, Integer> sortedMap = ...;
SortedMap<String, Integer> rest = new TreeMap<>(sortedMap);
rest.removeAll(getMostRepeatedWords(sortedMap));
Set<String> secondMostRepeatedWords = getMostRepeatedWords(rest);
Remove the most repeated words and then on the rest get the most repeated words.
More effort
You could also copy the values, sort them decreasingly, and then take the second lesser value:
with index > 0, and value lesser than first.

Related

Effective way. of comparing list elements in Java

Is there any **effective way **of comparing elements in Java and print out the position of the element which occurs once.
For example: if I have a list: ["Hi", "Hi", "No"], I want to print out 2 because "No" is in position 2. I have solved this using the following algorithm and it works, BUT the problem is that if I have a large list it takes too much time to compare the entire list to print out the first position of the unique word.
ArrayList<String> strings = new ArrayList<>();
for (int i = 0; i < strings.size(); i++) {
int oc = Collections.frequency(strings, strings.get(i));
if (oc == 1)
System.out.print(i);
break;
}

I can think of counting each element's occurrence no and filter out the first element though not sure how large your list is.
Using Stream:
List<String> list = Arrays.asList("Hi", "Hi", "No");
//iterating thorugh the list and storing each element and their no of occurance in Map
Map<String, Long> counts = list.stream().collect(Collectors.groupingBy(Function.identity(), LinkedHashMap::new, Collectors.counting()));
String value = counts.entrySet().stream()
.filter(e -> e.getValue() == 1) //filtering out all the elements which have more than 1 occurance
.map(Map.Entry::getKey) // creating a stream of element from map as all of these have only single occurance
.findFirst() //finding the first element from the element stream
.get();
System.out.println(list.indexOf(value));
EDIT:
A simplified version can be
Map<String, Long> counts2 = new LinkedHashMap<String, Long>();
for(String val : list){
long count = counts2.getOrDefault(val, 0L);
counts2.put(val, ++count);
}
for(String key: counts2.keySet()){
if(counts2.get(key)==1){
System.out.println(list.indexOf(key));
break;
}
}
The basic idea is to count each element's occurrence and store them in a Map.Once you have count of all elements occurrences. then you can simply check for the first element which one has 1 as count.

You can use HashMap.For example you can put word as key and index as value.Once you find the same word you can delete the key and last the map contain the result.

If there's only one word that's present only once, you can probably use a HashMap or HashSet + Deque (set for values, Deque for indices) to do this in linear time. A sort can give you the same in n log(n), so slower than linear but a lot faster than your solution. By sorting, it's easy to find in linear time (after the sort) which element is present only once because all duplicates will be next to each other in the array.
For example for a linear solution in pseudo-code (pseudo-Kotlin!):
counters = HashMap()
for (i, word in words.withIndex()) {
counters.merge(word, Counter(i, 1), (oldVal, newVal) -> Counter(oldVald.firstIndex, oldVald.count + newVal.count));
}
for (counter in counters.entrySet()) {
if (counter.count == 1) return counter.firstIndex;
}
class Counter(firstIndex, count)

Map<String,Boolean> + loops
Instead of using Map<String,Integer> as suggested in other answers.
You can maintain a HashMap (if you need to maintain the order, use LinkedHashMap instead) of type Map<String,Boolean> where a value would denote whether an element is unique or not.
The simplest way to generate the map is method put() in conjunction with containsKey() check.
But there are also more concise options like replace() + putIfAbsent(). putIfAbsent() would create a new entry only if key is not present in the map, therefore we can associate such string with a value of true (considered to be unique). On the other hand replace() would update only existing entry (otherwise map would not be effected), and if entry exist, the key is proved to be a duplicate, and it has to be associated with a value of false (non-unique).
And since Java 8 we also have method merge(), which expects tree arguments: a key, a value, and a function which is used when the given key already exists to resolve the old value and the new one.
The last step is to generate list of unique strings by iterating over the entry set of the newly created map. We need every key having a value of true (is unique) associated with it.
List<String> strings = // initializing the list
Map<String, Boolean> isUnique = new HashMap<>(); // or LinkedHashMap if you need preserve initial order of strings
for (String next: strings) {
isUnique.replace(next, false);
isUnique.putIfAbsent(next, true);
// isUnique.merge(next, true, (oldV, newV) -> false); // does the same as the commented out lines above
}
List<String> unique = new ArrayList<>();
for (Map.Entry<String, Boolean> entry: isUnique.entrySet()) {
if (entry.getValue()) unique.add(entry.getKey());
}
Stream-based solution
With streams, it can be done using collector toMap(). The overall logic remains the same.
List<String> unique = strings.stream()
.collect(Collectors.toMap( // creating intermediate map Map<String, Boolean>
Function.identity(), // key
key -> true, // value
(oldV, newV) -> false, // resolving duplicates
LinkedHashMap::new // Map implementation, if order is not important - discard this argument
))
.entrySet().stream()
.filter(Map.Entry::getValue)
.map(Map.Entry::getKey)
.toList(); // for Java 16+ or collect(Collectors.toList()) for earlier versions

Problem sorting ConcurrentHashMap by values using java.util.Collections.sort() in Java

I have this code which prints me a list of words sorted by keys (alphabetically) from counts, my ConcurrentHashMap which stores words as keys and their frequencies as values.
// Method to create a stopword list with the most frequent words from the lemmas key in the json file
private static List<String> StopWordsFile(ConcurrentHashMap<String, String> lemmas) {
// counts stores each word and its frequency
ConcurrentHashMap<String, Integer> counts = new ConcurrentHashMap<String, Integer>();
// corpus is an array list for all the individual words
ArrayList<String> corpus = new ArrayList<String>();
for (Entry<String, String> entry : lemmas.entrySet()) {
String line = entry.getValue().toLowerCase();
line = line.replaceAll("\\p{Punct}", " ");
line = line.replaceAll("\\d+"," ");
line = line.replaceAll("\\s+", " ");
line = line.trim();
String[] value = line.split(" ");
List<String> words = new ArrayList<String>(Arrays.asList(value));
corpus.addAll(words);
}
// count all the words in the corpus and store the words with each frequency i
//counts
for (String word : corpus) {
if (counts.keySet().contains(word)) {
counts.put(word, counts.get(word) + 1);
} else {counts.put(word, 1);}
}
// Create a list to store all the words with their frequency and sort it by values.
List<Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());
List<String> stopwordslist = new ArrayList<>(counts.keySet()); # this works but counts.values() gives an error
Collections.sort(stopwordslist);
System.out.println("List after sorting: " +stopwordslist);
So the output is:
List after sorting: [a, abruptly, absent, abstractmap, accept,...]
How can I sort them by values as well? when I use
List stopwordslist = new ArrayList<>(counts.values());
I get an error,
- Cannot infer type arguments for ArrayList<>
I guess that is because ArrayList can store < String > but not <String,Integer> and it gets confused.
I have also tried to do it with a custom Comparator like so:
Comparator<Entry<String, Integer>> valueComparator = new Comparator<Entry<String,Integer>>() {
#Override
public int compare(Entry<String, Integer> e1, Entry<String, Integer> e2) {
String v1 = e1.getValue();
String v2 = e2.getValue();
return v1.compareTo(v2);
}
};
List<Entry<String, Integer>> stopwordslist = new ArrayList<Entry<String, Integer>>();
// sorting HashMap by values using comparator
Collections.sort(counts, valueComparator)
which gives me another error,
The method sort(List<T>, Comparator<? super T>) in the type Collections is not applicable for the arguments (ConcurrentHashMap<String,Integer>, Comparator<Map.Entry<String,Integer>>)
how can I sort my list by values?
my expected output is something like
[the, of, value, v, key, to, given, a, k, map, in, for, this, returns, if, is, super, null, specified, u, function, and, ...]

Let’s go through all the issues of your code
Name conventions. Method names should start with a lowercase letter.
Unnecessary use of ConcurrentHashMap. For a purely local use like within you method, an ordinary HashMap will do. For parameters, just use the Map interface, to allow the caller to use whatever Map implementation will fit.
Unnecessarily iterating over the entrySet(). When you’re only interested in the values, you don’t need to use entrySet() and call getValue() on every entry; you can iterate over values() in the first place. Likewise, you would use keySet() when you’re interested in the keys only. Only iterate over entrySet() when you need key and value (or want to perform updates).
Don’t replace pattern matches by spaces, to split by the spaces afterwards. Specify the (combined) pattern directly to split, i.e. line.split("[\\p{Punct}\\d\\s]+").
Don’t use List<String> words = new ArrayList<String>(Arrays.asList(value)); unless you specifically need the features of an ArrayList. Otherwise, just use List<String> words = Arrays.asList(value);
But when the only thing you’re doing with the list, is addAll to another collection, you can use Collections.addAll(corpus, value); without the List detour.
Don’t use counts.keySet().contains(word) as you can simply use counts.containsKey(word). But you can simplify the entire
if (counts.containsKey(word)) {
counts.put(word, counts.get(word) + 1);
} else {counts.put(word, 1);}
to
counts.merge(word, 1, Integer::sum);
The points above yield
ArrayList<String> corpus = new ArrayList<>();
for(String line: lemmas.values()) {
String[] value = line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+");
Collections.addAll(corpus, value);
}
for (String word : corpus) {
counts.merge(word, 1, Integer::sum);
}
But there is no point in performing two loops, the first only to store everything into a potentially large list, to iterate over it a single time. You can perform the second loop’s operation right in the first (resp. only) loop and get rid of the list.
for(String line: lemmas.values()) {
for(String word: line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+")) {
counts.merge(word, 1, Integer::sum);
}
}
You already acknowledged that you can’t sort a map, by copying the map into a list and sorting the list in your first variant. In the second variant, you created a List<Entry<String, Integer>> but then, you didn’t use it at all but rather tried to pass the map to sort. (By the way, since Java 8, you can invoke sort directly on a List, no need to call Collections.sort).
You have to keep copying the map data into a list and sorting the list. For example,
List<Map.Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());
list.sort(Map.Entry.comparingByValue());
Now, you have to decide whether you change the return type to List<Map.Entry<String, Integer>> or copy the keys of the sorted entries to a new list.
Taking all points together and staying with the original return type, the fixed code looks like
private static List<String> stopWordsFile(Map<String, String> lemmas) {
Map<String, Integer> counts = new HashMap<>();
for(String line: lemmas.values()) {
for(String word: line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+")) {
counts.merge(word, 1, Integer::sum);
}
}
List<Map.Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());
list.sort(Map.Entry.comparingByValue());
List<String> stopwordslist = new ArrayList<>();
for(Map.Entry<String, Integer> e: list) stopwordslist.add(e.getKey());
// System.out.println("List after sorting: " + stopwordslist);
return stopwordslist;
}

Fast way to creating a word-occurrence counting vector

I have a HashMap<String, Integer> vocabulary, containing words and their weight (not important, only the string is important here):
vocabulary = ["this movie"=5, "great"=2, "bad"=2, ...]
and a tokenized string as a list:
String str = "this movie is great";
List<String> tokens = tokenize(str) // tokens = ["this", "movie", "is", "great", "this movie", "is great", ...]
Now I need a fast way to create a vector for this tokenized string, that counts for every entry of the vocabulary, the number of occurrences of this word within the tokenized string
HashMap<String, Integer> vec = new HashMap();
Iterator it = vocabulary.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry) it.next();
String word = (String) pair.getKey();
int count = 0;
for (String w : tokens) {
if (w.equals(word)) {
count += 1;
}
}
vec.put(word, count);
}
So, vec should be ["this movie"=1, "great"=1, bad = 0]
Is there a better performing way to do this? I'm having performance issues in a larger context and assumed that the issue must be here, since vocabulary has approximately 300'000 entries. A normal tokenized text contains around 100 words.
Is it a problem that vocabulary is a hashMap?

Count the number of occurrences of each element of tokens:
Map<String, Long> tokensCount = tokens.stream().collect(
Collectors.groupingBy(Function.identity(), Collectors.counting()));
Then just look up from this map instead of your inner loop:
count = tokensCount.getOrDefault(word, 0L).intValue();
This is faster because the lookup in the map is O(1), whereas iterating the tokens looking for equal elements is O(# tokens).
Also note that you aren't using pair other than to get its key, so you can iterate vocabulary.keySet(), rather than vocabulary.entrySet().
Additionally, if you weren't using a raw iterator, you wouldn't need the explicit casts:
Iterator<Map.Entry<String, Integer>> it = ...
Edit, now that you've added the relative sizes of the two collections:
You can simply iterate tokens, and see if vocabulary contains that:
Map<String, Integer> vec = new HashMap<>();
for (String token : tokens) {
if (vocabulary.contains(token)) {
vec.merge(token, 1, (old,v) -> old+v);
}
}

If vocabulary is already a HashMap, there is no need to iterate over it. Simply use the method contains which, in the case of the HashMap, is constant (O(1)), so you only have to iterate over the token list.
for(String w : tokens) {
if(vocabulary.contains(w)) {
vec.put(w, vec.get(w) + 1);
}
}

How to read strings off of .txt file and sort them into an ArrayList based on the number of occurrences?

I have a program that reads a .txt file, creates a HashMap containing each unique string and its number of occurrences, and I would like to create an ArrayList that displays these unique strings in descending order in terms of their number of appearances.
Currently, my program sorts in descending order from an alphabetical standpoint (using ASCII values I assume).
How can I sort this in descending order in terms of their number of appearances?
Here's the relevant part of the code:
Scanner in = new Scanner(new File("C:/Users/ahz9187/Desktop/counter.txt"));
while(in.hasNext()){
String string = in.next();
//makes sure unique strings are not repeated - adds a new unit if new, updates the count if repeated
if(map.containsKey(string)){
Integer count = (Integer)map.get(string);
map.put(string, new Integer(count.intValue()+1));
} else{
map.put(string, new Integer(1));
}
}
System.out.println(map);
//places units of map into an arrayList which is then sorted
//Using ArrayList because length does not need to be designated - can take in the units of HashMap 'map' regardless of length
ArrayList arraylist = new ArrayList(map.keySet());
Collections.sort(arraylist); //this method sorts in ascending order
//Outputs the list in reverse alphabetical (or descending) order, case sensitive
for(int i = arraylist.size()-1; i >= 0; i--){
String key = (String)arraylist.get(i);
Integer count = (Integer)map.get(key);
System.out.println(key + " --> " + count);
}

In Java 8:
public static void main(final String[] args) throws IOException {
final Path path = Paths.get("C:", "Users", "ahz9187", "Desktop", "counter.txt");
try (final Stream<String> lines = Files.lines(path)) {
final Map<String, Integer> count = lines.
collect(HashMap::new, (m, v) -> m.merge(v, 1, Integer::sum), Map::putAll);
final List<String> ordered = count.entrySet().stream().
sorted((l, r) -> Integer.compare(l.getValue(), r.getValue())).
map(Entry::getKey).
collect(Collectors.toList());
ordered.forEach(System.out::println);
}
}
First read the file using the Files.lines method which gives your a Stream<String> of the lines.
Now collect the lines into a Map<String, Integer> using the Map.merge method which takes a key and a value and also a lambda that is applied to the old value and the new value if the key is already present.
You now have your counts.
Now take a Stream of the entrySet of the Map and sort that by the value of each Entry and then take the key. Collect that to a List. You now have a List of your values sorted by count.
Now simply use forEach to print them.
If still using Java 7 you can use the Map to provide the sort order:
final Map<String, Integer> counts = /*from somewhere*/
final List<String> sorted = new ArrayList<>(counts.keySet());
Collections.sort(sorted, new Comparator<String>() {
#Override
public int compare(final String o1, final String o2) {
return counts.get(o1).compareTo(counts.get(o2));
}
});

You haven't shown the declaration of your map, but for the purpose of this answer I'm assuming that your map is declared like this:
Map<String,Integer> map = new HashMap<String,Integer>();
You need to use a Comparator in the call to sort, but it needs to compare by the count, while remembering the string. So you need to put objects in the list that have both the string and the count.
One type that provides this capability, and that is easily available from the Map.entrySet method, is the type Map.Entry.
The last part rewritten with Map.Entry and a Comparator:
ArrayList<Map.Entry<String,Integer>> arraylist = new ArrayList<Map.Entry<String,Integer>>(map.entrySet());
Collections.sort(arraylist, new Comparator<Map.Entry<String,Integer>>() {
#Override
public int compare(Entry<String, Integer> e1, Entry<String, Integer> e2) {
// Compares by count in descending order
return e2.getValue() - e1.getValue();
}
});
// Outputs the list in reverse alphabetical (or descending) order, case sensitive
for (Map.Entry<String,Integer> entry : arraylist) {
System.out.println(entry.getKey() + " --> " + entry.getValue());
}

Using HashMap for getting repeating occurences

I have a HashMap which is populated with String and Integer:
Map<String, Integer> from_table;
from_table = new HashMap<String, Integer>();
Next i want to get all the keys of items which there value (the Integer) is above x.
For example all the keys which their value is over 4.
Is there a fast method for doing that?
Thnaks!

public static void printMap(Map mp) {
for(Map.Entry pairs : mp.entrySet()) {
if(pairs.getValue() >= 4)
{
System.out.println(pairs.getKey());
}
}
}

Well, iterate over the key-value pairs and collect keys where values meet the criteria
//collect results here
List<String> resultKeys= new ArrayLIst<String>();
//hash map iterator
Iterator<String> it = from_table.keySet();
while(it.hasNext()) {
//get the key
String key= it.next();
/get the value for the key
Integer value= from_map.get(key);
//check the criteria
if (value.intValue() > x) {
resultKeys.add(key);
}
}

Not in standard Java. Guava has method called filter doing exactly this as a one-liner (+ the predicate).

As the above solution states there is nothing faster than just looping through, but an alternative solution would be to edit the function to put something in the map and have it check if there are 4 or more items, if there are it adds it to a new list with only objects with a count of more than 4

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get Second Most repeated Word in String using Maps - java

Related

Effective way. of comparing list elements in Java

Problem sorting ConcurrentHashMap by values using java.util.Collections.sort() in Java

Fast way to creating a word-occurrence counting vector

How to read strings off of .txt file and sort them into an ArrayList based on the number of occurrences?

Using HashMap for getting repeating occurences

Categories

Resources