Fast way to creating a word-occurrence counting vector

Fast way to creating a word-occurrence counting vector - java

I have a HashMap<String, Integer> vocabulary, containing words and their weight (not important, only the string is important here):
vocabulary = ["this movie"=5, "great"=2, "bad"=2, ...]
and a tokenized string as a list:
String str = "this movie is great";
List<String> tokens = tokenize(str) // tokens = ["this", "movie", "is", "great", "this movie", "is great", ...]
Now I need a fast way to create a vector for this tokenized string, that counts for every entry of the vocabulary, the number of occurrences of this word within the tokenized string
HashMap<String, Integer> vec = new HashMap();
Iterator it = vocabulary.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry) it.next();
String word = (String) pair.getKey();
int count = 0;
for (String w : tokens) {
if (w.equals(word)) {
count += 1;
}
}
vec.put(word, count);
}
So, vec should be ["this movie"=1, "great"=1, bad = 0]
Is there a better performing way to do this? I'm having performance issues in a larger context and assumed that the issue must be here, since vocabulary has approximately 300'000 entries. A normal tokenized text contains around 100 words.
Is it a problem that vocabulary is a hashMap?

Count the number of occurrences of each element of tokens:
Map<String, Long> tokensCount = tokens.stream().collect(
Collectors.groupingBy(Function.identity(), Collectors.counting()));
Then just look up from this map instead of your inner loop:
count = tokensCount.getOrDefault(word, 0L).intValue();
This is faster because the lookup in the map is O(1), whereas iterating the tokens looking for equal elements is O(# tokens).
Also note that you aren't using pair other than to get its key, so you can iterate vocabulary.keySet(), rather than vocabulary.entrySet().
Additionally, if you weren't using a raw iterator, you wouldn't need the explicit casts:
Iterator<Map.Entry<String, Integer>> it = ...
Edit, now that you've added the relative sizes of the two collections:
You can simply iterate tokens, and see if vocabulary contains that:
Map<String, Integer> vec = new HashMap<>();
for (String token : tokens) {
if (vocabulary.contains(token)) {
vec.merge(token, 1, (old,v) -> old+v);
}
}

If vocabulary is already a HashMap, there is no need to iterate over it. Simply use the method contains which, in the case of the HashMap, is constant (O(1)), so you only have to iterate over the token list.
for(String w : tokens) {
if(vocabulary.contains(w)) {
vec.put(w, vec.get(w) + 1);
}
}

Related

Problem sorting ConcurrentHashMap by values using java.util.Collections.sort() in Java

I have this code which prints me a list of words sorted by keys (alphabetically) from counts, my ConcurrentHashMap which stores words as keys and their frequencies as values.
// Method to create a stopword list with the most frequent words from the lemmas key in the json file
private static List<String> StopWordsFile(ConcurrentHashMap<String, String> lemmas) {
// counts stores each word and its frequency
ConcurrentHashMap<String, Integer> counts = new ConcurrentHashMap<String, Integer>();
// corpus is an array list for all the individual words
ArrayList<String> corpus = new ArrayList<String>();
for (Entry<String, String> entry : lemmas.entrySet()) {
String line = entry.getValue().toLowerCase();
line = line.replaceAll("\\p{Punct}", " ");
line = line.replaceAll("\\d+"," ");
line = line.replaceAll("\\s+", " ");
line = line.trim();
String[] value = line.split(" ");
List<String> words = new ArrayList<String>(Arrays.asList(value));
corpus.addAll(words);
}
// count all the words in the corpus and store the words with each frequency i
//counts
for (String word : corpus) {
if (counts.keySet().contains(word)) {
counts.put(word, counts.get(word) + 1);
} else {counts.put(word, 1);}
}
// Create a list to store all the words with their frequency and sort it by values.
List<Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());
List<String> stopwordslist = new ArrayList<>(counts.keySet()); # this works but counts.values() gives an error
Collections.sort(stopwordslist);
System.out.println("List after sorting: " +stopwordslist);
So the output is:
List after sorting: [a, abruptly, absent, abstractmap, accept,...]
How can I sort them by values as well? when I use
List stopwordslist = new ArrayList<>(counts.values());
I get an error,
- Cannot infer type arguments for ArrayList<>
I guess that is because ArrayList can store < String > but not <String,Integer> and it gets confused.
I have also tried to do it with a custom Comparator like so:
Comparator<Entry<String, Integer>> valueComparator = new Comparator<Entry<String,Integer>>() {
#Override
public int compare(Entry<String, Integer> e1, Entry<String, Integer> e2) {
String v1 = e1.getValue();
String v2 = e2.getValue();
return v1.compareTo(v2);
}
};
List<Entry<String, Integer>> stopwordslist = new ArrayList<Entry<String, Integer>>();
// sorting HashMap by values using comparator
Collections.sort(counts, valueComparator)
which gives me another error,
The method sort(List<T>, Comparator<? super T>) in the type Collections is not applicable for the arguments (ConcurrentHashMap<String,Integer>, Comparator<Map.Entry<String,Integer>>)
how can I sort my list by values?
my expected output is something like
[the, of, value, v, key, to, given, a, k, map, in, for, this, returns, if, is, super, null, specified, u, function, and, ...]

Let’s go through all the issues of your code
Name conventions. Method names should start with a lowercase letter.
Unnecessary use of ConcurrentHashMap. For a purely local use like within you method, an ordinary HashMap will do. For parameters, just use the Map interface, to allow the caller to use whatever Map implementation will fit.
Unnecessarily iterating over the entrySet(). When you’re only interested in the values, you don’t need to use entrySet() and call getValue() on every entry; you can iterate over values() in the first place. Likewise, you would use keySet() when you’re interested in the keys only. Only iterate over entrySet() when you need key and value (or want to perform updates).
Don’t replace pattern matches by spaces, to split by the spaces afterwards. Specify the (combined) pattern directly to split, i.e. line.split("[\\p{Punct}\\d\\s]+").
Don’t use List<String> words = new ArrayList<String>(Arrays.asList(value)); unless you specifically need the features of an ArrayList. Otherwise, just use List<String> words = Arrays.asList(value);
But when the only thing you’re doing with the list, is addAll to another collection, you can use Collections.addAll(corpus, value); without the List detour.
Don’t use counts.keySet().contains(word) as you can simply use counts.containsKey(word). But you can simplify the entire
if (counts.containsKey(word)) {
counts.put(word, counts.get(word) + 1);
} else {counts.put(word, 1);}
to
counts.merge(word, 1, Integer::sum);
The points above yield
ArrayList<String> corpus = new ArrayList<>();
for(String line: lemmas.values()) {
String[] value = line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+");
Collections.addAll(corpus, value);
}
for (String word : corpus) {
counts.merge(word, 1, Integer::sum);
}
But there is no point in performing two loops, the first only to store everything into a potentially large list, to iterate over it a single time. You can perform the second loop’s operation right in the first (resp. only) loop and get rid of the list.
for(String line: lemmas.values()) {
for(String word: line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+")) {
counts.merge(word, 1, Integer::sum);
}
}
You already acknowledged that you can’t sort a map, by copying the map into a list and sorting the list in your first variant. In the second variant, you created a List<Entry<String, Integer>> but then, you didn’t use it at all but rather tried to pass the map to sort. (By the way, since Java 8, you can invoke sort directly on a List, no need to call Collections.sort).
You have to keep copying the map data into a list and sorting the list. For example,
List<Map.Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());
list.sort(Map.Entry.comparingByValue());
Now, you have to decide whether you change the return type to List<Map.Entry<String, Integer>> or copy the keys of the sorted entries to a new list.
Taking all points together and staying with the original return type, the fixed code looks like
private static List<String> stopWordsFile(Map<String, String> lemmas) {
Map<String, Integer> counts = new HashMap<>();
for(String line: lemmas.values()) {
for(String word: line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+")) {
counts.merge(word, 1, Integer::sum);
}
}
List<Map.Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());
list.sort(Map.Entry.comparingByValue());
List<String> stopwordslist = new ArrayList<>();
for(Map.Entry<String, Integer> e: list) stopwordslist.add(e.getKey());
// System.out.println("List after sorting: " + stopwordslist);
return stopwordslist;
}

Comparing keys in HashMap and Values

I have a HashMap as follows-
HashMap<String, Integer> BC = new HashMap<String, Integer>();
which stores as keys- "tokens/tages" and as values- "frequency of each tokens/tags".
Example-
"the/at" 153
"that/cs" 45
"Ann/np" 3
I now parse through each key and check whether for same token say "the" whether it's associated with more than one tag and then take the largest of the two.
Example-
"the/at" 153
"the/det" 80
Then I take the key- "the/at" with value - 153.
The code that I have written to do so is as follows-
private HashMap<String, Integer> Unigram_Tagger = new HashMap<String, Integer>();
for(String curr_key: BC.keySet())
{
for(String next_key: BC.keySet())
{
if(curr_key.equals(next_key))
continue;
else
{
String[] split_key_curr_key = curr_key.split("/");
String[] split_key_next_key = next_key.split("/");
//out.println("CK- " + curr_key + ", NK- " + next_key);
if(split_key_curr_key[0].equals(split_key_next_key[0]))
{
int ck_v = 0, nk_v = 0;
ck_v = BC.get(curr_key);
nk_v = BC.get(next_key);
if(ck_v > nk_v)
Unigram_Tagger.put(curr_key, BC.get(curr_key));
else
Unigram_Tagger.put(next_key, BC.get(next_key));
}
}
}
}
But this code is taking too long to compute since the original HashMap 'BC' has 68442 entries which comes approximately to its square = 4684307364 times (plus some more).
My question is this- can I accomplish the same output using a more efficient method?
Thanks!

Create a new
Map<String,Integer> highCount = new HashMap<>();
that will map tokens to their largest count.
Make a single pass through the keys.
Split each key into its component tokens.
For each token, look in highMap. If the key does not exist, add it with its count. If the entry already exists and the current count is greater than the previous maximum, replace the maximum in the map.
When you are done with the single pass the highCount will contain all the unique tokens along with the highest count seen for each token.
Note: This answer is intended to give you a starting point from which to develop a complete solution. The key concept is that you create and populate a new map from token to some "value" type (not necessarily just Integer) that provides you with the functionality you need. Most likely the value type will be a new custom class that stores the tag and the count.

The slowest part of your current method is due to the pairwise comparison of keys. First, define a Tuple class:
public class Tuple<X, Y> {
public final X x;
public final Y y;
public Tuple(X x, Y y) {
this.x = x;
this.y = y;
}
}
Thus you can try an algorithm that does:
Initializes a new HashMap<String, Tuple<String, Integer>> result
Given input pair (key, value) from the old map, where key = "a/b", check whether result.keySet().contains(a) and result.keySet().contains(b).
If both a and b is not present, result.put(a, new Tuple<String, Integer>(b, value) and result.put(b, new Tuple<String, Integer>(a, value))
If a is present, compare value and v = result.get(a). If value > v, remove a and b from result and do step 3. Do the same for b. Otherwise, get the next key-value pair.
After you have iterated through the old hash map and inserted everything, then you can easily reconstruct the output you want by transforming the key-values in result.

A basic thought on the algorithm:
You should get the entrySet() of the HashMap and convert it to a List:
ArrayList<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet());
Now you should sort the list by the keys in alphabetical order. We do that because the HashMap has no order, so you can expect that the corresponding keys might be far apart. But by sorting them, all related keys are directly next to each other.
Collections.sort(list, Comparator.comparing(e -> e.getKey()));
The entries "the/at" and "the/det" will be next to each other, thanks to sorting alphabetically.
Now you can iterate over the entire list while remembering the best item, until you find a better one or you find the first item which has not the same prefix (e.g. "the").
ArrayList<Map.Entry<String, Integer>> bestList = new ArrayList<>();
// The first entry of the list is considered the currently best item for it's group
Map.Entry<String, Integer> currentBest = best.get(0);
String key = currentBest.getKey();
String currentPrefix = key.substring(0, key.indexOf('/'));
for (int i=1; i<list.size(); i++) {
// The item we compare the current best with
Map.Entry<String, Integer> next = list.get(i);
String nkey = next.getKey();
String nextPrefix = nkey.substring(0, nkey.indexOf('/'));
// If both items have the same prefix, then we want to keep the best one
// as the current best item
if (currentPrefix.equals(nextPrefix)) {
if (currentBest.getValue() < next.getValue()) {
currentBest = next;
}
// If the prefix is different we add the current best to the best list and
// consider the current item the best one for the next group
} else {
bestList.add(currentBest);
currentBest = next;
currentPrefix = nextPrefix;
}
}
// The last one must be added here, or we would forget it
bestList.add(currentBest);
Now you should have a list of Map.Entry objects representing the desired entries. The complexity should be n(log n) and is limited by the sorting algorithm, while grouping/collection the items has a complexity of n.

import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.TreeMap;
import java.util.stream.Collectors;
public class Point {
public static void main(String[] args) {
HashMap<String, Integer> BC = new HashMap<>();
//some random values
BC.put("the/at",5);
BC.put("Ann/npe",6);
BC.put("the/atx",7);
BC.put("that/cs",8);
BC.put("the/aty",9);
BC.put("Ann/np",1);
BC.put("Ann/npq",2);
BC.put("the/atz",3);
BC.put("Ann/npz",4);
BC.put("the/atq",0);
BC.put("the/atw",12);
BC.put("that/cs",14);
BC.put("that/cs1",16);
BC.put("the/at1",18);
BC.put("the/at2",100);
BC.put("the/at3",123);
BC.put("that/det",153);
BC.put("xyx",123);
BC.put("xyx/w",2);
System.out.println("\nUnsorted Map......");
printMap(BC);
System.out.println("\nSorted Map......By Key");
//sort original map using TreeMap, it will sort the Map by keys automatically.
Map<String, Integer> sortedBC = new TreeMap<>(BC);
printMap(sortedBC);
// find all distinct prefixes by spliting the keys at "/"
List<String> uniquePrefixes = sortedBC.keySet().stream().map(i->i.split("/")[0]).distinct().collect(Collectors.toList());
System.out.println("\nuniquePrefixes: "+uniquePrefixes);
TreeMap<String,Integer> mapOfMaxValues = new TreeMap<>();
// for each prefix from the list above filter the entries from the sorted map
// having keys starting with this prefix
//and sort them by value in descending order and get the first which will have the highst value
uniquePrefixes.stream().forEach(i->{
Entry <String,Integer> e =
sortedBC.entrySet().stream().filter(j->j.getKey().startsWith(i))
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder())).findFirst().get();
mapOfMaxValues.put(e.getKey(), e.getValue());
});
System.out.println("\nmapOfMaxValues...\n");
printMap(mapOfMaxValues);
}
//pretty print a map
public static <K, V> void printMap(Map<K, V> map) {
map.entrySet().stream().forEach((entry) -> {
System.out.println("Key : " + entry.getKey()
+ " Value : " + entry.getValue());
});
}
}
// note: only tested with random values provided in the code
// behavior for large maps untested

Java String array parsing and getting data

String input data is
{phone=333-333-3333, pr_specialist_email=null, sic_code=2391, status=ACTIVE, address1=E.BALL Drive, fax=333-888-3315, naics_code=325220, client_id=862222, bus_name=ENTERPRISES, address2=null, contact=BYRON BUEGE}
Key and values will increase in the array.
I want to get the value for each key ie myString.get("phone") should return 333-333-3333
I am using Java 1.7, is there any tools I can use this to parse the data and get the values.
Some of my input is having values like,
{phone=000000002,Desc="Light PROPERTITES, LCC", Address1="C/O ABC RICHARD",Address2="6508 THOUSAND OAKS BLVD.,",Adress3="SUITE 1120",city=MEMPHIS,state=TN,name=,dob=,DNE=,}
Comma separator doesn't work here

Here is a simple function that will do exacly what you want. It takes your string as an input and returns a Hashmap containing all the keys and values.
private HashMap<String, String> getKeyValueMap(String str) {
// Trim the curly ({}) brackets
str = str.trim().substring(1, str.length() - 1);
// Split all the key-values tuples
String[] split = str.split(",");
String[] keyValue;
HashMap<String, String> map = new HashMap<String, String>();
for (String tuple : split) {
// Seperate the key from the value and put them in the HashMap
keyValue = tuple.split("=");
map.put(keyValue[0].trim(), keyValue[1].trim());
}
// Return the HashMap with all the key-value combinations
return map;
}
Note: This will not work if there's ever a '=' or ',' character in any of the key names or values.
To get any value, all you have to do is:
HashMap<String, String> map = getKeyValueMap(...);
String value = map.get(key);

You can write a simple parser yourself. I'll exclude error checking in this code for brevity.
You should first remove the { and } characters, then split by ', ' and split each resulting string by =. At last add the results into a map.
String input = ...;
Map<String, String> map = new HashMap<>();
input = input.substring(1, input.length() - 1);
String elements[] = input.split(", ");
for(String elem : elements)
{
String values[] = elem.split("=");
map.put(values[0].trim(), values[1].trim());
}
Then, to retrieve a value, just do
String value = map.get("YOURKEY");

You can use "Google Core Libraries for Java API" MapSplitter to do your job.
First remove the curly braces using substring method and use the below code to do your job.
Map<String, String> splitKeyValues = Splitter.on(",")
.omitEmptyStrings()
.trimResults()
.withKeyValueSeparator("=")
.split(stringToSplit);

How to get Second Most repeated Word in String using Maps

I am trying to get the second most repeated word in the sentence.
eg:
String paraString = "This is a paragraph with multiple strings. Get the second most repeated word from the paragraph text and print the words with count".
Here 'the' is repeated for thrice and 'paragraph' & 'with' are repeated twice.
I need to print the second most repeated words 'paragraph' & 'with'.
Here is the program which I wrote to get the First Most Repeated Words.
public Set<String> getMostRepeatedWords(Map<String, Integer> sortedMap) {
Set<String> mostRepeatedWords = new HashSet<String>();
int mostrepeatedWord = Collections.max(sortedMap.values());
for (Map.Entry<String, Integer> entry : sortedMap.entrySet()) {
if (mostrepeatedWord == entry.getValue()) {
mostRepeatedWords.add(entry.getKey());
}
}
return mostRepeatedWords;
}
Please help me out.
The one option which I have is below. Let me know if there are any other ways.
int mostrepeatedWord = Collections.max(sortedMap.values())-1;

Here is an example of what you could do with Java 8 :
public List<String> getMostRepeatedWords(String s) {
Map<String,Integer> map = new HashMap<>();
String[] words = s.split("\\s+");
for (String word : words)
map.put(word,map.containsKey(word) ? map.get(word) + 1 : 0);
List<Entry<String,Integer>> tmp = new ArrayList<>(map.entrySet());
Collections.sort(tmp,(e1,e2) -> Integer.compare(e2.getValue(),e1.getValue()));
return tmp.stream().map(e -> e.getKey()).collect(Collectors.toList());
}
This method computes the complete list of the words sorted by decreasing number of occurrences. If you don't need the whole list, you should rather store the entries of the map in an array and then apply a quickselect on it, with a custom Comparator. Let me know if you are interested and I'll go in further details.

Following your solution
So you have getMostRepeatedWords and now want the second most repeated words.
In pseudo-code this would be:
Map<String, Integer> sortedMap = ...;
SortedMap<String, Integer> rest = new TreeMap<>(sortedMap);
rest.removeAll(getMostRepeatedWords(sortedMap));
Set<String> secondMostRepeatedWords = getMostRepeatedWords(rest);
Remove the most repeated words and then on the rest get the most repeated words.
More effort
You could also copy the values, sort them decreasingly, and then take the second lesser value:
with index > 0, and value lesser than first.

How to read strings off of .txt file and sort them into an ArrayList based on the number of occurrences?

I have a program that reads a .txt file, creates a HashMap containing each unique string and its number of occurrences, and I would like to create an ArrayList that displays these unique strings in descending order in terms of their number of appearances.
Currently, my program sorts in descending order from an alphabetical standpoint (using ASCII values I assume).
How can I sort this in descending order in terms of their number of appearances?
Here's the relevant part of the code:
Scanner in = new Scanner(new File("C:/Users/ahz9187/Desktop/counter.txt"));
while(in.hasNext()){
String string = in.next();
//makes sure unique strings are not repeated - adds a new unit if new, updates the count if repeated
if(map.containsKey(string)){
Integer count = (Integer)map.get(string);
map.put(string, new Integer(count.intValue()+1));
} else{
map.put(string, new Integer(1));
}
}
System.out.println(map);
//places units of map into an arrayList which is then sorted
//Using ArrayList because length does not need to be designated - can take in the units of HashMap 'map' regardless of length
ArrayList arraylist = new ArrayList(map.keySet());
Collections.sort(arraylist); //this method sorts in ascending order
//Outputs the list in reverse alphabetical (or descending) order, case sensitive
for(int i = arraylist.size()-1; i >= 0; i--){
String key = (String)arraylist.get(i);
Integer count = (Integer)map.get(key);
System.out.println(key + " --> " + count);
}

In Java 8:
public static void main(final String[] args) throws IOException {
final Path path = Paths.get("C:", "Users", "ahz9187", "Desktop", "counter.txt");
try (final Stream<String> lines = Files.lines(path)) {
final Map<String, Integer> count = lines.
collect(HashMap::new, (m, v) -> m.merge(v, 1, Integer::sum), Map::putAll);
final List<String> ordered = count.entrySet().stream().
sorted((l, r) -> Integer.compare(l.getValue(), r.getValue())).
map(Entry::getKey).
collect(Collectors.toList());
ordered.forEach(System.out::println);
}
}
First read the file using the Files.lines method which gives your a Stream<String> of the lines.
Now collect the lines into a Map<String, Integer> using the Map.merge method which takes a key and a value and also a lambda that is applied to the old value and the new value if the key is already present.
You now have your counts.
Now take a Stream of the entrySet of the Map and sort that by the value of each Entry and then take the key. Collect that to a List. You now have a List of your values sorted by count.
Now simply use forEach to print them.
If still using Java 7 you can use the Map to provide the sort order:
final Map<String, Integer> counts = /*from somewhere*/
final List<String> sorted = new ArrayList<>(counts.keySet());
Collections.sort(sorted, new Comparator<String>() {
#Override
public int compare(final String o1, final String o2) {
return counts.get(o1).compareTo(counts.get(o2));
}
});

You haven't shown the declaration of your map, but for the purpose of this answer I'm assuming that your map is declared like this:
Map<String,Integer> map = new HashMap<String,Integer>();
You need to use a Comparator in the call to sort, but it needs to compare by the count, while remembering the string. So you need to put objects in the list that have both the string and the count.
One type that provides this capability, and that is easily available from the Map.entrySet method, is the type Map.Entry.
The last part rewritten with Map.Entry and a Comparator:
ArrayList<Map.Entry<String,Integer>> arraylist = new ArrayList<Map.Entry<String,Integer>>(map.entrySet());
Collections.sort(arraylist, new Comparator<Map.Entry<String,Integer>>() {
#Override
public int compare(Entry<String, Integer> e1, Entry<String, Integer> e2) {
// Compares by count in descending order
return e2.getValue() - e1.getValue();
}
});
// Outputs the list in reverse alphabetical (or descending) order, case sensitive
for (Map.Entry<String,Integer> entry : arraylist) {
System.out.println(entry.getKey() + " --> " + entry.getValue());
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fast way to creating a word-occurrence counting vector - java

If vocabulary is already a HashMap, there is no need to iterate over it. Simply use the method contains which, in the case of the HashMap, is constant (O(1)), so you only have to iterate over the token list. for(String w : tokens) { if(vocabulary.contains(w)) { vec.put(w, vec.get(w) + 1); } }

Related

Problem sorting ConcurrentHashMap by values using java.util.Collections.sort() in Java

Comparing keys in HashMap and Values

Java String array parsing and getting data

How to get Second Most repeated Word in String using Maps

How to read strings off of .txt file and sort them into an ArrayList based on the number of occurrences?

Categories

Resources