Word count with java 8

Word count with java 8 - java

I am trying to implement a word count program in java 8 but I am unable to make it work. The method must take a string as parameter and returns a Map<String,Integer>.
When I am doing it in old java way, everthing works fine. But when I am trying to do it in java 8, it returns a map where the keys are the empty with the correct occurrences.
Here is my code in a java 8 style :
public Map<String, Integer> countJava8(String input){
return Pattern.compile("(\\w+)").splitAsStream(input).collect(Collectors.groupingBy(e -> e.toLowerCase(), Collectors.reducing(0, e -> 1, Integer::sum)));
}
Here is the code I would use in a normal situation :
public Map<String, Integer> count(String input){
Map<String, Integer> wordcount = new HashMap<>();
Pattern compile = Pattern.compile("(\\w+)");
Matcher matcher = compile.matcher(input);
while(matcher.find()){
String word = matcher.group().toLowerCase();
if(wordcount.containsKey(word)){
Integer count = wordcount.get(word);
wordcount.put(word, ++count);
} else {
wordcount.put(word.toLowerCase(), 1);
}
}
return wordcount;
}
The main program :
public static void main(String[] args) {
WordCount wordCount = new WordCount();
Map<String, Integer> phrase = wordCount.countJava8("one fish two fish red fish blue fish");
Map<String, Integer> count = wordCount.count("one fish two fish red fish blue fish");
System.out.println(phrase);
System.out.println();
System.out.println(count);
}
When I run this program, the outputs that I have :
{ =7, =1}
{red=1, blue=1, one=1, fish=4, two=1}
I thought that the method splitAsStream would stream the matching elements in the regex as Stream. How can I correct that?

The problem seems to be that you are in fact splitting by words, i.e. you are streaming over everything that is not a word, or that is in between words. Unfortunately, there seems to be no equivalent method for streaming the actual match results (hard to believe, but I did not find any; feel free to comment if you know one).
Instead, you could just split by non-words, using \W instead of \w. Also, as noted in comments, you can make it a bit more readable by using String::toLowerCase instead of a lambda and Collectors.summingInt.
public static Map<String, Integer> countJava8(String input) {
return Pattern.compile("\\W+")
.splitAsStream(input)
.collect(Collectors.groupingBy(String::toLowerCase,
Collectors.summingInt(s -> 1)));
}
But IMHO this is still very hard to comprehend, not only because of the "inverse" lookup, and it's also difficult to generalize to other, more complex patterns. Personally, I would just go with the "old school" solution, maybe making it a bit more compact using the new getOrDefault.
public static Map<String, Integer> countOldschool(String input) {
Map<String, Integer> wordcount = new HashMap<>();
Matcher matcher = Pattern.compile("\\w+").matcher(input);
while (matcher.find()) {
String word = matcher.group().toLowerCase();
wordcount.put(word, wordcount.getOrDefault(word, 0) + 1);
}
return wordcount;
}
The result seems to be the same in both cases.

Try this.
String in = "go go go go og sd";
Map<String, Integer> map = new HashMap<String, Integer>();
//Replace all punctuation with space
String[] s = in.replaceAll("\\p{Punct}", " ").split("\\s+");
for(int i = 0; i < s.length; i++)
{
map.put(s[i], i);
}
Set<String> st = new HashSet<String>(map.keySet());
for(int k = 0; k < s.length; k++)
{
int i = 0;
Pattern p = Pattern.compile(s[k]);
Matcher m = p.matcher(in);
while (m.find()) {
i++;
}
map.put(s[k], i);
}
for(String strin : st)
{
System.out.println("String: " + strin.toString() + " - Occurrency: " + map.get(strin.toString()));
}
System.out.println("Word: " + s.length);
This is output
String: sd, Occurrency: 1
String: go, Occurrency: 4
String: og, Occurrency: 1
Word: 6

Related

How to hold position of every character from encoded string

I am trying to hold the position of every character (A-Z) from encoded string. I am unable to find out a way to implement it in Java.
Below is one example from C++ which I am trying to rewrite in Java.
map<Character,Integer> enc = new map<Character,Integer>();
for (int i = 0; i < encoded.length(); i++)
{
enc[encoded.charAt(i)] = i;
}
Example below:
I will have a Keyword which is unique e.g., Keyword is NEW.
String will be formed by concatenating KEYWORD+Alphabets(A-Z which are not in the Keyword) e.g., NEWABCDFGHIJKLMOPQRSTUVXYZ (note that N,E and W are not repeated again in the above in the 26-Character string. Finally, I would like to hold the position of every character i.e., A-Z from the above string in bold.

If I'm understanding what youre saying, you want to map each character in a string to its index. However, Maps need a unique key for each entry, so your code wont work directly for strings which contain duplicate characters. Instead of using an Integer for each character, we'll use a List of Integers to store all the indexes which this character appears.
Here's how you would do that in java:
public static void main(String[] args) {
Map<Character, List<Integer>> charMap = new HashMap<>();
String string = "aaabbcdefg";
for (int i = 0; i < string.length(); i++) {
Character c = string.charAt(i);
if (charMap.containsKey(c)) {
List<Integer> positions = charMap.get(c);
positions.add(i);
} else {
charMap.put(c, new ArrayList<>(Arrays.asList(i)));
}
}
for (Character c : charMap.keySet()) {
System.out.print(c + ": ");
charMap.get(c).forEach(System.out::print);
System.out.println();
}
}
output:
a: 012
b: 34
c: 5
d: 6
e: 7
f: 8
g: 9

If you don't want to handle duplicates letters , you can do as follows, it’ll only keep last occurrence for each letter :
Map<Character, Integer> enc = new HashMap<>();
for (int i = 0; i < encoded.length(); i++) {
enc.put(encoded.charAt(i), i);
}
——————-
To handle duplicates char, you can hold them in a List or concatenate them in a String for example (on the second I add a filter operation to remove spaces)
public static void main (String[] args)
{
String str = "AUBU CUDU";
Map<Character, List<Integer>> mapList =
IntStream.range(0, str.length())
.boxed()
.collect(Collectors.toMap(i->str.charAt(i), i->Arrays.asList(i), (la,lb)->{List<Integer>res =new ArrayList<>(la); res.addAll(lb); return res;}));
System.out.println(mapList);
//{ =[4], A=[0], B=[2], C=[5], D=[7], U=[1, 3, 6, 8]}
Map<Character, String> mapString =
IntStream.range(0, str.length())
.boxed()
.filter(i->(""+str.charAt(i)).matches("[\\S]"))
.collect(Collectors.toMap(i->str.charAt(i), i->Integer.toString(i), (sa,sb)-> sa+sb ));
System.out.println(mapString);
//{A=0, B=2, C=5, D=7, U=1368}
}
Code Demo

Find the most common word from user input

I'm very new to Java creating a software application that allows a user to input text into a field and the program runs through all of the text and identifies what the most common word is. At the moment, my code looks like this:
JButton btnMostFrequentWord = new JButton("Most Frequent Word");
btnMostFrequentWord.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
String text = textArea.getText();
String[] words = text.split("\\s+");
HashMap<String, Integer> occurrences = new HashMap<String, Integer>();
for (String word : words) {
int value = 0;
if (occurrences.containsKey(word)) {
value = occurrences.get(word);
}
occurrences.put(word, value + 1);
}
JOptionPane.showMessageDialog(null, "Most Frequent Word: " + occurrences.values());
}
}
This just prints what the values of the words are, but I would like it to tell me what the number one most common word is instead. Any help would be really appreciated.

Just after your for loop, you can sort the map by value then reverse the sorted entries by value and select the first.
for (String word: words) {
int value = 0;
if (occurrences.containsKey(word)) {
value = occurrences.get(word);
}
occurrences.put(word, value + 1);
}
Map.Entry<String,Integer> tempResult = occurrences.entrySet().stream()
.sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
.findFirst().get();
JOptionPane.showMessageDialog(null, "Most Frequent Word: " + tempResult.getKey());

For anyone who is more familiar with Java, here is a very easy way to do it with Java 8:
List<String> words = Arrays.asList(text.split("\\s+"));
Collections.sort(words, Comparator.comparingInt(word -> {
return Collections.frequency(words, word);
}).reversed());
The most common word is stored in words.get(0) after sorting.

I would do something like this
int max = 0;
String a = null;
for (String word : words) {
int value = 0;
if(occurrences.containsKey(word)){
value = occurrences.get(word);
}
occurrences.put(word, value + 1);
if(max < value+1){
max = value+1;
a = word;
}
}
System.out.println(a);
You could sort it, and the solution would be much shorter, but I think this runs faster.

You can either iterate through occurrences map and find the max or
Try like below
String text = textArea.getText();;
String[] words = text.split("\\s+");
HashMap<String, Integer> occurrences = new HashMap<>();
int mostFreq = -1;
String mostFreqWord = null;
for (String word : words) {
int value = 0;
if (occurrences.containsKey(word)) {
value = occurrences.get(word);
}
value = value + 1;
occurrences.put(word, value);
if (value > mostFreq) {
mostFreq = value;
mostFreqWord = word;
}
}
JOptionPane.showMessageDialog(null, "Most Frequent Word: " + mostFreqWord);

replace multiple substrings in a string , Array vs HashMap

There are 2 functions defined below. They does the exactly same function i.e takes input a template (in which one wants to replace some substrings) and array of strings values( key value pair to replace, ex:[subStrToReplace1,value1,subStrToReplace1,value2,.....]) and returns the replaced String.
In second function I am iterating over words of the templates and searching for the relevant key if exist in hashmap and then next word. If I want to replace a word with some substring , which I again want to replace with some other key in values, I need to iterate over template twice. Thats what I did.
I would like to know which one should I use and why ? Any than alternative better than these are also welcome.
1st function
public static String populateTemplate1(String template, String... values) {
String populatedTemplate = template;
for (int i = 0; i < values.length; i += 2) {
populatedTemplate = populatedTemplate.replace(values[i], values[i + 1]);
}
return populatedTemplate;
}
2nd function
public static String populateTemplate2(String template, String... values) {
HashMap<String, String> map = new HashMap<>();
for (int i = 0; i < values.length; i += 2) {
map.put(values[i],values[i+1]);
}
StringBuilder regex = new StringBuilder();
boolean first = true;
for (String word : map.keySet()) {
if (first) {
first = false;
} else {
regex.append('|');
}
regex.append(Pattern.quote(word));
}
Pattern pattern = Pattern.compile(regex.toString());
int N0OfIterationOverTemplate =2;
// Pattern allowing to extract only the words
// Pattern pattern = Pattern.compile("\\w+");
StringBuilder populatedTemplate=new StringBuilder();;
String temp_template=template;
while(N0OfIterationOverTemplate!=0){
populatedTemplate = new StringBuilder();
Matcher matcher = pattern.matcher(temp_template);
int fromIndex = 0;
while (matcher.find(fromIndex)) {
// The start index of the current word
int startIdx = matcher.start();
if (fromIndex < startIdx) {
// Add what we have between two words
populatedTemplate.append(temp_template, fromIndex, startIdx);
}
// The current word
String word = matcher.group();
// Replace the word by itself or what we have in the map
// populatedTemplate.append(map.getOrDefault(word, word));
if (map.get(word) == null) {
populatedTemplate.append(word);
}
else {
populatedTemplate.append(map.get(word));
}
// Start the next find from the end index of the current word
fromIndex = matcher.end();
}
if (fromIndex < temp_template.length()) {
// Add the remaining sub String
populatedTemplate.append(temp_template, fromIndex, temp_template.length());
}
N0OfIterationOverTemplate--;
temp_template=populatedTemplate.toString();
}
return populatedTemplate.toString();
}

Definitively the first one for at least two reasons:
It is easier to read and shorter, so it is easier to maintain as it is much less error prone
You don't rely on a regular expression so it is faster by far

The first function is much clearer and easier to understand. I would prefer it unless you find out (by a profiler) that it takes a considerable amount of time and slows your application down. Then you can figure out how to optimize it.

Why make things complicated when you can make simple.
Keep in mind that simple solutions tend to be the best.
FYI, if the numbers of elements is and odd number you will get an ArrayIndexOutOfBoundsException.
I propose this improvement:
public static String populateTemplate(String template, String... values) {
String populatedTemplate = template;
int nextTarget = 2;
int lastTarget = values.length - nextTarget;
for (int i = 0; i <= lastTarget; i += nextTarget) {
String target = values[i];
String replacement = values[i + 1];
populatedTemplate = populatedTemplate.replace(target, replacement);
}
return populatedTemplate;
}
"Good programmers write code that humans can understand". Martin Fowler

remove repeated words from String Array

Good Morning
I write a function that calculates for me the frequency of a term:
public static int tfCalculator(String[] totalterms, String termToCheck) {
int count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count;
}
and after that I use it on the code below to calculate every word from a String[] words
for(String word:words){
int freq = tfCalculator(words, word);
System.out.println(word + "|" + freq);
mm+=word + "|" + freq+"\n";
}
well the problem that I have is that the words repeat here is for example the result:
cytoskeletal|2
network|1
enable|1
equal|1
spindle|1
cytoskeletal|2
...
...
so can someone help me to remove the repeated word and get as result like that:
cytoskeletal|2
network|1
enable|1
equal|1
spindle|1
...
...
Thank you very much!

Java 8 solution
words = Arrays.stream(words).distinct().toArray(String[]::new);
the distinct method removes duplicates. words is replaced with a new array without duplicates

I think here you want to print the frequency of each string in the array totalterms . I think using Map is a easier solution as in the single traversal of the array it will store the frequency of all the strings Check the following implementation.
public static void printFrequency(String[] totalterms)
{
Map frequencyMap = new HashMap<String, Integer>();
for (String string : totalterms) {
if(frequencyMap.containsKey(string))
{
Integer count = (Integer)frequencyMap.get(string);
frequencyMap.put(string, count+1);
}
else
{
frequencyMap.put(string, 1);
}
}
Set <Entry<String, Integer>> elements= frequencyMap.entrySet();
for (Entry<String, Integer> entry : elements) {
System.out.println(entry.getKey()+"|"+entry.getValue());
}
}

You can just use a HashSet and that should take care of the duplicates issue:
words = new HashSet<String>(Arrays.asList(words)).toArray(new String[0]);
This will take your array, convert it to a List, feed that to the constructor of HashSet<String>, and then convert it back to an array for you.

Sort the array, then you can just count equal adjacent elements:
Arrays.sort(totalterms);
int i = 0;
while (i < totalterms.length) {
int start = i;
while (i < totalterms.length && totalterms[i].equals(totalterms[start])) {
++i;
}
System.out.println(totalterms[start] + "|" + (i - start));
}

in two line :
String s = "cytoskeletal|2 - network|1 - enable|1 - equal|1 - spindle|1 - cytoskeletal|2";
System.out.println(new LinkedHashSet(Arrays.asList(s.split("-"))).toString().replaceAll("(^\[|\]$)", "").replace(", ", "- "));

Your code is fine, you just need keep track of which words were encountered already. For that you can keep a running set:
Set<String> prevWords = new HashSet<>();
for(String word:words){
// proceed if word is new to the set, otherwise skip
if (prevWords.add(word)) {
int freq = tfCalculator(words, word);
System.out.println(word + "|" + freq);
mm+=word + "|" + freq+"\n";
}
}

Best way of scanning for letter combinations

So let's say I have a 32 character string like this:
GCAAAGCTTGGCACACGTCAAGAGTTGACTTT
My goal is to count all occurrences of specific substrings, such as 'AA' 'ATT' 'CGG' and so on. For this purpose, the 3rd through 5th characters above contain 2 occurrences of 'AA'. There are a total of 8 of these substrings, 6 that are 3 characters in length and 2 that are 2 characters in length, and I would want counts for all eight.
What would be the most efficient way of doing this in Java? My thoughts follow a couple lines:
Scan through character by character, checking and flagging for each substring. This seems intensive and inefficient.
Find some existing function that would do the work (not sure of efficiency of what function it would be, String.contains is a boolean, not a count).
Scan through the string multiple times, each sweep checking for a different substring.
The implementation of 3 is trivial, but 1 might give a few extra headaches and won't be very clean code.

I think this should answer your question.
The naive approach (checking for substring at each possible index)
runs in O(nk) where n is the length of the string and k is the length
of the substring. This could be implemented with a for-loop, and
something like haystack.substring(i).startsWith(needle).
More efficient algorithms exist though. You may want to have a look at
the Knuth-Morris-Pratt algorithm, or the Aho-Corasick algorithm. As
opposed to the naive approach, both of these algorithms behave well
also on input like "look for the substring of 100 'X' in a string of
10000 'X's.
Taken from stackoverflow.com/questions/4121875/count-of-substrings-within-string

One approach is to essentially code up an NFA (http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton)
and just run your input on the NFA.
Here's my attempt at coding an NFA. You'd probably want to convert to a DFA first before running it so that you don't have to manage a bunch of branches. With the branches it's basically as slow as O(nk), whereas if you convert to a DFA it would be O(n)
import java.util.*;
public class Test
{
public static void main (String[] args)
{
new Test();
}
private static final String input = "TAAATGGAGGTAATAGAGGAGGTGTAT";
private static final String[] substrings = new String[] { "AA", "AG", "GG", "GAG", "TA" };
private static final int[] occurrences = new int[substrings.length];
public Test()
{
ArrayList<Branch> branches = new ArrayList<Branch>();
// For each character, read it, create branches for each substring, and pass the current character
// to each active branch
for (int i = 0; i < input.length(); i++)
{
char c = input.charAt(i);
// Make a new branch, one for each substring that we are searching for
for (int j = 0; j < substrings.length; j++)
branches.add(new Branch(substrings[j], j, branches));
// Pass the current input character to each branch that is still alive
// Iterate in reverse order because the nextCharacter method may
// cause the branch to be removed from the ArrayList
for (int j = branches.size()-1; j >= 0; j--)
branches.get(j).nextCharacter(c);
}
for (int i = 0; i < occurrences.length; i++)
System.out.println(substrings[i]+": "+occurrences[i]);
}
private static class Branch
{
private String searchFor;
private int position, index;
private ArrayList<Branch> parent;
public Branch(String searchFor, int searchForIndex, ArrayList<Branch> parent)
{
this.parent = parent;
this.searchFor = searchFor;
this.position = 0;
this.index = searchForIndex;
}
public void nextCharacter(char c)
{
// If the current character matches the ith character of the string we are searching for,
// Then this branch will stay alive
if (c == searchFor.charAt(position))
position++;
// Otherwise the substring didn't match, so this branch dies
else
suicide();
// Reached the end of the substring, so the substring was found.
if (position == searchFor.length())
{
occurrences[index] += 1;
suicide();
}
}
private void suicide()
{
parent.remove(this);
}
}
}
output for this example is
AA: 3
AG: 4
GG: 4
GAG: 3
TA: 4

Do you want to find all possible substrings that are longer than 1 character?
In that case one approach is to use HashMaps.
This example outputs:
{AA=3, TT=4, AC=3, CTT=2, CAA=2, GCA=2, CAC=2, AG=3, TTG=2, AAG=2, GT=2, CT=2, TG=2, GA=2, GC=3, CA=4}
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
public class Test {
public static void main(String[] args) {
String str = "GCAAAGCTTGGCACACGTCAAGAGTTGACTTT";
HashMap<String, Integer> map = countMatches(str);
System.out.println(map);
}
private static HashMap<String, List<Integer>> findOneLetterMatches(String str) {
ArrayList<Integer> list = new ArrayList<>();
for(int i = 0; i < str.length(); i++) list.add(i);
return extendMatches(str, list, 1);
}
private static HashMap<String, List<Integer>> extendMatches(String str, List<Integer> indices, int targetLength) {
HashMap<String, List<Integer>> map = new HashMap<>();
for(int index: indices) {
if(index+targetLength <= str.length()) {
String s = str.substring(index, index + targetLength);
List<Integer> list = map.get(s);
if(list == null) {
list = new ArrayList<>();
map.put(s, list);
}
list.add(index);
}
}
return map;
}
private static void addIfListLongerThanOne(HashMap<String, List<Integer>> source,
HashMap<String, List<Integer>> target) {
for(Map.Entry<String, List<Integer>> e: source.entrySet()) {
String s = e.getKey();
List<Integer> l = e.getValue();
if(l.size() > 1) target.put(s, l);
}
}
private static HashMap<String, List<Integer>> extendAllMatches(String str, HashMap<String, List<Integer>> map, int targetLength) {
HashMap<String, List<Integer>> result = new HashMap<>();
for(List<Integer> list: map.values()) {
HashMap<String, List<Integer>> m = extendMatches(str, list, targetLength);
addIfListLongerThanOne(m, result);
}
return result;
}
private static HashMap<String, Integer> countMatches(String str) {
HashMap<String, Integer> result = new HashMap<>();
HashMap<String, List<Integer>> matches = findOneLetterMatches(str);
for(int targetLength = 2; !matches.isEmpty(); targetLength++) {
HashMap<String, List<Integer>> m = extendAllMatches(str, matches, targetLength);
for(Map.Entry<String, List<Integer>> e: m.entrySet()) {
String s = e.getKey();
List<Integer> l = e.getValue();
result.put(s, l.size());
}
matches = m;
}
return result;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Word count with java 8 - java

Related

How to hold position of every character from encoded string

Find the most common word from user input

replace multiple substrings in a string , Array vs HashMap

remove repeated words from String Array

Best way of scanning for letter combinations

Categories

Resources