Words frequency in percentage java

Words frequency in percentage java - java

I have to make a program that does the words frequency from a linkedlist and outputs the result like this :
word, number of occurrences, frequency in percentage
import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;
public class Link {
public static void main(String args[]) {
long start = System.currentTimeMillis();
LinkedList<String> list = new LinkedList<String>();
File file = new File("words.txt");
try {
Scanner sc = new Scanner(file);
String words;
while (sc.hasNext()) {
words = sc.next();
words = words.replaceAll("[^a-zA-Z0-9]", "");
words = words.toLowerCase();
words = words.trim();
list.add(words);
}
sc.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
Map<String, Integer> frequency = new TreeMap<String, Integer>();
for (String count : list) {
if (frequency.containsKey(count)) {
frequency.put(count, frequency.get(count) + 1);
} else {
frequency.put(count, 1);
}
}
System.out.println(frequency);
long end = System.currentTimeMillis();
System.out.println("\n" + "Duration: " + (end - start) + " ms");
}
}
Output : {a=1, ab=3, abbc=1, asd=2, xyz=1}
What I don't know is how to do the frequency in percentage and ignore the words shorter than 2 caracters. For example "a=1" should be ignored.
Thanks in advance.

First, introduce a double variable to keep track of the total number of occurences. E.g.
double total = 0;
Next is to filter out any String with length() < 2. You can already do this before adding them to your LinkedList.
while (sc.hasNext()) {
words = sc.next();
words = words.replaceAll("[^a-zA-Z0-9]", "");
words = words.toLowerCase();
words = words.trim();
if (words.length() >= 2) list.add(words); //Filter out strings < 2 chars
}
Now, when going through your Strings we should increase the total variable by 1 for each occurence like so;
for (String count : list) {
if (frequency.containsKey(count)) {
frequency.put(count, frequency.get(count) + 1);
} else {
frequency.put(count, 1);
}
total++; //Increase total number of occurences
}
We can then use System.out.printf() to print it all out nicely.
for (Map.Entry<String, Integer> entry: frequency.entrySet()) {
System.out.printf("String: %s \t Occurences: %d \t Percentage: %.2f%%%n", entry.getKey(), entry.getValue(), entry.getValue()/total*100);
}
Note that this will not look nice (the printf statement) once you are working with large Strings, or have a ton of occurences. So optionally you could do the following given that maxLength contains the largest length() of any String in your list and occLength contains the amount of digits of the largest occurence.
for (Map.Entry<String, Integer> entry: frequency.entrySet()) {
System.out.printf("String: %" + maxLength + "s Occurences: %" + occLength + "d Percentage: %.2f%%%n", entry.getKey(), entry.getValue(), entry.getValue()/total*100);
}

Ignore strings with size less than 2 while adding to map step and maintain a legal words counter for calculating percentage.
int legalWords = 0;
for (String count: list) {
if (count.size() >= 2) {
if (frequency.containsKey(count)) {
frequency.put(count, frequency.get(count) + 1);
} else {
frequency.put(count, 1);
}
legalWords++;
}
}
for (Map.Entry < String, String > entry: map.entrySet()) {
System.out.println(entry.getKey() + " " + entry.getValue() + " " + (entry.getValue() / (double) legalWords) * 100.0 + "%");
}

Note: As the OP question does not provide us with details, let's assume that we will count words of one character but we will not output them.
Seperate your logic from your main class:
class WordStatistics {
private String word;
private long occurrences;
private float frequency;
public WordStatistics(String word){
this.word=word;
}
public WordStatistics calculateOccurrences(List<String> words) {
this.occurrences = words.stream()
.filter(p -> p.equalsIgnoreCase(this.word)).count();
return this;
}
public WordStatistics calculateFrequency(List<String> words) {
this.frequency = (float) this.occurrences / words.size() * 100;
return this;
}
// getters and setters
}
Explanation:
Considering this list of words:
List<String> words = Arrays.asList("Java", "C++", "R", "php", "Java",
"C", "Java", "C#", "C#","Java","R");
Count occurrences of word in words using java 8 Streams API:
words.stream()
.filter(p -> p.equalsIgnoreCase(word)).count();
Calculate word's frequency percentage:
frequency = (float) occurrences / words.size() * 100;
Setting up your words' statistics(Occurrences+Frequency):
List<WordStatistics> wordsStatistics = new LinkedList<WordStatistics>();
words.stream()
.distinct()
.forEach(
word -> wordsStatistics.add(new WordStatistics(word)
.calculateOccurrences(words)
.calculateFrequency(words)));
Output result with ignorance of words of one character:
wordsStatistics
.stream()
.filter(word -> word.getWord().length() > 1)
.forEach(
word -> System.out.printf("Word : %s \t"
+ "Occurences : %d \t"
+ "Frequency : %.2f%% \t\n", word.getWord(),
word.getOccurrences(), word.getFrequency()));
Output:
Word : C# Occurences : 2 Frequency : 18.18%
Word : Java Occurences : 4 Frequency : 36.36%
Word : C++ Occurences : 1 Frequency : 9.09%
Word : php Occurences : 1 Frequency : 9.09%

Use a simple data structure to make this easier.
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class WordCounter {
private int wordTotal;
private Map<String, Integer> wordCount = new HashMap<>();
public WordCounter(List<String> words) {
wordTotal = words.size();
for (String word : words) {
wordCount.put(word, wordCount.getOrDefault(word, 0) + 1);
}
}
public Map<String, Double> getPercentageByWord() {
return wordCount.entrySet()
.stream()
.collect(Collectors.toMap(e -> e.getKey(),
e -> getPercentage(e.getValue())));
}
private double getPercentage(double count) {
return (count / wordTotal) * 100;
}
}
Here's a test that uses it.
#Test
public void testWordCount() {
List<String> words = Arrays.asList("a", "a", "a", "a", "b", "b", "c", "d", "e", "f");
WordCounter counter = new WordCounter(words);
Map<String, Double> results = counter.getPercentageByWord();
assertThat(results).hasSize(6);
assertThat(results).containsEntry("a", 40.0);
assertThat(results).containsEntry("b", 20.0);
assertThat(results).containsEntry("c", 10.0);
assertThat(results).containsEntry("d", 10.0);
assertThat(results).containsEntry("e", 10.0);
assertThat(results).containsEntry("f", 10.0);
}

Related

How can I convert a String into ArrayList by counting occurrence of each characters?

I have a Input String as :
String str="1,1,2,2,2,1,3";
I want count each id occurrence and store them into List,and I want output Like this:
[
{
"count": "3",
"ids": "1, 2"
}
{
"count": "1",
"ids": "3"
}
]
I tried by using org.springframework.util.StringUtils.countOccurrencesOf(input, "a"); like this. But after counting not getting the things like I want.

This will give you the desired result. You first count the occurrences of each character, then you group by count each character in a new HashMap<Integer, List<String>>.
Here's a working example:
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
public class Test {
public static void main(String[] args) {
String str = "1,1,2,2,2,1,3";
String[] list = str.split(",");
HashMap<String, Integer> occr = new HashMap<>();
for (int i = 0; i < list.length; i++) {
if (occr.containsKey(list[i])) {
occr.put(list[i], occr.get(list[i]) + 1);
} else {
occr.put(list[i], 1);
}
}
HashMap<Integer, List<String>> res = new HashMap<>();
for (String key : occr.keySet()) {
int count = occr.get(key);
if (res.containsKey(count)) {
res.get(count).add(key);
} else {
List<String> l = new ArrayList<>();
l.add(key);
res.put(count, l);
}
}
StringBuffer sb = new StringBuffer();
sb.append("[\n");
for (Integer count : res.keySet()) {
sb.append("{\n");
List<String> finalList = res.get(count);
sb.append("\"count\":\"" + count + "\",\n");
sb.append("\"ids\":\"" + finalList.get(0));
for (int i = 1; i < finalList.size(); i++) {
sb.append("," + finalList.get(i));
}
sb.append("\"\n}\n");
}
sb.append("\n]");
System.out.println(sb.toString());
}
}
EDIT: A more generalised solution
Here's the method that returns a HashMap<Integer,List<String>>, which contains the number of occurrences of a string as a key of the HashMap where each key has a List<String> value which contains all the strings that occur key number of times.
public HashMap<Integer, List<String>> countOccurrences(String str, String delimiter) {
// First, we count the number of occurrences of each string.
String[] list = str.split(delimiter);
HashMap<String, Integer> occr = new HashMap<>();
for (int i = 0; i < list.length; i++) {
if (occr.containsKey(list[i])) {
occr.put(list[i], occr.get(list[i]) + 1);
} else {
occr.put(list[i], 1);
}
}
/** Now, we group them by the number of occurrences,
* All strings with the same number of occurrences are put into a list;
* this list is put into a HashMap as a value, with the number of
* occurrences as a key.
*/
HashMap<Integer, List<String>> res = new HashMap<>();
for (String key : occr.keySet()) {
int count = occr.get(key);
if (res.containsKey(count)) {
res.get(count).add(key);
} else {
List<String> l = new ArrayList<>();
l.add(key);
res.put(count, l);
}
}
return res;
}

You need to do some boring transfer, I'm not sure if you want to keep the ids sorted. A simple implementation is:
public List<Map<String, Object>> countFrequency(String s) {
// Count by char
Map<String, Integer> countMap = new HashMap<String, Integer>();
for (String ch : s.split(",")) {
Integer count = countMap.get(ch);
if (count == null) {
count = 0;
}
count++;
countMap.put(ch, count);
}
// Count by frequency
Map<Integer, String> countByFrequency = new HashMap<Integer, String>();
for (Map.Entry<String, Integer> entry : countMap.entrySet()) {
String chars = countByFrequency.get(entry.getValue());
System.out.println(entry.getValue() + " " + chars);
if (chars == null) {
chars = "" + entry.getKey();
} else {
chars += ", " + entry.getKey();
}
countByFrequency.put(entry.getValue(), chars);
}
// Convert to list
List<Map<String, Object>> result = new ArrayList<Map<String, Object>>();
for (Map.Entry<Integer, String> entry : countByFrequency.entrySet()) {
Map<String, Object> item = new HashMap<String, Object>();
item.put("count", entry.getKey());
item.put("ids", entry.getValue());
result.add(item);
}
return result;
}

Hey check the below code, it help you to achieve your expected result
public class Test
{
public static void main(String args[])
{
String str = "1,1,2,2,2,1,3"; //Your input string
List<String> listOfIds = Arrays.asList(str.split(",")); //Splits the string
System.out.println("List of IDs : " + listOfIds);
HashMap<String, List<String>> map = new HashMap<>();
Set<String> uniqueIds = new HashSet<>(Arrays.asList(str.split(",")));
for (String uniqueId : uniqueIds)
{
String frequency = String.valueOf(Collections.frequency(listOfIds, uniqueId));
System.out.println("ID = " + uniqueId + ", frequency = " + frequency);
if (!map.containsKey(frequency))
{
map.put(frequency, new ArrayList<String>());
}
map.get(frequency).add(uniqueId);
}
for (Map.Entry<String, List<String>> entry : map.entrySet())
{
System.out.println("Count = "+ entry.getKey() + ", IDs = " + entry.getValue());
}
}
}

One of the approach i can suggest you is to
put each "character" in hashMap as a key and "count" as a value.
Sample code to do so is
String str = "1,1,2,2,2,1,3";
HashMap<String, String> map = new HashMap();
for (String c : str.split(",")) {
if (map.containsKey( c)) {
int count = Integer.parseInt(map.get(c));
map.put(c, ++count + "");
} else
map.put(c, "1");
}
System.out.println(map.toString());
}

<!--first you split string based on "," and store into array, after that iterate array end of array lenght in side loop create new map and put element in map as a Key and set value as count 1 again check the key and increase count value in map-->
like....
String str="1,1,2,2,2,1,3";
String strArray=str.split(",");
Map strMap= new hashMap();
for(int i=0; i < strArray.length(); i++){
if(!strMap.containsKey(strArray[i])){
strMap.put(strArray[i],1)
}else{
strMap.put(strArray[i],strMap.get(strArray[i])+1)
}
}

String str="1,1,2,2,2,1,3";
//Converting given string to string array
String[] strArray = str.split(",");
//Creating a HashMap containing char as a key and occurrences as a value
Map<String,Integer> charCountMap = new HashMap<String, Integer>();
//checking each element of strArray
for(String num :strArray){
if(charCountMap.containsKey(num))
{
//If char is present in charCountMap, incrementing it's count by 1
charCountMap.put(num, charCountMap.get(num)+1);
}
else
{
//If char is not present in charCountMap, and putting this char to charCountMap with 1 as it's value
charCountMap.put(num, 1);
}
}
//Printing the charCountMap
for (Map.Entry<String, Integer> entry : charCountMap.entrySet())
{
System.out.println("ID ="+entry.getKey() + " count=" + entry.getValue());
}
}

// Split according to comma
HashMap<String, Integer> hm = new HashMap<String, Integer>();
for (String key : tokens) {
if (hm.containsKey(key)) {
Integer currentCount = hm.get(key);
hm.put(key, ++currentCount);
} else {
hm.put(key, 1);
}
}
// Organize info according to ID
HashMap<Integer, String> result = new HashMap<Integer, String>();
for (Map.Entry<String, Integer> entry : hm.entrySet()) {
Integer newKey = entry.getValue();
if (result.containsKey(newKey)) {
String newValue = entry.getKey() + ", " + result.get(newKey);
result.put(newKey, newValue);
} else {
result.put(newKey, entry.getKey());
}
}

And here is a complete Java 8 streaming solution for the problem. The main idea is to first build a map of the occurances of each id, which results in:
{1=3, 2=3, 3=1}
(first is ID and second the count) and then to group by it by the count:
public static void main(String[] args) {
String str = "1,1,2,2,2,1,3";
System.out.println(
Pattern.compile(",").splitAsStream(str)
.collect(groupingBy(identity(), counting()))
.entrySet().stream()
.collect(groupingBy(i -> i.getValue(), mapping( i -> i.getKey(), toList())))
);
}
which results in:
{1=[3], 3=[1, 2]}
This is the most compact version I could come up with. Is there anything even smaller?
EDIT: By the way here is the complete class, to get all static method imports right:
import static java.util.function.Function.identity;
import java.util.regex.Pattern;
import static java.util.stream.Collectors.counting;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.mapping;
import static java.util.stream.Collectors.toList;
public class Java8StreamsTest6 {
public static void main(String[] args) {
String str = "1,1,2,2,2,1,3";
System.out.println(
Pattern.compile(",").splitAsStream(str)
.collect(groupingBy(identity(), counting()))
.entrySet().stream()
.collect(groupingBy(i -> i.getValue(), mapping(i -> i.getKey(), toList())))
);
}
}

Output one occurence of a character in a string c#

This program should output only one occurrence of a character in a string then specify the number of occurrence in that string. It should be sorted in ascending order depending on the number of occurrences of that particular character. It's working except on the (char)i part. Does it have something to do with ASCII codes or something?
Desired Output:
b: 1
d:1
a:2
s:2
Code's output:
ü: 1
ý: 1
þ: 2
ÿ: 2
public class HuffmanCode {
static String string;
static Scanner input = new Scanner(System.in);
public static void main(String args[]){
System.out.print("Enter a string: ");
string = input.nextLine();
int count[] = countOccurence(string);
Arrays.sort(count);
for (int i = 0; i < count.length; i++) {
if (count[i] > 0)
System.out.println((char)i + ": " + count[i]);
}
}
public static int[] countOccurence(String str){
int counts[] = new int[256];
for(int i=0;i<str.length();i++){
char charAt = str.charAt(i);
counts[(int)charAt]++;
}
return counts;
}
}

In Java 8, you could use the Stream API and do something like this:
String input = "ababcabcd" ;
input.chars() // split the string to a stream of int representing the chars
.boxed() // convert to stream of Integer
.collect(Collectors.groupingBy(c->c,Collectors.counting())) // aggregate by counting the letters
.entrySet() // collection of entries (key, value), i.e. char, count
.stream() // corresponding stream
.sorted(Map.Entry.comparingByValue()) // sort by value, i.e. by number of occurence of letters
.forEach(e->System.out.println((char)(int)e.getKey() + ": " + e.getValue())); // Output the result
The result would be:
d: 1
c: 2
a: 3
b: 3
I hope it helps.
EDIT:
Suppose your input is
String input = "ababc\u0327abçd" ;
We would have in that case ababçabçdas input and we need normalization to make sure we properly count the letters that are the same, with different representations. To achieve that, we preprocess the inputstring using Normalization, which was introduced in JDK6:
input = Normalizer.normalize(input, Form.NFC);

Create a list and sort it instead of sorting count.
List<int[]> list = new ArrayList<>();
for (int i = 0; i < count.length; i++) {
if (count[i] > 0)
list.add(new int[] {i , count[i]});
}
Collections.sort(list, Comparator.comparing(a -> a[1]));
for (int[] a : list) {
System.out.println((char)a[0] + ": " + a[1]);
}

You could use TreeMap with a combination of custom Comparator
Here's an example
String test = "ABBCCCDDDDEEEEEFFFFFF";
Map<Character, Integer> map = new HashMap<>();
for (Character c : test.toCharArray()) {
if (!map.containsKey(c)) map.put(c, 0);
map.put(c, map.get(c) + 1);
}
Map<Character, Integer> tMap = new TreeMap<>(new MyComparator(map));
tMap.putAll(map);
for (Map.Entry<Character, Integer> entry : tMap.entrySet()) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
And here's the implementation of MyComparator
class MyComparator implements Comparator<Object> {
Map<Character, Integer> map;
public MyComparator(Map<Character, Integer> map) {
this.map = map;
}
public int compare(Object o1, Object o2) {
if (map.get(o1).equals(map.get(o2)))
return 1;
else
return (map.get(o1)).compareTo(map.get(o2));
}
}

get the two most used words in sentence in java

How do I get the two most used words in a sentence for example here after it count the total number of appearances of all the words it should also display the two most used words
import javax.swing.*;
import java.util.*;
import java.awt.event.*;
import java.util.Map;
import java.util.HashMap;
public class Tokenizer
{
public static void main(String[] args)
{
int index = 0; int tokenCount; int i =0;
Map<String,Integer> wordCount = new HashMap<String,Integer>();
Map<Integer,Integer> letterCount = new HashMap<Integer,Integer>();
String message="The Quick brown fox jumps over the lazy brown dog";
StringTokenizer string = new StringTokenizer(message);
tokenCount = string.countTokens();
System.out.println("Number of tokens = " + tokenCount);
while (string.hasMoreTokens()) {
String word = string.nextToken().toLowerCase();
Integer count = wordCount.get(word);
Integer lettercount = letterCount.get(word);
if(count == null) {
wordCount.put(word, 1);
}
else {
wordCount.put(word, count + 1);
}
}
for (String words : wordCount.keySet())
{System.out.println("Word : " + words + " has count :" +wordCount.get(words));
}
}

Iterate thorough the HashMap and then keep track of the highest counts.
int first, second;
first = second = Integer.MIN_VALUE;
String firstWord, secondWord;
for (Map.Entry<String, Integer> entry : map.entrySet())
{
int count = entry.getValue();
String word = entry.getKey();
if (count > first)
{
second = first;
secondWord = firstWord;
first = count;
firstWord = word;
}
else if (count > second && count != first)
{
second = count;
secondWord = word;
}
}
System.out.println(firstWord + " " + first);
System.out.println(secondWord + " " + second);

You need to iterate over map's entry set.
This will return you entry object which will contain key and max value.
Map.Entry<String, Integer> max = null;
for (Map.Entry<String, Integer> entry : map.entrySet())
{
if (max == null || entry.getValue().compareTo(max .getValue()) > 0)
{
max = entry;
}
}
For second most used word,i would say you can remove the max one and then again from this way,you can retrieve second one.

Java Inverted Index program

I am writing an inverted index program on java which returns the frequency of terms among multiple documents. I have been able to return the number times a word appears in the entire collection, but I have not been able to return which documents the word appears in. This is the code I have so far:
import java.util.*; // Provides TreeMap, Iterator, Scanner
import java.io.*; // Provides FileReader, FileNotFoundException
public class Run
{
public static void main(String[ ] args)
{
// **THIS CREATES A TREE MAP**
TreeMap<String, Integer> frequencyData = new TreeMap<String, Integer>( );
Map[] mapArray = new Map[5];
mapArray[0] = new HashMap<String, Integer>();
readWordFile(frequencyData);
printAllCounts(frequencyData);
}
public static int getCount(String word, TreeMap<String, Integer> frequencyData)
{
if (frequencyData.containsKey(word))
{ // The word has occurred before, so get its count from the map
return frequencyData.get(word); // Auto-unboxed
}
else
{ // No occurrences of this word
return 0;
}
}
public static void printAllCounts(TreeMap<String, Integer> frequencyData)
{
System.out.println("-----------------------------------------------");
System.out.println(" Occurrences Word");
for(String word : frequencyData.keySet( ))
{
System.out.printf("%15d %s\n", frequencyData.get(word), word);
}
System.out.println("-----------------------------------------------");
}
public static void readWordFile(TreeMap<String, Integer> frequencyData)
{
int total = 0;
Scanner wordFile;
String word; // A word read from the file
Integer count; // The number of occurrences of the word
int counter = 0;
int docs = 0;
//**FOR LOOP TO READ THE DOCUMENTS**
for(int x=0; x<Docs.length; x++)
{ //start of for loop [*
try
{
wordFile = new Scanner(new FileReader(Docs[x]));
}
catch (FileNotFoundException e)
{
System.err.println(e);
return;
}
while (wordFile.hasNext( ))
{
// Read the next word and get rid of the end-of-line marker if needed:
word = wordFile.next( );
// This makes the Word lower case.
word = word.toLowerCase();
word = word.replaceAll("[^a-zA-Z0-9\\s]", "");
// Get the current count of this word, add one, and then store the new count:
count = getCount(word, frequencyData) + 1;
frequencyData.put(word, count);
total = total + count;
counter++;
docs = x + 1;
}
} //End of for loop *]
System.out.println("There are " + total + " terms in the collection.");
System.out.println("There are " + counter + " unique terms in the collection.");
System.out.println("There are " + docs + " documents in the collection.");
}
// Array of documents
static String Docs [] = {"words.txt", "words2.txt",};

Instead of simply having a Map from word to count, create a Map from each word to a nested Map from document to count. In other words:
Map<String, Map<String, Integer>> wordToDocumentMap;
Then, inside your loop which records the counts, you want to use code which looks like this:
Map<String, Integer> documentToCountMap = wordToDocumentMap.get(currentWord);
if(documentToCountMap == null) {
// This word has not been found anywhere before,
// so create a Map to hold document-map counts.
documentToCountMap = new TreeMap<>();
wordToDocumentMap.put(currentWord, documentToCountMap);
}
Integer currentCount = documentToCountMap.get(currentDocument);
if(currentCount == null) {
// This word has not been found in this document before, so
// set the initial count to zero.
currentCount = 0;
}
documentToCountMap.put(currentDocument, currentCount + 1);
Now you're capturing the counts on a per-word and per-document basis.
Once you've completed the analysis and you want to print a summary of the results, you can run through the map like so:
for(Map.Entry<String, Map<String,Integer>> wordToDocument :
wordToDocumentMap.entrySet()) {
String currentWord = wordToDocument.getKey();
Map<String, Integer> documentToWordCount = wordToDocument.getValue();
for(Map.Entry<String, Integer> documentToFrequency :
documentToWordCount.entrySet()) {
String document = documentToFrequency.getKey();
Integer wordCount = documentToFrequency.getValue();
System.out.println("Word " + currentWord + " found " + wordCount +
" times in document " + document);
}
}
For an explanation of the for-each structure in Java, see this tutorial page.
For a good explanation of the features of the Map interface, including the entrySet method, see this tutorial page.

Try adding second map word -> set of document name like this:
Map<String, Set<String>> filenames = new HashMap<String, Set<String>>();
...
word = word.replaceAll("[^a-zA-Z0-9\\s]", "");
// Get the current count of this word, add one, and then store the new count:
count = getCount(word, frequencyData) + 1;
frequencyData.put(word, count);
Set<String> filenamesForWord = filenames.get(word);
if (filenamesForWord == null) {
filenamesForWord = new HashSet<String>();
}
filenamesForWord.add(Docs[x]);
filenames.put(word, filenamesForWord);
total = total + count;
counter++;
docs = x + 1;
When you need to get a set of filenames in which you encountered a particular word, you'll just get() it from the map filenames. Here is the example that prints out all the file names, in which we have encountered a word:
public static void printAllCounts(TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames) {
System.out.println("-----------------------------------------------");
System.out.println(" Occurrences Word");
for(String word : frequencyData.keySet( ))
{
System.out.printf("%15d %s\n", frequencyData.get(word), word);
for (String filename : filenames.get(word)) {
System.out.println(filename);
}
}
System.out.println("-----------------------------------------------");
}

I've put a scanner into the main methode, and the word I search for will return the documents the word occurce in. I also return how many times the word occurs, but I will only get it to be the total of times in all of three documents. And I want it to return how many times it occurs in each document. I want this to be able to calculate tf-idf, if u have a total answer for the whole tf-idf I would appreciate. Cheers
Here is my code:
import java.util.*; // Provides TreeMap, Iterator, Scanner
import java.io.*; // Provides FileReader, FileNotFoundException
public class test2
{
public static void main(String[ ] args)
{
// **THIS CREATES A TREE MAP**
TreeMap<String, Integer> frequencyData = new TreeMap<String, Integer>();
Map<String, Set<String>> filenames = new HashMap<String, Set<String>>();
Map<String, Integer> countByWords = new HashMap<String, Integer>();
Map[] mapArray = new Map[5];
mapArray[0] = new HashMap<String, Integer>();
readWordFile(countByWords, frequencyData, filenames);
printAllCounts(countByWords, frequencyData, filenames);
}
public static int getCount(String word, TreeMap<String, Integer> frequencyData)
{
if (frequencyData.containsKey(word))
{ // The word has occurred before, so get its count from the map
return frequencyData.get(word); // Auto-unboxed
}
else
{ // No occurrences of this word
return 0;
}
}
public static void printAllCounts( Map<String, Integer> countByWords, TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames)
{
System.out.println("-----------------------------------------------");
System.out.print("Search for a word: ");
String worde;
int result = 0;
Scanner input = new Scanner(System.in);
worde=input.nextLine();
if(!filenames.containsKey(worde)){
System.out.println("The word does not exist");
}
else{
for(String filename : filenames.get(worde)){
System.out.println(filename);
System.out.println(countByWords.get(worde));
}
}
System.out.println("\n-----------------------------------------------");
}
public static void readWordFile(Map<String, Integer> countByWords ,TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames)
{
Scanner wordFile;
String word; // A word read from the file
Integer count; // The number of occurrences of the word
int counter = 0;
int docs = 0;
//**FOR LOOP TO READ THE DOCUMENTS**
for(int x=0; x<Docs.length; x++)
{ //start of for loop [*
try
{
wordFile = new Scanner(new FileReader(Docs[x]));
}
catch (FileNotFoundException e)
{
System.err.println(e);
return;
}
while (wordFile.hasNext( ))
{
// Read the next word and get rid of the end-of-line marker if needed:
word = wordFile.next( );
// This makes the Word lower case.
word = word.toLowerCase();
word = word.replaceAll("[^a-zA-Z0-9\\s]", "");
// Get the current count of this word, add one, and then store the new count:
count = countByWords.get(word);
if(count != null){
countByWords.put(word, count + 1);
}
else{
countByWords.put(word, 1);
}
Set<String> filenamesForWord = filenames.get(word);
if (filenamesForWord == null) {
filenamesForWord = new HashSet<String>();
}
filenamesForWord.add(Docs[x]);
filenames.put(word, filenamesForWord);
counter++;
docs = x + 1;
}
} //End of for loop *]
System.out.println("There are " + counter + " terms in the collection.");
System.out.println("There are " + docs + " documents in the collection.");
}
// Array of documents
static String Docs [] = {"Document1.txt", "Document2.txt", "Document3.txt"};
}

More efficient way of getting frequency of words

I want to count the frequency of each word in an ArrayList by the start of the word. e.g [cat, cog, mouse] will mean there are 2 words begining with c and one word begining with m. The code I have works fine but there are 26 letters in the alphabet which will require alot more if s. Is there any other way of doing this?
public static void countAlphabeticalWords(ArrayList<String> arrayList) throws IOException
{
int counta =0, countb=0, countc=0, countd=0,counte=0;
String word = "";
for(int i = 0; i<arrayList.size();i++)
{
word = arrayList.get(i);
if (word.charAt(0) == 'a' || word.charAt(0) == 'A'){ counta++;}
if (word.charAt(0) == 'b' || word.charAt(0) == 'B'){ countb++;}
}
System.out.println("The number of words begining with A are: " + counta);
System.out.println("The number of words begining with B are: " + countb);
}

Use a Map
public static void countAlphabeticalWords(List<String> arrayList) throws IOException {
Map<Character,Integer> counts = new HashMap<Character,Integer>();
String word = "";
for(String word : list) {
Character c = Character.toUpperCase(word.charAt(0));
if (counts.containsKey(c)) {
counts.put(c, counts.get(c) + 1);
}
else {
counts.put(c, 1);
}
}
for (Map.Entry<Character, Integer> entry : counts.entrySet()) {
System.out.println("The number of words begining with " + entry.getKey() + " are: " + entry.getValue());
}
Or use a Map and AtomicInteger (as per Jarrod Roberson)
public static void countAlphabeticalWords(List<String> arrayList) throws IOException {
Map<Character,AtomicInteger> counts = new HashMap<Character,AtomicInteger>();
String word = "";
for(String word : list) {
Character c = Character.toUpperCase(word.charAt(0));
if (counts.containsKey(c)) {
counts.get(c).incrementAndGet();
}
else {
counts.put(c, new AtomicInteger(1));
}
}
for (Map.Entry<Character, AtomicInteger> entry : counts.entrySet()) {
System.out.println("The number of words begining with " + entry.getKey() + " are: " + entry.getValue());
}
Best Practices
Never do list.get(i), use for(element : list) instead. And never use ArrayList in a signature use the Interface List instead so you can change the implemenation.

How about this? Considering that the words start only with [a-zA-Z]:
public static int[] getCount(List<String> arrayList) {
int[] data = new int[26];
final int a = (int) 'a';
for(String s : arrayList) {
data[((int) Character.toLowerCase(s.charAt(0))) - a]++;
}
return data;
}
edit:
Just out of curiosity, I made a very simple test comparing my method and Steph's method with map.
List with 236 items, 10000000 iterations (without printing the result): my code took ~10000ms and Steph's took ~65000ms.
Test: http://pastebin.com/HNBgKFRk
Data: http://pastebin.com/UhCtapZZ

Now, every character can be cast to an integer, representing an ASCII decimal. For example, (int)'a' is 97. 'z''s ASCII decimal is 122. http://www.asciitable.com/
You can create a lookup table for the characters:
int characters = new int[128]
Then in your algorithm's loop use the ASCII decimal as index and increment the value:
word = arrayList.get(i);
characters[word.charAt(0)]++;
In the end, you can print the occurence of the characters:
for (int i = 97; i<=122; i++){
System.out.println(String.format("The number of words beginning with %s are: %d", (char)i, characters[i]));
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Words frequency in percentage java - java

Related

How can I convert a String into ArrayList by counting occurrence of each characters?

Output one occurence of a character in a string c#

get the two most used words in sentence in java

Java Inverted Index program

More efficient way of getting frequency of words

Categories

Resources