So the task was to read a file with the following names:
Alice
Bob
James
Richard
Bob
Alice
Alice
Alice
James
Richard
Bob
Richard
Bob
Stephan
Michael
Henry
And print out each name with its value of occurrence e.g "Alice - <4>".
I got it working, basically. The only problem I have is that the last name (Stephan - <1>) is missing in my output and I can't get it to work properly.. It's probably because I used [i-1] but as I said, I'm not getting the right solution here.
Well, here's my code..
package Assignment4;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.BufferedReader;
import java.util.Arrays;
public class ReportUniqueNames {
public static void main(String[] args) {
// TODO Auto-generated method stub
System.out.println (" This programm counts words, characters and lines!\n");
System.out.println ("Please enter the name of the .txt file:");
BufferedReader input = new BufferedReader(new InputStreamReader (System.in));
BufferedReader read = null;
String file = "";
String text = "";
String line = "";
boolean unique = true;
int nameCounter = 1;
try {
file = input.readLine();
read = new BufferedReader (new FileReader(file));
while ((line = read.readLine()) != null) {
text += line.trim() + " ";
}
} catch (FileNotFoundException e) {
System.out.println("File was not found.");
} catch (IOException e) {
System.out.println("An error has occured.");
}
String textarray[] = text.split(" ");
Arrays.sort(textarray);
for (int i=0; i < textarray.length; i++) {
if (i > 0 && textarray[i].equals(textarray[i-1])) {
nameCounter++;
unique = false;
}
if (i > 0 && !textarray[i].equals(textarray[i-1]) && !unique) {
System.out.println("<"+textarray[i-1]+"> - <"+nameCounter+">");
nameCounter = 1;
unique = true;
} else if (i > 0 && !textarray[i].equals(textarray[i-1]) && unique) {
//nameCounter = 1;
System.out.println("<"+textarray[i-1]+"> - <"+nameCounter+">");
}
}
}
}
So that's it.. Hopefully one of you could help me out.
EDIT: Wow, so many different approaches. First of all thanks for all of your help. I'll look through your suggested solutions and maybe restart from the bottom ;). Will give you a heads up when I'm done.
You could simply use a Map (that emulates a "Multiset") for the purpose of counting words:
String textarray[] = text.split(" ");
// TreeMap gives sorting by alphabetical order "for free"
Map<String, Integer> wordCounts = new TreeMap<>();
for (int i = 0; i < textarray.length; i++) {
Integer count = wordCounts.get(textarray[i]);
wordCounts.put(textarray[i], count != null ? count + 1 : 1);
}
for (Map.Entry<String, Integer> e : wordCounts.entrySet()) {
System.out.println("<" + e.getKey() + "> - <" + e.getValue() + ">");
}
You can use Scanner to read your input file (whose location is denoted by "filepath") using the new line character as your delimiter and add the words directly to an ArrayList<String>.
Then, iterate the ArrayList<String> and count the frequency of each word in your original file in a HashMap<String, Integer>.
Full Working Code:
Scanner s = new Scanner(new File("filepath")).useDelimiter("\n");
List<String> list = new ArrayList<>();
while (s.hasNext()){
list.add(s.next());
}
s.close();
Map<String, Integer> wordFrequency = new HashMap<>();
for(String str : list)
{
if(wordFrequency.containsKey(str))
wordFrequency.put(str, wordFrequency.get(str) + 1); // Increment the frequency by 1
else
wordFrequency.put(str, 1);
}
//Print the frequency:
for(String str : list)
{
System.out.println(str + ": " + wordFrequency.get(str));
}
EDIT:
Alternatively, you can read the entire file into a single String and then split the contents of the String using \n as delimiter into a list. The code is shorter than the first option:
String fileContents = new Scanner(new File("filepath")).useDelimiter("\\Z").next(); // \Z is the end of string anchor, so the entire file is read in one call to next()
List<String> list = Arrays.asList(fileContents.split("\\s*\\n\\s*"));// Using new line character as delimiter, it adds every word to the list
I'd do it like this:
Map<String,Integer> occurs = new HashMap<String,Integer>();
int i = 0, number;
for (; i < textarray.length; i++) {
if (occurs.containsKey(textarray[i])) {
number = occurs.get(testarray[i]);
occurs.put(testarray[i], number + 1);
} else {
occurs.put(testarray[i], 1);
}
}
for(Map.Entry<String, Integer> entry : occurs.entrySet()){
System.out.println("<" + entry.getKey() + "> - " + entry.getValue());
}
System.out.println("<"+textarray[textarray.length-1]+"> - <"+nameCounter+">");
you need this after your loop because you print only till i-1 even though your loop runs correct number of times
But using a map is a better choice
Because your code prints the results for a name the first time that that name isn't the same any more. Then you are missing a print statement for the last entry. To solve this you can just add another if statement at the end of your loop that checks if this is the last time the loop will loop. The if statement would look like this:
if(i == textarray.length - 1){
System.out.println("<"+textarray[i]+"> - <"+nameCounter+">");
}
Now the loop will look like this:
for (int i=1; i < textarray.length; i++) {
if (i > 0 && textarray[i].equals(textarray[i-1])) {
nameCounter++;
unique = false;
}
if (i > 0 && !textarray[i].equals(textarray[i-1]) && !unique) {
System.out.println("<"+textarray[i-1]+"> - <"+nameCounter+">");
nameCounter = 1;
unique = true;
}
else if (i > 0 && !textarray[i].equals(textarray[i-1]) && unique) {
//nameCounter = 1;
System.out.println("<"+textarray[i-1]+"> - <"+nameCounter+">");
}
if(i == textarray.length - 1){
System.out.println("<"+textarray[i]+"> - <"+nameCounter+">");
}
}
And now the loop will also print the results for the last entry in the list.
I hope this helps :)
P.S. some of the other solutions here are far more efficient but this is a solution for your current approach.
I want to discuss the logic you used initially to solve the problem of uniqueness of values in an array of strings.
You just compared two cells of the array, and supposing if they are not equal this means that the name of textarray[i] is unique !
This is false, because it can occur later on while your "unique" boolean variable was set to true.
example:
john | luke| john|charlotte|
comparing the first and second will give you that both john and luke are unequal and comparing them again will say also that they are unequal too when the "i" of loop advances, but this is not the truth.
so lets imagine that we have no map in java, how to solve this with algorithms ?
I will help you with an idea.
1 - create a function that takes in parameter the string that you want to verify and the table
2- then loop all the table testing if the string is equal the cell of the current table if yes, return null or -1
3- if you finish looping the table until the last cell of the array, this means that your string is unique just print if on screen.
4- call this function textarray.length times
and you will have on your screen only the unique names.
First things first:
You won't need your text variable since we will be replacing it with a more appropriate data structure. You need one to store the names that you found so far in the file, along with an integer (number of occurrences) for each name that you found.
Like Dmitry said, the best data structure you can use for this particular case is Hashtable or HashMap.
Assuming that the file structure is a single name per line without any punctuation or spaces, your code would look something like this:
try {
Hashtable<String,Integer> table = new Hashtable<String,Integer>();
file = input.readLine();
read = new BufferedReader (new FileReader(file));
while ((line = read.readLine()) != null) {
line.trim();
if(table.containsKey(line))
table.put(line, table.get(line)+1);
else
table.put(line, 1);
}
System.out.println(table); // looks pretty good and compact on the console... :)
} catch (FileNotFoundException e) {
System.out.println("File was not found.");
} catch (IOException e) {
System.out.println("An error has occured.");
}
Related
I'm new to programming, and here I'm required to capitalise the user's input, which excludes certain words.
For example, if the input is
THIS IS A TEST I get This Is A Test
However, I want to get This is a Test format
String s = in.nextLine();
StringBuilder sb = new StringBuilder(s.length());
String wordSplit[] = s.trim().toLowerCase().split("\\s");
String[] t = {"is","but","a"};
for(int i=0;i<wordSplit.length;i++){
if(wordSplit[i].equals(t))
sb.append(wordSplit[i]).append(" ");
else
sb.append(Character.toUpperCase(wordSplit[i].charAt(0))).append(wordSplit[i].substring(1)).append(" ");
}
System.out.println(sb);
}
This is the closest I have gotten so far but I seem to be unable to exclude capitalising the specific words.
The problem is that you are comparing each word to the entire array. Java does not disallow this, but it does not really make a lot of sense. Instead, you could loop each word in the array and compare those, but that's a bit lengthy in code, and also not very fast if the array of words gets bigger.
Instead, I'd suggest creating a Set from the array and checking whether it contains the word:
String[] t = {"is","but","a"};
Set<String> t_set = new HashSet<>(Arrays.asList(t));
...
if (t_set.contains(wordSplit[i]) {
...
Your problem (as pointed out by #sleepToken) is that
if(wordSplit[i].equals(t))
is checking to see if the current word is equal to the array containing your keywords.
Instead what you want to do is to check whether the array contains a given input word, like so:
if (Arrays.asList(t).contains(wordSplit[i].toLowerCase()))
Note that there is no "case sensitive" contains() method, so it's important to convert the word in question into lower case before searching for it.
You're already doing the iteration once. Just do it again; iterate through every String in t for each String in wordSplit:
for (int i = 0; i < wordSplit.length; i++){
boolean found = false;
for (int j = 0; j < t.length; j++) {
if(wordSplit[i].equals(t[j])) {
found = true;
}
}
if (found) { /* do your stuff */ }
else { }
}
First of all right method which is checking if the word contains in array.
contains(word) {
for (int i = 0;i < arr.length;i++) {
if ( word.equals(arr[i])) {
return true;
}
}
return false;
}
And then change your condition wordSplit[i].equals(t) to contains(wordSplit[i]
You are not comparing with each word to ignore in your code in this line if(wordSplit[i].equals(t))
You can do something like this as below:
public class Sample {
public static void main(String[] args) {
String s = "THIS IS A TEST";
String[] ignore = {"is","but","a"};
List<String> toIgnoreList = Arrays.asList(ignore);
StringBuilder result = new StringBuilder();
for (String s1 : s.split(" ")) {
if(!toIgnoreList.contains(s1.toLowerCase())) {
result.append(s1.substring(0,1).toUpperCase())
.append(s1.substring(1).toLowerCase())
.append(" ");
} else {
result.append(s1.toLowerCase())
.append(" ");
}
}
System.out.println("Result: " + result);
}
}
Output is:
Result: This is a Test
To check the words to exclude java.util.ArrayList.contains() method would be a better choice.
The below expression checks if the exclude list contains the word and if not capitalises the first letter:
tlist.contains(x) ? x : (x = x.substring(0,1).toUpperCase() + x.substring(1)))
The expression is also corresponds to:
if(tlist.contains(x)) { // ?
x = x; // do nothing
} else { // :
x = x.substring(0,1).toUpperCase() + x.substring(1);
}
or:
if(!tlist.contains(x)) {
x = x.substring(0,1).toUpperCase() + x.substring(1);
}
If you're allowed to use java 8:
String s = in.nextLine();
String wordSplit[] = s.trim().toLowerCase().split("\\s");
List<String> tlist = Arrays.asList("is","but","a");
String result = Stream.of(wordSplit).map(x ->
tlist.contains(x) ? x : (x = x.substring(0,1).toUpperCase() + x.substring(1)))
.collect(Collectors.joining(" "));
System.out.println(result);
Output:
This is a Test
I have implemented some code to find the anagrams word in the txt sample.txt file and output them on the console. The txt document contains String (word) in each of line.
Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?
I appreciate any help.
Sample
abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi
oupt
abac aabc abca
xyz zyx yxz
Code
package org.reader;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Test {
// To store the anagram words
static List<String> match = new ArrayList<String>();
// Flag to check whether the checkWorld1InMatch() was invoked.
static boolean flagCheckWord1InMatch;
public static void main(String[] args) {
String fileName = "G:\\test\\sample2.txt";
StringBuilder sb = new StringBuilder();
// In case of matching, this flag is used to append the first word to
// the StringBuilder once.
boolean flag = true;
BufferedReader br = null;
try {
// convert the data in the sample.txt file to list
List<String> list = Files.readAllLines(Paths.get(fileName));
for (int i = 0; i < list.size(); i++) {
flagCheckWord1InMatch = true;
String word1 = list.get(i);
for (int j = i + 1; j < list.size(); j++) {
String word2 = list.get(j);
boolean isExist = false;
if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
isExist = checkWord1InMatch(word1);
}
if (isExist) {
// A word with the same characters was checked before
// and there is no need to check it again. Therefore, we
// jump to the next word in the list.
// flagCheckWord1InMatch = true;
break;
} else {
boolean result = isAnagram(word1, word2);
if (result) {
if (flag) {
sb.append(word1 + " ");
flag = false;
}
sb.append(word2 + " ");
}
if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
match.add(sb.toString().trim());
sb.setLength(0);
flag = true;
}
}
}
}
} catch (
IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
for (String item : match) {
System.out.println(item);
}
// System.out.println("Sihwail");
}
private static boolean checkWord1InMatch(String word1) {
flagCheckWord1InMatch = false;
boolean isAvailable = false;
for (String item : match) {
String[] content = item.split(" ");
for (String word : content) {
if (word1.equals(word)) {
isAvailable = true;
break;
}
}
}
return isAvailable;
}
public static boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.toCharArray();
char[] word2 = secondWord.toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
}
For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.
20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.
This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.
Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).
First for a smaller number.
As it is better to use a more powerful data structure, do not read all lines in core, but read line-wise.
Map<String, Set<String>> mapSortedToWords = new HashMap<>();
Path path = Paths.get(fileName);
try (BufferedReader in = Files.newBufferedReader(Path, StandardCharsets.UTF_8)) {
for (;;) {
String word = in.readLine();
if (word == null) {
break;
}
String key = sorted(word);
Set<String> words = mapSortedToWords.get(key);
if (words == null) {
words = new TreeSet<String>();
mapSortedToWords.put(key, words);
}
words.add(word);
}
}
for (Set<String> anagrams : mapSortedToWords.values()) {
if (anagrams.size() > 1) {
... anagrams
}
}
static String sorted(String word) {
char[] letters = word.toCharArray();
Arrays.sort(letters);
return new String(letters);
}
This stores in the map a set of words. Comparable with abac aabc abca.
For a large number a database where you store (sortedLetters, word) would be better. An embedded database like Derby or H2 poses no installation problems.
For the kind of file size that you specify ( 20 billion words), obviously there are two main problems with your code,
List<String> list = Files.readAllLines(Paths.get(fileName));
AND
for (int i = 0; i < list.size(); i++)
These two lines in your programs basically question,
Do you have enough memory to read full file in one go?
Is it OK to iterate 20 billion times?
For most systems, answer for both above questions would be NO.
So your target is to cut down memory foot print and reduce the number of iterations.
So you need to read your files chunk by chunk and use some kind of search data structures ( like Trie ) to store your words.
You will find numerous questions on SO for both of above topics like,
Fastest way to incrementally read a large file
Finding anagrams for a given word
Above algorithm says that you have to first create a dictionary for your words.
Anyway, I believe there is no ready made answer for you. Take a file with one billion words ( that is a very difficult task in itself ) and see what works and what doesn't but your current code will obviously not work.
Hope it helps !!
Use a stream to read the file. That way you are only storing one word at once.
FileReader file = new FileReader("file.txt"); //filestream
String word;
while(file.ready()) //return true if there a bytes left in the stream
{
char c = file.read(); //reads one character
if(c != '\n')
{
word+=c;
}
else {
process(word); // do whatever you want
word = "";
}
}
Update
You can use a map for finding the anagrams like below. For each word you have you may sort its chars and obtain a sorted String. So, this would be the key of your anagrams map. And values of this key will be the other anagram words.
public void findAnagrams(String[] yourWords) {
Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
for (String word : yourWords) {
String sortedWord = sortedString(word);
List<String> values = anagrams.get(sortedWord);
if (values == null)
values = new LinkedList<>();
values.add(word);
anagrams.put(sortedWord, values);
}
System.out.println(anagrams);
}
private static String sortedString(String originalWord) {
char[] chars = originalWord.toCharArray();
Arrays.sort(chars);
String sorted = new String(chars);
return sorted;
}
I know this question has been already asked several times but I can't find the way to apply it on my code.
So my propose is the following:
I have two files griechenland_test.txt and outagain5.txt . I want to read them and then get which percentage of outagain5.txt is inside the other file.
Outagain5 has input like that:
mit dem 542824
und die 517126
And Griechenland is an normal article from Wikipedia about that topic (so like normal text, without freqeuncy Counts).
1. Problem
- How can I split the input in bigramms? Like every two words, but always with the one before? So if I have words A, B, C, D --> get AB, BC, CD ?
I have this:
while ((sCurrentLine = in.readLine()) != null) {
// System.out.println(sCurrentLine);
arr = sCurrentLine.split(" ");
for (int i = 0; i < arr.length; i++) {
if (null == hash.get(arr[i])) {
hash.put(arr[i], 1);
} else {
int x = hash.get(arr[i]) + 1;
hash.put(arr[i], x);
}
}
Then I read the other file with this code ( I just add the word, and not the number (I split it with 4 spaces, so the two words are at h[0])).
for (String line = br.readLine(); line != null; line = br.readLine()) {
String h[] = line.split(" ");
words.add(h[0]);
}
2. Problem
Now I make the comparsion between the String x in hash and the String s in words. I have put the else System out.print to get which words are not contained in outagain5.txt, but there are several words printed out which ARE contained in outagain5.txt. I don't understand why :D
So I think that the comparsion doesn't work well or maybe this will be solved will fix the first problem.
ArrayList<String> words = new ArrayList<String>();
ArrayList<String> neuS = new ArrayList<String>();
ArrayList<Long> neuZ = new ArrayList<Long>();
for (String x : hash.keySet()) {
summe = summe + hash.get(x);
long neu = hash.get(x);
for (String s : words) {
if (x.equals(s)) {
neuS.add(x);
neuZ.add(neu);
disc = disc + 1;
} else {
System.out.println(x);
break;
}
}
}
Hope I made my question clear, thanks a lot!!
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
It is much easier to use the generic "n-gram" approach so you can split every 2 or 3 words if you want. Here is the link I used to grab the code from: I have used this exact code almost any time I need to split words in the (AB), (BC), (CD) format. NGram Sequence.
If I recall, String has a method titled split(regex, count) that will split the item according to a specific point and you can tell it how many times to do it.
I am referencing this JavaDoc https://docs.oracle.com/javase/6/docs/api/java/lang/String.html#split(java.lang.String, int).
And I guess for running comparison between two text files I would recommend having your code read both of them, populated two unique arrays and then try to run comparisons between the two strings each time. Hope I helped.
I have a text and split it into words separated by white spaces.
I'm classifying units and they work if it occurs in the same word (eg.: '100m', '90kg', '140°F', 'US$500'), but I'm having problems if they appears separately, each part in a word (eg.: '100 °C', 'US$ 450', '150 km').
The classifier algorithm can understand if the unit is in right and the value is missing is in the left or right side.
My question is how can I iterate over all word that are in a list providing the corrects word to the classifier.
This is only an example of code. I have tried in a lot of ways.
for(String word: words){
String category = classifier.classify(word);
if(classifier.needPreviousWord()){
// ?
}
if(classifier.needNextWord()){
// ?
}
}
In another words, I need to iterate over the list classifying all the words, and if the previous word is needed to test, provide the last word and the unit. If the next word is needed, provide the unit and the next word. Appears to be simple, but I don't know how to do.
Don't use an implicit iterator in your for loop, but an explicit. Then you can go back and forth as you like.
Iterator<String> i = words.iterator();
while (i.hasNext()) {
String category = classifier.classify(i.next());
if(classifier.needPreviousWord()){
i.previous();
}
if(classifier.needNextWord()){
i.next();
}
}
This is not complete, because I don't know what your classifier does exactly, but it should give you an idea on how to proceed.
This could help.
public static void main(String [] args)
{
List<String> words = new ArrayList<String>();
String previousWord = "";
String nextWord = "";
for(int i=0; i < words.size(); i++) {
if(i > 0) {
previousWord = words.get(i-1);
}
String currentWord = words.get(i);
if(i < words.size() - 1) {
nextWord = words.get(i+1);
} else {
nextWord = "";
}
String category = classifier.classify(word);
if(category.needPreviousWord()){
if(previousWord.length() == 0) {
System.out.println("ERROR: missing previous unit");
} else {
System.out.println(previousWord + currentWord);
}
}
if(category.needNextWord()){
if(nextWord.length() == 0) {
System.out.println("ERROR: missing next unit");
} else {
System.out.println(currentWord + nextWord);
}
}
}
}
Hi I'm pretty new to Stack Overflow so I hope that I'm doing this correctly and that someone out there has the answer I need.
I'm currently coding a program in Java with Eclipse IDE an my question is this:
I need a snippet of code that does the following
It's supposed to get a .TXT file containing text and from that .TXT file
count the number of rows and print it,
count the number of words and print it,
count the number of characters and print it.
And finally make a list of the top 10 words used and print that.
Allt the printing is done to system outprintln
I'm pretty new to Java and am having some difficulties.
Anyone out there who can provide me with these lines of code or that knows where i can find them? I want to study the code provided that's how I learn best=)
Thanks to all
Didnt find the edit button sorry...
I Added this to my question:
Hehe it´s an assignment but not a homework assignment ok i see well i could provide what i've done so far, i think im pretty close but it´s not working for me. Is there anything i have missed?
// Class Tip
import java.io.*;
import java.util.*;
class Tip
{
public static void main(String [] args) throws Exception
{
String root = System.getProperty("user.dir");
InputStream is = new FileInputStream( root + "\\tip.txt" );
Scanner scan = new Scanner( is );
String tempString = "";
int lines = 0;
int words = 0;
Vector<Integer> wordLength = new Vector<Integer>();
int avarageWordLength = 0;
while(scan.hasNextLine() == true)
{
tempString = scan.nextLine();
lines++;
}
is.close();
is = new FileInputStream( root );
scan = new Scanner( is );
while(scan.hasNext() == true)
{
tempString = scan.next();
wordLength.add(tempString.length());
words++;
}
for(Integer i : wordLength)
{
avarageWordLength += i;
}
avarageWordLength /= wordLength.size();
System.out.println("Lines : " + lines);
System.out.println("Words : " + words);
System.out.println("Words Avarage Length : " + avarageWordLength);
is.close();
}
}
This sounds a bit too much like a homework assignment to warrant providing a full answer, but I'll give you some tips on where to look in the Java API:
FileReader and BufferedReader for getting the data in.
Collections API for storing your data
A custom data structure for storing your list of words and occurence count
Comparator or Comparable for sorting your data structure to get the top 10 list out
Once you've started work and have something functioning and need specific help, come back here with specific questions and then we'll do our best to help you.
Good luck!
Typing "java count words example" into Google came up with a few suggestions.
This link looks to be a decent starting point.
This simple example from here might also give you some ideas:
public class WordCount
{
public static void main(String args[])
{
System.out.println(java.util.regex.Pattern.compile("[\\w]+").split(args[0].trim()).length);
}
}
Here's a solution:
public static void main(String[] args) {
int nRows = 0;
int nChars = 0;
int nWords = 0;
final HashMap<String, Integer> map = new HashMap<String, Integer>();
try {
BufferedReader input = new BufferedReader(new FileReader("c:\\test.txt"));
try {
String line = null;
Pattern p = Pattern.compile("[^\\w]+");
while ((line = input.readLine()) != null) {
nChars += line.length();
nRows++;
String[] words = p.split(line);
nWords += words.length;
for (String w : words) {
String word = w.toLowerCase();
Integer n = map.get(word);
if (null == n)
map.put(word, 1);
else
map.put(word, n.intValue() + 1);
}
}
TreeMap<String, Integer> treeMap = new TreeMap<String, Integer>(new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
if (map.get(o1) > map.get(o2))
return -1;
else if (map.get(o1) < map.get(o2))
return 1;
else
return o1.compareTo(o2);
}
});
treeMap.putAll(map);
System.out.println("N.º Rows: " + nRows);
System.out.println("N.º Words: " + nWords);
System.out.println("N.º Chars: " + nChars);
System.out.println();
System.out.println("Top 10 Words:");
for (int i = 0; i < 10; i++) {
Entry<String, Integer> e = treeMap.pollFirstEntry();
System.out.println("Word: " + e.getKey() + " Count: " + e.getValue());
}
} finally {
input.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
Not a complete answer but I'd recomend looking at Sun's Java IO tutorials. It deals with reading and writing from files. Especially the tutorial on Scanners and Formaters
Here is the summary of the tutorial from the website
Programming I/O often involves
translating to and from the neatly
formatted data humans like to work
with. To assist you with these chores,
the Java platform provides two APIs.
The scanner API breaks input into
individual tokens associated with bits
of data. The formatting API assembles
data into nicely formatted,
human-readable form.
So to me it looks like it is exactly the APIs you are asking about
You might get some leverage out of using Apache Commons Utils which has a handy util called WordUtil that does some simple things with sentences and words.