Word Count from a file

Word Count from a file - java

I'm at the start of writing my program (this is for a class) and I'm running into trouble to just write it down. Here's a list of goals I am hoping to meet.
It is a method given a .txt file (using java.io.File)
It needs to read the file and split the words, duplicates are allowed. (I plan to use String.split and util.regex.Pattern to work out whitespace and punctuation)
I'm aiming to put the words in a 1D array and then just find the length of the array.
The problem I'm running into is parsing the txt file. I was told in class that Scanner can, but I'm not finding it while R(ing)TFM. I guess I'm asking for some directions in the API that helps me understand how to read a file with Scanner. Once I can get it to put each word in the array I should be in the clear.
EDIT: I figured out what I needed to do thanks to everyone's help and input. My final snippet ends up looking like this, should anyone in the future come across this question.
Scanner in = new Scanner(file).useDelimiter(" ");
ArrayList<String> prepwords=new ArrayList<String>();
while(in.hasNext())
prepwords.add(in.next());
return prepwords; //returns an ArrayList without spaces but still has punctuation
I had to throw IOExceptions since java hates not being sure a file exists, so if you run into "FileNotFoundException", you need to import and throw IOException. At the very least this worked for me. Thank you everyone for your input!

BufferedReader input = new BufferedReader(new FileReader(filename));
input.readLine();
This is what I use to read from files. Note that you have to handle the IOException

Here is a link to the JSE 6.0 Scanner API
Here is the info you need to complete your project:
1. Use the Scanner(File) constructor.
2. Use a loop that is, essentially this:
a. Scanner blam = new Scanner(theInputFile);
b. Map<String, Integer> wordMap = new HashMap<String, Integer>();
c. Set<String> wordSet = new HashSet<String>();
d. while (blam.hasNextLine)
e. String nextLine = blam.nextLine();
f. Split nextLine into words (head about the read String.split() method).
g. If you need a count of words: for each word on the line, check if the word is in the map, if it is, increment the count. If not, add it to the map. This uses the wordMap (you dont need wordSet for this solution).
h. If you just need to track the words, add each word on the line to the set. This uses the wordSet (you dont need wordMap for this solution).
3. that is all.
If you dont need either the map or the set, then use a List<String> and either an ArrayList or a LinkedList. If you dont need random access to the words, LinkedList is the way to go.

Something simple:
//variables you need
File file = new File("someTextFile.txt");//put your file here
Scanner scanFile = new Scanner(new FileReader(file));//create scanner
ArrayList<String> words = new ArrayList<String>();//just a place to put the words
String theWord;//temporary variable for words
//loop through file
//this looks at the .txt file for every word (moreover, looks for white spaces)
while (scanFile.hasNext())//check if there is another word
{
theWord = scanFile.next();//get next word
words.add(theWord);//add word to list
//if you dont want to add the word to the list
//you can easily do your split logic here
}
//print the list of words
System.out.println("Total amount of words is: " + words.size);
for(int i = 0; i<words.size(); i++)
{
System.out.println("Word at " + i + ": " + words.get(i));
}
Source:
http://www.dreamincode.net/forums/topic/229265-reading-in-words-from-text-file-using-scanner/

Related

How would I separate a sentence into individual words with a loop and send them individually into an ArrayList in Java?

I have a very specific problem for my CS course. I have a sentence in a string, and I need that separated into individual words within an ArrayList, and cannot use the split method.
The issue I have is that I have had zero teaching on arrays, only the bare minimum teaching for loops and String statements. I've done a lot of research and figured out the best way to go about making the loop, and sending the words to the ArrayList, however I still can't find a good way to actually have it loop through the sentence and separate each individual word at all. I get how easily it can be done to separate the very first word, however after that I get lost. I have no idea how to make the loop for its other iterations specifically grab the next word in the sentence after the one it previously got.
(Note: The only utilities imported are Scanner, File, HashMap, ArrayList, Random, and *)
What I'm looking for is any tips of specific methods I should try and employ or research. Or perhaps a set of code that is fairly functional in doing something similar to what I'm looking for that I can look at and build my own code off of.

When you said the word "word"(That's weird) I assume that they are separated by spaces. If you are reading in the inputs, then just use Scanner... Then:
Scanner input = new Scanner(System.in);
ArrayList<String> words = new ArrayList<String>();
for(int i = 0;i<numberOfWord;i++){
words.add(input.next());// .next() method read data that are separated by space
}
or:
String theLineOfWord;//Given to you
StringTokenizer st = new StringTokenizer(theLinOfWord);//Used to separate words.
ArrayList<String> words = new ArrayList<String>();
while(st.hasMoreTokens()){
words.add(st.nextToken();
}
or:
public static ArrayList<String> getWords(String line){
ArrayList<String> words = new ArrayList<String>();
line += " ";//add a space to ignore the ending case
while(line.length() != 0) {
words.add(line.substring(0, line.indexOf(' ')));//add the word to the list
line = line.substring(line.indexOf(' ') + 1, line.length());//take out the useless string away
}
return words;
}

Would you help me to get an idea to do next after String w1 = scanner.nextLine(); to check for the occurrence of the words

A file called “getwordinfo.txt” should reside in your project directory that should contain some of the words that have been found in the input files. Read in each of the words from this file (maybe using a simple Scanner object), and then output the following to the console window:
The word itself
The list of occurrences of that word, or, if the word never occurred, simply output “Not found”
The total number of occurrences of the word, and the usage frequency of the word (as a percentage) relative to all word occurrences in the input files
//
File fileInTheFolder = new File(f, docname);
fileInTheFolder.createNewFile();
File infile = new File("input.txt");
Scanner scanner = new Scanner(infile);
String w1 = scanner.nextLine();

What I suggest you do (read: what I would probably do as a first approach) is to create a Map to hold the data in and then, using your reader insert the data into the map one by one.
A Map, for example:
HashMap<String, Integer> hmap = new HashMap<String, Integer>();
works by having two fields, a key and a value. In your case the key is the word you want to count instances of and the value is the counter value.
Once you have your map you can begin inserting into it.
For example, as seen here:
for (String a : args) {
Integer count = m.get(a);
m.put(a, (count == null) ? 1 : count + 1);
}
What we do is:
Check if a has already been seen.
If it has been seen then we add to the counter ( + 1) otherwise we set the initial counter value to 1.
So if you could take your line and parse it into words, go through the words and insert them into the map you will have your answer.

How to make an array of Strings to store words of a "dictionary" from a .txt file?

I need to make a dictionary that takes words from a .txt file. These words (separated line by line) need to be stored in a String array. I have already gotten to the point of separating the words and adding them to a new .txt file, but I have no idea how to add them each to a String array. There are

You need to count the lines in the file. Create an array of that size.
Then for each line in the file, read it and insert it into the array at the index[lineReadFrom].

Since you are not allowed to use ArrayList or LinkedList objects, I would suggest to save every found word "on the fly" while you are reading the input file. These is a series of steps you could follow to get this done:
1. Read the file, line by line: Use the common new BufferedReader(new FileInputStream("/path/to/file")) approach and read line by line (as I assume you are already doing, looking at your code).
2. Check every line for words: Break every possilbe word by spaces with a String.split() and remove punctuation characters.
3. Save every word: Loop through the String array returned by the String.split() and for every element that you considered a word, update your statistics and write it to your dictionary file with the common new BufferedWriter(new FileWriter("")).write(...);
4. Close your resources: Close the reader an writer after you finished looping through them, preferably in a finally block.

Here is a complete code sample:
public static void main(String[] args) throws IOException {
File dictionaryFile = new File("dict.txt");
// Count the number of lines in the file
LineNumberReader lnr = new LineNumberReader(new FileReader(dictionaryFile));
lnr.skip(Long.MAX_VALUE);
// Instantiate a String[] with the size = number of lines
String[] dict = new String[lnr.getLineNumber() + 1];
lnr.close();
Scanner scanner = new Scanner(dictionaryFile);
int wordNumber = 0;
while (scanner.hasNextLine()) {
String word = scanner.nextLine();
if (word.length() >= 2 && !(Character.isUpperCase(word.charAt(0)))) {
dict[wordNumber] = word;
wordNumber++;
}
}
scanner.close();
}
It took about 350 ms to finish executing on a 118,620 line file, so it should work for your purposes. Note that I instantiated the array in the beginning instead of creating a new String[] on each line (and replacing the old one like you did in your code).
I used wordNumber to keep track of the current array index so that each word would be added to the array at the right location.
I also used .nextLine() instead of .next() since you said that the dictionary was separated by line instead of by spaces (which is what .next() uses).

Compare two Lists for Anagrams - Java

I'm currently working on an anagram solver. I saw a really good post which had one recommendation on alphabetizing the letters of both the user input and dictionary list before comparing. It seemed interesting so I'm giving it a try. Previously I used permutations, but I want something that I can eventually (and efficiently) use to solve multi word anagrams.
I can put both my user input and dictionary into char arrays and sorting alphabetically. Now I need to compare each so I can determine if something is an anagram or not. I thought about taking the alphabetized user input and determining if the alphabetized dictionary contained it or not. I've posted my code below. As you can guess I'm a little confused on the logic of this process. I was wondering if someone could help me straighten out the logic a little. Thanks for any help.
public class AnagramSolver1 {
public static void main(String[] args) throws IOException {
List<String> dictionary = new ArrayList<String>();
List<String> inputList = new ArrayList<String>();
BufferedReader in = new BufferedReader(new FileReader("src/dictionary.txt"));
String line = null;
Scanner scan = new Scanner(System.in);
while (null!=(line=in.readLine())){
dictionary.add(line);
}
in.close();
char[] sortDictionary;
char[] inputSort;
System.out.println("Enter Word: ");
String input = scan.next();
inputList.add(input);
//Getting a little confused here. I thought about sorting my input
//then iterating through my dictionary (while sorting it too) and comparing
//thus far it produces nothing
for(int i = 0; i < inputList.size(); i++){
inputSort = inputList.get(i).toCharArray();
Arrays.sort(inputSort);
for (int j = 0; j < dictionary.size(); j++) {
sortDictionary = dictionary.get(i).toCharArray();
Arrays.sort(sortDictionary);
if(inputSort.equals(sortDictionary)){
System.out.println("Anagram" +dictionary.get(i));
} //end if
}//end for
}//end for
}//end main
}

Why not maintain a Map<String, Set<String>> that maps a sorted-character string to a set of strings that are its anagrams. You can update this map as you read words from the dictionary. For example, if you read the word dog you would add an entry to the map "dgo" => {"dog"} (notice that dgo consists of the sorted characters of the word dog). Then if you read the word god, you would sort its characters to obtain the same dgo and consequently amend the previous entry to be "dgo" => {"dog", "god"}. You would of course repeat this for every word in the dictionary.
This should allow for quick and easy querying. If you wanted to then find anagrams of the word dog you would use map.get(sortChars("dog")).
On another note, I'm going to reiterate what the other answer mentioned, namely that it's important to modularize your code. You should put logically related functions/tasks in their own methods as opposed to having everything in one place. This helps with readability and your/others' ability to maintain your code in the future.

You are doing too many things at once here. You've got file IO, user input, sorting and the algorithm all in one place. Try to modularize it so you have a function called isAnagram(List<Character> firstPhrase, List<Character> secondPhrase). Make sure that works correctly, then have all the other steps figure out how to call it. This way you can test your algorithm without requiring user input. This will be a much faster feedback loop.
It's algorithm will work like this:
(optionally) copy the contents of the input so you don't mutate the input
compare their lengths. If they're not equal, return false
sort each list
iterate element by element and check if they're equal. If they're not, return false
if you reach the end, return true.

JAVA : read and write a file together

I am trying to read a java file and modify it simultaneously. This is what I need to do : My file is of the format :
aaa
bbb
aaa
ccc
ddd
ddd
I need to read through the file and get the count of the # of occurrences and modify the duplicates to get the following file:
aaa - 2
bbb - 1
ccc - 1
ddd - 2
I tried using the RandomAccessFile to do this, but couldn't do it. Can somebody help me out with the code for this one?

It's far easier if you don't do two things at the same time. The best way is to run through the entire file, count all the occurrences of each string in a hash and then write out all the results into another file. Then if you need to, move the new file over the old one.
You never want to read and write to the same file at the same time. Your offsets within the file will shift everytime you make a write and the read cursor will not keep track of that.

I'd do it this way:
- Parse the original file and save all entries into a new file. Use fixed length data blocks to write entries to the new file (so, say your longest string is 10 bytes long, take 10 + x as block length, x is for the extra info you want to save along the entries. So the 10th entry in the file would be at byte position 10*(10+x)). You'd also have to know the number of entries to create the (so the file size would noOfEntries*blocklength, use a RandomAccesFile and setLength to set the this file length).
- Now use quicksort algorithm to sort the entries in the file (my idea is to have a sorted file in the end which makes things far easier and faster finally. Hashing would theoretically work too, but you'd have to deal with rearranging duplicate entries then to have all duplicates grouped - not really a choice here).
- Parse the file with the now sorted entries. Save a pointer to the entry of the first occurence of a entry. Increment the number of duplicates until there is a new entry. Change the first entry and add that additonal info you want to have there into a new "final result" file. Continue this way with all remaining entries in the sorted file.
Conclusions: I think this should be a reasonably fast and use reasonable amount of resources. However, it depends on the data you have. If you have a very large number of duplicates, quicksort performance will degrade. Also, if your longest data entry is way longer than the average, it will also waste file space.

If you have to, there are ways you can manipulate the same file and update the counters, without having to open another file or keep everything in memory. However, the simplest of the approaches would be very slow.

import java.util.*;
import java.io.*;
import java.util.*;
class WordFrequencyCountTest
{
public static void main( String args[])
{
System.out.println(" enter the file name");
Scanner sc = new Scanner(System.in);
String fname= sc.next();
File f1 = new File(fname);
if(!f1.exists())
{
System.out.println(" Source file doesnot exists");
System.exit(0);
}
else{
try{
FileReader fis = new FileReader(f1);
BufferedReader br = new BufferedReader(fis);
String str = "";
int count=0;
Map<String, Integer> map = new TreeMap<String, Integer>();
while((str = br.readLine()) != null )
{
String[] strArray = str.split("\\s");
count=1;
for(String token : strArray) // iteration of strArray []
{
if(map.get(token)!=null )
{
count=map.get(token);
count++;
map.put(token, count);
count=1;
}else{
map.put(token, count);
}
}
}
Set set=map.entrySet();
Iterator itr = set.iterator();
System.out.println("========");
while(itr.hasNext())
{
Map.Entry entry = (Map.Entry)itr.next();
System.out.println( entry.getKey()+ " "+entry.getValue());
}
fis.close();
}catch(Exception e){}
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.