Counting words from a text-file in Java - java

I'm writing a program that'll scan a text file in, and count the number of words in it. The definition for a word for the assignment is: 'A word is a non-empty string consisting of only of letters (a,. . . ,z,A,. . . ,Z), surrounded
by blanks, punctuation, hyphenation, line start, or line end.
'.
I'm very novice at java programming, and so far i've managed to write this instancemethod, which presumably should work. But it doesn't.
public int wordCount() {
int countWord = 0;
String line = "";
try {
File file = new File("testtext01.txt");
Scanner input = new Scanner(file);
while (input.hasNext()) {
line = line + input.next()+" ";
input.next();
}
input.close();
String[] tokens = line.split("[^a-zA-Z]+");
for (int i=0; i<tokens.length; i++){
countWord++;
}
return countWord;
} catch (Exception ex) {
ex.printStackTrace();
}
return -1;
}

Quoting from Counting words in text file?
int wordCount = 0;
while (input.hasNextLine()){
String nextLine = input.nextLine();
Scanner word = new Scanner(nextline);
while(word.hasNext()){
wordCount++;
word.next();
}
word.close();
}
input.close();

The only usable word separators in your file are spaces and hyphens. You can use regex and the split() method.
int num_words = line.split("[\\s\\-]").length; //stores number of words
System.out.print("Number of words in file is "+num_words);
REGEX (Regular Expression):
\\s splits the String at white spaces/line breaks and \\- at hyphens. So wherever there is a space, line break or hyphen, the sentence will be split. The words extracted are copied into and returned as an array whose length is the number of words in your file.

you can use java regular expression.
You can read http://docs.oracle.com/javase/tutorial/essential/regex/groups.html to know about group
public int wordCount(){
String patternToMatch = "([a-zA-z]+)";
int countWord = 0;
try {
Pattern pattern = Pattern.compile(patternToMatch);
File file = new File("abc.txt");
Scanner sc = new Scanner(file);
while(sc.hasNextLine()){
Matcher matcher = pattern.matcher(sc.nextLine());
while(matcher.find()){
countWord++;
}
}
sc.close();
}catch(Exception e){
e.printStackTrace();
}
return countWord > 0 ? countWord : -1;
}

void run(String path)
throws Exception
{
try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8")))
{
int result = 0;
while (true)
{
String line = reader.readLine();
if (line == null)
{
break;
}
result += countWords(line);
}
System.out.println("Words in text: " + result);
}
}
final Pattern pattern = Pattern.compile("[A-Za-z]+");
int countWords(String text)
{
Matcher matcher = pattern.matcher(text);
int result = 0;
while (matcher.find())
{
++result;
System.out.println("Matcher found [" + matcher.group() + "]");
}
System.out.println("Words in line: " + result);
return result;
}

Related

Word Count from a text file using Java

I am trying to write a simple code that will give me the word count from a text file. The code is as follows:
import java.io.File; //to read file
import java.util.Scanner;
public class ReadTextFile {
public static void main(String[] args) throws Exception {
String filename = "textfile.txt";
File f = new File (filename);
Scanner scan = new Scanner(f);
int wordCnt = 1;
while(scan.hasNextLine()) {
String text = scan.nextLine();
for (int i = 0; i < text.length(); i++) {
if(text.charAt(i) == ' ' && text.charAt(i-1) != ' ') {
wordCnt++;
}
}
}
System.out.println("Word count is " + wordCnt);
}
}
this code compiles but does not give the correct word count. What am I doing incorrectly?
Right now you are only incrementing wordCnt if the character you are on is a whitespace and the character before it is not. However this discounts several cases, such as if there is not a space, but a newline character. Consider if your file looked like:
This is a text file\n
with a bunch of\n
words.
Your method should return ten, but since there is not space after the words file, and of it will not count them as words.
If you just want the word count you can do something along the lines of:
while(scan.hasNextLine()){
String text = scan.nextLine();
wordCnt+= text.split("\\s+").length;
}
Which will split on white space(s), and return how many tokens are in the resulting Array
First of all remember about closing resources. Please check this out.
Since Java 8 you can count words in this way:
String regex = "\\s+"
String filename = "textfile.txt";
File f = new File (filename);
long wordCnt = 1;
try (var scanner = new Scanner (f)){
wordCnt scanner.lines().map(str -> str.split(regex)).count();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Word count is " + wordCnt);

BufferedReader ( scanner )

This is what i have for now. I want to know, how many times i have some word in .txt document . Now i am trying to use BufferedReader didn't manage well enough. I guess here is a easier way to solve this, but i don't know.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
public class TekstiAnalüsaator {
public static void main(String[] args) throws Exception {
InputStream baidid = new FileInputStream("test.txt");
InputStreamReader tekst = new InputStreamReader(baidid, "UTF-8");
BufferedReader puhverdab = new BufferedReader(tekst);
String rida = puhverdab.readLine();
while (rida != null){
System.out.println("Reading: " + rida);
rida = puhverdab.readLine();
}
puhverdab.close();
}
}
I want to search words using this structure. What file, then what word i need to find, (return) how many times, this word is in the file.
TekstiAnalüsaator analüsaator = new TekstiAnalüsaator("kiri.txt");
int esinemisteArv = analüsaator.sõneEsinemisteArv("kala");
Please see the code example below. This should solve the issue you are facing.
import java.io.*;
public class CountWords {
public static void main(String args[]) throws IOException {
System.out.println(count("Test.java", "static"));
}
public static int count(String filename, String wordToSearch) throws IOException {
int tokencount = 0;
FileReader fr = new FileReader(filename);
BufferedReader br = new BufferedReader(fr);
String s;
int linecount = 0;
String line;
while ((s = br.readLine()) != null) {
if (s.contains(wordToSearch))
tokencount++;
// System.out.println(s);
}
return tokencount;
}
}
It is a bit of a tricky question because counting words in a string is not so simple task. Your approach is fine for reading the file line by line so now the problem is how to count the word matches.
For example you can do the simple check for matches like that:
public static int getCountOFWordsInLine(String line, String test){
int count=0;
int index=0;
while(line.indexOf(test,index ) != -1) {
count++;
index=line.indexOf(test,index)+1;
}
return count;
}
The problem with that approach is that if your word is "test" and your string is "Next word matches asdfatestsdf" it will count it as a match. So you can try using some more advanced regex:
public static int getCountOFWordsInLine(String line, String word) {
int count = 0;
Pattern pattern = Pattern.compile("\\b"+word+"\\b");
Matcher matcher = pattern.matcher(line);
while (matcher.find())
count++;
return count;
}
It actually checks for the word surrounded by \b which is word break
It still won't find the word if it start with uppercase though. If you want to make it case insensitive you can modify the previous method by changing everything to lowercase prior to searching. But it depends on your definition of word.
The whole program will become:
public class MainClass {
public static void main(String[] args) throws InterruptedException {
try {
InputStream baidid = new FileInputStream("c:\\test.txt");
InputStreamReader tekst = new InputStreamReader(baidid, "UTF-8");
BufferedReader puhverdab = new BufferedReader(tekst);
String rida = puhverdab.readLine();
String word="test";
int count=0;
while (rida != null){
System.out.println("Reading: " + rida);
count+=getCountOFWordsInLine(rida,word );
rida = puhverdab.readLine();
}
System.out.println("count:"+count);
puhverdab.close();
}catch(Exception e) {
e.printStackTrace();
}
}
public static int getCountOFWordsInLine(String line, String test) {
int count = 0;
Pattern pattern = Pattern.compile("\\b"+test+"\\b");
Matcher matcher = pattern.matcher(line);
while (matcher.find())
count++;
return count;
}
}
import java.io.*;
import java.until.regex.*;
public class TA
{
public static void main(String[] args) throws Exception
{
InputStream baidid = new FileInputStream("test.txt");
InputStreamReader tekst = new InputStreamReader(baidid, "UTF-8");
BufferedReader puhverdab = new BufferedReader(tekst);
String rida;
String word = argv[0]; // search word passed via command line
int count1=0, count2=0, count3=0, count4=0;
Pattern P1 = Pattern.compile("\\b" + word + "\\b");
Pattern P2 = Pattern.compile("\\b" + word + "\\b", Pattern.CASE_INSENSITIVE);
while ((rida = puhverdab.readLine()) != null)
{
System.out.println("Reading: " + rida);
// Version 1 : counts lines containing [word]
if (rida.contains(word)) count1++;
// Version 2: counts every instance of [word]
into pos=0;
while ((pos = rida.indexOf(word, pos)) != -1) { count2++; pos++; }
// Version 3: looks for surrounding whitespace
Matcher m = P1.matcher(rida);
while (m.find()) count3++;
// Version 4: looks for surrounding whitespace (case insensitive)
Matcher m = P2.matcher(rida);
while (m.find()) count4++;
}
System.out.println("Found exactly " + count1 + " line(s) containing word: \"" + word + "\"");
System.out.println("Found word \"" + word + "\" exactly " + count2 + " time(s)");
System.out.println("Found word \"" + word + "\" surrounded by whitespace " + count3 + " time(s).");
System.out.println("Found, case insensitive search, word \"" + word + "\" surrounded by whitespace " + count4 + " time(s).");
puhverdab.close();
}
}
This reads line-by-line as you've already done, splits a line by whitespace to obtain individual words, and checks each word for a match.
int countWords(String filename, String word) throws Exception {
InputStream inputStream = new FileInputStream(filename);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "UTF-8");
BufferedReader reader = new BufferedReader(inputStreamReader);
int count = 0;
String line = reader.readLine();
while (line != null) {
String[] words = line.split("\\s+");
for (String w : words)
if (w.equals(word))
count++;
line = reader.readLine();
}
reader.close();
return count;
}

InputStream reading file and counting lines/words

I'm working on a project and I'm trying to count
1) The number of words.
2) The number of lines in a text file.
My problem is that I can't figure out how to detect when the file goes to the next line so I can increment lines correctly. Basically if next is not a space increment words and if next is a new line, increment lines. How would I do this? Thanks!
public static void readFile(Scanner f) {
int words = 0;
int lines = 0;
while (f.hasNext()) {
if (f.next().equals("\n")) {
lines++;
} else if (!(f.next().equals(" "))) {
words++;
}
}
System.out.println("Total number of words: " + words);
System.out.println("Total number of lines: " + lines);
}
Try this:
public static void readFile(Scanner f) {
int words = 0;
int lines = 0;
while (f.hasNextLine()) {
String line = f.nextLine();
lines++;
for (String token : line.split("\\s+")) {
if (!token.isEmpty()) {
words++;
}
}
}
System.out.println("Total number of words: " + words);
System.out.println("Total number of lines: " + lines);
}
Do you have to use InputStream? (Yes) It is better to use a BufferedReader with an InputStreamReader passed in so you can read the file line by line and increment while doing so.
numLines = 0;
try (BufferedReader br = new BufferedReader(new InputStreamReader(inputStream))) {
String line;
while ((line = br.readLine()) != null)
{
numLines++;
// process the line.
}
}
Then to count the words just split the string using a regular expression that finds whitespaces. myStringArray = MyString.Split(MyRegexPattern); will then return a String[] of all the words. Then all you do is numWords += myStringArray.length();
You can use an InputStreamReader to create a bufferedreader which can read a file line by line:
int amountOfLines = 0;
try {BufferedReader br = new BufferedReader(new InputStreamReader(inputStream))} catch (Exception e) {e.printStackTrace();}
String line;
while ((line = br.readLine()) != null{
numLines++;
// process the line.
}
You can then use the split(String) method to separate every part
Try following:
public static void readFile(Scanner f) {
int words = 0;
int lines = 0;
while (f.hasNextLine()) {
String line = f.nextLine();
String[] arr = line.split("\\s");
words += arr.length;
lines++;
}
System.out.println("Total number of words: " + words);
System.out.println("Total number of lines: " + lines);
}

Count number of sentences in a text file

Sentences I guess being string that end in ! ? .
Excepting thing like Dr. Mr.
It is true that you cannot really know a sentence in java because of grammar.
But I guess what I mean is a period or exclamation mark or question mark and then what follows being a capital letter.
How would one do this.
This be what I have
But its not working.....
BufferedReader Compton = new BufferedReader(new FileReader(fileName));
int sentenceCount=0;
String violet;
String limit="?!.";
while(Compton.ready())
{
violet=Compton.readLine();
for(int i=0; i<violet.length()-1;i++)
{
if(limit.indexOf(violet.charAt(i)) != -1 && i>0 && limit.indexOf(violet.charAt(i-1)) != -1)
{
sentenceCount++;
}
}
}
System.out.println("the amount of sentence is " + sentenceCount);
EDIT
New way that works better
String violet;
while(Compton.ready())
{
violet=Compton.readLine();
sentenceCount=violet.split("[!?.:]+").length;
System.out.println("the number of words in line is " +
sentenceCount);
}
BufferedReader reader = new BufferedReader(new FileReader(fileName));
int sentenceCount = 0;
String line;
String delimiters = "?!.";
while ((line = reader.readLine()) != null) { // Continue reading until end of file is reached
for (int i = 0; i < line.length(); i++) {
if (delimiters.indexOf(line.charAt(i)) != -1) { // If the delimiters string contains the character
sentenceCount++;
}
}
}
reader.close();
System.out.println("The number of sentences is " + sentenceCount);
One liner:
int n = new String (Files.readAllBytes(Paths.get(path))).split ("[\\.\\?!]").length
Uses Java 7 constructs to read whole file to byte array, create a string from that and split into sentence array then gets the length of the array.
A potential way to do this is to scan your file as words and then count words that are not in your exception list that end in your given punctuation.
Here's a possible implementation using Java 8 streams:
List<String> exceptions = Arrays.toList("Dr.", "Mr.");
Iterable<String> iterableScanner = () -> new Scanner(filename);
int sentenceCount = StreamSupport.stream(iterableScanner, false)
.filter(word -> word.matches(".*[\\.\\?!]))
.filter(word -> !exceptions.contains(word))
.count();

Removing back to back dashes and asterisks in a string

I am having some trouble reading in a file and removing all of the punctuation from the file.
Below is what I currently have and I can not figure out why "----" and "*****" would still occur.
Can anyone point me in a direction to figure out how I need to adjust my replaceAll() in order to make sure repeated occurrences of punctuation can be removed?
public void analyzeFile(File filepath) {
try {
FileInputStream fStream = new FileInputStream(filepath);
DataInputStream in = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String textFile = "";
String regex = "[a-zA-Z0-9\\s]";
String putString = "";
wordCount = 0;
while ((textFile = br.readLine()) != null) {
if (!textFile.equals("") && textFile.length() > 0) {
String[] words = textFile.split(" ");
wordCount += words.length;
for (int i = 0; i < words.length; i++) {
putString = cleanString(regex, words[i]);
if(putString.length() > 0){
mapInterface.put(putString, 1);
}
}
putString = "";
}
}
in.close();
} catch (Exception e) {
System.out.println("Error while attempting to read file: "
+ filepath + " " + e.getMessage());
}
}
private String cleanString(String regex, String str){
String newString = "";
Pattern regexChecker = Pattern.compile(regex);
Matcher regexMatcher = regexChecker.matcher(str);
while(regexMatcher.find()){
if(regexMatcher.group().length() != 0){
newString += regexMatcher.group().toString();
}
}
return newString;
}
Surely you can use the \w escaped alphanumeric character? This will recognise all letters and numbers, but not punctuation.
putString = words[i].replaceAll("[^\w]+", "");
This replaces any non-word character with an empty string.

Categories