I'm having a problem counting the number of words in a file. The approach that I am taking is when I see a space or a newLine then I know to count a word.
The problem is that if I have multiple lines between paragraphs then I ended up counting them as words also. If you look at the readFile() method you can see what I am doing.
Could you help me out and guide me in the right direction on how to fix this?
Example input file (including a blank line):
word word word
word word
word word word
You can use a Scanner with a FileInputStream instead of BufferedReader with a FileReader. For example:-
File file = new File("sample.txt");
try(Scanner sc = new Scanner(new FileInputStream(file))){
int count=0;
while(sc.hasNext()){
sc.next();
count++;
}
System.out.println("Number of words: " + count);
}
I would change your approach a bit. First, I would use a BufferedReader to read the file file in line-by-line using readLine(). Then split each line on whitespace using String.split("\\s") and use the size of the resulting array to see how many words are on that line. To get the number of characters you could either look at the size of each line or of each split word (depending of if you want to count whitespace as characters).
This is just a thought. There is one very easy way to do it. If you just need number of words and not actual words then just use Apache WordUtils
import org.apache.commons.lang.WordUtils;
public class CountWord {
public static void main(String[] args) {
String str = "Just keep a boolean flag around that lets you know if the previous character was whitespace or not pseudocode follows";
String initials = WordUtils.initials(str);
System.out.println(initials);
//so number of words in your file will be
System.out.println(initials.length());
}
}
Just keep a boolean flag around that lets you know if the previous character was whitespace or not (pseudocode follows):
boolean prevWhitespace = false;
int wordCount = 0;
while (char ch = getNextChar(input)) {
if (isWhitespace(ch)) {
if (!prevWhitespace) {
prevWhitespace = true;
wordCount++;
}
} else {
prevWhitespace = false;
}
}
I think a correct approach would be by means of Regex:
String fileContent = <text from file>;
String[] words = Pattern.compile("\\s+").split(fileContent);
System.out.println("File has " + words.length + " words");
Hope it helps. The "\s+" meaning is in Pattern javadoc
import java.io.BufferedReader;
import java.io.FileReader;
public class CountWords {
public static void main (String args[]) throws Exception {
System.out.println ("Counting Words");
FileReader fr = new FileReader ("c:\\Customer1.txt");
BufferedReader br = new BufferedReader (fr);
String line = br.readLin ();
int count = 0;
while (line != null) {
String []parts = line.split(" ");
for( String w : parts)
{
count++;
}
line = br.readLine();
}
System.out.println(count);
}
}
Hack solution
You can read the text file into a String var. Then split the String into an array using a single whitespace as the delimiter StringVar.Split(" ").
The Array count would equal the number of "Words" in the file.
Of course this wouldnt give you a count of line numbers.
3 steps: Consume all the white spaces, check if is a line, consume all the nonwhitespace.3
while(true){
c = inFile.read();
// consume whitespaces
while(isspace(c)){ inFile.read() }
if (c == '\n'){ numberLines++; continue; }
while (!isspace(c)){
numberChars++;
c = inFile.read();
}
numberWords++;
}
File Word-Count
If in between words having some symbols then you can split and count the number of Words.
Scanner sc = new Scanner(new FileInputStream(new File("Input.txt")));
int count = 0;
while (sc.hasNext()) {
String[] s = sc.next().split("d*[.#:=#-]");
for (int i = 0; i < s.length; i++) {
if (!s[i].isEmpty()){
System.out.println(s[i]);
count++;
}
}
}
System.out.println("Word-Count : "+count);
Take a look at my solution here, it should work. The idea is to remove all the unwanted symbols from the words, then separate those words and store them in some other variable, i was using ArrayList. By adjusting the "excludedSymbols" variable you can add more symbols which you would like to be excluded from the words.
public static void countWords () {
String textFileLocation ="c:\\yourFileLocation";
String readWords ="";
ArrayList<String> extractOnlyWordsFromTextFile = new ArrayList<>();
// excludedSymbols can be extended to whatever you want to exclude from the file
String[] excludedSymbols = {" ", "," , "." , "/" , ":" , ";" , "<" , ">", "\n"};
String readByteCharByChar = "";
boolean testIfWord = false;
try {
InputStream inputStream = new FileInputStream(textFileLocation);
byte byte1 = (byte) inputStream.read();
while (byte1 != -1) {
readByteCharByChar +=String.valueOf((char)byte1);
for(int i=0;i<excludedSymbols.length;i++) {
if(readByteCharByChar.equals(excludedSymbols[i])) {
if(!readWords.equals("")) {
extractOnlyWordsFromTextFile.add(readWords);
}
readWords ="";
testIfWord = true;
break;
}
}
if(!testIfWord) {
readWords+=(char)byte1;
}
readByteCharByChar = "";
testIfWord = false;
byte1 = (byte)inputStream.read();
if(byte1 == -1 && !readWords.equals("")) {
extractOnlyWordsFromTextFile.add(readWords);
}
}
inputStream.close();
System.out.println(extractOnlyWordsFromTextFile);
System.out.println("The number of words in the choosen text file are: " + extractOnlyWordsFromTextFile.size());
} catch (IOException ioException) {
ioException.printStackTrace();
}
}
This can be done in a very way using Java 8:
Files.lines(Paths.get(file))
.flatMap(str->Stream.of(str.split("[ ,.!?\r\n]")))
.filter(s->s.length()>0).count();
BufferedReader bf= new BufferedReader(new FileReader("G://Sample.txt"));
String line=bf.readLine();
while(line!=null)
{
String[] words=line.split(" ");
System.out.println("this line contains " +words.length+ " words");
line=bf.readLine();
}
The below code supports in Java 8
//Read file into String
String fileContent=new String(Files.readAlBytes(Paths.get("MyFile.txt")),StandardCharacters.UFT_8);
//Keeping these into list of strings by splitting with a delimiter
List<String> words = Arrays.asList(contents.split("\\PL+"));
int count=0;
for(String x: words){
if(x.length()>1) count++;
}
sop(x);
So easy we can get the String from files by method: getText();
public class Main {
static int countOfWords(String str) {
if (str.equals("") || str == null) {
return 0;
}else{
int numberWords = 0;
for (char c : str.toCharArray()) {
if (c == ' ') {
numberWords++;
}
}
return ++numberWordss;
}
}
}
Related
I have 3 files, "MyFile" , "myOtherFile" , "yetAnotherFile" that my code will be drawing words from to put them in an array, check to see if they start with an uppercase, and if they do, it will also sort them alphabetically.
all 3 have 3 or more words, one has only one word that starts with a lowercase so I can test that invalid input print line
I am somehow getting all 3 to print the invalid line
Added a counter so if counter > 0 it then does the print statement
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.*;
public class StringSorter {
private String inputFileName;
//private String line;
public StringSorter(String fileName) {
inputFileName = fileName;
}
public void sortStrings() throws IOException {
FileReader input = new FileReader(inputFileName);
BufferedReader myReader = new BufferedReader(input);
String line, data = "";
String[] words;
int posCount = 0;
while ((line = myReader.readLine()) != null)
data += line;
words = data.split(",");
for(int posi = 0; posi < words.length; posi++) {
if(!Character.isUpperCase(words[posi].charAt(0))) {
posCount++;
}
}
if(posCount > 0) {
System.out.print("Invalid input. Word found which does not start with an uppercase letter.");
}
else {
for (int k = 0; k < words.length; k++) {
for (int i = k - 1; i >= 0; i--) {
if (words[i].charAt(0) < words[k].charAt(0)) {
String temp = words[k];
words[k] = words[i];
words[i] = temp;
k = i;
}
}
}
for(int print = 0; print < words.length - 1; print++){
System.out.print(words[print].trim() + ", ");
}
System.out.print(words[words.length-1]);
}
input.close();
myReader.close();
}
}
import java.io.*;
public class TestStringSorter {
public static void main(String[] args) throws IOException {
StringSorter sorterA = new StringSorter("MyFile.txt");
sorterA.sortStrings();
StringSorter sorterB = new StringSorter("myOtherFile.txt");
sorterB.sortStrings();
StringSorter sorterC = new StringSorter("yetAnotherFile.txt");
sorterC.sortStrings();
}
}
Invalid input. Word found which does not start with an uppercase letter.
Invalid input. Word found which does not start with an uppercase letter.
Invalid input. Word found which does not start with an uppercase letter.
I see what might be the problem. You're splitting on ',', but you have spaces after the comma. So you're going to have a "word" like " Dog", and if you test the first character of that, you're going to get a failure, because a space is not an uppercase letter.
Try splitting on:
words = data.split("[,\\s]+");
that would fix the problem with the spaces in the data.
I see another problem that will cause you to probably not get the results you expect. You're concatenating multiple lines together, but not putting anything between the lines, so the last word on one line is going to combine with the first word on the next. You probably want to put a "," between each line when you concatenate them together.
I guess you want to write your own sort. I'll leave that to you or others to debug. But you could just:
Arrays.sort(words)
Maybe you are putting a space before each word and this is what you are trying to check if it is upper-case...
I am tasked with taking a user sentence then separating it at the upper case letters as well as making those letters lower case after adding a " ".
I want to add a space add that position so that if user inputs "HappyDaysToCome" will output "Happy days to come".
Current code
public static void main(String[] args)
{
Scanner s = new Scanner(System.in);
System.out.println("Please enter a sentence");
String sentenceString = s.nextLine();
char[] sentenceArray = sentenceString.toCharArray();
for(int i = 0; i < sentenceArray.length; i++)
{
if(i!=0 && Character.isUpperCase(sentenceArray[i]))
{
Character.toLowerCase(sentenceArray[i]);
sentenceArray.add(i, ' ');
}
}
System.out.println(sentenceArray)
s.close();
}
}
There is no add method for arrays. Arrays are not resizeable. If you indeed want to use a char[] array, you need to allocate one that is large enough, e.g. by counting the uppercase letters or simply by allocating a array that is surely large enough (twice the String length minus 1).
String input = ...
String outputString;
if (input.isEmpty()) {
outputString = "";
} else {
char[] output = new char[input.length() * 2 - 1];
output[0] = input.charAt(0);
int outputIndex = 1;
for (int i = 1; i < input.length(); i++, outputIndex++) {
char c = input.charAt(i);
if (Character.isUpperCase(c)) {
output[outputIndex++] = ' ';
output[outputIndex] = Character.toLowerCase(c);
} else {
output[outputIndex] = c;
}
}
outputString = new String(output, 0, outputIndex);
}
System.out.println(outputString);
Or better still use a StringBuilder
String input = ...
String outputString;
if (input.isEmpty()) {
outputString = "";
} else {
StringBuilder sb = new StringBuilder().append(input.charAt(0));
for (int i = 1; i < input.length(); i++) {
char c = input.charAt(i);
if (Character.isUpperCase(c)) {
sb.append(' ').append(Character.toLowerCase(c));
} else {
sb.append(c);
}
}
outputString = sb.toString();
}
System.out.println(outputString);
You're approaching this the wrong way. Just add each char back to a new string but with spaces included at the right spots. Don't worry about modifying your char array at all. Here is a slight modification of your code:
public static void main(String[] args)
{
Scanner s = new Scanner(System.in);
System.out.println("Please enter a sentence");
String sentenceString = s.nextLine();
char[] sentenceArray = sentenceString.toCharArray();
//new string to hold the output
//starts with only the first char of the old string
string spacedString = sentenceArray[0] + "";
for(int i = 1; i < sentenceArray.length; i++)
{
if(Character.isUpperCase(sentenceArray[i]))
{
//if we find an upper case char, add a space and the lower case of that char
spacedString = spacedString + " " + Character.toLowerCase(sentenceArray[i]);
}else {
//otherwise just add the char itself
spacedString = spacedString + sentenceArray[i];
}
}
System.out.println(spacedString)
s.close();
}
If you want to optimize performance, you can use a StringBuilder object. However, for spacing out a single sentence, performance isn't going to make any real difference at all. If performance does matter to you, read more on StringBuilder here: https://docs.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html
We basically want to tokenize the input string on uppercase letters. This can be done using the regular expression [A-Z][^A-Z]* (i.e., one uppercase, followed by zero or more "not" uppercase). The String class has a built-in split() method that takes a regular expression. Unfortunately, you also want to keep the delimiter (which is the uppercase letter), so that slightly complicates matters, but it can still be done using Pattern and Matcher to put the matched delimiter back into the string:
import java.util.StringTokenizer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.List;
import java.util.ArrayList;
public class Foo {
public static void main(String[] args) {
String text = "ThisIsATest1234ABC";
String regex = "\\p{javaUpperCase}[^\\p{javaUpperCase}]*";
Matcher matcher = Pattern.compile(regex).matcher(text);
StringBuffer buf = new StringBuffer();
List<String> result = new ArrayList<String>();
while(matcher.find()){
matcher.appendReplacement(buf, matcher.group());
result.add(buf.toString());
buf.setLength(0);
}
matcher.appendTail(buf);
result.add(buf.toString());
String resultString = "";
for(String s: result) { resultString += s + " "; }
System.out.println("Final: \"" + resultString.trim() + "\"");
}
}
Output:
Final: "This Is A Test1234 A B C"
I have implemented some code to find the anagrams word in the txt sample.txt file and output them on the console. The txt document contains String (word) in each of line.
Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?
I appreciate any help.
Sample
abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi
oupt
abac aabc abca
xyz zyx yxz
Code
package org.reader;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Test {
// To store the anagram words
static List<String> match = new ArrayList<String>();
// Flag to check whether the checkWorld1InMatch() was invoked.
static boolean flagCheckWord1InMatch;
public static void main(String[] args) {
String fileName = "G:\\test\\sample2.txt";
StringBuilder sb = new StringBuilder();
// In case of matching, this flag is used to append the first word to
// the StringBuilder once.
boolean flag = true;
BufferedReader br = null;
try {
// convert the data in the sample.txt file to list
List<String> list = Files.readAllLines(Paths.get(fileName));
for (int i = 0; i < list.size(); i++) {
flagCheckWord1InMatch = true;
String word1 = list.get(i);
for (int j = i + 1; j < list.size(); j++) {
String word2 = list.get(j);
boolean isExist = false;
if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
isExist = checkWord1InMatch(word1);
}
if (isExist) {
// A word with the same characters was checked before
// and there is no need to check it again. Therefore, we
// jump to the next word in the list.
// flagCheckWord1InMatch = true;
break;
} else {
boolean result = isAnagram(word1, word2);
if (result) {
if (flag) {
sb.append(word1 + " ");
flag = false;
}
sb.append(word2 + " ");
}
if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
match.add(sb.toString().trim());
sb.setLength(0);
flag = true;
}
}
}
}
} catch (
IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
for (String item : match) {
System.out.println(item);
}
// System.out.println("Sihwail");
}
private static boolean checkWord1InMatch(String word1) {
flagCheckWord1InMatch = false;
boolean isAvailable = false;
for (String item : match) {
String[] content = item.split(" ");
for (String word : content) {
if (word1.equals(word)) {
isAvailable = true;
break;
}
}
}
return isAvailable;
}
public static boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.toCharArray();
char[] word2 = secondWord.toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
}
For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.
20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.
This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.
Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).
First for a smaller number.
As it is better to use a more powerful data structure, do not read all lines in core, but read line-wise.
Map<String, Set<String>> mapSortedToWords = new HashMap<>();
Path path = Paths.get(fileName);
try (BufferedReader in = Files.newBufferedReader(Path, StandardCharsets.UTF_8)) {
for (;;) {
String word = in.readLine();
if (word == null) {
break;
}
String key = sorted(word);
Set<String> words = mapSortedToWords.get(key);
if (words == null) {
words = new TreeSet<String>();
mapSortedToWords.put(key, words);
}
words.add(word);
}
}
for (Set<String> anagrams : mapSortedToWords.values()) {
if (anagrams.size() > 1) {
... anagrams
}
}
static String sorted(String word) {
char[] letters = word.toCharArray();
Arrays.sort(letters);
return new String(letters);
}
This stores in the map a set of words. Comparable with abac aabc abca.
For a large number a database where you store (sortedLetters, word) would be better. An embedded database like Derby or H2 poses no installation problems.
For the kind of file size that you specify ( 20 billion words), obviously there are two main problems with your code,
List<String> list = Files.readAllLines(Paths.get(fileName));
AND
for (int i = 0; i < list.size(); i++)
These two lines in your programs basically question,
Do you have enough memory to read full file in one go?
Is it OK to iterate 20 billion times?
For most systems, answer for both above questions would be NO.
So your target is to cut down memory foot print and reduce the number of iterations.
So you need to read your files chunk by chunk and use some kind of search data structures ( like Trie ) to store your words.
You will find numerous questions on SO for both of above topics like,
Fastest way to incrementally read a large file
Finding anagrams for a given word
Above algorithm says that you have to first create a dictionary for your words.
Anyway, I believe there is no ready made answer for you. Take a file with one billion words ( that is a very difficult task in itself ) and see what works and what doesn't but your current code will obviously not work.
Hope it helps !!
Use a stream to read the file. That way you are only storing one word at once.
FileReader file = new FileReader("file.txt"); //filestream
String word;
while(file.ready()) //return true if there a bytes left in the stream
{
char c = file.read(); //reads one character
if(c != '\n')
{
word+=c;
}
else {
process(word); // do whatever you want
word = "";
}
}
Update
You can use a map for finding the anagrams like below. For each word you have you may sort its chars and obtain a sorted String. So, this would be the key of your anagrams map. And values of this key will be the other anagram words.
public void findAnagrams(String[] yourWords) {
Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
for (String word : yourWords) {
String sortedWord = sortedString(word);
List<String> values = anagrams.get(sortedWord);
if (values == null)
values = new LinkedList<>();
values.add(word);
anagrams.put(sortedWord, values);
}
System.out.println(anagrams);
}
private static String sortedString(String originalWord) {
char[] chars = originalWord.toCharArray();
Arrays.sort(chars);
String sorted = new String(chars);
return sorted;
}
I am writing a Java program. I need help with the input of the program, that is a sequence of lines containing two tokens separated by one or more spaces.
import java.util.Scanner;
class ArrayCustomer {
public static void main(String[] args) {
Customer[] array = new Customer[5];
Scanner aScanner = new Scanner(System.in);
int index = readInput(aScanner, array);
}
}
It is better to use value.trim().length()
The trim() method will remove extra spaces if any.
Also String is assigned to Customer you will need to create a object out of the String of type Customer before assigning it.
Try this code... You can put the file you want to read from where "stuff.txt" currently is. This code uses the split() method from the String class to tokenize each line of text until the end of the file. In the code the split() method splits each line based on a space. This method takes a regex such as the empty space in this code to determine how to tokenize.
import java.io.*;
import java.util.ArrayList;
public class ReadFile {
static ArrayList<String> AL = new ArrayList<String>();
public static void main(String[] args) {
try {
BufferedReader br = new BufferedReader(new FileReader("stuff.txt"));
String datLine;
while((datLine = br.readLine()) != null) {
AL.add(datLine); // add line of text to ArrayList
System.out.println(datLine); //print line
}
System.out.println("tokenizing...");
//loop through String array
for(String x: AL) {
//split each line into 2 segments based on the space between them
String[] tokens = x.split(" ");
//loop through the tokens array
for(int j=0; j<tokens.length; j++) {
//only print if j is a multiple of two and j+1 is not greater or equal to the length of the tokens array to preven ArrayIndexOutOfBoundsException
if ( j % 2 ==0 && (j+1) < tokens.length) {
System.out.println(tokens[j] + " " + tokens[j+1]);
}
}
}
} catch(IOException ioe) {
System.out.println("this was thrown: " + ioe);
}
}
}
Hi I wrote a java code to find longest word made of other words. My logic is to read the list of words from the text file and add each word into an array (In the text the words are sorted and there will be only one word in each line) After that we check if each element in the array has other elemnts as substrings. If so we count the number of substrings. The element with maximum number of substrings will be the result
The code is running when I give a text file wih only two words. But when there are more than two words I am getting following error
java.lang.StringIndexOutOfBoundsException: String index out of range: 3
I feel the error is occuring in this line if(s.charAt(i1)==w.charAt(j1))
import java.util.*;
import java.io.*;
import java.lang.reflect.Array;
public class Parser
{
public static void main (String[] args) throws IOException
{
String [] addyArray = null;
FileReader inFile = new FileReader ("sample.txt");
BufferedReader in = new BufferedReader (inFile);
String line = "";
int a = 0;
int size=0;
String smallestelement = "";
while(in.ready())
{
line=in.readLine();
while (line != null && line != "\n")
{
size++;
System.out.println(size);
line = in.readLine();
if (line == null) line = "\n";
}
}
addyArray = new String[size];
FileReader inFile2 = new FileReader ("sample.txt");
BufferedReader in2 = new BufferedReader (inFile2);
String line2 = "";
while(in2.ready())
{
line2 = in2.readLine();
while (line2 != null && line2 != "\n")
{
addyArray[a] = line2;
System.out.println("Array"+addyArray[a]);
line2 = in.readLine();
a++;
if (line2 == null) line2 = "\n";
}
}
int numberofsubstrings=0;
int[] substringarray= new int[size];
int count=0,no=0;
for(int i=0;i<size;i++)
{
System.out.println("sentence "+addyArray[i]);
for(int j=0;j<size;j++)
{
System.out.println("word "+addyArray[j]);
String w,s;
s=addyArray[i].trim();
w=addyArray[j].trim();
try{
for(int i1=0;i1<s.length();i1++)
{
if(s.equals(w)&& s.indexOf(addyArray[j-1].trim()) == -1)
{}
else
{
if(s.charAt(i1)==w.charAt(0))
{
for(int j1=0;j1<w.length();j1++,i1++)
{
if(s.charAt(i1)==w.charAt(j1)) //I feel the error is occuring here
{ count=count+1;}
if(count==w.length())
{no=no+1;count=0;};
}
}
}
}
System.out.println(no);
}
catch(Exception e){System.out.println(e);}
substringarray[i]=no;
no=0;
}
}
for(int i=0;i<size;i++)
{
System.out.println("Substring array"+substringarray[i]);
}
Arrays.sort(substringarray);
int max=substringarray[0];
System.out.println("Final result is"+addyArray[max]+size);
}
}
This is the problem:
for(int j1=0;j1<w.length();j1++,i1++)
On each iteration through the loop, you're incrementing i1 as well as j1. i1 could already be at the end of s, so after you've incremented it, s.charAt(i1) is going to be invalid.
Two asides:
You should look at String.regionMatches
Using consistent indentation and sensible whitespace can make your code much easier to read.
A few tips:
First, always include the full stack trace when you are asking for debugging help. It should point to the exact line number the issue is happening on.
Second, your issue is likely in your most inner loop for(int j1=0;j1<w.length();j1++,i1++) you are incrementing i1 in addition to j1 this will cause i1 to eventually go beyond the size of String s
Finally, you should consider using the String.contains() method for Strings or even a Regular Expression.
When you use string.charAt(x), you must check that it is not beyond string length.
Documentation shows that you will get "IndexOutOfBoundsException if the index argument is negative or not less than the length of this string". And in your particular case, you are only validating in the loop that you are under w length, so it will fail.
As SO already said, the loop runs only taking into account w length, so in case you have a shorter s, it will raise that exception. Check the condition so it goes up to the shorter string or rethink the process.