search for multiple strings from a text file in java - java

I'm trying to search of multiple words given from a user ( i used array to store them in ) from one txt file , and then if that word presented once in the file it will be displayed and if it's not it won't.
also for the words itself , if it's duplicated it will search it once.
the problem now when i search for only one it worked , but with multiple words it keeps repeated that the word isn't present even if it's there.
i would like to know where should i put the for loop and what's the possible changes.
package search;
import java.io.*;
import java.util.Scanner;
public class Read {
public static void main(String[] args) throws IOException
{
Scanner sc = new Scanner(System.in);
String[] words=null;
FileReader fr = new FileReader("java.txt");
BufferedReader br = new BufferedReader(fr);
String s;
System.out.println("Enter the number of words:");
Integer n = sc.nextInt();
String wordsArray[] = new String[n];
System.out.println("Enter words:");
for(int i=0; i<n; i++)
{
wordsArray[i]=sc.next();
}
for (int i = 0; i <n; i++) {
int count=0; //Intialize the word to zero
while((s=br.readLine())!=null) //Reading Content from the file
{
{
words=s.split(" "); //Split the word using space
for (String word : words)
{
if (word.equals(wordsArray[i])) //Search for the given word
{
count++; //If Present increase the count by one
}
}
if(count == 1)
{
System.out.println(wordsArray[i] + " is unique in file ");
}
else if (count == 0)
{
System.out.println("The given word is not present in the file");
}
else
{
System.out.println("The given word is present in the file more than 1 time");
}
}
}
}
fr.close();
}
}

The code which you wrote is error prone and remember always there should be proper break condition when you use while loop.
Try the following code:
public class Read {
public static void main(String[] args)
{
// Declaring the String
String paragraph = "These words can be searched";
// Declaring a HashMap of <String, Integer>
Map<String, Integer> hashMap = new HashMap<>();
// Splitting the words of string
// and storing them in the array.
String[] words = new String[]{"These", "can", "searched"};
for (String word : words) {
// Asking whether the HashMap contains the
// key or not. Will return null if not.
Integer integer = hashMap.get(word);
if (integer == null)
// Storing the word as key and its
// occurrence as value in the HashMap.
hashMap.put(word, 1);
else {
// Incrementing the value if the word
// is already present in the HashMap.
hashMap.put(word, integer + 1);
}
}
System.out.println(hashMap);
}
}
I've tried by hard coding the values, you can take words and paragraph from the file and console.

The 'proper' class to use for extracting words from text is java.text.BreakIterator
You can try the following (reading line-wise in case of large files)
import java.text.BreakIterator;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Stream;
import java.nio.file.Files;
import java.nio.file.Paths;
public class WordFinder {
public static void main(String[] args) {
try {
if (args.length < 2) {
WordFinder.usage();
System.exit(1);
}
ArrayList<String> argv = new ArrayList<>(Arrays.asList(args));
String path = argv.remove(0);
List<String> found = WordFinder.findWords(Files.lines(Paths.get(path)), argv);
System.out.printf("Found the following word(s) in file at %s%n", path);
System.out.println(found);
} catch (Throwable t) {
t.printStackTrace();
}
}
public static List<String> findWords(Stream<String> lines, ArrayList<String> searchWords) {
List<String> result = new ArrayList<>();
BreakIterator boundary = BreakIterator.getWordInstance();
lines.forEach(line -> {
boundary.setText(line);
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
String candidate = line.substring(start, end);
if (searchWords.contains(candidate)) {
result.add(candidate);
searchWords.remove(candidate);
}
}
});
return result;
}
private static void usage() {
System.err.println("Usage: java WordFinder <Path to input file> <Word 1> [<Word 2> <Word 3>...]");
}
}
Sample run:
goose#t410:/tmp$ echo 'the quick brown fox jumps over the lazy dog' >quick.txt
goose#t410:/tmp$ java WordFinder quick.txt dog goose the did quick over
Found the following word(s) in file at quick.txt
[the, quick, over, dog]
goose#t410:/tmp$

Related

parse a document with million words

I have implemented some code to find the anagrams word in the txt sample.txt file and output them on the console. The txt document contains String (word) in each of line.
Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?
I appreciate any help.
Sample
abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi
oupt
abac aabc abca
xyz zyx yxz
Code
package org.reader;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Test {
// To store the anagram words
static List<String> match = new ArrayList<String>();
// Flag to check whether the checkWorld1InMatch() was invoked.
static boolean flagCheckWord1InMatch;
public static void main(String[] args) {
String fileName = "G:\\test\\sample2.txt";
StringBuilder sb = new StringBuilder();
// In case of matching, this flag is used to append the first word to
// the StringBuilder once.
boolean flag = true;
BufferedReader br = null;
try {
// convert the data in the sample.txt file to list
List<String> list = Files.readAllLines(Paths.get(fileName));
for (int i = 0; i < list.size(); i++) {
flagCheckWord1InMatch = true;
String word1 = list.get(i);
for (int j = i + 1; j < list.size(); j++) {
String word2 = list.get(j);
boolean isExist = false;
if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
isExist = checkWord1InMatch(word1);
}
if (isExist) {
// A word with the same characters was checked before
// and there is no need to check it again. Therefore, we
// jump to the next word in the list.
// flagCheckWord1InMatch = true;
break;
} else {
boolean result = isAnagram(word1, word2);
if (result) {
if (flag) {
sb.append(word1 + " ");
flag = false;
}
sb.append(word2 + " ");
}
if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
match.add(sb.toString().trim());
sb.setLength(0);
flag = true;
}
}
}
}
} catch (
IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
for (String item : match) {
System.out.println(item);
}
// System.out.println("Sihwail");
}
private static boolean checkWord1InMatch(String word1) {
flagCheckWord1InMatch = false;
boolean isAvailable = false;
for (String item : match) {
String[] content = item.split(" ");
for (String word : content) {
if (word1.equals(word)) {
isAvailable = true;
break;
}
}
}
return isAvailable;
}
public static boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.toCharArray();
char[] word2 = secondWord.toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
}
For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.
20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.
This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.
Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).
First for a smaller number.
As it is better to use a more powerful data structure, do not read all lines in core, but read line-wise.
Map<String, Set<String>> mapSortedToWords = new HashMap<>();
Path path = Paths.get(fileName);
try (BufferedReader in = Files.newBufferedReader(Path, StandardCharsets.UTF_8)) {
for (;;) {
String word = in.readLine();
if (word == null) {
break;
}
String key = sorted(word);
Set<String> words = mapSortedToWords.get(key);
if (words == null) {
words = new TreeSet<String>();
mapSortedToWords.put(key, words);
}
words.add(word);
}
}
for (Set<String> anagrams : mapSortedToWords.values()) {
if (anagrams.size() > 1) {
... anagrams
}
}
static String sorted(String word) {
char[] letters = word.toCharArray();
Arrays.sort(letters);
return new String(letters);
}
This stores in the map a set of words. Comparable with abac aabc abca.
For a large number a database where you store (sortedLetters, word) would be better. An embedded database like Derby or H2 poses no installation problems.
For the kind of file size that you specify ( 20 billion words), obviously there are two main problems with your code,
List<String> list = Files.readAllLines(Paths.get(fileName));
AND
for (int i = 0; i < list.size(); i++)
These two lines in your programs basically question,
Do you have enough memory to read full file in one go?
Is it OK to iterate 20 billion times?
For most systems, answer for both above questions would be NO.
So your target is to cut down memory foot print and reduce the number of iterations.
So you need to read your files chunk by chunk and use some kind of search data structures ( like Trie ) to store your words.
You will find numerous questions on SO for both of above topics like,
Fastest way to incrementally read a large file
Finding anagrams for a given word
Above algorithm says that you have to first create a dictionary for your words.
Anyway, I believe there is no ready made answer for you. Take a file with one billion words ( that is a very difficult task in itself ) and see what works and what doesn't but your current code will obviously not work.
Hope it helps !!
Use a stream to read the file. That way you are only storing one word at once.
FileReader file = new FileReader("file.txt"); //filestream
String word;
while(file.ready()) //return true if there a bytes left in the stream
{
char c = file.read(); //reads one character
if(c != '\n')
{
word+=c;
}
else {
process(word); // do whatever you want
word = "";
}
}
Update
You can use a map for finding the anagrams like below. For each word you have you may sort its chars and obtain a sorted String. So, this would be the key of your anagrams map. And values of this key will be the other anagram words.
public void findAnagrams(String[] yourWords) {
Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
for (String word : yourWords) {
String sortedWord = sortedString(word);
List<String> values = anagrams.get(sortedWord);
if (values == null)
values = new LinkedList<>();
values.add(word);
anagrams.put(sortedWord, values);
}
System.out.println(anagrams);
}
private static String sortedString(String originalWord) {
char[] chars = originalWord.toCharArray();
Arrays.sort(chars);
String sorted = new String(chars);
return sorted;
}

What is wrong with my ArrayList when trying to detect persons who have same first name and last name. I am using Regex Pattern

Hy,
I have the following code:
package regexsimple5;
import java.util.Scanner;
import java.util.ArrayList;
import java.util.regex.*;
import java.io.*;
import java.util.regex.Pattern;
public class RegexSimple5 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
ArrayList <String> foundName = new ArrayList<String>();
ArrayList <String> noDuplicatesName = new ArrayList<String>();
try{
Scanner myfis = new Scanner(new File("D:\\myfis5.txt"));
while(myfis.hasNext())
{
String delim = " ";
String line = myfis.nextLine();
String [] words = line.split(delim);
for( String s: words)
{
if(!s.isEmpty()&&s!=null)
{
Pattern search = Pattern.compile("[A-Z][a-z]*");
Matcher match = search.matcher(s);
if(match.find())
{
foundName.add(s);
}
}
}
}
if(!foundName.isEmpty())
{
for(String s: foundName)
{
System.out.println(s);
int n = foundName.size();
System.out.println(foundName.get(0));
}
int n = foundName.size();
for(int i=0; i<n; i++)
{
if(foundName.get(i).equals(foundName.get(i+1)))
{
noDuplicatesName.add(foundName.get(i));
}
}
System.out.println(n);
}
if(!noDuplicatesName.isEmpty())
{
for(String s: noDuplicatesName)
{
System.out.print("***********");
System.out.print(s);
}
}
}
catch (Exception ex)
{
System.out.println(ex);
}
}
}
with which I am tring to display persons who have the same first and last name.
But I get the error:
java.lang.IndexOutOfBoundsException:
without displaying my arraylist with duplicates name and surname.
Sincerly,
Problem line is most likely this:
if(foundName.get(i).equals(foundName.get(i+1)))
When at the end of the list it will cause OutOfBoundsException while accessing (i+1)th element.
Difficult to understand whole code but you can probably fix it by running your loop till n-1, i.e.:
for(int i=0; i<n-1; i++)

Reading a sequence until the empty line

I am writing a Java program. I need help with the input of the program, that is a sequence of lines containing two tokens separated by one or more spaces.
import java.util.Scanner;
class ArrayCustomer {
public static void main(String[] args) {
Customer[] array = new Customer[5];
Scanner aScanner = new Scanner(System.in);
int index = readInput(aScanner, array);
}
}
It is better to use value.trim().length()
The trim() method will remove extra spaces if any.
Also String is assigned to Customer you will need to create a object out of the String of type Customer before assigning it.
Try this code... You can put the file you want to read from where "stuff.txt" currently is. This code uses the split() method from the String class to tokenize each line of text until the end of the file. In the code the split() method splits each line based on a space. This method takes a regex such as the empty space in this code to determine how to tokenize.
import java.io.*;
import java.util.ArrayList;
public class ReadFile {
static ArrayList<String> AL = new ArrayList<String>();
public static void main(String[] args) {
try {
BufferedReader br = new BufferedReader(new FileReader("stuff.txt"));
String datLine;
while((datLine = br.readLine()) != null) {
AL.add(datLine); // add line of text to ArrayList
System.out.println(datLine); //print line
}
System.out.println("tokenizing...");
//loop through String array
for(String x: AL) {
//split each line into 2 segments based on the space between them
String[] tokens = x.split(" ");
//loop through the tokens array
for(int j=0; j<tokens.length; j++) {
//only print if j is a multiple of two and j+1 is not greater or equal to the length of the tokens array to preven ArrayIndexOutOfBoundsException
if ( j % 2 ==0 && (j+1) < tokens.length) {
System.out.println(tokens[j] + " " + tokens[j+1]);
}
}
}
} catch(IOException ioe) {
System.out.println("this was thrown: " + ioe);
}
}
}

Calculate number of words in an ArrayList while some words are on the same line

I'm trying to calculate how many words an ArrayList contains. I know how to do this if every words is on a separate line, but some of the words are on the same line, like:
hello there
blah
cats dogs
So I'm thinking I should go through every entry and somehow find out how many words the current entry contains, something like:
public int numberOfWords(){
for(int i = 0; i < arraylist.size(); i++) {
int words = 0;
words = words + (number of words on current line);
//words should eventually equal to 5
}
return words;
}
Am I thinking right?
You should declare and instantiate int words outside of the loop the int is not reassign during every iteration of the loop. You can use the for..each syntax to loop through the list, which will eliminate the need to get() items out of the list. To handle multiple words on a line split the String into an Array and count the items in the Array.
public int numberOfWords(){
int words = 0;
for(String s:arraylist) {
words += s.split(" ").length;
}
return words;
}
Full Test
public class StackTest {
public static void main(String[] args) {
List<String> arraylist = new ArrayList<String>();
arraylist.add("hello there");
arraylist.add("blah");
arraylist.add(" cats dogs");
arraylist.add(" ");
arraylist.add(" ");
arraylist.add(" ");
int words = 0;
for(String s:arraylist) {
s = s.trim().replaceAll(" +", " "); //clean up the String
if(!s.isEmpty()){ //do not count empty strings
words += s.split(" ").length;
}
}
System.out.println(words);
}
}
Should looks like this:
public int numberOfWords(){
int words = 0;
for(int i = 0; i < arraylist.size(); i++) {
words = words + (number of words on current line);
//words should eventually equal to 5
}
return words;
}
I think this could help you .
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.StringTokenizer;
public class LineWord {
public static void main(String args[]) {
try {
File f = new File("C:\\Users\\MissingNumber\\Documents\\NetBeansProjects\\Puzzlecode\\src\\com\\test\\test.txt"); // Creating the File passing path to the constructor..!!
BufferedReader br = new BufferedReader(new FileReader(f)); //
String strLine = " ";
String filedata = "";
while ((strLine = br.readLine()) != null) {
filedata += strLine + " ";
}
StringTokenizer stk = new StringTokenizer(filedata);
List <String> token = new ArrayList <String>();
while (stk.hasMoreTokens()) {
token.add(stk.nextToken());
}
//Collections.sort(token);
System.out.println(token.size());
br.close();
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}
}
}
So you'll red data from a file in this case and store them in a list after tokenizing them , just count them , If you just want to get input from the console use the Bufferedreader , tokenize them , separating with space , put in list , simple get size .
Hope you got what you are looking for .

Keeping track of punctuation, spacing, when editing a file in Java

I'm writing a program to delete duplicate consecutive words from a text file, then replaces that text file without the duplicates. I know that my current code does not handle the case where a duplicate word is at the end of one line, and at the beginning of the next line since I read each line into an ArrayList, find the duplicate, and remove it. After writing it though, I wasn't sure if this was an 'ok' way to do it since now I don't know how to write it back out. I'm not sure how I can keep track of the punctuation for beginning and end of line sentences, as well as the correct spacing, and when there are line returns in the original text file. Is there a way to handle those things (spacing, punctuation, etc) with what I have so far? Or, do I need to do a redesign? The other thing I thought I could do is return an array of what indices of words I need deleted, but then I wasn't sure if that's much better. Anyway, here is my code: (thanks in advance!)
/** Removes consecutive duplicate words from text files.
It accepts only one argument, that argument being a text file
or a directory. It finds all text files in the directory and
its subdirectories and moves duplicate words from those files
as well. It replaces the original file. */
import java.io.*;
import java.util.*;
public class RemoveDuplicates {
public static void main(String[] args) {
if (args.length != 1) {
System.out.println("Program accepts one command-line argument. Exiting!");
System.exit(1);
}
File f = new File(args[0]);
if (!f.exists()) {
System.out.println("Does not exist!");
}
else if (f.isDirectory()) {
System.out.println("is directory");
}
else if (f.isFile()) {
System.out.println("is file");
String fileName = f.toString();
RemoveDuplicates dup = new RemoveDuplicates(f);
dup.showTextFile();
List<String> noDuplicates = dup.doDeleteDuplicates();
showTextFile(noDuplicates);
//writeOutputFile(fileName, noDuplicates);
}
else {
System.out.println("Shouldn't happen");
}
}
/** Reads in each line of the passed in .txt file into the lineOfWords array. */
public RemoveDuplicates(File fin) {
lineOfWords = new ArrayList<String>();
try {
BufferedReader in = new BufferedReader(new FileReader(fin));
for (String s = null; (s = in.readLine()) != null; ) {
lineOfWords.add(s);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
public void showTextFile() {
for (String s : lineOfWords) {
System.out.println(s);
}
}
public static void showTextFile(List<String> list) {
for (String s : list) {
System.out.print(s);
}
}
public List<String> doDeleteDuplicates() {
List<String> noDup = new ArrayList<String>(); // List to be returned without duplicates
// go through each line and split each word into end string array
for (String s : lineOfWords) {
String endString[] = s.split("[\\s+\\p{Punct}]");
// add each word to the arraylist
for (String word : endString) {
noDup.add(word);
}
}
for (int i = 0; i < noDup.size() - 1; i++) {
if (noDup.get(i).toUpperCase().equals(noDup.get(i + 1).toUpperCase())) {
System.out.println("Removing: " + noDup.get(i+1));
noDup.remove(i + 1);
i--;
}
}
return noDup;
}
public static void writeOutputFile(String fileName, List<String> newData) {
try {
PrintWriter outputFile = new PrintWriter(new BufferedWriter(new FileWriter(fileName)));
for (String str : newData) {
outputFile.print(str + " ");
}
outputFile.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
private List<String> lineOfWords;
}
An example.txt:
Hello hello this is a test test in order
order to see if it deletes duplicates Duplicates words.
How about something like this? In this case, I assume it is case insensitive.
Pattern p = Pattern.compile("(\\w+) \\1");
String line = "Hello hello this is a test test in order\norder to see if it deletes duplicates Duplicates words.";
Matcher m = p.matcher(line.toUpperCase());
StringBuilder sb = new StringBuilder(1000);
int idx = 0;
while (m.find()) {
sb.append(line.substring(idx, m.end(1)));
idx = m.end();
}
sb.append(line.substring(idx));
System.out.println(sb.toString());
Here's the output:-
Hello this a test in order
order to see if it deletes duplicates words.

Categories