parse a document with million words

parse a document with million words - java

I have implemented some code to find the anagrams word in the txt sample.txt file and output them on the console. The txt document contains String (word) in each of line.
Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?
I appreciate any help.
Sample
abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi
oupt
abac aabc abca
xyz zyx yxz
Code
package org.reader;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Test {
// To store the anagram words
static List<String> match = new ArrayList<String>();
// Flag to check whether the checkWorld1InMatch() was invoked.
static boolean flagCheckWord1InMatch;
public static void main(String[] args) {
String fileName = "G:\\test\\sample2.txt";
StringBuilder sb = new StringBuilder();
// In case of matching, this flag is used to append the first word to
// the StringBuilder once.
boolean flag = true;
BufferedReader br = null;
try {
// convert the data in the sample.txt file to list
List<String> list = Files.readAllLines(Paths.get(fileName));
for (int i = 0; i < list.size(); i++) {
flagCheckWord1InMatch = true;
String word1 = list.get(i);
for (int j = i + 1; j < list.size(); j++) {
String word2 = list.get(j);
boolean isExist = false;
if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
isExist = checkWord1InMatch(word1);
}
if (isExist) {
// A word with the same characters was checked before
// and there is no need to check it again. Therefore, we
// jump to the next word in the list.
// flagCheckWord1InMatch = true;
break;
} else {
boolean result = isAnagram(word1, word2);
if (result) {
if (flag) {
sb.append(word1 + " ");
flag = false;
}
sb.append(word2 + " ");
}
if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
match.add(sb.toString().trim());
sb.setLength(0);
flag = true;
}
}
}
}
} catch (
IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
for (String item : match) {
System.out.println(item);
}
// System.out.println("Sihwail");
}
private static boolean checkWord1InMatch(String word1) {
flagCheckWord1InMatch = false;
boolean isAvailable = false;
for (String item : match) {
String[] content = item.split(" ");
for (String word : content) {
if (word1.equals(word)) {
isAvailable = true;
break;
}
}
}
return isAvailable;
}
public static boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.toCharArray();
char[] word2 = secondWord.toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
}

For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.
20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.
This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.
Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).

First for a smaller number.
As it is better to use a more powerful data structure, do not read all lines in core, but read line-wise.
Map<String, Set<String>> mapSortedToWords = new HashMap<>();
Path path = Paths.get(fileName);
try (BufferedReader in = Files.newBufferedReader(Path, StandardCharsets.UTF_8)) {
for (;;) {
String word = in.readLine();
if (word == null) {
break;
}
String key = sorted(word);
Set<String> words = mapSortedToWords.get(key);
if (words == null) {
words = new TreeSet<String>();
mapSortedToWords.put(key, words);
}
words.add(word);
}
}
for (Set<String> anagrams : mapSortedToWords.values()) {
if (anagrams.size() > 1) {
... anagrams
}
}
static String sorted(String word) {
char[] letters = word.toCharArray();
Arrays.sort(letters);
return new String(letters);
}
This stores in the map a set of words. Comparable with abac aabc abca.
For a large number a database where you store (sortedLetters, word) would be better. An embedded database like Derby or H2 poses no installation problems.

For the kind of file size that you specify ( 20 billion words), obviously there are two main problems with your code,
List<String> list = Files.readAllLines(Paths.get(fileName));
AND
for (int i = 0; i < list.size(); i++)
These two lines in your programs basically question,
Do you have enough memory to read full file in one go?
Is it OK to iterate 20 billion times?
For most systems, answer for both above questions would be NO.
So your target is to cut down memory foot print and reduce the number of iterations.
So you need to read your files chunk by chunk and use some kind of search data structures ( like Trie ) to store your words.
You will find numerous questions on SO for both of above topics like,
Fastest way to incrementally read a large file
Finding anagrams for a given word
Above algorithm says that you have to first create a dictionary for your words.
Anyway, I believe there is no ready made answer for you. Take a file with one billion words ( that is a very difficult task in itself ) and see what works and what doesn't but your current code will obviously not work.
Hope it helps !!

Use a stream to read the file. That way you are only storing one word at once.
FileReader file = new FileReader("file.txt"); //filestream
String word;
while(file.ready()) //return true if there a bytes left in the stream
{
char c = file.read(); //reads one character
if(c != '\n')
{
word+=c;
}
else {
process(word); // do whatever you want
word = "";
}
}

Update
You can use a map for finding the anagrams like below. For each word you have you may sort its chars and obtain a sorted String. So, this would be the key of your anagrams map. And values of this key will be the other anagram words.
public void findAnagrams(String[] yourWords) {
Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
for (String word : yourWords) {
String sortedWord = sortedString(word);
List<String> values = anagrams.get(sortedWord);
if (values == null)
values = new LinkedList<>();
values.add(word);
anagrams.put(sortedWord, values);
}
System.out.println(anagrams);
}
private static String sortedString(String originalWord) {
char[] chars = originalWord.toCharArray();
Arrays.sort(chars);
String sorted = new String(chars);
return sorted;
}

Related

search for multiple strings from a text file in java

I'm trying to search of multiple words given from a user ( i used array to store them in ) from one txt file , and then if that word presented once in the file it will be displayed and if it's not it won't.
also for the words itself , if it's duplicated it will search it once.
the problem now when i search for only one it worked , but with multiple words it keeps repeated that the word isn't present even if it's there.
i would like to know where should i put the for loop and what's the possible changes.
package search;
import java.io.*;
import java.util.Scanner;
public class Read {
public static void main(String[] args) throws IOException
{
Scanner sc = new Scanner(System.in);
String[] words=null;
FileReader fr = new FileReader("java.txt");
BufferedReader br = new BufferedReader(fr);
String s;
System.out.println("Enter the number of words:");
Integer n = sc.nextInt();
String wordsArray[] = new String[n];
System.out.println("Enter words:");
for(int i=0; i<n; i++)
{
wordsArray[i]=sc.next();
}
for (int i = 0; i <n; i++) {
int count=0; //Intialize the word to zero
while((s=br.readLine())!=null) //Reading Content from the file
{
{
words=s.split(" "); //Split the word using space
for (String word : words)
{
if (word.equals(wordsArray[i])) //Search for the given word
{
count++; //If Present increase the count by one
}
}
if(count == 1)
{
System.out.println(wordsArray[i] + " is unique in file ");
}
else if (count == 0)
{
System.out.println("The given word is not present in the file");
}
else
{
System.out.println("The given word is present in the file more than 1 time");
}
}
}
}
fr.close();
}
}

The code which you wrote is error prone and remember always there should be proper break condition when you use while loop.
Try the following code:
public class Read {
public static void main(String[] args)
{
// Declaring the String
String paragraph = "These words can be searched";
// Declaring a HashMap of <String, Integer>
Map<String, Integer> hashMap = new HashMap<>();
// Splitting the words of string
// and storing them in the array.
String[] words = new String[]{"These", "can", "searched"};
for (String word : words) {
// Asking whether the HashMap contains the
// key or not. Will return null if not.
Integer integer = hashMap.get(word);
if (integer == null)
// Storing the word as key and its
// occurrence as value in the HashMap.
hashMap.put(word, 1);
else {
// Incrementing the value if the word
// is already present in the HashMap.
hashMap.put(word, integer + 1);
}
}
System.out.println(hashMap);
}
}
I've tried by hard coding the values, you can take words and paragraph from the file and console.

The 'proper' class to use for extracting words from text is java.text.BreakIterator
You can try the following (reading line-wise in case of large files)
import java.text.BreakIterator;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Stream;
import java.nio.file.Files;
import java.nio.file.Paths;
public class WordFinder {
public static void main(String[] args) {
try {
if (args.length < 2) {
WordFinder.usage();
System.exit(1);
}
ArrayList<String> argv = new ArrayList<>(Arrays.asList(args));
String path = argv.remove(0);
List<String> found = WordFinder.findWords(Files.lines(Paths.get(path)), argv);
System.out.printf("Found the following word(s) in file at %s%n", path);
System.out.println(found);
} catch (Throwable t) {
t.printStackTrace();
}
}
public static List<String> findWords(Stream<String> lines, ArrayList<String> searchWords) {
List<String> result = new ArrayList<>();
BreakIterator boundary = BreakIterator.getWordInstance();
lines.forEach(line -> {
boundary.setText(line);
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
String candidate = line.substring(start, end);
if (searchWords.contains(candidate)) {
result.add(candidate);
searchWords.remove(candidate);
}
}
});
return result;
}
private static void usage() {
System.err.println("Usage: java WordFinder <Path to input file> <Word 1> [<Word 2> <Word 3>...]");
}
}
Sample run:
goose#t410:/tmp$ echo 'the quick brown fox jumps over the lazy dog' >quick.txt
goose#t410:/tmp$ java WordFinder quick.txt dog goose the did quick over
Found the following word(s) in file at quick.txt
[the, quick, over, dog]
goose#t410:/tmp$

Search ArrayList for certain character in string

What is the correct syntax for searching an ArrayList of strings for a single character? I want to check each string in the array for a single character.
Ultimately I want to perform multiple search and replaces on all strings in an array based on the presence of a single character in the string.
I have reviewed java-examples.com and java docs as well as several methods of searching ArrayLists. None of them do quite what I need.
P.S. Any pointers on using some sort of file library to perform multiple search and replaces would be great.
--- Edit ---
As per MightyPork's recommendations arraylist revised to use simple string type. This also made it compatible with hoosssein's solution which is included.
public void ArrayInput() {
String FileName; // set file variable
FileName = fileName.getText(); // get file name
ArrayList<String> fileContents = new ArrayList<String>(); // create arraylist
try {
BufferedReader reader = new BufferedReader(new FileReader(FileName)); // create reader
String line = null;
while ((line = reader.readLine()) != null) {
if(line.length() > 0) { // don't include blank lines
line = line.trim(); // remove whitespaces
fileContents.add(line); // add to array
}
}
for (String row : fileContents) {
System.out.println(row); // print array to cmd
}
String oldstr;
String newstr;
oldstr = "}";
newstr = "!!!!!";
for(int i = 0; i < fileContents.size(); i++) {
if(fileContents.contains(oldstr)) {
fileContents.set(i, fileContents.get(i).replace(oldstr, newstr));
}
}
for (String row : fileContents) {
System.out.println(row); // print array to cmd
}
// close file
}
catch (IOException ex) { // E.H. for try
JOptionPane.showMessageDialog(null, "File not found. Check name and directory.");
}
}

first you need to iterate the list and search for that character
string.contains("A");
for replacing the character you need to keep in mind that String is immutable and you must replace new string with old string in that list
so the code is like this
public void replace(ArrayList<String> toSearchIn,String oldstr, String newStr ){
for(int i=0;i<toSearchIn.size();i++){
if(toSearchIn.contains(oldstr)){
toSearchIn.set(i, toSearchIn.get(i).replace(oldstr, newStr));
}
}
}

For the search and replace you are better off using a dictionary, if you know that you will replace Hi with Hello. The first one is a simple search, here with the index and the string being returned in a Object[2], you will have to cast the result. It returns the first match, you were not clear on this.
public static Object[] findStringMatchingCharacter(List<String> list,
char character) {
if (list == null)
return null;
Object[] ret = new Object[2];
for (int i = 0; i < list.size(); i++) {
String s = list.get(i);
if (s.contains("" + character)) {
ret[0] = s;
ret[1] = i;
}
return ret;
}
return null;
}
public static void searchAndReplace(ArrayList<String> original,
Map<String, String> dictionary) {
if (original == null || dictionary == null)
return;
for (int i = 0; i < original.size(); i++) {
String s = original.get(i);
if (dictionary.get(s) != null)
original.set(i, dictionary.get(s));
}
}

You can try this, modify as needed:
public static ArrayList<String> findInString(String needle, List<String> haystack) {
ArrayList<String> found = new ArrayList<String>();
for(String s : haystack) {
if(s.contains(needle)) {
found.add(s);
}
}
return found;
}
(to search char, just do myChar+"" and you have string)
To add the find'n'replace functionality should now be fairly easy for you.
Here's a variant for searching String[]:
public static ArrayList<String[]> findInString(String needle, List<String[]> haystack) {
ArrayList<String[]> found = new ArrayList<String[]>();
for(String fileLines[] : haystack) {
for(String s : fileLines) {
if(s.contains(needle)) {
found.add(fileLines);
break;
}
}
}
return found;
}

You don't need to iterate over lines twice to do what you need. You can make replacement when iterating over file.
Java 8 solution
try (BufferedReader reader = Files.newBufferedReader(Paths.get("pom.xml"))) {
reader
.lines()
.filter(x -> x.length() > 0)
.map(x -> x.trim())
.map(x -> x.replace("a", "b"))
.forEach(System.out::println);
} catch (IOException e){
//handle exception
}

Another way by using iterator
public static void main(String[] args) {
ArrayList<String> list = new ArrayList<>();
list.add("Naman");
list.add("Aman");
list.add("Nikhil");
list.add("Adarsh");
list.add("Shiva");
list.add("Namit");
Iterator<String> iterator = list.iterator();
while (iterator.hasNext()) {
String next = iterator.next();
if (next.startsWith("Na")) {
System.out.println(next);
}
}
}

Problem implementing classifier algorithm for whitespace separated words

I have a text and split it into words separated by white spaces.
I'm classifying units and they work if it occurs in the same word (eg.: '100m', '90kg', '140°F', 'US$500'), but I'm having problems if they appears separately, each part in a word (eg.: '100 °C', 'US$ 450', '150 km').
The classifier algorithm can understand if the unit is in right and the value is missing is in the left or right side.
My question is how can I iterate over all word that are in a list providing the corrects word to the classifier.
This is only an example of code. I have tried in a lot of ways.
for(String word: words){
String category = classifier.classify(word);
if(classifier.needPreviousWord()){
// ?
}
if(classifier.needNextWord()){
// ?
}
}
In another words, I need to iterate over the list classifying all the words, and if the previous word is needed to test, provide the last word and the unit. If the next word is needed, provide the unit and the next word. Appears to be simple, but I don't know how to do.

Don't use an implicit iterator in your for loop, but an explicit. Then you can go back and forth as you like.
Iterator<String> i = words.iterator();
while (i.hasNext()) {
String category = classifier.classify(i.next());
if(classifier.needPreviousWord()){
i.previous();
}
if(classifier.needNextWord()){
i.next();
}
}
This is not complete, because I don't know what your classifier does exactly, but it should give you an idea on how to proceed.

This could help.
public static void main(String [] args)
{
List<String> words = new ArrayList<String>();
String previousWord = "";
String nextWord = "";
for(int i=0; i < words.size(); i++) {
if(i > 0) {
previousWord = words.get(i-1);
}
String currentWord = words.get(i);
if(i < words.size() - 1) {
nextWord = words.get(i+1);
} else {
nextWord = "";
}
String category = classifier.classify(word);
if(category.needPreviousWord()){
if(previousWord.length() == 0) {
System.out.println("ERROR: missing previous unit");
} else {
System.out.println(previousWord + currentWord);
}
}
if(category.needNextWord()){
if(nextWord.length() == 0) {
System.out.println("ERROR: missing next unit");
} else {
System.out.println(currentWord + nextWord);
}
}
}
}

Counting number of words in a file

I'm having a problem counting the number of words in a file. The approach that I am taking is when I see a space or a newLine then I know to count a word.
The problem is that if I have multiple lines between paragraphs then I ended up counting them as words also. If you look at the readFile() method you can see what I am doing.
Could you help me out and guide me in the right direction on how to fix this?
Example input file (including a blank line):
word word word
word word
word word word

You can use a Scanner with a FileInputStream instead of BufferedReader with a FileReader. For example:-
File file = new File("sample.txt");
try(Scanner sc = new Scanner(new FileInputStream(file))){
int count=0;
while(sc.hasNext()){
sc.next();
count++;
}
System.out.println("Number of words: " + count);
}

I would change your approach a bit. First, I would use a BufferedReader to read the file file in line-by-line using readLine(). Then split each line on whitespace using String.split("\\s") and use the size of the resulting array to see how many words are on that line. To get the number of characters you could either look at the size of each line or of each split word (depending of if you want to count whitespace as characters).

This is just a thought. There is one very easy way to do it. If you just need number of words and not actual words then just use Apache WordUtils
import org.apache.commons.lang.WordUtils;
public class CountWord {
public static void main(String[] args) {
String str = "Just keep a boolean flag around that lets you know if the previous character was whitespace or not pseudocode follows";
String initials = WordUtils.initials(str);
System.out.println(initials);
//so number of words in your file will be
System.out.println(initials.length());
}
}

Just keep a boolean flag around that lets you know if the previous character was whitespace or not (pseudocode follows):
boolean prevWhitespace = false;
int wordCount = 0;
while (char ch = getNextChar(input)) {
if (isWhitespace(ch)) {
if (!prevWhitespace) {
prevWhitespace = true;
wordCount++;
}
} else {
prevWhitespace = false;
}
}

I think a correct approach would be by means of Regex:
String fileContent = <text from file>;
String[] words = Pattern.compile("\\s+").split(fileContent);
System.out.println("File has " + words.length + " words");
Hope it helps. The "\s+" meaning is in Pattern javadoc

import java.io.BufferedReader;
import java.io.FileReader;
public class CountWords {
public static void main (String args[]) throws Exception {
System.out.println ("Counting Words");
FileReader fr = new FileReader ("c:\\Customer1.txt");
BufferedReader br = new BufferedReader (fr);
String line = br.readLin ();
int count = 0;
while (line != null) {
String []parts = line.split(" ");
for( String w : parts)
{
count++;
}
line = br.readLine();
}
System.out.println(count);
}
}

Hack solution
You can read the text file into a String var. Then split the String into an array using a single whitespace as the delimiter StringVar.Split(" ").
The Array count would equal the number of "Words" in the file.
Of course this wouldnt give you a count of line numbers.

3 steps: Consume all the white spaces, check if is a line, consume all the nonwhitespace.3
while(true){
c = inFile.read();
// consume whitespaces
while(isspace(c)){ inFile.read() }
if (c == '\n'){ numberLines++; continue; }
while (!isspace(c)){
numberChars++;
c = inFile.read();
}
numberWords++;
}

File Word-Count
If in between words having some symbols then you can split and count the number of Words.
Scanner sc = new Scanner(new FileInputStream(new File("Input.txt")));
int count = 0;
while (sc.hasNext()) {
String[] s = sc.next().split("d*[.#:=#-]");
for (int i = 0; i < s.length; i++) {
if (!s[i].isEmpty()){
System.out.println(s[i]);
count++;
}
}
}
System.out.println("Word-Count : "+count);

Take a look at my solution here, it should work. The idea is to remove all the unwanted symbols from the words, then separate those words and store them in some other variable, i was using ArrayList. By adjusting the "excludedSymbols" variable you can add more symbols which you would like to be excluded from the words.
public static void countWords () {
String textFileLocation ="c:\\yourFileLocation";
String readWords ="";
ArrayList<String> extractOnlyWordsFromTextFile = new ArrayList<>();
// excludedSymbols can be extended to whatever you want to exclude from the file
String[] excludedSymbols = {" ", "," , "." , "/" , ":" , ";" , "<" , ">", "\n"};
String readByteCharByChar = "";
boolean testIfWord = false;
try {
InputStream inputStream = new FileInputStream(textFileLocation);
byte byte1 = (byte) inputStream.read();
while (byte1 != -1) {
readByteCharByChar +=String.valueOf((char)byte1);
for(int i=0;i<excludedSymbols.length;i++) {
if(readByteCharByChar.equals(excludedSymbols[i])) {
if(!readWords.equals("")) {
extractOnlyWordsFromTextFile.add(readWords);
}
readWords ="";
testIfWord = true;
break;
}
}
if(!testIfWord) {
readWords+=(char)byte1;
}
readByteCharByChar = "";
testIfWord = false;
byte1 = (byte)inputStream.read();
if(byte1 == -1 && !readWords.equals("")) {
extractOnlyWordsFromTextFile.add(readWords);
}
}
inputStream.close();
System.out.println(extractOnlyWordsFromTextFile);
System.out.println("The number of words in the choosen text file are: " + extractOnlyWordsFromTextFile.size());
} catch (IOException ioException) {
ioException.printStackTrace();
}
}

This can be done in a very way using Java 8:
Files.lines(Paths.get(file))
.flatMap(str->Stream.of(str.split("[ ,.!?\r\n]")))
.filter(s->s.length()>0).count();

BufferedReader bf= new BufferedReader(new FileReader("G://Sample.txt"));
String line=bf.readLine();
while(line!=null)
{
String[] words=line.split(" ");
System.out.println("this line contains " +words.length+ " words");
line=bf.readLine();
}

The below code supports in Java 8
//Read file into String
String fileContent=new String(Files.readAlBytes(Paths.get("MyFile.txt")),StandardCharacters.UFT_8);
//Keeping these into list of strings by splitting with a delimiter
List<String> words = Arrays.asList(contents.split("\\PL+"));
int count=0;
for(String x: words){
if(x.length()>1) count++;
}
sop(x);

So easy we can get the String from files by method: getText();
public class Main {
static int countOfWords(String str) {
if (str.equals("") || str == null) {
return 0;
}else{
int numberWords = 0;
for (char c : str.toCharArray()) {
if (c == ' ') {
numberWords++;
}
}
return ++numberWordss;
}
}
}

Keeping track of punctuation, spacing, when editing a file in Java

I'm writing a program to delete duplicate consecutive words from a text file, then replaces that text file without the duplicates. I know that my current code does not handle the case where a duplicate word is at the end of one line, and at the beginning of the next line since I read each line into an ArrayList, find the duplicate, and remove it. After writing it though, I wasn't sure if this was an 'ok' way to do it since now I don't know how to write it back out. I'm not sure how I can keep track of the punctuation for beginning and end of line sentences, as well as the correct spacing, and when there are line returns in the original text file. Is there a way to handle those things (spacing, punctuation, etc) with what I have so far? Or, do I need to do a redesign? The other thing I thought I could do is return an array of what indices of words I need deleted, but then I wasn't sure if that's much better. Anyway, here is my code: (thanks in advance!)
/** Removes consecutive duplicate words from text files.
It accepts only one argument, that argument being a text file
or a directory. It finds all text files in the directory and
its subdirectories and moves duplicate words from those files
as well. It replaces the original file. */
import java.io.*;
import java.util.*;
public class RemoveDuplicates {
public static void main(String[] args) {
if (args.length != 1) {
System.out.println("Program accepts one command-line argument. Exiting!");
System.exit(1);
}
File f = new File(args[0]);
if (!f.exists()) {
System.out.println("Does not exist!");
}
else if (f.isDirectory()) {
System.out.println("is directory");
}
else if (f.isFile()) {
System.out.println("is file");
String fileName = f.toString();
RemoveDuplicates dup = new RemoveDuplicates(f);
dup.showTextFile();
List<String> noDuplicates = dup.doDeleteDuplicates();
showTextFile(noDuplicates);
//writeOutputFile(fileName, noDuplicates);
}
else {
System.out.println("Shouldn't happen");
}
}
/** Reads in each line of the passed in .txt file into the lineOfWords array. */
public RemoveDuplicates(File fin) {
lineOfWords = new ArrayList<String>();
try {
BufferedReader in = new BufferedReader(new FileReader(fin));
for (String s = null; (s = in.readLine()) != null; ) {
lineOfWords.add(s);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
public void showTextFile() {
for (String s : lineOfWords) {
System.out.println(s);
}
}
public static void showTextFile(List<String> list) {
for (String s : list) {
System.out.print(s);
}
}
public List<String> doDeleteDuplicates() {
List<String> noDup = new ArrayList<String>(); // List to be returned without duplicates
// go through each line and split each word into end string array
for (String s : lineOfWords) {
String endString[] = s.split("[\\s+\\p{Punct}]");
// add each word to the arraylist
for (String word : endString) {
noDup.add(word);
}
}
for (int i = 0; i < noDup.size() - 1; i++) {
if (noDup.get(i).toUpperCase().equals(noDup.get(i + 1).toUpperCase())) {
System.out.println("Removing: " + noDup.get(i+1));
noDup.remove(i + 1);
i--;
}
}
return noDup;
}
public static void writeOutputFile(String fileName, List<String> newData) {
try {
PrintWriter outputFile = new PrintWriter(new BufferedWriter(new FileWriter(fileName)));
for (String str : newData) {
outputFile.print(str + " ");
}
outputFile.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
private List<String> lineOfWords;
}
An example.txt:
Hello hello this is a test test in order
order to see if it deletes duplicates Duplicates words.

How about something like this? In this case, I assume it is case insensitive.
Pattern p = Pattern.compile("(\\w+) \\1");
String line = "Hello hello this is a test test in order\norder to see if it deletes duplicates Duplicates words.";
Matcher m = p.matcher(line.toUpperCase());
StringBuilder sb = new StringBuilder(1000);
int idx = 0;
while (m.find()) {
sb.append(line.substring(idx, m.end(1)));
idx = m.end();
}
sb.append(line.substring(idx));
System.out.println(sb.toString());
Here's the output:-
Hello this a test in order
order to see if it deletes duplicates words.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parse a document with million words - java

Related

search for multiple strings from a text file in java

Search ArrayList for certain character in string

Problem implementing classifier algorithm for whitespace separated words

Counting number of words in a file

Keeping track of punctuation, spacing, when editing a file in Java

Categories

Resources