Search ArrayList for certain character in string

Search ArrayList for certain character in string - java

What is the correct syntax for searching an ArrayList of strings for a single character? I want to check each string in the array for a single character.
Ultimately I want to perform multiple search and replaces on all strings in an array based on the presence of a single character in the string.
I have reviewed java-examples.com and java docs as well as several methods of searching ArrayLists. None of them do quite what I need.
P.S. Any pointers on using some sort of file library to perform multiple search and replaces would be great.
--- Edit ---
As per MightyPork's recommendations arraylist revised to use simple string type. This also made it compatible with hoosssein's solution which is included.
public void ArrayInput() {
String FileName; // set file variable
FileName = fileName.getText(); // get file name
ArrayList<String> fileContents = new ArrayList<String>(); // create arraylist
try {
BufferedReader reader = new BufferedReader(new FileReader(FileName)); // create reader
String line = null;
while ((line = reader.readLine()) != null) {
if(line.length() > 0) { // don't include blank lines
line = line.trim(); // remove whitespaces
fileContents.add(line); // add to array
}
}
for (String row : fileContents) {
System.out.println(row); // print array to cmd
}
String oldstr;
String newstr;
oldstr = "}";
newstr = "!!!!!";
for(int i = 0; i < fileContents.size(); i++) {
if(fileContents.contains(oldstr)) {
fileContents.set(i, fileContents.get(i).replace(oldstr, newstr));
}
}
for (String row : fileContents) {
System.out.println(row); // print array to cmd
}
// close file
}
catch (IOException ex) { // E.H. for try
JOptionPane.showMessageDialog(null, "File not found. Check name and directory.");
}
}

first you need to iterate the list and search for that character
string.contains("A");
for replacing the character you need to keep in mind that String is immutable and you must replace new string with old string in that list
so the code is like this
public void replace(ArrayList<String> toSearchIn,String oldstr, String newStr ){
for(int i=0;i<toSearchIn.size();i++){
if(toSearchIn.contains(oldstr)){
toSearchIn.set(i, toSearchIn.get(i).replace(oldstr, newStr));
}
}
}

For the search and replace you are better off using a dictionary, if you know that you will replace Hi with Hello. The first one is a simple search, here with the index and the string being returned in a Object[2], you will have to cast the result. It returns the first match, you were not clear on this.
public static Object[] findStringMatchingCharacter(List<String> list,
char character) {
if (list == null)
return null;
Object[] ret = new Object[2];
for (int i = 0; i < list.size(); i++) {
String s = list.get(i);
if (s.contains("" + character)) {
ret[0] = s;
ret[1] = i;
}
return ret;
}
return null;
}
public static void searchAndReplace(ArrayList<String> original,
Map<String, String> dictionary) {
if (original == null || dictionary == null)
return;
for (int i = 0; i < original.size(); i++) {
String s = original.get(i);
if (dictionary.get(s) != null)
original.set(i, dictionary.get(s));
}
}

You can try this, modify as needed:
public static ArrayList<String> findInString(String needle, List<String> haystack) {
ArrayList<String> found = new ArrayList<String>();
for(String s : haystack) {
if(s.contains(needle)) {
found.add(s);
}
}
return found;
}
(to search char, just do myChar+"" and you have string)
To add the find'n'replace functionality should now be fairly easy for you.
Here's a variant for searching String[]:
public static ArrayList<String[]> findInString(String needle, List<String[]> haystack) {
ArrayList<String[]> found = new ArrayList<String[]>();
for(String fileLines[] : haystack) {
for(String s : fileLines) {
if(s.contains(needle)) {
found.add(fileLines);
break;
}
}
}
return found;
}

You don't need to iterate over lines twice to do what you need. You can make replacement when iterating over file.
Java 8 solution
try (BufferedReader reader = Files.newBufferedReader(Paths.get("pom.xml"))) {
reader
.lines()
.filter(x -> x.length() > 0)
.map(x -> x.trim())
.map(x -> x.replace("a", "b"))
.forEach(System.out::println);
} catch (IOException e){
//handle exception
}

Another way by using iterator
public static void main(String[] args) {
ArrayList<String> list = new ArrayList<>();
list.add("Naman");
list.add("Aman");
list.add("Nikhil");
list.add("Adarsh");
list.add("Shiva");
list.add("Namit");
Iterator<String> iterator = list.iterator();
while (iterator.hasNext()) {
String next = iterator.next();
if (next.startsWith("Na")) {
System.out.println(next);
}
}
}

Related

Filtering out Repeated Characters in Java

I am trying to write a program that has the method public static void method(List<String> words) where the parameter words is a list of words from the text file words.txt that are sorted and contain only the words where each letter occurs only once. For example, the word "feel" would not be included in this list since "e" occurs more than once. The word list is not to be used as an argument in the rest of the program, so the method method is only to be used to store and remember the wordlist for later use. This function can also perform any of the sorting methods.
My thought process was to create a method that would read the text file, and use that text file as the argument in method. method would then filter out all words with letters that appear more than once, and also sort the new list.
When running the program, I'm getting an error "java.util.ConcurrentModificationException: null (in java.util.LinkedList$Listltr)" on the line for (String word : words). Also does the line public static List list; properly save and store the list for later use?
import java.util.*;
import java.io.*;
class ABC
{
public static List<String> list = new LinkedList<String>()
public static List readFile()
{
String content = new String();
File file = new File("words.txt");
LinkedList<String> words = new LinkedList<String>();
try
{
Scanner sc = new Scanner(new FileInputStream(file));
while (sc.hasNextLine())
{
content = sc.nextLine();
words.add(content);
}
}
catch (FileNotFoundException fnf)
{
fnf.printStackTrace();
}
catch (Exception e)
{
e.printStackTrace();
System.out.println("\nProgram terminated safely");
}
for (String word : words)
{
if (letters(word) == false)
{
list.add(word);
}
}
Collections.sort(list);
return list;
}
public static boolean letters(String word)
{
for (int i = 0; i < word.length() - 1; i++)
{
if (word.contains(String.valueOf(word.charAt(i))) == true)
{
return true;
}
}
return false;
}
public static void main(String args[])
{
System.out.println(readFile());
}
}

The source of the error is that you are changing a list that you are iterating on. This is generally not a good idea.
Since you are building a new list, you don't actually need to change the one you are iterating on. I would recommend changing your code so that the logic for deciding if a letter appears more than once goes in a separate method. This way the complexity of any given method is manageable, and you can test them separately.
So create a new method that tests if any letter appears more than once:
static boolean doesAnyLetterAppearMoreThanOnce(String word) {
...
}
Then you can use it in your existing method:
for (String word : words) {
if (!doesAnyLetterAppearMoreThanOnce(word)) {
list.add(word);
}
}
Collections.sort(list);

Use an iterator. Try it like this.
Iterator<String> it = words.iterator();
while(it.hasNext()) {
CharSequence ch = it.next();
for (int j = 0; j < ch.length(); j++)
{
for (int k = j + 1; k < ch.length(); k++)
{
if (ch.charAt(j) == ch.charAt(k))
{
it.remove(word);
}
}
}
list.add(word);
}
However, I would approach it differently.
String[] data =
{ "hello", "bad", "bye", "computer", "feel", "glee" };
outer: for (String word : data) {
for (int i = 0; i < word.length() - 1; i++) {
if (word.charAt(i) == word.charAt(i + 1)) {
System.out.println("dropping '" + word + "'");
continue outer;
}
}
System.out.println("Keeping '" + word + "'");
List.add(word);
}
Note: You used feel as an example so it wasn't clear if you wanted to check for the same letter anywhere in the word or only adjacent letters that are the same.

There are several problems with you program:
public static List list;
Whenever you see a collection (like List) without a generics - it's a bad smell. Should be public static List<String> list;
Also consider changing public to private.
In readFile() method you mask the class variable 'list' with a local variable 'list'. So your class variable remains uninitialized:
list = new LinkedList<String>();
Better use try-with-resources for scanner:
try(Scanner sc = new Scanner(new FileInputStream(file))) {
You don't need to close it afterwards manually.
You cannot modify the list through which you are iterating. You should either use an iterator and its remove method, or create a new list and append good words to it, instead of removing bad words from the original list.
public static List<String> readFile() {
File file = new File("words.txt");
List<String> list = new ArrayList<>();
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine()) {
String word = scanner.nextLine();
if (noDuplicates(word)) {
list.add(word);
}
}
Collections.sort(list);
} catch (FileNotFoundException e) {
System.out.println("File not found");
}
return list;
}
private static boolean noDuplicates(String word) {
Set<Character> distinctChars = new HashSet<>();
for (char c : word.toCharArray()) {
if (!distinctChars.add(c)) {
return false;
}
}
return true;
}

I suggest this shorter approach:
public static void method(List<String> words) {
words.removeIf(word -> {
Set<Integer> hs = new HashSet<>();
return word.chars().anyMatch(c -> {
if (hs.contains(c)) return true;
else hs.add(c);
return false;
});
});
System.out.println(words);}
words List now contain only the words where each letter occurs only once.

Java checking if an element from a list appears in all occurrences

I have a method that takes in an ArrayList of strings with each element in the list equaling to a variation of:
>AX018718 Equine influenza virus H3N8 // 4 (HA)
CAAAAGCAGGGTGACAAAAACATGATGGATTCCAACACTGTGTCAAGCTTTCAGGTAGACTGTTTTCTTT
GGCATGTCCGCAAACGATTTGCAGACCAAGAACTGGGTGATGCCCCATTCCTTGACCGGCTTCGCCGAGA
This method is broken down into the Acc, which is AX018718 in this case and seq which are the two lines following the Acc
This is then checked by another ArrayList of strings called pal to see if the substrings match [AAAATTTT, AAACGTTT, AAATATATTT]
I am able to get all of the matches for the different elements of the first list outputted as:
AATATATT in organism: AX225014 Was found in position: 15 and at 15
AATATT in organism: AX225014 Was found in position: 1432 and at 1432
AATATT in organism: AX225016 Was found in position: 1404 and at 1404
AATT in organism: AX225016 Was found in position: 169 and at 2205
Is it possible to check if for all of the outputted information if all of the Acc match one pal?
In the case above, the wanted output would be:
AATATT was found in all of the Acc.
my working code:
public static ArrayList<String> PB2Scan(ArrayList<String> Pal) throws FileNotFoundException, IOException
{
ArrayList<String> PalindromesSpotted = new ArrayList<String>();
File file = new File("IAV_PB2_32640.txt");
Scanner sc = new Scanner(file);
sc.useDelimiter(">");
//initializes the ArrayList
ArrayList<String> Gene1 = new ArrayList<String>();
//initializes the writer
FileWriter fileWriter = new FileWriter("PB2out");
PrintWriter printwriter = new PrintWriter(fileWriter);
//Loads the Array List
while(sc.hasNext()) Gene1.add(sc.next());
for(int i = 0; i < Gene1.size(); i++)
{
//Acc breaks down the title so the element:
//>AX225014 Equine influenza virus H3N8 // 1 (PB2)
//ATGAAGACAACCATTATTTTGATACTACTGACCCATTGGGTCTACAGTCAAAACCCAACCAGTGGCAACA
//GGCATGTCCGCAAACGATTTGCAGACCAAGAACTGGGTGATGCCCCATTCCTTGACCGGCTTCGCCGAGA
//comes out as AX225014
String Acc = Accession(Gene1.get(i));
//seq takes the same element as above and returns only
//ATGAAGACAACCATTATTTTGATACTACTGACCCATTGGGTCTACAGTCAAAACCCAACCAGTGGCAACA
//GGCATGTCCGCAAACGATTTGCAGACCAAGAACTGGGTGATGCCCCATTCCTTGACCGGCTTCGCCGAGA
String seq = trimHeader(Gene1.get(i));
for(int x = 0; x<Pal.size(); x++)
{
if(seq.contains(Pal.get(x))){
String match = (Pal.get(x) + " in organism: " + Acc + " Was found in position: "+ seq.indexOf(Pal.get(x)) + " and at " +seq.lastIndexOf(Pal.get(x)));
printwriter.println(match);
PalindromesSpotted.add(match);
}
}
}
Collections.sort(PalindromesSpotted);
return PalindromesSpotted;
}

First off, your code won't write to any file to log the results since you don't close your writers or at the very least flush PrintWriter. As a matter of fact you don't close your reader as well. You really should close your Readers and Writers to free resources. Food for thought.
You can make your PB2Scan() method return either a simple result list as it does now, or a result list of just acc's which contain the same Pal(s), or perhaps both where a simple result list is logged and at the end of that list a list of acc's which contain the same Pal(s) which will also be logged.
Some additional code and an additional integer parameter for the PB2Scan() method would do this. For the additional parameter you might want to add something like this:
public static ArrayList<String> PB2Scan(ArrayList<String> Pal, int resultType)
throws FileNotFoundException, IOException
{ .... }
Where the integer resultType argument would take one of three integer values from 0 to 2:
0 - Simple result list as the code currently does now;
1 - Acc's that match Pal's;
2 - Simple result list and Acc's that Match Pal's at the end of result list.
You should also really have the file to read as an argument for the PB2Scan() method since this file could very easily be a different name the next go around. This makes the method more versatile rather than if the name of the file was hard-coded.
public static ArrayList<String> PB2Scan(String filePath, ArrayList<String> Pal, int resultType)
throws FileNotFoundException, IOException { .... }
The method can always write the Same output file since it would best suit what method it came from.
Using the above concept rather than writing to the output file (PB2Out.txt) as the PalindromesSpotted ArrayList is being created I think it's best to write the file after your ArrayList or ArrayLists are complete. To do this another method (writeListToFile()) is best suited to carry out the task. To find out if any same Pal's match other Acc's it is again a good idea to have yet another method (getPalMatches()) do that task.
Since the index locations of of more than one given Pal in any given Seq was not reporting properly either I have provided yet another method (findSubstringIndexes()) to quickly take care of that task.
It should be noted that the code below assumes that the Seq acquired from the trimHeader() method is all one single String with no Line Break characters within it.
The reworked PB2Scan() method and the other above mentioned methods are listed below:
The PB2Scan() Method:
public static ArrayList<String> PB2Scan(String filePath, ArrayList<String> Pal, int resultType)
throws FileNotFoundException, IOException {
// Make sure the supplied result type is either
// 0, 1, or 2. If not then default to 0.
if (resultType < 0 || resultType > 2) {
resultType = 0;
}
ArrayList<String> PalindromesSpotted = new ArrayList<>();
File file = new File(filePath);
Scanner sc = new Scanner(file);
sc.useDelimiter(">");
//initializes the ArrayList
ArrayList<String> Gene1 = new ArrayList<>();
//Loads the Array List
while (sc.hasNext()) {
Gene1.add(sc.next());
}
sc.close(); // Close the read in text file.
for (int i = 0; i < Gene1.size(); i++) {
//Acc breaks down the title so the element:
//>AX225014 Equine influenza virus H3N8 // 1 (PB2)
//ATGAAGACAACCATTATTTTGATACTACTGACCCATTGGGTCTACAGTCAAAACCCAACCAGTGGCAACA
//GGCATGTCCGCAAACGATTTGCAGACCAAGAACTGGGTGATGCCCCATTCCTTGACCGGCTTCGCCGAGA
//comes out as AX225014
String Acc = Accession(Gene1.get(i));
//seq takes the same element as above and returns only
//ATGAAGACAACCATTATTTTGATACTACTGACCCATTGGGTCTACAGTCAAAACCCAACCAGTGGCAACA
//GGCATGTCCGCAAACGATTTGCAGACCAAGAACTGGGTGATGCCCCATTCCTTGACCGGCTTCGCCGAGA
String seq = trimHeader(Gene1.get(i));
for (int x = 0; x < Pal.size(); x++) {
if (seq.contains(Pal.get(x))) {
String match = Pal.get(x) + " in organism: " + Acc +
" Was found in position(s): " +
findSubstringIndexes(seq, Pal.get(x));
PalindromesSpotted.add(match);
}
}
}
// If there is nothing to work with get outta here.
if (PalindromesSpotted.isEmpty()) {
return PalindromesSpotted;
}
// Sort the ArrayList
Collections.sort(PalindromesSpotted);
// Another ArrayList for matching Pal's to Acc's
ArrayList<String> accMatchingPal = new ArrayList<>();
switch (resultType) {
case 0: // if resultType is 0 is supplied
writeListToFile("PB2Out.txt", PalindromesSpotted);
return PalindromesSpotted;
case 1: // if resultType is 1 is supplied
accMatchingPal = getPalMatches(PalindromesSpotted);
writeListToFile("PB2Out.txt", accMatchingPal);
return accMatchingPal;
default: // if resultType is 2 is supplied
accMatchingPal = getPalMatches(PalindromesSpotted);
ArrayList<String> fullList = new ArrayList<>();
fullList.addAll(PalindromesSpotted);
// Create a Underline made of = signs in the list.
fullList.add(String.join("", Collections.nCopies(70, "=")));
fullList.addAll(accMatchingPal);
writeListToFile("PB2Out.txt", fullList);
return fullList;
}
}
The findSubstringIndexes() Method:
private static String findSubstringIndexes(String inputString, String stringToFind){
String indexes = "";
int index = inputString.indexOf(stringToFind);
while (index >= 0){
indexes+= (indexes.equals("")) ? String.valueOf(index) : ", " + String.valueOf(index);
index = inputString.indexOf(stringToFind, index + stringToFind.length()) ;
}
return indexes;
}
The getPalMatches() Method:
private static ArrayList<String> getPalMatches(ArrayList<String> Palindromes) {
ArrayList<String> accMatching = new ArrayList<>();
for (int i = 0; i < Palindromes.size(); i++) {
String matches = "";
String[] split1 = Palindromes.get(i).split("\\s+");
String pal1 = split1[0];
// Make sure the current Pal hasn't already been listed.
boolean alreadyListed = false;
for (int there = 0; there < accMatching.size(); there++) {
String[] th = accMatching.get(there).split("\\s+");
if (th[0].equals(pal1)) {
alreadyListed = true;
break;
}
}
if (alreadyListed) { continue; }
for (int j = 0; j < Palindromes.size(); j++) {
String[] split2 = Palindromes.get(j).split("\\s+");
String pal2 = split2[0];
if (pal1.equals(pal2)) {
// Using Ternary Operator to build the matches string
matches+= (matches.equals("")) ? pal1 + " was found in the following Accessions: "
+ split2[3] : ", " + split2[3];
}
}
if (!matches.equals("")) {
accMatching.add(matches);
}
}
return accMatching;
}
The writeListToFile() Method:
private static void writeListToFile(String filePath, ArrayList<String> list, boolean... appendToFile) {
boolean appendFile = false;
if (appendToFile.length > 0) { appendFile = appendToFile[0]; }
try {
try (BufferedWriter bw = new BufferedWriter(new FileWriter(filePath, appendFile))) {
for (int i = 0; i < list.size(); i++) {
bw.append(list.get(i) + System.lineSeparator());
}
}
} catch (IOException ex) {
ex.printStackTrace();
}
}

You should probably create aMap<String, List<String>> containing the Pals as keys and the Accs that contain them as values.
Map<String, List<String>> result = new HashMap<>();
for (String gene : Gene1) {
List<String> list = new ArrayList<>();
result.put(gene, list);
for (String pal : Pal) {
if (acc.contains(trimHeader(gene))) {
list.add(pal);
}
}
}
Now you have a Map that you can query for the Pals every Gene contains:
List<String> containedPals = result.get(gene);
This is a very reasonable result for a function like this. What you do afterwards (ie the writing into a file) should better be done in another function (that calls this one).
So, this is probably what you want to do:
List<String> genes = loadGenes(geneFile);
List<String> pals = loadPal(palFile);
Map<String, List<String>> genesToContainedPal = methodAbove(genes, pals);
switch (resultTyp) {
// ...
}

parse a document with million words

I have implemented some code to find the anagrams word in the txt sample.txt file and output them on the console. The txt document contains String (word) in each of line.
Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?
I appreciate any help.
Sample
abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi
oupt
abac aabc abca
xyz zyx yxz
Code
package org.reader;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Test {
// To store the anagram words
static List<String> match = new ArrayList<String>();
// Flag to check whether the checkWorld1InMatch() was invoked.
static boolean flagCheckWord1InMatch;
public static void main(String[] args) {
String fileName = "G:\\test\\sample2.txt";
StringBuilder sb = new StringBuilder();
// In case of matching, this flag is used to append the first word to
// the StringBuilder once.
boolean flag = true;
BufferedReader br = null;
try {
// convert the data in the sample.txt file to list
List<String> list = Files.readAllLines(Paths.get(fileName));
for (int i = 0; i < list.size(); i++) {
flagCheckWord1InMatch = true;
String word1 = list.get(i);
for (int j = i + 1; j < list.size(); j++) {
String word2 = list.get(j);
boolean isExist = false;
if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
isExist = checkWord1InMatch(word1);
}
if (isExist) {
// A word with the same characters was checked before
// and there is no need to check it again. Therefore, we
// jump to the next word in the list.
// flagCheckWord1InMatch = true;
break;
} else {
boolean result = isAnagram(word1, word2);
if (result) {
if (flag) {
sb.append(word1 + " ");
flag = false;
}
sb.append(word2 + " ");
}
if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
match.add(sb.toString().trim());
sb.setLength(0);
flag = true;
}
}
}
}
} catch (
IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
for (String item : match) {
System.out.println(item);
}
// System.out.println("Sihwail");
}
private static boolean checkWord1InMatch(String word1) {
flagCheckWord1InMatch = false;
boolean isAvailable = false;
for (String item : match) {
String[] content = item.split(" ");
for (String word : content) {
if (word1.equals(word)) {
isAvailable = true;
break;
}
}
}
return isAvailable;
}
public static boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.toCharArray();
char[] word2 = secondWord.toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
}

For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.
20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.
This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.
Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).

First for a smaller number.
As it is better to use a more powerful data structure, do not read all lines in core, but read line-wise.
Map<String, Set<String>> mapSortedToWords = new HashMap<>();
Path path = Paths.get(fileName);
try (BufferedReader in = Files.newBufferedReader(Path, StandardCharsets.UTF_8)) {
for (;;) {
String word = in.readLine();
if (word == null) {
break;
}
String key = sorted(word);
Set<String> words = mapSortedToWords.get(key);
if (words == null) {
words = new TreeSet<String>();
mapSortedToWords.put(key, words);
}
words.add(word);
}
}
for (Set<String> anagrams : mapSortedToWords.values()) {
if (anagrams.size() > 1) {
... anagrams
}
}
static String sorted(String word) {
char[] letters = word.toCharArray();
Arrays.sort(letters);
return new String(letters);
}
This stores in the map a set of words. Comparable with abac aabc abca.
For a large number a database where you store (sortedLetters, word) would be better. An embedded database like Derby or H2 poses no installation problems.

For the kind of file size that you specify ( 20 billion words), obviously there are two main problems with your code,
List<String> list = Files.readAllLines(Paths.get(fileName));
AND
for (int i = 0; i < list.size(); i++)
These two lines in your programs basically question,
Do you have enough memory to read full file in one go?
Is it OK to iterate 20 billion times?
For most systems, answer for both above questions would be NO.
So your target is to cut down memory foot print and reduce the number of iterations.
So you need to read your files chunk by chunk and use some kind of search data structures ( like Trie ) to store your words.
You will find numerous questions on SO for both of above topics like,
Fastest way to incrementally read a large file
Finding anagrams for a given word
Above algorithm says that you have to first create a dictionary for your words.
Anyway, I believe there is no ready made answer for you. Take a file with one billion words ( that is a very difficult task in itself ) and see what works and what doesn't but your current code will obviously not work.
Hope it helps !!

Use a stream to read the file. That way you are only storing one word at once.
FileReader file = new FileReader("file.txt"); //filestream
String word;
while(file.ready()) //return true if there a bytes left in the stream
{
char c = file.read(); //reads one character
if(c != '\n')
{
word+=c;
}
else {
process(word); // do whatever you want
word = "";
}
}

Update
You can use a map for finding the anagrams like below. For each word you have you may sort its chars and obtain a sorted String. So, this would be the key of your anagrams map. And values of this key will be the other anagram words.
public void findAnagrams(String[] yourWords) {
Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
for (String word : yourWords) {
String sortedWord = sortedString(word);
List<String> values = anagrams.get(sortedWord);
if (values == null)
values = new LinkedList<>();
values.add(word);
anagrams.put(sortedWord, values);
}
System.out.println(anagrams);
}
private static String sortedString(String originalWord) {
char[] chars = originalWord.toCharArray();
Arrays.sort(chars);
String sorted = new String(chars);
return sorted;
}

Finding duplicate values in row with Java

So I've a file with these rows
155, 490, 297, 490,
-45, 19, 45, 19,
-24, 80,-12,-69, 80,
12,-92, 28,-40,
I try to read the file and find these rows which contain duplicate elements. But something with my logic is wrong and I can't find the error. Any help ?
Here is the code:
public static void main(String[] args) throws IOException {
Scanner fileInput = null;
try {
fileInput = new Scanner( new File("array_list.csv"));
String line;
while (fileInput.hasNextLine()) {
line = fileInput.nextLine();
String[] lineArr = line.split(",");
// check for missing values
boolean contains = true;
for(int i=0; i<lineArr.length; i++) {
for(int j=0; j<lineArr.length; j++) {
if(lineArr[i]==lineArr[j]) {
contains = false;
break;
}
}
if(!contains) {
// print the row .....
}
else {
contains = true;
// print some thing ...
}
}
}
} finally {
if (null != fileInput) {
fileInput.close();
}
}
}

Since you're comparing strings, you need to use the equals() method:
lineArr[i].equals(lineArr[j])
That being said, there are a few other things I can see which may cause you problems:
Be careful of spaces after commas. The sample data is inconsistent, so it's best to call lineArr[i].trim() to get rid of leading/trailing whitespace.
You should set contains to false initially and try to find a match, then set it to true and break. Then if (contains), print the row.
The way your loops are set up, you will check each element with itself. So of course you will find a duplicate for each row!

The problem that stands out to me immediately is that you are working with Strings, and you are currently using the "==" operator to compare strings on this line:
if(lineArr[i]==lineArr[j]) {
This should instead be:
if(lineArr[i].equals(lineArr[j])) {

Try replacing your code as
if(lineArr[i].equals(lineArr[j]))
instead
if(lineArr[i]==lineArr[j])
The equals() method compares the actual content of the Strings, using the underlying Unicode representation, while == compares only the identity of the objects, using their address in memory.

Put all values into a set and check if its length is equal to the original array. If so, then all values are unique, otherwise they are not:
while (fileInput.hasNextLine()) {
line = fileInput.nextLine();
List<String> lineArr = Arrays.asList(line.split(","));
if (new HashSet<String>(lineArr).size() != lineArr.size()) {
System.out.println(line);
}
}

You should compare String using equals().
Yet there are other issues in your code. At some point, i and j are equal and thus lineArr[i]==lineArr[j] will always be true.
A simple way for checking duplicates is to use a Set and check its size:
Set<String> lineSet = new HashSet<lineArr.length>;
for(String s : lineArr) {
lineSet.add(s);
}
if(lineSet.size() < lineArr.length) {
// there are duplicates
}

public static void main(String[] args) throws IOException{
Scanner fileInput = null;
try {
fileInput = new Scanner(new File("array_list.csv"));
String line;
while (fileInput.hasNextLine()) {
line = fileInput.nextLine();
String[] lineArr = line.split(",");
// check for missing values
boolean contains = true;
for(int i=0; i<lineArr.length; i++) {
for(int j=0; j<i; j++) {
if(lineArr[i].equals(lineArr[j])) {
contains = false;
break;
}
}
}
if(!contains) {
System.out.println(line);
}
else {
contains = true;
}
}
} finally {
if (null != fileInput) {
fileInput.close();
}
}
}

Here is my code for finding duplicate elements
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
public class StringManipulation {
public static void main(String[] args) {
StringManipulation manipulation=new StringManipulation();
manipulation.findDuplicateElementList();
//manipulation.findDuplicateElementbyMap();
}
private void findDuplicateElementList() {
String lineData = "ashish manish ashish manish sachin manish ashish neha manish";
String[] list = lineData.split(" ");
List<String> stringList = Arrays.asList(list);
// containingList=stringList;
Set<String> stringSet = new HashSet<String>();
for (int i = 0; i < stringList.size(); i++) {
int count = 0;
String currVal = stringList.get(i);
if (stringSet.contains(currVal)) {
continue;
} else {
for (String string : stringList) {
if (currVal.equals(string)) {
stringSet.add(currVal);
count++;
}
}
}
System.out.println("Occurances of " + currVal + " " + count);
}
}
}

Keeping track of punctuation, spacing, when editing a file in Java

I'm writing a program to delete duplicate consecutive words from a text file, then replaces that text file without the duplicates. I know that my current code does not handle the case where a duplicate word is at the end of one line, and at the beginning of the next line since I read each line into an ArrayList, find the duplicate, and remove it. After writing it though, I wasn't sure if this was an 'ok' way to do it since now I don't know how to write it back out. I'm not sure how I can keep track of the punctuation for beginning and end of line sentences, as well as the correct spacing, and when there are line returns in the original text file. Is there a way to handle those things (spacing, punctuation, etc) with what I have so far? Or, do I need to do a redesign? The other thing I thought I could do is return an array of what indices of words I need deleted, but then I wasn't sure if that's much better. Anyway, here is my code: (thanks in advance!)
/** Removes consecutive duplicate words from text files.
It accepts only one argument, that argument being a text file
or a directory. It finds all text files in the directory and
its subdirectories and moves duplicate words from those files
as well. It replaces the original file. */
import java.io.*;
import java.util.*;
public class RemoveDuplicates {
public static void main(String[] args) {
if (args.length != 1) {
System.out.println("Program accepts one command-line argument. Exiting!");
System.exit(1);
}
File f = new File(args[0]);
if (!f.exists()) {
System.out.println("Does not exist!");
}
else if (f.isDirectory()) {
System.out.println("is directory");
}
else if (f.isFile()) {
System.out.println("is file");
String fileName = f.toString();
RemoveDuplicates dup = new RemoveDuplicates(f);
dup.showTextFile();
List<String> noDuplicates = dup.doDeleteDuplicates();
showTextFile(noDuplicates);
//writeOutputFile(fileName, noDuplicates);
}
else {
System.out.println("Shouldn't happen");
}
}
/** Reads in each line of the passed in .txt file into the lineOfWords array. */
public RemoveDuplicates(File fin) {
lineOfWords = new ArrayList<String>();
try {
BufferedReader in = new BufferedReader(new FileReader(fin));
for (String s = null; (s = in.readLine()) != null; ) {
lineOfWords.add(s);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
public void showTextFile() {
for (String s : lineOfWords) {
System.out.println(s);
}
}
public static void showTextFile(List<String> list) {
for (String s : list) {
System.out.print(s);
}
}
public List<String> doDeleteDuplicates() {
List<String> noDup = new ArrayList<String>(); // List to be returned without duplicates
// go through each line and split each word into end string array
for (String s : lineOfWords) {
String endString[] = s.split("[\\s+\\p{Punct}]");
// add each word to the arraylist
for (String word : endString) {
noDup.add(word);
}
}
for (int i = 0; i < noDup.size() - 1; i++) {
if (noDup.get(i).toUpperCase().equals(noDup.get(i + 1).toUpperCase())) {
System.out.println("Removing: " + noDup.get(i+1));
noDup.remove(i + 1);
i--;
}
}
return noDup;
}
public static void writeOutputFile(String fileName, List<String> newData) {
try {
PrintWriter outputFile = new PrintWriter(new BufferedWriter(new FileWriter(fileName)));
for (String str : newData) {
outputFile.print(str + " ");
}
outputFile.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
private List<String> lineOfWords;
}
An example.txt:
Hello hello this is a test test in order
order to see if it deletes duplicates Duplicates words.

How about something like this? In this case, I assume it is case insensitive.
Pattern p = Pattern.compile("(\\w+) \\1");
String line = "Hello hello this is a test test in order\norder to see if it deletes duplicates Duplicates words.";
Matcher m = p.matcher(line.toUpperCase());
StringBuilder sb = new StringBuilder(1000);
int idx = 0;
while (m.find()) {
sb.append(line.substring(idx, m.end(1)));
idx = m.end();
}
sb.append(line.substring(idx));
System.out.println(sb.toString());
Here's the output:-
Hello this a test in order
order to see if it deletes duplicates words.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Search ArrayList for certain character in string - java

Related

Filtering out Repeated Characters in Java

Java checking if an element from a list appears in all occurrences

parse a document with million words

Finding duplicate values in row with Java

Keeping track of punctuation, spacing, when editing a file in Java

Categories

Resources