Java - Comparing Lists

Java - Comparing Lists - java

I have a program written in Java that reads in a file that is simply a list of strings into a LinkedHashMap. Then it takes a second file which consists of two columns and for each row see if the right-hand term matches one of the terms from the HashMap. The problem is it's running very slow.
Here's a code snippet, this is where it compares the second file to the HashMap terms:
String output = "";
infile = new File("2columns.txt");
try {
in = new BufferedReader(new FileReader(infile));
} catch (FileNotFoundException e2) {
System.out.println("2columns.txt" + " not found");
}
try {
fw = new FileWriter("newfile.txt");
out = new PrintWriter(fw);
try {
String str = in.readLine();
while (str != null) {
StringTokenizer strtok = new StringTokenizer(str);
strtok.nextToken();
String strDest = strtok.nextToken();
System.out.println("Term = " + strDest);
//if (uniqList.contains(strDest)) {
if (uniqMap.get(strDest) != null) {
output += str + "\r\n";
System.out.println("Matched! Added: " + str);
}
str = in.readLine();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
out.print(output);
I got a performance boost from switching from an ArrayList initially to the LinkedHashMap but it's still taking a long time. What can I do to speed this up?

Your major bottleneck may be that you are recreating a StringTokenizer for every iteration of the while loop. Moving this outside the loop could help considerably. Minor speed ups can be obtained by moving the String definition outside the while loop.
The biggest speedup will probably come from using a StreamTokenizer. See below for an example.
Oh and use a HashMap instead of a LinkedHashMap as #Doug Ayers says in the above comments :)
And # MДΓΓ БДLL's suggestion of profiling your code is bang on. checkout this Eclipse Profiling Example
Reader r = new BufferedReader(new FileReader(infile));
StreamTokenizer strtok = new StreamTokenizer(r);
String strDest ="";
while (strtok.nextToken() != StreamTokenizer.TT_EOF) {
strDest=strtok.sval; //strtok.toString() might be safer, but slower
strtok.nextToken();
System.out.println("Term = " + strtok.sval);
//if (uniqList.contains(strDest)) {
if (uniqMap.get(strtok.sval) != null) {
output += str + "\r\n";
System.out.println("Matched! Added: " + strDest +" "+ strtok.sval);
}
str = in.readLine();
}
One final thought is (and I'm not confident on this one) that writing to a file may also be faster if you do it in one go at the end. i.e. store all your matches in a buffer of some sort and do the writing in one hit.

StringTokenizer is a legacy class. The recommended replacement is the string "split" method.
Some of the trys might be consolidated. You can have multiple catches for a single try.
The suggestion to use HashMap instead of LinkedHashMap is a good one. Performance for gets and puts in a smidgeon faster since there is no need to maintain a list structure.
The "output" string should be a StringBuilder rather than a String. That could help a lot.

Related

assigning properties to strings in text file

Hopefully my explanation does me some justice. I am pretty new to java. I have a text file that looks like this
Java
The Java Tutorials
http://docs.oracle.com/javase/tutorial/
Python
Tutorialspoint Java tutorials
http://www.tutorialspoint.com/python/
Perl
Tutorialspoint Perl tutorials
http://www.tutorialspoint.com/perl/
I have properties for language name, website description, and website url. Right now, I just want to list the information from the text file exactly how it looks, but I need to assign those properties to them.
The problem I am getting is "index 1 is out of bounds for length 1"
try {
BufferedReader in = new BufferedReader(new FileReader("Tutorials.txt"));
while (in.readLine() != null) {
TutorialWebsite tw = new TutorialWebsite();
str = in.readLine();
String[] fields = str.split("\\r?\\n");
tw.setProgramLanguage(fields[0]);
tw.setWebDescription(fields[1]);
tw.setWebURL(fields[2]);
System.out.println(tw);
}
} catch (IOException e) {
e.printStackTrace();
}
I wanted to test something so i removed the new lines and put commas instead and made it str.split(",") which printed it out just fine, but im sure i would get points taken off it i changed the format.

readline returns a "string containing the contents of the line, not including any line-termination characters", so why are you trying to split each line on "\\r?\\n"?
Where is str declared? Why are you reading two lines for each iteration of the loop, and ignoring the first one?
I suggest you start from
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
and work from there.
The first readline gets the language, the second gets the description, and the third gets the url, and then the pattern repeats. There is nothing to stop you using readline three times for each iteration of the while loop.

you can read all the file in a String like this
// try with resources, to make sure BufferedReader is closed safely
try (BufferedReader in = new BufferedReader(new FileReader("Tutorials.txt"))) {
//str will hold all the file contents
StringBuilder str = new StringBuilder();
String line;
while ((line = in.readLine()) != null) {
str.append(line);
str.append("\n");
} catch (IOException e) {
e.printStackTrace();
}
Later you can split the string with
String[] fields = str.toString().split("[\\n\\r]+");

Why not try it like this.
allocate a List to hold the TutorialWebsite instances.
use try with resources to open the file, read the lines, and trim any white space.
put the lines in an array
then iterate over the array, filling in the class instance
the print the list.
The loop ensures the array length is a multiple of nFields, discarding any remainder. So if your total lines are not divisible by nFields you will not read the remainder of the file. You would still have to adjust the setters if additional fields were added.
int nFields = 3;
List<TutorialWebsite> list = new ArrayList<>();
try (BufferedReader in = new BufferedReader(new FileReader("tutorials.txt"))) {
String[] lines = in.lines().map(String::trim).toArray(String[]::new);
for (int i = 0; i < (lines.length/nFields)*nFields; i+=nFields) {
TutorialWebsite tw = new TutorialWebsite();
tw.setProgramLanguage(lines[i]);
tw.setWebDescription(lines[i+1]);
tw.setWebURL(lines[i+2]);
list.add(tw);
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
list.forEach(System.out::println);
A improvement would be to use a constructor and pass the strings to that when each instance is created.
And remember the file name as specified is relative to the directory in which the program is run.

How sort N files

Following this answer -->
How do I sort very large files
I need only the Merge function on N already sorted files on disk ,
I want to sort them into one Big file my limitation is the memory Not more than K lines in the memory (K < N) so i cannot fetch all them and then sort, preferred with java
so far I Tried as the code below , but I need a good way to iterate over all N of files line by line (not more than K LINES in memory) + store to disk the sorted final file
public void run() {
try {
System.out.println(file1 + " Started Merging " + file2 );
FileReader fileReader1 = new FileReader(file1);
FileReader fileReader2 = new FileReader(file2);
//......TODO with N ?? ......
FileWriter writer = new FileWriter(file3);
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
//Merge 2 files based on which string is greater.
while (line1 != null || line2 != null) {
if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
writer.write(line2 + "\r\n");
line2 = bufferedReader2.readLine();
} else {
writer.write(line1 + "\r\n");
line1 = bufferedReader1.readLine();
}
}
System.out.println(file1 + " Done Merging " + file2 );
new File(file1).delete();
new File(file2).delete();
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
regards,

You can use something like this
public static void mergeFiles(String target, String... input) throws IOException {
String lineBreak = System.getProperty("line.separator");
PriorityQueue<Map.Entry<String,BufferedReader>> lines
= new PriorityQueue<>(Map.Entry.comparingByKey());
try(FileWriter fw = new FileWriter(target)) {
String header = null;
for(String file: input) {
BufferedReader br = new BufferedReader(new FileReader(file));
String line = br.readLine();
if(line == null) br.close();
else {
if(header == null) fw.append(header = line).write(lineBreak);
line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
for(;;) {
Map.Entry<String, BufferedReader> next = lines.poll();
if(next == null) break;
fw.append(next.getKey()).write(lineBreak);
final BufferedReader br = next.getValue();
String line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
catch(Throwable t) {
for(Map.Entry<String,BufferedReader> br: lines) try {
br.getValue().close();
} catch(Throwable next) {
if(t != next) t.addSuppressed(next);
}
}
}
Note that this code, unlike the code in your question, handles the header line. Like the original code, it will delete the input lines. If that’s not intended, you can remove the DELETE_ON_CLOSE option and simplify the entire reader construction to
BufferedReader br = new BufferedReader(new FileReader(file));
It has exactly as much lines in memory, as you have files.
While in principle, it is possible to hold less line strings in memory, to re-read them when needed, it would be a performance disaster for a questionable little saving. E.g. you have already N strings in memory when calling this method, due to the fact that you have N file names.
However, when you want to reduce the number of lines held at the same time, at all costs, you can simply use the method shown in your question. Merge the first two files into a temporary file, merge that temporary file with the third to another temporary file, and so on, until merging the temporary file with the last input file to the final result. Then you have at most two line strings in memory (K == 2), saving less memory than the operating system will use for buffering, trying to mitigate the horrible performance of this approach.
Likewise, you can use the method shown above to merge K files into a temporary file, then merge the temporary file with the next K-1 file, and so on, until merging the temporary file with the remaining K-1 or less files to the final result, to have a memory consumption scaling with K < N. This approach allows to tune K to have a reasonable ratio to N, to trade memory for speed. I think, in most practical cases, K == N will work just fine.

#Holger gave a nice answer assuming that K>=N.
You can extend it to the K<N case by using mark(int) and reset() methods of the BufferedInputStream.
The parameter of mark is how many bytes a single line can have.
The idea is as follows:
Instead of putting all the N lines in the TreeMap, you can only have K of them. Whenever you put a new line into the set and it is already 'full' you evict the smallest one from it. Additionally, you reset the stream from which it came. So when you will read it again the same data can pop up.
You have to keep track of the maximum line not kept in the TreeSet, lets call it the lower bound. Once there are no elements in the TreeSet greater than the maintained lower bound, you scan all the files once again and repopulate the set.
I'm not sure if this approach is optimal, but should be ok.
Moreover, you have to be aware that BufferedInputStream has an internal buffer at least the size of a single line, so that will consume a lot of your memory, perhaps it would be better to maintain buffering on your own.

External Sort GC Overhead

I am writing an external sort to sort a large 2 gig file on disk
I first split the file into chunks that fit into memory and sort each one individually, and rewrite them back to disk. However, during this process I am getting GC Memory overhead exception in String.Split method in function geModel. Below is my code.
private static List<Model> getModel(String file, long lineCount, final long readSize) {
List<Model> modelList = new ArrayList<Model>();
long read = 0L;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
//Skip lineCount lines;
for (long i = 0; i < lineCount; i++)
br.readLine();
String line = "";
while ((line = br.readLine()) != null) {
read += line.length();
if (read > readSize)
break;
String[] split = line.split("\t");
String curvature = (split.length >= 7) ? split[6] : "";
String heading = (split.length >= 8) ? split[7] : "";
String slope = (split.length == 9) ? split[8] : "";
modelList.add(new Model(split[0], split[1], split[2], split[3], split[4], split[5], curvature, heading, slope));
}
br.close();
return modelList;
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
private static void split(String inputDir, String inputFile, String outputDir, final long readSize) throws IOException {
long lineCount = 0L;
int count = 0;
int writeSize = 100000;
System.out.println("Reading...");
List<Model> curModel = getModel(inputDir + inputFile, lineCount, readSize);
System.out.println("Reading Complete");
while (curModel.size() > 0) {
lineCount += curModel.size();
System.out.println("Sorting...");
curModel.sort(new Comparator<Model>() {
#Override
public int compare(Model arg0, Model arg1) {
return arg0.compareTo(arg1);
}
});
System.out.println("Sorting Complete");
System.out.println("Writing...");
writeFile(curModel, outputDir + inputFile + count, writeSize);
System.out.println("Writing Complete");
count++;
System.out.println("Reading...");
curModel = getModel(inputDir + inputFile, lineCount, readSize);
System.out.println("Reading Complete");
}
}
It makes it through one pass and sorts ~250 MB of data from the file. However, on the second pass it throws GC Memory Overhead exception on String.split function. I do not want to use external libraries, I want to learn this on my own. The sorting and splitting works, but I cannot understand why the GC is throwing memory overhead exception on string.split function.

I'm not sure just what is causing the exception--manipulating large strings, in particular cutting and splicing them, is a huge memory/gc issue. StringBuilder can help, but in general you may have to take more direct control over the process.
To figure out more you probably want to run a profiler with your app. There is one built into the JDK (VisualVM) that is functional. It will show you what objects Java is holding on to... because of the nature of strings it's possible that you are holding onto a lot of redundant character array data.
Personally I'd try a completely different approach, for instance, what if you sorted the entire file in memory by loading the first 10(?) sortable characters of each line into an array along with the file location they were read from, sort the array and resolve any ties by loading more (the rest?) of those lines that were identical.
If you did something like that then you should be able to seek to each line and copy it to the destination file without ever caching more than one line in memory and only reading through the source file twice.
I suppose you could manufacture a file that would fail if all the strings were identical until the last couple characters, so if that ever became an issue you might have to be able to flush the full strings you've cached (there is a java memory reference object made to automatically do this for you, it's not particularly hard)

Based on how I read your implementation readSize only makes sure that you get first block X size. You are not reading 2nd block or 3rd block. Hence its not actually complete external sort.
read += line.length();
if (read > readSize)
break;
String[] split = line.split("\t");
even though you are splitting each line you seem to be using only first 9 characters. And then checking no of words in each line. This means your data is not uniform.

Faster way of searching a String in a .txt

Hey guys i have written this code for searching a string in a txt file.
Is it possible to optimize the code so that it searches for the string in fastest manner possible.
Assuming the text file would be a large one (500MB - 1GB)
I dont want to use pattern Matchers.
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class StringFinder {
public static void main(String[] args)
{
double count = 0,countBuffer=0,countLine=0;
String lineNumber = "";
String filePath = "C:\\Users\\allen\\Desktop\\TestText.txt";
BufferedReader br;
String inputSearch = "are";
String line = "";
try {
br = new BufferedReader(new FileReader(filePath));
try {
while((line = br.readLine()) != null)
{
countLine++;
//System.out.println(line);
String[] words = line.split(" ");
for (String word : words) {
if (word.equals(inputSearch)) {
count++;
countBuffer++;
}
}
if(countBuffer > 0)
{
countBuffer = 0;
lineNumber += countLine + ",";
}
}
br.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Times found at--"+count);
System.out.println("Word found at--"+lineNumber);
}
}

There are fast string search algorithms, but a big part of the time will go into reading the file from external storage. If you can index the file ahead of time, you can save reading and scanning the entire file. If you can't, perhaps you can at least avoid reading the file from external storage, e.g. if the file came in from the network, then search it before or instead of writing it to storage.

Try Matcher.find, splitting is slow since it creates a lot of objects

If you don't want to use Matcher.find for some reason, then at least go for using indexOf.
You can check on the whole line without breaking up the line into a lot of String Objects which then need iterating over.
int index = line.indexOf (inputSearch);
while (index != -1)
{
count++;
countBuffer++;
index = line.indexOf (inputSearch, index+1);
}

For a plain string, i.e., not a regex, and if you can't index the file first using some sophisticated engine (Lucene or Solr come to mind for such a large file) or database (?), you should check out the Rabin-Karp algorithm. It's a very clever algorithm that finds a simple string match in O(n+m) where n is the length of the text and m the length of the search string.

Your bottleneck may not be the time it takes to parse each line at all, but reading the actual file. Disk IO is at least an order of magnitude slower than iterating thru a char array. But you really won't know until you profile your code. Fire up VisualVM and use it to figure out where you are spending the most time. If you don't, you're just guessing.

BufferedReader: read multiple lines into a single string

I'm reading numbers from a txt file using BufferedReader for analysis. The way I'm going about this now is- reading a line using .readline, splitting this string into an array of strings using .split
public InputFile () {
fileIn = null;
//stuff here
fileIn = new FileReader((filename + ".txt"));
buffIn = new BufferedReader(fileIn);
return;
//stuff here
}
public String ReadBigStringIn() {
String line = null;
try { line = buffIn.readLine(); }
catch(IOException e){};
return line;
}
public ProcessMain() {
initComponents();
String[] stringArray;
String line;
try {
InputFile stringIn = new InputFile();
line = stringIn.ReadBigStringIn();
stringArray = line.split("[^0-9.+Ee-]+");
// analysis etc.
}
}
This works fine, but what if the txt file has multiple lines of text? Is there a way to output a single long string, or perhaps another way of doing it? Maybe use while(buffIn.readline != null) {}? Not sure how to implement this.
Ideas appreciated,
thanks.

You are right, a loop would be needed here.
The usual idiom (using only plain Java) is something like this:
public String ReadBigStringIn(BufferedReader buffIn) throws IOException {
StringBuilder everything = new StringBuilder();
String line;
while( (line = buffIn.readLine()) != null) {
everything.append(line);
}
return everything.toString();
}
This removes the line breaks - if you want to retain them, don't use the readLine() method, but simply read into a char[] instead (and append this to your StringBuilder).
Please note that this loop will run until the stream ends (and will block if it doesn't end), so if you need a different condition to finish the loop, implement it in there.

I would strongly advice using library here but since Java 8 you can do this also using streams.
try (InputStreamReader in = new InputStreamReader(System.in);
BufferedReader buffer = new BufferedReader(in)) {
final String fileAsText = buffer.lines().collect(Collectors.joining());
System.out.println(fileAsText);
} catch (Exception e) {
e.printStackTrace();
}
You can notice also that it is pretty effective as joining is using StringBuilder internally.

If you just want to read the entirety of a file into a string, I suggest you use Guava's Files class:
String text = Files.toString("filename.txt", Charsets.UTF_8);
Of course, that's assuming you want to maintain the linebreaks. If you want to remove the linebreaks, you could either load it that way and then use String.replace, or you could use Guava again:
List<String> lines = Files.readLines(new File("filename.txt"), Charsets.UTF_8);
String joined = Joiner.on("").join(lines);

Sounds like you want Apache IO FileUtils
String text = FileUtils.readStringFromFile(new File(filename + ".txt"));
String[] stringArray = text.split("[^0-9.+Ee-]+");

If you create a StringBuilder, then you can append every line to it, and return the String using toString() at the end.
You can replace your ReadBigStringIn() with
public String ReadBigStringIn() {
StringBuilder b = new StringBuilder();
try {
String line = buffIn.readLine();
while (line != null) {
b.append(line);
line = buffIn.readLine();
}
}
catch(IOException e){};
return b.toString();
}

You have a file containing doubles. Looks like you have more than one number per line, and may have multiple lines.
Simplest thing to do is read lines in a while loop.
You could return null from your ReadBigStringIn method when last line is reached and terminate your loop there.
But more normal would be to create and use the reader in one method. Perhaps you could change to a method which reads the file and returns an array or list of doubles.
BTW, could you simply split your strings by whitespace?
Reading a whole file into a single String may suit your particular case, but be aware that it could cause a memory explosion if your file was very large. Streaming approach is generally safer for such i/o.

This creates a long string, every line is seprateted from string " " (one space):
public String ReadBigStringIn() {
StringBuffer line = new StringBuffer();
try {
while(buffIn.ready()) {
line.append(" " + buffIn.readLine());
} catch(IOException e){
e.printStackTrace();
}
return line.toString();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Comparing Lists - java

Related

assigning properties to strings in text file

How sort N files

External Sort GC Overhead

Faster way of searching a String in a .txt

BufferedReader: read multiple lines into a single string

Categories

Resources