Writing huge string data to file in java, optimization options

Writing huge string data to file in java, optimization options - java

I have a chat like desktop java swing app, where i keep getting String type data. Eventually the String variable keeps growing larger and larger.
1) Is it wise idea to keep the large variable in memory and only when the logging is finished save this to disk.
2) If not, then should i continue saving everytime i get a new string (of length about 30-40).
How should i go about optimizing such a desgin?

I would use a BufferedWriter, like PrintWriter. This will buffer the data for you and write every 8 KB (actually every 8192 characters). If you want to write more often you can use flush() or a smaller buffer.
PrintWriter pw = new PrintWriter("my.log");
// will actually write to the OS, 5 times. (1000 * 40 / 8192)
for(int i = 0; i < 1000; i++) {
pw.printf("%39d%n", i); // a 40 character number.
}
pw.flush();
or you can use
pw.println(lineOfText);
BTW: If you want to know what a really huge file looks like ;) This example writes an 8 TB file http://vanillajava.blogspot.com/2011/12/using-memory-mapped-file-for-huge.html

Perhaps you should use a StringBuilder. Append each new message to it, and at the end convert it to a string.
For example,
StringBuilder sb = new StringBuilder();
// Do your code that continuously adds new messages/strings.
sb.append(new_string);
// Then once you are done...
String result = sb.toString();
If you were to have some string, say String message, and every time you got a new message/string you did message += new_string, it will eat up more memory.
As suggested by Viruzzo, only save so much, then discard the earlier strings at some point. Don't hold on to every message forever.

Related

Java: What's the most efficient way to read relatively large txt files and store its data?

I was supposed to write a method that reads a DNA sequence in order to test some string matching algorithms on it.
I took some existing code I use to read text files (don't really know any others):
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while((line = br.readLine()) != null) {
seq += line;
}
br.close();
}
catch(FileNotFoundException e) { e.printStackTrace(); }
catch(IOException e) { e.printStackTrace(); }
This seems to work just fine for small text files with ~3000 characters, but it takes forever (I just cancelled it after 10 minutes) to read files containing more than 45 million characters.
Is there a more efficient way of doing this?

One thing I notice is that you are doing seq+=line. seq is probably a String? If so, then you have to remember that strings are immutable. So in fact what you are doing is creating a new String each time you are trying to append a line to it. Please use StringBuilder instead. Also, if possible you don't want to do create a string and then process. That way you have to do it twice. Ideally you want to process as you read, but I don't know your situation.

The main element slowing your progress is the "concatenation" of the String seq and line when you call seq+=line. I use quotes for concatenation because in Java, Strings cannot be modified once they are created (e.g. immutable as user1598503 mentioned). Initially, this is not an issue, as the Strings are small, however once the Strings become very long, e.e. hundreds of thousands of characters, memory must be reallocated for the new String, which takes quite a bit of time. StringBuilder will allow you to do these concatenations in place, meaning you will not be creating a new Object every single time.

Your problem is not that the reading takes too much time, but the concatenating takes too much time. Just to verify this I ran your code (didn't finish) and then simply comented line 8 (seq += line) and it ran in under a second. You could try using seq = seq.concat(line) since it has been reported to be quite a bit faster most of the times, but I tried that too and didn't ran under 1-2 minutes (for a 9.6mb input file). My solution would be to store your lines in an ArrayList (or a container of your choice). The ArrayList example worked in about 2-3 seconds with the same input file. (so the content of your while loop would be list.add(line);). If you really, really want to store your entire file in a string you could do something like this (using the Scanner class):
String content = new Scanner(new File("input")).useDelimiter("\\Z").next();
^^This works in a matter of seconds as well. I should mention that "\Z" is the end of file delimiter so that's why it reads the whole thing in one swoop.

goto file line number in Java

I want to know how to directly reach a particular line no of a text file in java.
one Method is this.
int line=0;
BufferedReader read=new BufferedReader(new FileReader(Filename));
while(read.readLine()!=null){
line++;
if(line==LIMIT) break;
}
But this will create a lot of String objects which wont be freed unless gc runs.
Please provide a solution that will be fast and doesn't consume a lot of memory.
PS:I am reading from a file that has millions of lines.

Lets assume that the text file has variable length lines, and that you haven't preprocessed it to create an index. (Otherwise, it should be possible to predetermine the position of the Nth line, and then "seek" to it.)
First observation is that (with the above assumptions), it is not possible to find the Nth line without examining every character before the start of the Nth line.
But you can still do this in a way that doesn't generate lots of garbage. Here's a simple version:
BufferedReader br = new BufferedReader(new FileReader(filename));
for (int i = 1; i < LIMIT; i++) {
while ((ch = br.read()) != '\n') {
if (ch == -1) {
// reached the end of file too soon ...
throw new IOException("The file has < " + LIMIT + " lines");
}
}
}
line = br.readLine();
The trick is to skip over the lines without forming them into String objects.
Now there is a small flaw in the above. It is assuming that the lines of the text file are terminated by a newline character ('\n'), whereas the readLine can cope with 3 kinds of line separator. But that could be addressed ... without generating extra garbage. I'll leave it as "an exercise for the reader", along with investigating tweaks like using read(char[]) instead of read().
You could probably get better performance if you opened the file using a FileInputStream, obtained the FileChannel, read the bytes into a ByteBuffer and then searched it for (byte) '\n'. But the code is significantly more complicated.
However, I'd like to reinforce a point made in the comments. You are probably wasting your time with this. The chances are that your original version runs fast enough for your purposes, despite generating lots of garbage. In reality, GC is fast when the ratio of garbage to non-garbage is high. And for a program that reads an discards lines, you are pretty much guaranteed that will be the case.
Rather than spending time figuring out how to make your program fast based on a false premise, you would be better of writing a simple version and measuring its performance on typical input files. Only optimize if the program is actually too slow.

Instead of reading strings, you can read data in blocks (may be 1024 bytes block) and search line characters. To read block of data, you can use byte array, so it will be reused and so no memory issues. You have to take care of:
Handling of both \r and \n characters
Encoding of the file (like Unicode or other)
Reading data in blocks instead of byte by byte will be more efficient.

I think this should help :
FileReader fr = new FileReader("file1.txt");
BufferedReader br = new BufferedReader(fr);
LineIterator it = IOUtils.lineIterator(br);
for (int l = 0; it.hasNext(); l++) {
String line = (String) it.next();
if (l == LIMIT) {
return line;
}
}

File splitting loss of data

I wrote a program for file splitting and joining. When I break the file into small pieces I found that the size of smaller file is not equal to the original one, there is loss of approximately 30-50 bytes of data. and the combined file doesn't run correctly
e.g. a file ABC has been broken into 2 parts, ABC1 and ABC2 but the problem is
sizeof(ABC) is not equal to sizeof(ABC1) + sizeof(ABC2). By sizeof(ABC) I mean from Windows's perspective, i.e. from the Windows property dialog box.
My code is:
for(int i =0;i<no_of_parts;i++)
{
copied_data = 0;// a variable that count the no of byte transferred in the part of file
fos = new FileOutputStream(jTextField2.getText()+"\\".part"+i);
bouts = new BufferedOutputStream(fos);
while((b = bins.read())!= -1)
{
bouts.write(b);
copied_data++;
if(copied_data==each_part_size_in_byte)
break;
}
}

What about closing your output stream? It will flush the buffer and free the file descriptor you use. Call bouts.close().

When you create a file, it is created in blocks of memories instead of individual bytes. So when you divide the file into two, both of them have sizes in fixed blocks which may be more than your actual size of the written data.

Reading big files and performing some operations in java

First of all I would try to explain what I need to do.
I need to read a file (whose size could be from 1 byte to 2 GB), 2 GB maximum because I try to use MappedByteBuffer for fast reading. Maybe later I will try to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII encoding) to chars which later I put into a StringBuilder and then I put this String Builder into an ArrayList
However I also need to do the following:
User could type blockSize which is the number of chars I have to read into the StringBuilder (which is basically number of file bytes converted to chars)
Once I have collected the user defined char count, I create a copy of the String Builder and put it into an Array List
All steps are performed for every char read. The problem is with String Builder since if the file is big (<500 MB), I get the exception OutOfMemoryError.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.<init>(StringBuilder.java:80)
at java.lang.StringBuilder.<init>(StringBuilder.java:106)
at borrows.wheeler.ReadFile.readFile(ReadFile.java:43)
Java Result: 1
I post my code, maybe someone could suggest improvements to this code or suggest some alternatives.
public class ReadFile {
//matrix block size
public int blockSize = 100;
public int charCounter = 0;
public ArrayList readFile(File file) throws FileNotFoundException, IOException {
FileChannel fc = new FileInputStream(file).getChannel();
MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, (int) fc.size());
ArrayList characters = new ArrayList();
int counter = 0;
StringBuilder sb = new StringBuilder();//blockSize-1
while (mbb.hasRemaining()) {
char charAscii = (char)mbb.get();
counter++;
charCounter++;
if (counter == blockSize){
sb.append(charAscii);
characters.add(new StringBuilder(sb));//new StringBuilder(sb)
sb.delete(0, sb.length());
counter = 0;
}else{
sb.append(charAscii);
}
if(!mbb.hasRemaining()){
characters.add(sb);
}
}
fc.close();
return characters;
}
}
EDIT:
I am doing Burrows-Wheeler transformation. There i should read every file then by Block Size create as many as needed matrixes. well i believe that wiki will explain better than me:
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

If you load large files, it's not entirely surprising that you run out of memory.
How much memory do you have? Are you on a 64-bit system with 64-bit Java? How much heap memory have you allocated (e.g using -Xmx setting)?
Bear in mind that you will need at least twice as much memory as the filesize, because Java uses Unicode UTF-16, which uses at least 2 bytes for each character, but your input is one byte per character. So to load a 2GB file you will need at least 4GB allocated to the heap just for storing this text data.
Also, you need to sort out the logic in your code - you do the same sb.append(charAscii) in the if and the else, and you test !mbb.hasRemaining() in every iteration of a while((mbb.hasRemaining()) loop.
As I asked in your previous question, do you need to store StringBuilders, or would the resulting Strings be OK? Storing strings would save space because StringBuilder allocates memory in big chunks (I think it doubles in size every time it runs out of space!) so may waste a lot.
If you do have to use StringBuilders then pre-sizing them to the value of blockSize would make the code more memory-efficient (and faster).

I try to use MappedByteBuffer for fast reading. Maybe later I will try
to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII
encoding) to chars which later I put into a StringBuilder and then I
put this String Builder into an ArrayList
This sounds more like a problem than a solution. I suggest to you that the file already is ASCII, or character data; that it could be read pretty efficiently using a BufferedReader; and that it can be processed one line at a time.
So do that. You won't get even double the speed by using a MappedByteBuffer, and everything you're doing including the MappedByteBuffer is consuming memory on a truly heroic scale.
If the file isn't such that it can be processed line by line, or record by record, there is something badly wrong upstream.

Can I optimize this code?

I am trying to retrieve the data from the table and convert each row into CSV format like
s12, james, 24, 1232, Salaried
The below code does the job, but takes a long time, with tables of rows exceeding 1,00,000.
Please advise on optimizing technique:
while(rset1.next()!=false) {
sr=sr+"\n";
for(int j=1;j<=rsMetaData.getColumnCount();j++)
{
if(j< 5)
{
sr=sr+rset1.getString(j).toString()+",";
}
else
sr=sr+rset1.getString(j).toString();
}
}
/SR

Two approaches, in order of preference:
Stream the output
PrintWriter csvOut = ... // Construct a write from an outputstream, say to a file
while (rs.next())
csvOut.println(...) // Write a single line
(note that you should ensure that your Writer / OutputStream is buffered, although many are by default)
Use a StringBuilder
StringBuilder sb = new StringBuilder();
while (rs.next())
sb.append(...) // Write a single line
The idea here is that appending Strings in a loop is a bad idea. Imagine that you have a string. In Java, Strings are immutable. That means that to append to a string you have to copy the entire string and then write more to the end. Since you are appending things a little bit at a time, you will have many many copies of the string which aren't really useful.
If you're writing to a File, it's most efficient just to write directly out with a stream or a Writer. Otherwise you can use the StringBuilder which is tuned to be much more efficient for appending many small strings together.

I'm no Java expert, but I think it's always bad practice to use something like getColumnCount() in a conditional check. This is because after each loop, it runs that function to see what the column count is, instead of just referencing a static number. Instead, set a variable equal to that number and use the variable to compare against j.

You might want to use a StringBuilder to build the string, that's much more efficient when you're doing a lot of concatenation. Also if you have that much data, you might want to consider writing it directly to wherever you're going to put it instead of building it in memory at first, if that's a file or a socket, for example.

StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
sr.append('\n');
for (int j = 1; j <= columnCount; j++) {
sr.append(rset1.getString(j));
if (j < 5) {
sr.append(',');
}
}
}

As a completely different, but undoubtely the most optimal alternative, use the DB-provided export facilities. It's unclear which DB you're using, but as per your question history you seem to be doing a lot with Oracle. In this case, you can export a table into a CSV file using UTL_FILE.
See also:
Generating CSV files using Oracle
Stored procedure example on Ask Tom

As the other answers say, stop appending to a String. In Java, String objects are immutable, so each append must do a full copy of the string, turning this into an O(n^2) operation.
The other is big slowdown is fetch size. By default, the driver is likely to fetch one row at a time. Even if this takes 1ms, that limits you to a thousand rows per second. A remote database, even on the same network, will be much worse. Try calling setFetchSize(1000) on the Statement. Beware that setting the fetch size too big can cause out of memory errors with some database drivers.

I don't believe minor code changes are going to make a substantive difference. I'd surely use a StringBuffer however.
He's going to be reading a million rows over a wire, assuming his database is on a separate machine. First, if performance is unacceptable, I'd run that code on the database server and clip the network out of the equation. If it's the sort of code that gets run once a week as a batch job that may be ok.
Now, what are you going to do with the StringBuffer or String once it is fully loaded from the database? We're looking at a String that could be 50 Mbyte long.
This should be 1 iota faster since it removes the unneeded (i<5) check.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
for (int j = 1; j < columnCount; j++) {
sr.append(rset1.getString(j)).append(",");
}
// I suspect the 'if (j<5)' really meant, "if we aren't on the last
// column then tack on a comma." So we always tack it on above and
// write the last column and a newline now.
sr.append(rset1.getString(columnCount)).append("\n");
}
}
Another answer is to change the select so it returns a comma-sep string. Then we read the single-column result and append it to the StringBuffer.
I forget the syntax now, but something like:
select column1 || "," || column2 || "," ... from table;
Now we don't need to loop and comma concatenation business.
StringBuilder sr = new StringBuilder();
while (rset1.next()) {
sr.append(rset1.getString(1)).append("\n");
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Writing huge string data to file in java, optimization options - java

Related

Java: What's the most efficient way to read relatively large txt files and store its data?

goto file line number in Java

File splitting loss of data

Reading big files and performing some operations in java

Can I optimize this code?

Categories

Resources