Reduce memory imprint when Java application reads gigantic file in chunks

Reduce memory imprint when Java application reads gigantic file in chunks - java

I am creating an application to upload data to a server. The data will be pretty huge, up to 60-70gb. I am using java since I need it to run in any browser.
My approach is something like this:
InputStream s = new FileInputStream(file);
byte[] chunk = new byte[20000000];
s.read(chunk);
s.close();
client.postToServer(chunk);
For the moment it uses a large amount of memory, steadily climbs to about 1gb, and when the garbage collector hits it is VERY obvious, a 5-6 second gap between chunks.
Is there any way to improve the performance of this and keep the memory footprint to a decent level?
EDIT:
This is not my real code. There is alot of other things I do like calculating CRC, validating against InputStream.read return value, etcetera.

You need to think about buffer reuse, something like this:
int size = 64*1024; // 64KiB
byte[] chunk = new byte[size];
int read = -1;
for( read = s.read(chunk); read != -1; read = s.read(chunk)) {
/*
* I do hope you have some API call like the thing below, or at least one with a wrapper object that
* exposes partially filled buffers. Because read might not be the size of the entire buffer if there
* are less than that amount of bytes available in the input stream until the end of the file...
*/
client.postToServer(chunk, 0, read);
}

The first step would be to re-use your buffer, if you don't already do so. Reading a huge file should not generally require a lot of memory unless you keep it all in memory.
Also: Why are you using such a huge buffer? There's nothing really to be gained from it (unless you have an insanely fast network connection & hard disk). Reducing it to about 64k should have no negative effect on performance and might help Java with the GC.

You can try to tune the garbage collector ( http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html , http://www.petefreitag.com/articles/gctuning/ )

Related

Efficiency of Code: Java Transfer File over TCP

I would like to know the difference in terms of performance between these two blocks that try to send a big file over a TCP socket.
I couldn't find much resources explaining their efficiency.
A-
byte[] buffer = new byte[1024];
int number;
while ((number = fileInputStream.read(buffer)) != -1) {
socketOutputStream.write(buffer, 0, number);
}
B-
byte mybytearray = new byte[filesize];
os.write(mybytearray);
Which one is better in terms of transfer delay?
Also What is the difference if i set the size to 1024 or 65536? How would that affect the performance.

The latency until the last byte of the file arrives is basically identical. However the first one is preferable, although with a much larger buffer, for the following reasons:
The data starts arriving sooner.
There is no assumption that the file size fits into an int.
There is no assumption that the entire file fits into memory, so
It scales to very large files without code changes.

Your MTU (Maximum Transmission Unit) size is likely to be around 1500 bytes. This means your data will be broken up (or combined into) this size no matter what you do. Any reasonable buffer size from 512 byte up is likely to give you the same transfer speed.
How you send and receieve data impact the amount of CPU you use. Unles syou have a fast network e.g. 10 GB, your CPU will mroe than keep up with your network.
Writing the code in an efficient manner will ensure you don't waste CPU (which is a good thing) but shouldn't make much difference to your transfer speed which is limited by your bandwidth (and latency of your network)

Java BufferedOutputStream: How many bytes to write

This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks

Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.

BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.

What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive

If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)

Transfer a File over a network using TCP (Speed up the transfer)

I have been trying to send a big file over a Socket connection, but it runs slowly and I was wondering if this code can be optimized in some way to improve the transfer speed.
This is my code for sending the file:
byte[] buffer = new byte[65536];
int number;
while ((number = fileInputStream.read(buffer)) != -1) {
socketOutputStream.write(buffer, 0, number);
}
socketOutputStream.close();
fileInputStream.close();
This is what I use to receive the file on the other machine:
byte[] buffer = new byte[65536];
InputStream socketStream= clientSocket.getInputStream();
File f=new File("C:\\output.dat");
OutputStream fileStream=new FileOutputStream(f);
while ((number = socketStream.read(buffer)) != -1) {
fileStream.write(buffer,0,number);
}
fileStream.close();
socketStream.close();
I think writing to the fileStream is taking the majority of the time. Could anyone offer any advise for speeding up this code.

There's nothing obviously wrong with that code, other than the lack of finally blocks for the close statements.
How long does it take for how much data? It's very unlikely that the FileOutputStream is what's taking the time - it's much more likely to be the network being slow. You could potentially read from the network and write to the file system in parallel, but that would be a lot of work to get right, and it's unlikely to give that much benefit, IMO.

You could try a BufferedOutputStream around the FileOutputStream. It would have the effect of block-aligning all disk writes, regardless of the count you read from the network. I wouldn't expect a major difference but it might help a bit.

I had a similar issue FTP'ing large files. I realized that using the same buffer for reading from the hard drive AND writing to the network was the issue. File system IO likes larger buffers because it is a lot less work for the hard drive to do all the seeking and reading. Networks on the other hand then to prefer smaller buffers for optimizing throughput.
The solution is to read from the hard disk using a large buffer, then write this buffer to the network stream in smaller chunks.
I was able to max out my NIC at 100% utilization for the entire length of any file with 4mb reads and 32kb writes. You can then do the mirrored version on the server by reading in 32kb at a time and storing it in memory then writing 4mb at a time to the hard drive.

Finding server internet bandwidth thru java for streaming

Following this thread.
Streaming large files in a java servlet.
Is it possible to find the total internet bandwidth available in current machine thru java?
what i am trying to do is while streaming large files thru servlet, based on the number of parallel request and the total band width i am trying to reduce the BUFFER_SIZE of the stream for each request. make sense?
Is there any pure java way? (without JNI)

Maybe you can time how long the app need to send one package (the buffer). And if that is larger than x milliseconds, then make your buffer smaller. You can use other values for the original bufferSize and if (stop - start > 700).
This is based on the thread you noticed:
ServletOutputStream out = response.getOutputStream();
InputStream in = [ code to get source input stream ];
String mimeType = [ code to get mimetype of data to be served ];
int bufferSize = 1024 * 4;
byte[] bytes = new byte[bufferSize];
int bytesRead;
response.setContentType(mimeType);
while ((bytesRead = in.read(bytes)) != -1) {
long start = System.currentTimeMillis();
out.write(bytes, 0, bytesRead);
long stop = System.currentTimeMillis();
if (stop - start > 700)
{
bufferSize /= 2;
bytes = new byte[bufferSize];
}
}
// do the following in a finally block:
in.close();
out.close();

The only way to find available bandwidth is to monitor / measure it. On windows you have access to Net.exe and can get the throughput on each NIC.

If you're serving the content through a servlet, then you could calculate how fast each servlet output stream is going. Collect that data for all streams for a user/session, and you could determine at least what the current bandwidth usage is.
A possible way to calculate the rate could be instead of writing the large files through the servlet output stream, write to a new FilterOutputStream that would keep track of your download rates.

The concept of "total internet bandwidth available in current machine" is really hard to define. However, tweaking the local buffer size will not affect how much data you can push through to an individual client.
The rate at which a given client can take data from your server will vary with the client, and with time. For any given connection, you might be limited by your local upstream connection to the Internet (e.g., server on DSL) or you might be limited somewhere in the core (unlikely) or the remote end (e.g., server in a data center, client on a dialup line). When you have many connections, each individual connection may have a different bottleneck. Measuring this available bandwidth is a hard problem; see for example this list of research and tools on the subject.
In general, TCP will handle using all the available bandwidth fairly for any given connection (though sometimes it may react to changes in available bandwidth slower than you like). If the client can't handle more data, the write call will block.
You should only need to tweak the buffersize in the linked question if you find that you are seeing low bandwidth and the cause of that is insufficient data buffered to write to the network. Another reason you might tweak the buffer size is if you have so many active connections that you are running low on memory.
In any case, the real answer may be to not buffer at all but instead put your static files on a separate server and use something like thttpd to serve them (using a system call like sendfile) instead of a servlet. This helps ensure that the bottleneck is not on your server, but somewhere out in the Internet, beyond your control.

EDIT: Re-reading this, it's a little muddled because it's late here. Basically, you shouldn't have to do this from scratch; use one of the existing highly scalable java servers, since they'll do it better and easier.
You're not going to like this, but it actually doesn't make sense, and here's why:
Total bandwidth is independent of the number of connections (though there is some small overhead), so messing with buffer sizes won't help much
Your chunks of data are being broken into variable-sized packets anyway. Your network card and protocol will deal with this better than your servlet can
Resizing buffers regularly is expensive -- far better to re-use constant buffers from a fixed-size pool and have all connections queue up for I/O rights
There are a billion and a half libraries that assist with this sort of server
Were this me, I would start looking at multiplexed I/O using NIO. You can almost certainly find a library to do this for you. The IBM article here may be a useful starting point.
I think the smart money gives you one network I/O thread, and one disk I/O thread, with multiplexing. Each connection requests a buffer from a pool, fills it with data (from a shared network or disk Stream or Channel), processes it, then returns the buffer to the pool for re-use. No re-sizing of buffers, just a bit of a wait for each chunk of data. If you want latency to stay short, then limit how many transfers can be active at a time, and queue up the others.

How to allocate the memory from OS instead of increasing the JVM’s heap size?

I need to detect whether the file I am attaching to an email is exceeding the server limit. I am not allowed to increase the JVM heap size to do this since it is going to affect the application performance.
If I don’t increase the JVM heap size, I will run into OutOfMemoryError directly.
I would like to know how do allocate the memory from OS instead of increasing the JVM’s heap size?
Thanks a lot!

Are you really trying to read the entire file to determine its size to check if it is less than some configured value (your question is not too easy to understand)? If so, why are you not using File#length() instead?

If you need to stream the file to the server in order to find out whether it's too big, you still don't need to read the whole file into memory.
Instead, read maybe 10-100k into memory. Fill the buffer, send it to the server. Repeat until the file is done or the server complains. Then you don't need enough memory for the whole file.

If you write your own stream handling code, you could create your own counter to track the number of bytes transmitted. I'd be surprised if there wasn't already some sort of Filter class that does this for you. Sun has a page about this. Search for 'CountReader'.

You could allocate the memory natively via native code and JNI. However that sounds a painful way to do this.
Instead can't you give the JVM suitable memory configurations (via -Xmx) ? If the document you're mailing is so big that you can't easily handle it, then I'm not sure email is the correct medium to transfer it, and you should instead host it and send a link to it, or perhaps FTP it.

If all the other solutions turn out to be unusable (and I would encourage you to find a better way than requiring the entire file to fit in memory!) you could consider using a direct ByteBuffer. It has the option of using mmap() or other system calls to map a file into your memory without actually reading / allocating space in the heap. You can do this by calling map() on a FileChannel -- API documentation. Note that this is potentially expensive and/or not supported on some platforms, so it should be considered suboptimal compared to any solution which does not require the entire file to be in memory.

Socket s = /* go get your socket to the server */
InputStream is = new FileInputStream("foo.txt");
OutputStream os = s.getOutputStream();
byte[] buf = new byte[4096];
for(int len=-1;(len=is.read(buf))!=-1;) os.write(buf,0,len);
os.close();
is.close();
Of course handle your Exceptions.

If you're not allowed to increase the heap size because of memory constaints, doing an "under the table" memory allocation would cause the same problems. It sounds like you're looking for a loophole in the rules. Like, "My doctor says to cut down on how much I eat at each meal, so I'm eating more between meals to make up for it."
The only way I know of to allocate memory without using the Java heap would be to write JNDI calls to malloc the memory with C. But then how would you use this memory? You'd have to write more JNDI calls to interact with it. I think you'd end up basically re-inventing Java.
If the goal here is to send a large file, just use buffered streams and read/write it one byte at a time. A buffered stream, as the name implies, will take care of buffering for you so you're not really hitting the hard drive one byte at a time. It will really read, I think the default is 8k at a time, and then pass these bytes to you as you ask for them. Likewise, on the write side it will save up a few kb and and send them all in chunks.
So all you should have to do is open a BufferedInputStream and a BufferedOutputStream. Then write a loop that reads one byte from the input stream and writes it to the output stream until you hit end-of-file.
Something like:
OutputStream os=... however you're getting your socket ...
BufferedInputStream bis=new BufferedInputStream(new FileInputStream(fileObject));
BufferedOutputStream bos=new BufferedOutputStream(os);
int b;
while ((b=bis.read())!=-1)
bos.write(b);
bis.close();
bos.close();
No need to make life complicated for yourself by re-inventing buffering.
while (

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reduce memory imprint when Java application reads gigantic file in chunks - java

You can try to tune the garbage collector ( http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html , http://www.petefreitag.com/articles/gctuning/ )

Related

Efficiency of Code: Java Transfer File over TCP

Java BufferedOutputStream: How many bytes to write

Transfer a File over a network using TCP (Speed up the transfer)

Finding server internet bandwidth thru java for streaming

How to allocate the memory from OS instead of increasing the JVM’s heap size?

Categories

Resources