Memory-friendly way of writing an InputStream to a File

Memory-friendly way of writing an InputStream to a File - java

I'm trying to write a bulk-downloader for images. Getting the InputStream from an URLConnection is easy enough, but downloading all files takes a while. Using multithreading sure speeds it up, but having a lot of threads download files could use a lot of memory. Here's what I found:
Let in be the InputStream, file the target File and fos a FileOutputStream to file
The simple way
fos.write(in.readAllBytes());
Read whole file, write the returning byte[]. Probably useable for getting the website source, no good for bigger files such as images.
Writing chunks
byte[] buffer = new byte[bufsize];
int read;
while ((read = in.read(buffer, 0, bufsize)) >= 0) {
fos.write(buffer, 0, read);
}
Seems better to me.
in.transferTo(fos)
in.transferTo(fos);
Writes chunks internally, as seen above.
Files.copy()
Files.copy(in, file.toPath(), StandardCopyOption.REPLACE_EXISTING);
Appears to use native implementations.
Which one of these should I use to minimize memory usage when done dozens of times in parallel?
This is a small project fur fun, external libraries are overkill for that IMO. Also I can't use ImageIO, since that can't handle webms, some pngs/jpgs and animated gifs.
EDIT:
This question was based on the assumption that concurrent writing is possible. However, it doesn't seem like that is the case. I'll probably get the image links concurrently and then download them one after another. Thanks for the answers anyways!

The short answer is: from the memory usage perspective the best solution is to use the version which reads and stores data in chunks.
The buffer size should be basically choosen taking into account the number of simultaneuous downloads, available memory, download speed and efficiency of the target drive in terms of data tranfer rate and IOPS.
The long answer is that concurrent download of files doesn't neccesarilly mean the download will be faster.
The number of simultaneuous downloads to actually speed up the overall download time mostly depends on:
number of hosts from which you're downlading
speed of internet connection of the host from which you're
downloading, limited by the speed of the network adapter of this host
speed of your internet connection, limited by the speed of the network adapter of this host
IOps of the storage of the host from which you're downloading
IOps of the storage you're downloading into
Tranfer rate of the storage on the host from which you're downloading
Tranfer rate of the storage you're downloading into
Performance of the local and remote hosts. For instance some older or low cost android device could be limited by the CPU speed.
For instance it could appear that if the source host has single hdd drive and single connection already gives the full connection speed, then it is useless to use multiple connections, as it would make the download slower by creating overhead of switching beetwen tranfered files.
It could be also that the source host has a speed limit on single connection, so multiple connections could speed things up.
HDD drive usually have an IOPS value around 80 IOPS and tranfer rate about 80 MB/s, and it could limit the speed of download/upload by these factors. So practically you can't write or read from such disk more than 80 files per second, and more than the tranfer limit around 80MB/s, of course this hardly depends on the disk model.
SSD drive usually have tens of thousands of IOPS and transfer rate > 400 MB/s, so the limits are much bigger, but for really fast internet connections they are still important.

I found on the internet a time-based comparison (hence performance) here journaldev.com/861/java-copy-file
However if you are focused on memory you could try to measure the memory consumption yourself using something like the code proposed by #pasha701 here
Runtime runtime = Runtime.getRuntime();
long usedMemoryBefore = runtime.totalMemory() - runtime.freeMemory();
System.out.println("Used Memory before" + usedMemoryBefore);
// copy file method here
long usedMemoryAfter = runtime.totalMemory() - runtime.freeMemory();
System.out.println("Memory increased:" + (usedMemoryAfter-usedMemoryBefore));
Notice this returns values are in bytes, divide by 1000000 to get values in MB.

Related

How to speed up data transfer over socket?

Currently I am using this code on both Server and Client Side. Client is an android device.
BufferedOutputStream os = new BufferedOutputStream(socket.getOutputStream(),10000000);
BufferedInputStream sin = new BufferedInputStream(socket.getInputStream(),10000000);
os.write("10000000\n".getBytes());
os.flush();
for (int i =0;i<10000000;i++){
os.write((sampleRead[i]+" ").getBytes());
}
os.flush();
The problem is that this code takes about 80 secs to transfer data from android client to server while it takes only 8 seconds to transfer the data back from server to client. The code is same on both sides and buffer is also same. I also tried with different buffer sizes but the problem is with this segment
for (int i =0;i<10000000;i++){
os.write((sampleRead[i]+" ").getBytes());
}
The buffering takes most of the time while the actual transfer takes only about 6-7 seconds on a 150mbps hotspot connection. What could be the problem and how to solve it?

First of all, as a commenter has already noted, using a monstrously large buffer is likely to be counter productive. Once your stream buffer is bigger than the size of a network packet, app-side buffering loses its effectiveness. (The data in your "big" buffer needs to be split packet-sized chunks by the TCP/IP stack before it goes onto the network.) Indeed, if the app-side buffer is really large, you may find that your data gets stuck in the buffer for a long time waiting for the buffer to fill ... while the network is effectively idle.
(The Buffered... readers, writers and streams are primarily designed to avoid lots of syscalls that transfer tiny amounts of data. Above 10K or so, the buffering doesn't performance help much.)
The other thing to now is that in a lot of OS environments, the network throughput is actually limited by virtualization and default network stack tuning parameters. To get a better throughput, you may need to tune at the OS level.
Finally, if your network path is going over a network path that is congested, has a high end-to-end latency or links with constrained data rate, then you are unlikely to get fast data transfers no matter how you tune things.
(Compression might help ... if you can afford the CPU overhead at both ends ... but some data links already do compression transparently.)

You could compress the data transfer, it will save a lot of memory and well to transfer a compress stream of data is cheaper... For that you need to implement compress logic in client side and decompress logic in server side, see GZIPInputStream... And try reducing the buffer size is huge for a mobile device...

High performance file IO in Android

I'm creating an app which communicates with an ECG monitor. Data is read at a rate of 250 samples pr second. Each package from the ECG monitor contains 80 bytes and this is received 40 times per second.
I've tried using a RandomAcccessFile but packages were lost both in sync
RandomAccessFile(outputFile, "rws") and async RandomAccessFile(outputFile, "rw") mode.
In a recent experiment I've tried using the MappedByteBuffer. This should be extremely performant, but when I create the buffer I have to specify a size map(FileChannel.MapMode.READ_WRITE, 0, 10485760) for a 10MB buffer. But this results in a file that's always 10MB in size. Is it possible to use a MappedByteBuffer where the file size is only the actual amount of data stored?
Or is there another way to achieve this? Is it naive to write to a file this often?
On a side note this wasn't an issue at all on iOS - this can be achieved with no buffering at all.

Risk of repeated write/delete to SSD?

I have a program in Java that creates a log file about 1K in size. If I run a test that deletes the old log, and creates a new log, then saves it, repeated a million times, if the size of the file grows over time (up to a few mb's), will I risk damage to my SSD? Is there a size limit for the log file that could avoid this risk, or can anyone help me understand the mechanics of the risk?

In the case of constant same file open/close with gradual file size increase there are 2 protection mechanisms at File System and SSD levels that will prevent early disk failure.
First, on every file delete, File System will initiate Trim (aka Discard aka Logical Erase) command to the SSD. Trim address range will cover entire size of deleting file. Trim greatly helps SSD to reclaim free space for the new data. Using Trim in combination with Writes when accessing the same data range is the best operational mode for SSD in terms of saving its endurance. Just make sure that your OS has Trim operation enabled (usually it is by default). All modern SSDs should support it as well. Important notice, Trim is logical erase, it will not initiate immediate Physical Media erase. Physical Erase will be initiated later indirectly as a part of SSD internal Garbage Collection.
Second, when accessing the same file, most likely File System will issue Writes to the SSD at the same address. Just amount of writes will grow as file size grows. Such pattern is known as Hot Range access. It is nasty pattern for SSD in terms of endurance. SSD has to allocated free resources (physical pages) on every file write but lifetime of the data is very short as the data is deleted almost immediately. Overall, amount of unique data is very low in SSD Physical Media, but amount of allocated and processed resources (physical pages) is huge. Modern SSDs has protection from Hot Range access by using Physical Media Units in round robin manner that evens the wear.
I advise to monitor SSD SMART Health Data (Life-time left parameter), for example by using https://www.smartmontools.org/ or Software provided by SSD Vendor. I will help to see how your access pattern is affecting endurance.

Like with any file, if the disk doesn't contain enough space to write to a file, the OS (or Java) won't allow the file to be written until space is cleared. The only way you can "screw up" a disk in this manner is if you mess around with addresses at the kernel level.

Finding server internet bandwidth thru java for streaming

Following this thread.
Streaming large files in a java servlet.
Is it possible to find the total internet bandwidth available in current machine thru java?
what i am trying to do is while streaming large files thru servlet, based on the number of parallel request and the total band width i am trying to reduce the BUFFER_SIZE of the stream for each request. make sense?
Is there any pure java way? (without JNI)

Maybe you can time how long the app need to send one package (the buffer). And if that is larger than x milliseconds, then make your buffer smaller. You can use other values for the original bufferSize and if (stop - start > 700).
This is based on the thread you noticed:
ServletOutputStream out = response.getOutputStream();
InputStream in = [ code to get source input stream ];
String mimeType = [ code to get mimetype of data to be served ];
int bufferSize = 1024 * 4;
byte[] bytes = new byte[bufferSize];
int bytesRead;
response.setContentType(mimeType);
while ((bytesRead = in.read(bytes)) != -1) {
long start = System.currentTimeMillis();
out.write(bytes, 0, bytesRead);
long stop = System.currentTimeMillis();
if (stop - start > 700)
{
bufferSize /= 2;
bytes = new byte[bufferSize];
}
}
// do the following in a finally block:
in.close();
out.close();

The only way to find available bandwidth is to monitor / measure it. On windows you have access to Net.exe and can get the throughput on each NIC.

If you're serving the content through a servlet, then you could calculate how fast each servlet output stream is going. Collect that data for all streams for a user/session, and you could determine at least what the current bandwidth usage is.
A possible way to calculate the rate could be instead of writing the large files through the servlet output stream, write to a new FilterOutputStream that would keep track of your download rates.

The concept of "total internet bandwidth available in current machine" is really hard to define. However, tweaking the local buffer size will not affect how much data you can push through to an individual client.
The rate at which a given client can take data from your server will vary with the client, and with time. For any given connection, you might be limited by your local upstream connection to the Internet (e.g., server on DSL) or you might be limited somewhere in the core (unlikely) or the remote end (e.g., server in a data center, client on a dialup line). When you have many connections, each individual connection may have a different bottleneck. Measuring this available bandwidth is a hard problem; see for example this list of research and tools on the subject.
In general, TCP will handle using all the available bandwidth fairly for any given connection (though sometimes it may react to changes in available bandwidth slower than you like). If the client can't handle more data, the write call will block.
You should only need to tweak the buffersize in the linked question if you find that you are seeing low bandwidth and the cause of that is insufficient data buffered to write to the network. Another reason you might tweak the buffer size is if you have so many active connections that you are running low on memory.
In any case, the real answer may be to not buffer at all but instead put your static files on a separate server and use something like thttpd to serve them (using a system call like sendfile) instead of a servlet. This helps ensure that the bottleneck is not on your server, but somewhere out in the Internet, beyond your control.

EDIT: Re-reading this, it's a little muddled because it's late here. Basically, you shouldn't have to do this from scratch; use one of the existing highly scalable java servers, since they'll do it better and easier.
You're not going to like this, but it actually doesn't make sense, and here's why:
Total bandwidth is independent of the number of connections (though there is some small overhead), so messing with buffer sizes won't help much
Your chunks of data are being broken into variable-sized packets anyway. Your network card and protocol will deal with this better than your servlet can
Resizing buffers regularly is expensive -- far better to re-use constant buffers from a fixed-size pool and have all connections queue up for I/O rights
There are a billion and a half libraries that assist with this sort of server
Were this me, I would start looking at multiplexed I/O using NIO. You can almost certainly find a library to do this for you. The IBM article here may be a useful starting point.
I think the smart money gives you one network I/O thread, and one disk I/O thread, with multiplexing. Each connection requests a buffer from a pool, fills it with data (from a shared network or disk Stream or Channel), processes it, then returns the buffer to the pool for re-use. No re-sizing of buffers, just a bit of a wait for each chunk of data. If you want latency to stay short, then limit how many transfers can be active at a time, and queue up the others.

Streaming large files in a java servlet

I am building a java server that needs to scale. One of the servlets will be serving images stored in Amazon S3.
Recently under load, I ran out of memory in my VM and it was after I added the code to serve the images so I'm pretty sure that streaming larger servlet responses is causing my troubles.
My question is : is there any best practice in how to code a java servlet to stream a large (>200k) response back to a browser when read from a database or other cloud storage?
I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. This seems like it would be io heavy.
Any thoughts would be appreciated. Thanks.

When possible, you should not store the entire contents of a file to be served in memory. Instead, aquire an InputStream for the data, and copy the data to the Servlet OutputStream in pieces. For example:
ServletOutputStream out = response.getOutputStream();
InputStream in = [ code to get source input stream ];
String mimeType = [ code to get mimetype of data to be served ];
byte[] bytes = new byte[FILEBUFFERSIZE];
int bytesRead;
response.setContentType(mimeType);
while ((bytesRead = in.read(bytes)) != -1) {
out.write(bytes, 0, bytesRead);
}
// do the following in a finally block:
in.close();
out.close();
I do agree with toby, you should instead "point them to the S3 url."
As for the OOM exception, are you sure it has to do with serving the image data? Let's say your JVM has 256MB of "extra" memory to use for serving image data. With Google's help, "256MB / 200KB" = 1310. For 2GB "extra" memory (these days a very reasonable amount) over 10,000 simultaneous clients could be supported. Even so, 1300 simultaneous clients is a pretty large number. Is this the type of load you experienced? If not, you may need to look elsewhere for the cause of the OOM exception.
Edit - Regarding:
In this use case the images can contain sensitive data...
When I read through the S3 documentation a few weeks ago, I noticed that you can generate time-expiring keys that can be attached to S3 URLs. So, you would not have to open up the files on S3 to the public. My understanding of the technique is:
Initial HTML page has download links to your webapp
User clicks on a download link
Your webapp generates an S3 URL that includes a key that expires in, lets say, 5 minutes.
Send an HTTP redirect to the client with the URL from step 3.
The user downloads the file from S3. This works even if the download takes more than 5 minutes - once a download starts it can continue through completion.

Why wouldn't you just point them to the S3 url? Taking an artifact from S3 and then streaming it through your own server to me defeats the purpose of using S3, which is to offload the bandwidth and processing of serving the images to Amazon.

I've seen a lot of code like john-vasilef's (currently accepted) answer, a tight while loop reading chunks from one stream and writing them to the other stream.
The argument I'd make is against needless code duplication, in favor of using Apache's IOUtils. If you are already using it elsewhere, or if another library or framework you're using is already depending on it, it's a single line that is known and well-tested.
In the following code, I'm streaming an object from Amazon S3 to the client in a servlet.
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.commons.io.IOUtils;
InputStream in = null;
OutputStream out = null;
try {
in = object.getObjectContent();
out = response.getOutputStream();
IOUtils.copy(in, out);
} finally {
IOUtils.closeQuietly(in);
IOUtils.closeQuietly(out);
}
6 lines of a well-defined pattern with proper stream closing seems pretty solid.

toby is right, you should be pointing straight to S3, if you can. If you cannot, the question is a little vague to give an accurate response:
How big is your java heap? How many streams are open concurrently when you run out of memory?
How big is your read write/bufer (8K is good)?
You are reading 8K from the stream, then writing 8k to the output, right? You are not trying to read the whole image from S3, buffer it in memory, then sending the whole thing at once?
If you use 8K buffers, you could have 1000 concurrent streams going in ~8Megs of heap space, so you are definitely doing something wrong....
BTW, I did not pick 8K out of thin air, it is the default size for socket buffers, send more data, say 1Meg, and you will be blocking on the tcp/ip stack holding a large amount of memory.

I agree strongly with both toby and John Vasileff--S3 is great for off loading large media objects if you can tolerate the associated issues. (An instance of own app does that for 10-1000MB FLVs and MP4s.) E.g.: No partial requests (byte range header), though. One has to handle that 'manually', occasional down time, etc..
If that is not an option, John's code looks good. I have found that a byte buffer of 2k FILEBUFFERSIZE is the most efficient in microbench marks. Another option might be a shared FileChannel. (FileChannels are thread-safe.)
That said, I'd also add that guessing at what caused an out of memory error is a classic optimization mistake. You would improve your chances of success by working with hard metrics.
Place -XX:+HeapDumpOnOutOfMemoryError into you JVM startup parameters, just in case
take use jmap on the running JVM (jmap -histo <pid>) under load
Analyize the metrics (jmap -histo out put, or have jhat look at your heap dump). It very well may be that your out of memory is coming from somewhere unexpected.
There are of course other tools out there, but jmap & jhat come with Java 5+ 'out of the box'
I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. This seems like it would be io heavy.
Ah, I don't think you can't do that. And even if you could, it sounds dubious. The tomcat thread that is managing the connection needs to in control. If you are experiencing thread starvation then increase the number of available threads in ./conf/server.xml. Again, metrics are the way to detect this--don't just guess.
Question: Are you also running on EC2? What are your tomcat's JVM start up parameters?

You have to check two things:
Are you closing the stream? Very important
Maybe you're giving stream connections "for free". The stream is not large, but many many streams at the same time can steal all your memory. Create a pool so that you cannot have a certain number of streams running at the same time

In addition to what John suggested, you should repeatedly flush the output stream. Depending on your web container, it is possible that it caches parts or even all of your output and flushes it at-once (for example, to calculate the Content-Length header). That would burn quite a bit of memory.

If you can structure your files so that the static files are separate and in their own bucket, the fastest performance today can likely be achieved by using the Amazon S3 CDN, CloudFront.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.