How to reduce java heap usage when reading >50mb files?

How to reduce java heap usage when reading >50mb files? - java

I am developing a webapplication which reads very large files > 50MB and display it. The Java Spring backend will read these files and its content with CXF. My problem is, after it reads a 50 megabyte file, the size of used heap is increasing by 500 megabytes. I read this fils as a String and this String will be sent to frontend. Is there any idea, tricks how can I reduce the Java heap usage? I tried nio, spring's Resource class, but nothing helped.

A dirty way to do this is to have the Spring #Controller method accept an OutputStream or Writer argument—the framework will supply the raw output stream of the HTTP response and you can write directly into it. This sidesteps all the nice logic of content type management etc.
A better way is to define a custom type which will be returned from the controller method and a matching HttpMessageConverter which will (for example) use the information in that object to open the appropriate file and write its contents into the output stream.
In both cases you will not read the file into RAM; you'll use a single, fixed-size buffer to transfer the data directly from the disk to the HTTP response.

Related

Why RequestBody in HttpServletRequest is an InputStream?

I know that in BIO, web container receives the request and wrap it in a HttpServletRequest object, and We can get Header and other stuff from it.
I think the HTTP Message has been copied to User-space. But why the request body is still an inputstream ? Can anybody explain that? Thanks a lot!

HttpServletRequest is an interface from the servlet specification. There are many servlet container implementations including jetty, tomcat and websphere to name a few. Each has its own implementation of HttpServletRequest.
Using an InputStream gives freedom to the servlet implementation to source the request body value from wherever it likes. One implementation might use a local file and FileInputStream, another may have a byte array in memory and use ByteArrayInputStream, another might source from a cache or database etc.
Another benefit of InputStream over a byte array is that you can stream over it only holding a small chunk in memory at a time. There's no requirement to have a large byte array (eg gigabytes) in memory all at once.
Imagine a video sharing site where each user can upload 1GB videos. If the servlet spec were to impose byte array instead of InputStream then a server with 8GB of ram could only support 8 concurrent uploads. With InputStream you can have a small buffer of ram for each upload so could support hundreds/thousands of concurrent 1GB uploads

reading files from memory instead of disk

I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?

I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.

You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example

Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.

have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory

Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).

Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.

Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j

It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.

is there a more efficient way of sending an mp4 file to the user

I am using Spring-MVC and I need to send a MP4 file back to the user. The MP4 files are, of course, very large in size (> 2 GB).
I found this SO thread Downloading a file from spring controllers, which shows how to stream back a binary file, which should theoretically work for my case. However, what I am concerned about is efficiency.
In one case, an answer may implicate to load all the bytes into memory.
byte[] data = SomeFileUtil.loadBytes(new File("somefile.mp4"));
In another case, an answer suggest using IOUtils.
InputStream is = new FileInputStream(new File("somefile.mp4"));
OutputStream os = response.getOutputStream();
IOUtils.copy(is, os);
I wonder if either of these are more memory efficient than simply defining a resource mapping?
<resources mapping="/videos/**" location="/path/to/videos/"/>
The resource mapping may work, except that I need to protect all requests to these videos, and I do not think resource mapping will lend itself to logic that protects the content.
Is there another way to stream back binary data (namely, MP4)? I'd like something that's memory efficient.

I would think that defining a resource mapping would be the cleanest way of handling it. With regards to protecting access, you can simply add /videos/** to your security configuration and define what access you allow for it via something like
<security:intercept-url pattern="/videos/**" access="ROLE_USER, ROLE_ADMIN"/>
or whatever access you desire.
Also, you might consider saving these large mp4's to a cloud storage and/or CDN such as Amazon S3 (with our without CloudFront).
Then you can generate unique urls which will last as long as you want them to. Then the download is handled by Amazon rather than having to use the computing power, data space, and memory of your web server to serve up the large resource files. Also, if you use something like CloudFront, you can configure it for streaming rather than download.

Loading the entire file into memory is worse, as well as using more memory and being non-scalable. You don't transmit any data until you've loaded it all, which adds all that latency.

Need to send multiple objects through an http output stream

I am trying to send some very large files (>200MB) through an Http output stream from a Java client to a servlet running in Tomcat.
My protocol currently packages the file contents in a byte[] and that is placed a a Map<String, Object> along with some metadata (filename, etc.), each part under a "standard" key ("FILENAME" -> "Foo", "CONTENTS" -> byte[], "USERID" -> 1234, etc.). The Map is written to the URL connection output stream (urlConnection.getOutputStream()). This works well when the file contents are small (<25MB), but I am running into Tomcat memory issues (OutOfMemoryError) when the file size is very large.
I thought of sending the metadata Map first, followed by the file contents, and finally by a checksum on the file data. The receiver servlet can then read the metadata from its input stream, then read bytes until the entire file is finished, finally followed by reading the checksum.
Would it be better to send the metadata in connection headers? If so, how? If I send the metadata down the socket first, followed by the file contents, is there some kind of standard protocol for doing this?

You will almost certainly want to use a multipart POST to send the data to the server. Then on the server you can use something like commons-fileupload to process the upload.
The good thing about commons-fileupload is that it understands that the server may not have enough memory to buffer large files and will automatically stream the uploaded data to disk once it exceeds a certain size, which is quite helpful in avoiding OutOfMemoryError type problems.
Otherwise you are going to have to implement something comparable yourself. It doesn't really make much difference how you package and send your data, so long as the server can 1) parse the upload and 2) redirect data to a file so that it doesn't ever have to buffer the entire request in memory at once. As mentioned both of these come free if you use commons-fileupload, so that's definitely what I'd recommend.

I don't have a direct answer for you but you might consider using FTP instead. Apache Mina provides FTPLets, essentially servlets that respond to FTP events (see http://mina.apache.org/ftpserver/ftplet.html for details).
This would allow you to push your data in any format without requiring the receiving end to accommodate the entire data in memory.
Regards.

How to create a file that streams to http response

I'm writing a web application and want the user to be able click a link and get a file download.
I have an interface is in a third party library that I can't alter:
writeFancyData(File file, Data data);
Is there an easy way that I can create a file object that I can pass to this method that when written to will stream to the HTTP response?
Notes:
Obviously I could just write a temporary file and then read it back in and then write it the output stream of the http response. However what I'm looking for is a way to avoid the file system IO. Ideally by creating a fake file that when written to will instead write to the output stream of the http response.
e.g.
writeFancyData(new OutputStreamBackedFile(response.getOutputStream()), data);
I need to use the writeFancyData method as it writes a file in a very specific format that I can't reproduce.

Assuming writeFancyData is a black box, it's not possible. As a thought experiment, consider an implementation of writeFancyData that did something like this:
public void writeFancyData(File file, Data data){
File localFile = new File(file.getPath());
...
// process data from file
...
}
Given the only thing you can return from any extended version of File is the path name, you're just not going to be able to get the data you want into that method. If the signature included some sort of stream, you would be in a lot better position, but since all you can pass in is a File, this can't be done.
In practice the implementation is probably one of the FileInputStream or FileReader classes that use the File object really just for the name and then call out to native methods to get a file descriptor and handle the actual i/o.

As dlawrence writes the API it is impossible to determine what the API is doing with the File.
A non-java approach is to create a named pipe. You could establish a reader for the pipe in your program, create a File on that path and pass it to API.
Before doing anything so fancy, I would recommend analyzing performance and verify that disk i/o is indeed a bottleneck.

Given that API, the best you can do is to give it the File for a file in a RAM disk filesystem.
And lodge a bug / defect report against the API asking for an overload that takes a Writer or OutputStream argument.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to reduce java heap usage when reading >50mb files? - java

Related

Why RequestBody in HttpServletRequest is an InputStream?

reading files from memory instead of disk

is there a more efficient way of sending an mp4 file to the user

Need to send multiple objects through an http output stream

How to create a file that streams to http response

Categories

Resources