Copying a growing file in Java

Copying a growing file in Java - java

I'm trying to work out how I can copy a growing file using Java. An example of what I would like to work is the following:
A file is downloaded from an HTTP server.
I initiate a file copy before the file has finished downloading
The copying begins, and doesn't end until the file is completely downloaded and everything has been copied
I have used the following code:
InputStream is = new FileInputStream(sourceFile);
OutputStream os = new FileOutputStream(targetFile);
byte[] buf = new byte[8192];
int num;
while ((num = is.read(buf)) != -1) {
os.write(buf, 0, num);
}
But that only copies the content that has so far been downloaded, so I end up with a broken target file.
I have also tested using BufferedInputStream and BufferedOutputStream, but that didn't work either.
Is there any way to achieve what I want?
Thanks

If you are in control off the file download via HTTP then you could download to a temporary file and then rename the file once the download has completed, thus making the operation atomic.
The alternative is for your file copy process to periodically check the file size of the target file and to only initiate the copy once the file size has stabilised and is no longer increasing. For example, you may elect to record the file size every second and only initiate the copy if the size remains constant for 3 successive poll attempts.

This is going to be tricky, since the copying process has no reliable way of knowing when the download has finished. You could look at whether the file is growing, but the download could stall for a period of time, and you could erroneously conclude that it has finished. If the download fails in the middle, you also have no way of knowing that you're looking at an incomplete file.
I think your best bet is the downloading process. If you control it, you could modify it to store the file in the other location, or both locations, or move/rename it at the end, depending on your requirements.
If you don't control the downloading process, and it's a simple HTTP download, you could replace it with something that you do control.

Related

Zipping large files(>15GB) and uploading to S3 without OOM

I have one memory issue while zipping large files/folders(result zip >15GB) and uploading it to S3 storage. I can create zip file in disc and append files/folders, upload that file with parts to S3. But by my experience it is not good way to resolve this issue. Do you know any good patterns zipping large files/folders and uploading it to S3 without memory issues(such OOM)? It will be good if i can append these files/folders to S3 directly to some uploaded zip.
Zip files/folders to disc and uploading that zip file by parts to S3.

👋
The main reason why you are getting an OOM is just because of how the deflate algorithm of zlib works.
Imagine this setup:
It starts to read the whole file by opening a readable stream.
It creates a temporary 0 byte output file from the start.
It then reads the data in chunks, called dictionary size, it then sends it to the CPU for further processing and compression, which are propagated back to the RAM.
When it finished with a certain fixed sized dictionary, it moves to the next one, and so on until it reaches END OF FILE terminator.
After that, it grabs all that deflated bytes (compressed) from RAM and writes that to the actual file.
You can observe & deduce that behavior by initiating a deflate operation, an example below.
(The file is created, 372mb is processed, but none is written to the file until the last processed byte.)
You could technically grab all of the parts, archive them AGAIN in a tar.gz and then upload to AWS, as one file, but you may get into the same problem with memory, but now on the uploading part.
Here are the file size limitations:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html
If you use the CLI you can technically do that, if you need or have to use the REST API that's not an option for you as the limitation there is only 5GB per request.
Also, you have not specified the max size, so if it's even larger than 160GB that's not an option EVEN using the AWS CLI (which takes care of releasing the memory after each uploaded chunk). So your best bet would be multipart upload.
https://docs.aws.amazon.com/cli/latest/reference/s3api/create-multipart-upload.html
All the best!

You can use AWS Lambda to zip your files for you before uploading them to an S3 bucket. You can even configure Lambda to be triggered and zip your files on upload. Here is a Java example of a Lambda function for zipping large files. This library is limited to 10 GB, but this can be overcome by using EFS.
Lambda’s ephemeral storage is limited to 10 GB, but you can attach EFS storage to handle larger files. The cost should be close to none if you delete the files after use.
Also, remember to use Multipart Upload when uploading file larger than 100 MB to S3. If you are using the SDK, it should handle this for you.

Zipping the file in 1 go is not exactly a correct way to go about. Think the better way is to break down the problem in a way you don't load the whole data in 1 go, but read it byte by byte and sent it to your destination byte by byte. This way, not only you will get speed (~x10) but also address those OOM's
Your destination could be a web end point on an EC2 instance or an API gateway fronted web service depending upon your architectural choice.
So essentially the part 1 of solution is to STREAM - zip it byte by byte and sent it to an http end point. Part 2 might be to use Multi part upload interfaces from AWS SDK (in your destination) and push it in parallel to S3
Path in = Paths.get("abc.huge");
Path out = Paths.get("abc.huge.gz");
try (InputStream in = Files.newInputStream(in);
OutputStream fout = Files.newOutputStream(out);) {
GZipCompressorOutputStream out2 = new GZipCompressorOutputStream(
new BufferedOutputStream(fout));
// Read and write byte by byte
final byte[] buffer = new byte[buffersize];
int n = 0;
while (-1 != (n = in.read(buffer))) {
out2.write(buffer, 0, n);
}
}

Reading an image in Java with threads

I have a question regarding reading images in Java. I am trying to read an image using threads and i was curious whether by doing this:
myInputFile = new FileInputStream(myFile);
I already read the whole data or not. I already read it in 4 chunks using threads and i am curious whether I just read it twice, once with threads and once with FileInputStream, or what does FileInputStream exactly do. Thanks in advance!

The FileInputStream is not reading your file yet, just by calling it like: myInputFile = new FileInputStream(myFile);.
It basically only gives you a handle to the underlying file and prepares to read from it by opening a connection to that file. Also it runs some basic checks including whether the file exists and if its a proper file and not a directory.
Following is stated in the JavaDocs which you can find here:
Creates a FileInputStream by opening a connection to an actual file,
the file named by the File object file in the file system. A new
FileDescriptor object is created to represent this file connection.
First, if there is a security manager, its checkRead method is called
with the path represented by the file argument as its argument.
If the named file does not exist, is a directory rather than a regular
file, or for some other reason cannot be opened for reading then a
FileNotFoundException is thrown.
Only by calling the FileInputStream.read methods it starts to read and return the contents of the file.
Thereby the FileInputStream.read() method will only read one single byte of the file and the FileInputStream.read(byte[] b) method will read as many bytes as the size of the byte array b.
Edit:
Because reading a file byte by byte is pretty slow and the usage of the plain FileInputStream.read(byte[] b) method can be a bit cumbersome it's a good practice to use the BufferedInputStream to process files in Java.
It'll read by default the next 8192 bytes of a file and buffer it in-memory for faster access. So the BufferedInputStream.read method will still only return a single byte per call, but in the BufferedInputStream it'll mainly be served from an internal buffer. As long the requested bytes are in this buffer, they'll be served from it. The underlying file will be accessed again only when really needed (-> the requested byte is not in the buffer anymore). This drastically reduces the number of read accesses to the hardware (which in comparison is the slowest operation in this process) and therefore boosts the reading performance a lot.
The initialization looks like this:
InputStream i = new BufferedInputStream(new FileInputStream(myFile));
The handling of it is exactly same as with the 'plain' FileInputStream, since they share the same InputStream interface.

How file manipulations perform during power outage

Linux machine, Java standalone application
I am having the following situation:
I have:
consecutive file write(which creates the destination file and writes some content to it) and file move.
I also have a power outage problem, which instantly cuts off the power of computer during these operations.
As a result, I am getting that the file was created, and it was moved as well, but the file content is empty.
The question is what under the hood can be causing this exact outcome? Considering the time sensitivity, may be hard drive is disabled before the processor and RAM during the cut out, but in that case, how is it possible that the file is created and moved after, but the write before moving is not successful?
I tried catching and logging the exception and debug information but the problem is power outage disables the logging abilities(I/O) as well.
try {
FileUtils.writeStringToFile(file, JsonUtils.toJson(object));
} finally {
if (file.exists()) {
FileUtils.moveFileToDirectory(file, new File(path), true);
}
}

Linux file systems don't necessarily write things to disk immediately, or in exactly the order that you wrote them. That includes both file content and file / directory metadata.
So if you get a power failure at the wrong time, you may find that the file data and metadata is inconsistent.
Normally this doesn't matter. (If the power fails and you don't have a UPS, the applications go away without getting a chance to finish what they were doing.)
However, if it does matter, you can do the following: to force the file to "sync" before you move it:
FileOutputStream fos = ...
// write to file
fs.getFD().sync();
fs.close();
// now move it
You need to read the javadoc for sync() carefully to understand what the method actually does.
You also need to read the javadoc for the method you are using to move the file regarding atomicity.

How to manage the creation and deletion of temporary files

I'm adding code to a large JSP web application, integrating functionality to convert CGM files to PDFs (or PDFs to CGMs) to display to the user.
It looks like I can create the converted files and store them in the directory designated by System.getProperty("java.io.tmpdir"). How do I manage their deletion, though? The program resides on a Linux-based server. Will the OS automatically delete from /tmp or will I need to come up with functionality myself? If it's the latter scenario, what are good ways to go about doing it?
EDIT: I see I can use deleteOnExit() (relevant answer elsewhere), but I think the JVM runs more or less continuously in the background so I'm not sure if the exits would be frequent enough.
I don't think I need to cache any converted files--just convert a file anew every time it's needed.

You can do this
File file = File.createTempFile("base_name", ".tmp", new File(temporaryFolderPath));
file.deleteOnExit();
the file will be deleted when the virtual machine terminates
Edit:
If you want to delete it after the job is done, just do it:
File file = null;
try{
file = File.createTempFile("webdav", ".tmp", new File(temporaryFolderPath));
// do sth with the file
}finally{
file.delete();
}

There are ways to have the JVM delete files when the JVM exits using deleteOnExit() but I think there are known memory leaks using that method. Here is a blog explaining the leak: http://www.pongasoft.com/blog/yan/java/2011/05/17/file-dot-deleteOnExit-is-evil/
A better solution would either be to delete old files using a cron or if you know you aren't going to use the file again, why not just delete it after processing?

From your comment :
Also, could I just create something that checks to see if the size of my files exceeds a certain amount, and then deletes the oldest ones if that's true? Or am I overthinking it?
You could create a class that keeps track of the created files with a size limit. When the size of the created files, after creating a new one, goes over the limit, it deletes the oldest one. Beware that this may delete a file that still needs to exist even if it is the oldest one. You might need a way to know which files still need to be kept and delete only those that are not needed anymore.
You could have a timer in the class to check periodically instead of after each creation. This solution is tied to your application while using a cron isn't.

Java FileLock: How to Load Dynamic Library From Locked File?

I have an applet that retrieves a byte array from a backend server. This byte array contains a dynamic library (DLL or SO, depending on which OS the applet is running on), that must be written to disk and subsequently loaded from the disk by a call to System.load().
I need to ensure that the file is not tampered with after it's been written to disk and before it's loaded by the OS through the call to System.load(). I obtain an exclusive lock on the file while it's written to disk, but my testing shows that I must release this lock before the call to System.load(), or it'll fail to load the library.
Is there some way I can keep the lock on the file while I load it?
Sample code:
File f = File.createTempFile("tmp", "");
RandomAccessFile raf = new RandomAccessFile(f, "rwd");
FileChannel channel = raf.getChannel();
FileLock lock = channel.lock(0, Long.MAX_VALUE, false);
// This would be where I write the DLL/SO from a byte array...
raf.write((int)65); // 'A'
raf.write((int)66); // 'B'
raf.write((int)67); // 'C'
System.out.println("Wrote dynamic library to file...");
// Close and release lock
raf.close();
System.out.println("File closed. Lock released.");
// This call fails if the file is still locked.
System.load(f.getAbsolutePath());
Any help is greatly appreciated. The solution (if there is any) must not be native to any OS, but work on all platforms supported by Java. It is also a requirement that the solution be compatible with Java 1.4.

In Java 7 you can implement in-memory file system (see java.nio.file.spi.FileSystemProvider), so library content will be completely in memory thus making attacker's life much harder.
Another possible approach is to sign the library and let OS do security checks after reading file from disk; might be not very portable though.
Most important - is it really the biggest security issue you're facing? The time between 2 calls will be some micro- (ok, maybe milli-) seconds - one must hack really deep into filesystem to do bad things. Wouldn't it be easier to alter file while it is transferred over network? Or don't you think an attacker that advanced may... let's say hack JVM and substitute content while library is written to disk? Nothing is bulletproof, maybe this is the risk you can accept?
Just out of interest - what exactly is the error you're getting?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.