Writing a ZIP on the fly? - java

At the moment I am tracing a load of files that come into my system to a directory. The issue is that I am running out of iNodes (the number of files I can store). For "replay" reasons (there are other reasons too), I would like the files to be separate files, so I can't just write to one file.
I am wondering whether I can replace this with code that will write the files to a ZIP on the fly. However, my concern is - what happens if the JVM crashes during procesing for whatever reason, do will I end up with a corrupt ZIP file? Or is there a way that I can ensure that the ZIP is valid after every "write"?

Related

Need help choosing a robust archive format

I'm working on a Java application that runs on Lubuntu on single-board computers and produces thousands of image files, which are then transferred over FTP. The transfer takes several times longer for multiple files than it does for a single file of the same size as the total of the multiple files, I'm assuming because the FTP client has to establish a new connection for every file. So I thought I'd have the application put the image files in a single archive file, but the problem with this is that sometimes the SBC won't shut down cleanly for various reasons, and the entire archive may be corrupted all the images will be lost. Archiving the files afterwards is not a great option basically because it takes a long time. An intermediate solution may be to create multiple midsize archives, but I'm not happy with it.
I wrote a simple unit test to experiment with ZipOutputStream, and if I cancel the test it before it closes the stream, the resulting zip file gets corrupted, unsurprisingly. Could anyone suggest a different widely recognized archive format and/or implementation that might be more robust?
The tar format, jtar implementation seem to work pretty well. If I cancel in the middle of writing, I can still open the archive at least with 7zip and even get the partially written last entry.

Does saving files in a ".zip" folder speed up file write time to network drive?

I know that when I write a new file to a folder that ends in ".zip" it compresses the file. This is when using BufferedOutputStream in JAVA and saving to a windows file system. I'm saving these files to a network drive, so the write time is dependent on network speed.
Will saving to a .zip folder speed up write time? In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file? Sorry if this is an ignorant question.
There are so many misconceptions in the Question, I think it is worth going through them one at a time.
I know that when I write a new file to a folder that ends in ".zip" it compresses the file.
That is not correct. Creating a file with a ".zip" suffix does not automatically make it compressed. Writing files to a directory that has ".zip" as its filename suffix (?!?) doesn't either. Not in Java. Not in other languages.
In order to get compression, the application needs to take steps to make this happen. In Java you could use ZipOutputStream to write a file in ZIP file format. However, a ZIP file is actually an "archive" format that is designed to hold multiple files in a ZIP file. If you simply trying to compress a single file, there are better alternatives; e.g. GZIPOutputStream.
(It is also possible that this so-called "ZIP folder" you are talking about is a normal ZIP file that has been "mounted" as a loopback file system. You / someone else would have had to set that up explicitly. Anyhow, if this is what is going on here, it is nothing to do with Java. It is all happening in external software and in the operating system where the ZIP is "mounted".)
This is when using BufferedOutputStream in JAVA and saving to a windows file system.
Erm ... no. See above. However you are correct that it may be better to use a BufferedOutputStream to write files, though it only really helps if your application is writing the files in small chunks; e.g. a byte at a time. (Stream compression complicates the issue, so it is difficult to give a simple, general answer on this.)
I'm saving these files to a network drive, so the write time is dependent on network speed.
Correct. It is also dependent on network latency, the protocols used and the load on the remote file server. (If you have a ZIP "mounted", then that is going to add overheads too.)
Will saving to a .zip folder speed up write time?
Maybe. See above. It depends what you mean by a ZIP folder.
Ignoring that, writing the files (the right way) in compressed and / or archive form from Java may speed up writes. There are actually two things to consider:
For plain compression, you are trading off the time it takes the application (!!) to compress and decompress the data against the time (and disk space) you are saving by moving and storing less bytes.
For ZIP files (and similar archive formats) there is a second potential saving. Storing and retrieving lots of individual small files from a file system is slow compared with storing and retrieving a single ZIP file containing those files.
And if you are looking for optimal compression, then ZIP is not the best option.
In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file?
There are so many variables that it is hard to say for sure. But unless you have done something odd, it is likely that the bytes are sent over the network in compressed form.
Finally, I would advise you NOT to try to combine mounted ZIP files and network shares:
The combination of the two could potentially interact in ways that makes performance worse.
There is a risk that you will end up with a corrupted ZIP or lost files if the network share goes offline at an inconvenient point.

TrueZip Random Access Functionality

I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.
Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).
The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.
So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.
I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.
Additionally, does "random" reading mean that it first uncompresses
the entire archive
If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.
The purpose is that I want to retrieve some basic information from the
file without having to uncompress the entire thing just to read it(ie
username).
As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.
The source code comment of zran describes how such tools are working:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
In conclusion one can say that the complete file has to be processed for generating the necessary index.
That is much faster than actually decompress everything.
The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.

how to write into a text file in Java

I am doing a project in java and in that i need to add and modify my
text file at runtime,which is grouped in the jar.
I am using class.getResourceAsStream(filename) this method we
can read that file from class path.
i want to write into the same textfile.
What is the possible solution for this.
If i can't update the text file in jar what other solution is there?
Appreciate any help.
The easiest solution here is to not put the file in the jar. It sounds like you are putting files in your jar so that your user only needs to worry about one file that contains everything related to that program. This is an artificial constraint and just add headaches.
There is a simple solution that still allows you to distribute just the jar file. At start up, attempt to read the file from the file system. If you don't find it, use default values that are encoded in you program. Then when changes are made, you can write it to the file system.
In general, you can't update a file that you located using getResourceAsStream. It might be a file in a JAR/ZIP file ... and writing it would entail rewriting the entire JAR file. It might be a remote file served up by a Url classloader.
For your sanity (and good practice), you should not attempt to update files that you access via the classpath. If you need to, read the file out of the JAR file (or whatever), copy it into the regular file system, and then update the copy.
I'm not saying that it is impossible to do this in all cases. Indeed, in most normal cases you can do it with some effort. However, this is not supported, and there are no standard APIs for doing this.
Furthermore, attempts to update resources are liable to cause anomalies in the classloader. For example, I'd expect resources in JAR files to not update (from the perspective of the application) until the application restarted. But resources in exploded JAR files probably would update ... though new resources might not show up.
Finally, there are cases where updating a resource is impossible:
When the user doesn't have write access to the application's installation directory. This is typical for a properly administered UNIX / Linux machine.
When the JAR file is fetched from a remote server, you are likely not to be able to write the updates back.
When you are using an arbitrary custom classloader, you've got no way of knowing where the actual bytes of an updated resource should be stored, and no way of storing them.
All JAR rewriting techniques in Java look similar. Open the Jar file, read all of it's contents, and write a new Jar file containing the unmodified contents (and the modifications you whished to make). Such techniques are not advisable for a Jar file on the class path, much less a Jar file you're running from.
If you decide you must do it this way, Java World has a few articles:
Modifying Archives, Part 1
Modifying Archives, Part 2
A good solution that avoids the need to put your items into a Jar file is to read (if present) a properties file out of a hidden subdirectory in the user's home directory. The logic looks a bit like this:
if (the hidden directory named after my application doesn't exist) {
makeTheHiddenDirectory();
writeTheDefaultPropertiesFile();
}
Properties appProps = new Properties();
appProps.load(new FileInputStream(fileInHiddenDir));
...
... After the appProps have changed ...
...
appProps.store(new FileOutputStream(fileInHiddenDir), "Do not modify this file");
Look to java.util.Properties, and keep in mind that they have two different load and store formats (key = value based and XML based). Pick the one that suits you best.
If i can't update the text file in jar what other solution is there?
Store the information in any of:
Cookies
The server
Deploy the applet using 1.6.0_10+, launch it using JWS and use the PersistenceService to store the information. Here is my demo. of the PersistenceService.
Also, if your users will agree to a trusted applet (which seems overkill for this), you might write the information to a sub-directory of user.home.

How to be sure a file has been successfully written?

I'm adding autosave functionality to a graphics application in Java. The application periodically autosaves the current document and also autosaves on exit. When the user starts the application, the autosave file is reloaded.
If the autosave file is corrupted in any way (I assume a power cut when the file is in the middle of being saved would do this?), the user will lose their work. How can I prevent such situations and do all I can to guarantee that the autosave document is in a consistent state?
To further complicate matters, to autosave the document I need to save one .xml file and several .png files. Also, the .png saving occurs in C code over JNI.
My current strategy is to write each .png with the extension .png.tmp, write the .xml file with the extension .xml.tmp, and then rename each file to remove the .tmp part leaving the .xml until last. On startup, I only load the autosave document if I can find a .xml file and ignore .xml.tmp files. I also don't delete the previous autosave document until the .xml.tmp file for the new document is renamed.
I guess my knowledge of what happens when you write to disk is poor. I know you can have software read/write buffers when using files, as well as OS and hardware buffers and that all of these need to be flushed. I'm confused how I can know for sure when something really has been written to disk and what I can do to protect myself. Does the renaming operation do anything to make sure buffers are flushed?
If the autosave file is corrupted in any way (I assume a power cut when the file is in the middle of being saved would do this?), the user will lose their work. How can I prevent such situations and do all I can to guarantee that the autosave document is in a consistent state?
To prevent loss of data due to partially written autosave file, don't overwrite the autosave file. Instead, write to a new file each time, and then rename it once the file has been safely written.
To guard against not noticing that an autosave file has not been correctly written:
Pay attention to the exceptions thrown as the autosave file is written and closed in case a disc error, file system full, etc.
Keep a running checksum of the file as it is written and write it at the end of the file. Then when you load the autosave file, check that the checksum is there and is correct.
If the checkpointed state involves multiple files, make sure that you write the files in a well known order (without overwriting!), and write the checksum on the autosave file after all of the other files have been safely closed. You might want to create a directory for each checkpoint.
FOLLOW UP
No. I'm not saying that rename always succeeds. However, it is atomic - it either succeeds (and completes) or the file system is not changed. So, if you do this:
write "file.new" and close,
delete "file",
rename "file.new" to "file"
then provided the first step succeeds you are guaranteed to have the latest "file" safely on disc. And it is simple to add a couple of steps so that you have a backup of "file" at all times. (If the 3rd step fails, you are left with "file.new" and no "file". This can be recovered manually, or automatically by the application next time you run it.)
Also, I'm not saying that writes always succeed, or that applications don't crash, or that the power never goes off. And the point of the checksum is to allow you to detect the cases where these things have happened and the autosave file is incomplete.
Finally, it is a good idea to have two autosaves in case your application gets itself into a state where its data structures are messed up and the last autosave is nonsensical as a result. (The checksum won't protect against this.) Be cautious about autosaving when the application crashes for the same reason.
As an aside, since you have several different files as part of this one document, consider using either a project directory to hold them all together, or using some encapsulation format (like .zip) to put them all inside one file.
What you want to do is atomically replace the old backup files with new ones. Unfortunately, I don't believe that Java gives you enough control do this directly. You also need to reason about what operations are atomic in the underlying operating system. I know Linux file systems, so my answer will be biased towards a Java program running on that system. I would be shocked if Windows didn't do the same thing, but I can't say for certain.
Most Linux file systems (e.g. the meta-data journaled ones) let you rename files atomically. If the system crashes half-way through a rename, when you restart, it will be as if you never renamed a file in the first place. For this reason, a common way to atomically update an existing file F is to write your new data to a temporary file T and then rename T to F. Any system or application crash up to that rename will not affect F, so it will always be consistent.
Of course, before you rename, you need to make sure that your temporary file is consistent. Make sure that all streaming buffers for the file are flushed to the OS (Channel.force() or OutputStream.flush()) and the OS buffers are flushed to the disk (FileOutputStream.getFD.sync()). Of course, unless your OS disables the write cache on the hard disk itself (it probably hasn't), there's still a chance that your data can be corrupted. Add a checksum to the XML if you really want to be really sure. If you're truly paranoid, you should flush the OS and hard disk buffer caches and re-read the file to verify that it is consistent. This is beyond any reasonable expectation for normal consumer applications.
But that's just to atomically write write a single file. Your propblem is more complex: you have many files to update atomically. For example, I'll say that you have two files, img.png and main.xml. I'd do one of these:
The easy solution is to make a per-savefile directory. You wouldn't need to worry about renaming each individual file, and you could still atomically rename the new backup dir over the old backup dir you're replacing. That is, if your old backup is bak/img.png and bak/main.xml, write bak.tmp/img.png and bak.tmp/main.xml and rename bak.tmp to bak.
Name the new auxiliary files something else and let them coexist with the old ones for a little while. That is, write img.2.png and main.xml.tmp (which should refer to img.2.png, not img.png) and only rename main.xml.tmp to main.xml. Then delete img.png.
addition: If you don't have atomic renames, the next best thing extends on #2. Whenever you save the project, give it a new name (e.g. ver342.xml). When you load, just find the most recent XML that is consistent (i.e. its checksum verifies). Keep around 2 or 3 to be safe. Only delete an auto-save if you have successfully restored from a more-recent copy.

Categories