I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.
Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).
The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.
So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.
I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.
Additionally, does "random" reading mean that it first uncompresses
the entire archive
If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.
The purpose is that I want to retrieve some basic information from the
file without having to uncompress the entire thing just to read it(ie
username).
As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.
The source code comment of zran describes how such tools are working:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
In conclusion one can say that the complete file has to be processed for generating the necessary index.
That is much faster than actually decompress everything.
The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.
Related
Summary:
I have a program I want to ship as a single jar file.
It depends on three big resource files (700MB each) in a binary format. The file content can easily be accessed via indexing, my parser therefore reads these files as RandomAccessFile-objects.
So my goal is to access resource files from a jar via File objects.
My problem:
When accessing the resource files from my file system, there is no issue, but I aim to pack them into the jar file of the program, so the user does not need to handle these files themselves.
The only way I found so far to access a file packed in a jar is via InputStream (generated by class.getResourceAsStream()), which is totally useless for my application as it would be much too slow reading these files from start to end instead of using RandomAccessFile.
Copying the file content into a file, reading it and deleting it in runtime is no option eigher for the same reason.
Can someone confirm that there is no way to achieve my goal or provide a solution (or a hint so I can work it out myself)?
What I found so far:
I found this answer and if I understand the answer it says that there is no way to solve my problem:
Resources in a .jar file are not files in the sense that the OS can access them directly via normal file access APIs.
And since java.io.File represents exactly that kind of file (i.e. a thing that looks like a file to the OS), it can't be used to refer to anything in a .jar file.
A possible workaround is to extract the resource to a temporary file and refer to that with a File.
I think I can follow the reasoning behind it, but it is over eight years old now and while I am not very educated when it comes to file systems and archives, I know that the Java language has evolved quite much since then, so maybe there is hope? :)
Probably useless background information:
The files are genomes in the 2bit format and I use the TwoBitParser from biojava via the wrapper class TwoBitFacade?. The Javadocs can be found here and here.
Resources are not files, and they live in a JAR file, which is not a random access medium.
I know that when I write a new file to a folder that ends in ".zip" it compresses the file. This is when using BufferedOutputStream in JAVA and saving to a windows file system. I'm saving these files to a network drive, so the write time is dependent on network speed.
Will saving to a .zip folder speed up write time? In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file? Sorry if this is an ignorant question.
There are so many misconceptions in the Question, I think it is worth going through them one at a time.
I know that when I write a new file to a folder that ends in ".zip" it compresses the file.
That is not correct. Creating a file with a ".zip" suffix does not automatically make it compressed. Writing files to a directory that has ".zip" as its filename suffix (?!?) doesn't either. Not in Java. Not in other languages.
In order to get compression, the application needs to take steps to make this happen. In Java you could use ZipOutputStream to write a file in ZIP file format. However, a ZIP file is actually an "archive" format that is designed to hold multiple files in a ZIP file. If you simply trying to compress a single file, there are better alternatives; e.g. GZIPOutputStream.
(It is also possible that this so-called "ZIP folder" you are talking about is a normal ZIP file that has been "mounted" as a loopback file system. You / someone else would have had to set that up explicitly. Anyhow, if this is what is going on here, it is nothing to do with Java. It is all happening in external software and in the operating system where the ZIP is "mounted".)
This is when using BufferedOutputStream in JAVA and saving to a windows file system.
Erm ... no. See above. However you are correct that it may be better to use a BufferedOutputStream to write files, though it only really helps if your application is writing the files in small chunks; e.g. a byte at a time. (Stream compression complicates the issue, so it is difficult to give a simple, general answer on this.)
I'm saving these files to a network drive, so the write time is dependent on network speed.
Correct. It is also dependent on network latency, the protocols used and the load on the remote file server. (If you have a ZIP "mounted", then that is going to add overheads too.)
Will saving to a .zip folder speed up write time?
Maybe. See above. It depends what you mean by a ZIP folder.
Ignoring that, writing the files (the right way) in compressed and / or archive form from Java may speed up writes. There are actually two things to consider:
For plain compression, you are trading off the time it takes the application (!!) to compress and decompress the data against the time (and disk space) you are saving by moving and storing less bytes.
For ZIP files (and similar archive formats) there is a second potential saving. Storing and retrieving lots of individual small files from a file system is slow compared with storing and retrieving a single ZIP file containing those files.
And if you are looking for optimal compression, then ZIP is not the best option.
In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file?
There are so many variables that it is hard to say for sure. But unless you have done something odd, it is likely that the bytes are sent over the network in compressed form.
Finally, I would advise you NOT to try to combine mounted ZIP files and network shares:
The combination of the two could potentially interact in ways that makes performance worse.
There is a risk that you will end up with a corrupted ZIP or lost files if the network share goes offline at an inconvenient point.
Background
I'm working on a Scala/Java project where we convert individual HTML files to PDF files. We then merge the individual files into one larger complete PDF file.
For the converting we are using sPDF which is built on top of wkhtmltopdf. For the merging we use PDFMergerUtility.
The reasons for making individual files is a bit complicated - but it should be noted we can't make one big PDF off the bat and have to make the individual files first.
The issue
Initially we had no problems with this approach - however as the system has grown - so have the final PDF files. We went from files that were 2MB-3MB to files that are 20MB. I would like to know if there is any obvious compression methods or techniques we could use?
There is a lot of shared content across the individual files but since we're just merging them as isolated/independent files (as in, none of the content that is the same across the individual files is being reused to save space) it doesn't make a difference in bringing down the file size.
If I manually ZIP the final PDF file it greatly reduces the file size -as obviously there is a lot of repeated content.
So one option might just be to zip the PDF after I've finished the merging, but I would prefer to compress it during the merger process or conversion process.
Any ideas?
You could try Sejda to merge, it's Java, open source and based on a fork of PDFBox. It can generate PDF files using object streams (PDFBox currently doesn't support that) and, in case it doesn't reduce the size that much, you can try to pipe its 'compress' task which goes through the document removing unused resources and compressing images.
It's battle tested as engine behind PDFsam so, if you want to give it a quick test and see what's the outcome, just download PDFsam, use the merge module with your files (and compression flag on) and the result is what Sejda will generate.
At the moment I am tracing a load of files that come into my system to a directory. The issue is that I am running out of iNodes (the number of files I can store). For "replay" reasons (there are other reasons too), I would like the files to be separate files, so I can't just write to one file.
I am wondering whether I can replace this with code that will write the files to a ZIP on the fly. However, my concern is - what happens if the JVM crashes during procesing for whatever reason, do will I end up with a corrupt ZIP file? Or is there a way that I can ensure that the ZIP is valid after every "write"?
I am doing a project in java and in that i need to add and modify my
text file at runtime,which is grouped in the jar.
I am using class.getResourceAsStream(filename) this method we
can read that file from class path.
i want to write into the same textfile.
What is the possible solution for this.
If i can't update the text file in jar what other solution is there?
Appreciate any help.
The easiest solution here is to not put the file in the jar. It sounds like you are putting files in your jar so that your user only needs to worry about one file that contains everything related to that program. This is an artificial constraint and just add headaches.
There is a simple solution that still allows you to distribute just the jar file. At start up, attempt to read the file from the file system. If you don't find it, use default values that are encoded in you program. Then when changes are made, you can write it to the file system.
In general, you can't update a file that you located using getResourceAsStream. It might be a file in a JAR/ZIP file ... and writing it would entail rewriting the entire JAR file. It might be a remote file served up by a Url classloader.
For your sanity (and good practice), you should not attempt to update files that you access via the classpath. If you need to, read the file out of the JAR file (or whatever), copy it into the regular file system, and then update the copy.
I'm not saying that it is impossible to do this in all cases. Indeed, in most normal cases you can do it with some effort. However, this is not supported, and there are no standard APIs for doing this.
Furthermore, attempts to update resources are liable to cause anomalies in the classloader. For example, I'd expect resources in JAR files to not update (from the perspective of the application) until the application restarted. But resources in exploded JAR files probably would update ... though new resources might not show up.
Finally, there are cases where updating a resource is impossible:
When the user doesn't have write access to the application's installation directory. This is typical for a properly administered UNIX / Linux machine.
When the JAR file is fetched from a remote server, you are likely not to be able to write the updates back.
When you are using an arbitrary custom classloader, you've got no way of knowing where the actual bytes of an updated resource should be stored, and no way of storing them.
All JAR rewriting techniques in Java look similar. Open the Jar file, read all of it's contents, and write a new Jar file containing the unmodified contents (and the modifications you whished to make). Such techniques are not advisable for a Jar file on the class path, much less a Jar file you're running from.
If you decide you must do it this way, Java World has a few articles:
Modifying Archives, Part 1
Modifying Archives, Part 2
A good solution that avoids the need to put your items into a Jar file is to read (if present) a properties file out of a hidden subdirectory in the user's home directory. The logic looks a bit like this:
if (the hidden directory named after my application doesn't exist) {
makeTheHiddenDirectory();
writeTheDefaultPropertiesFile();
}
Properties appProps = new Properties();
appProps.load(new FileInputStream(fileInHiddenDir));
...
... After the appProps have changed ...
...
appProps.store(new FileOutputStream(fileInHiddenDir), "Do not modify this file");
Look to java.util.Properties, and keep in mind that they have two different load and store formats (key = value based and XML based). Pick the one that suits you best.
If i can't update the text file in jar what other solution is there?
Store the information in any of:
Cookies
The server
Deploy the applet using 1.6.0_10+, launch it using JWS and use the PersistenceService to store the information. Here is my demo. of the PersistenceService.
Also, if your users will agree to a trusted applet (which seems overkill for this), you might write the information to a sub-directory of user.home.