I have a complicated requirement. We have a big .iso file being present on a remote server. The file contains a small .txt file which is containing important information about the iso (version).
I need to check this file before(!) downloading the complete iso file. This needs to be done within Java.
For ZIP files I found a solution here (How to extract a single file from a remote archive file?), but I am not pretty sure, if this is also not downloading the whole file.
I had a look for the loopy (http://loopy.sourceforge.net/) library, but seems that they only work with "File" objects, which needs the file to be on local drive.
ISO is 9660 format.
What would also be possible is reading the "Lable" of the ISO within Java, because it also contains that version information.
Does anybody has a suggestion? :)
Thanks in advice
Related
Can someone please let me know if there is a memory efficient way to append to .xls files. (Client is very insistent on .xls file for the report and I did all possible research but in vain) All I could find is that to append to existing .xls, we first have to load the entire file into memory, append data and then write it back. Is that the only way ? I can afford to give up on time to optimize memory consumption.
I am afraid that is not possible using apache poi. And I doubt that it will be possible by other libraries. Even the Microsoft applications itself needs always opening the whole file to be able to work with it.
All of the Microsoft Office file formats have a complex internal structure similar to a file system. And the parts of that internal system may have relations to each other. So one cannot simply stream data into those files and append data as it is possible with plain text files or CSV files or single XML files for example. One always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. And where should it be known when not in memory?
The modern Microsoft Office file formats are Office Open XML. This are ZIP archives containing an internal file system having a directory structure containing XML files and other files too. So one can reduce the memory footprint by reading data parts from that ZIP file system directly instead of reading all data into the memory by unzipping the ZIP file system. This is what apache poi tries with XSSF and SAX (Event API). But this is for reading only.
For the writing approach one could have parts of the data (single XML files) written to temporary files to keep them away from the memory. Then put the complete ZIP file system together from those temporary files when all writing is complete. This is what SXSSF (Streaming Usermodel API) tries to do. But this is for writing only.
When it comes to appending data to an existing Microsoft Office file, then nothing of the above is useable. Because, as said already, one always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. So the whole file system always needs to be accessible to append data parts to it and update the relationships. One could think about having all data parts (single XML files) and relationship parts in temporary files to keep them away from the memory. But I don't know any library (maybe the closed source ones like Aspose) who does this. And I doubt that will be possible in a performant way. So you would pay time for a lower memory footprint.
The older Microsoft Office file formats are binary file systems but also consists in an complex internal structure. The single parts are streams of binary records which also may have relations to each other. So the main problem is the same as with Office Open XML.
There is Event API (HSSF Only) which tries reading single record streams similiar to the event API for Office Open XML. But, of course, this is for reading only.
There is no streaming approach for writing HSSF upto now. And the reason is that the old binary Excel worksheets only provide 65,536 rows and 256 columns. So the data amount in one sheet cannot be that big. So a GB sized *.xls file should not occur at all. You should not use Excel as data exchange format for database data. This is not what a spreadsheet calculation application is made for.
But even if one would program a streaming approach for writing HSSF this would not solve your problem. Because there is still nothing for appending data to an existing *.xls file. And the problems for this are the same as with the Office Open XML file formats.
Summary:
I have a program I want to ship as a single jar file.
It depends on three big resource files (700MB each) in a binary format. The file content can easily be accessed via indexing, my parser therefore reads these files as RandomAccessFile-objects.
So my goal is to access resource files from a jar via File objects.
My problem:
When accessing the resource files from my file system, there is no issue, but I aim to pack them into the jar file of the program, so the user does not need to handle these files themselves.
The only way I found so far to access a file packed in a jar is via InputStream (generated by class.getResourceAsStream()), which is totally useless for my application as it would be much too slow reading these files from start to end instead of using RandomAccessFile.
Copying the file content into a file, reading it and deleting it in runtime is no option eigher for the same reason.
Can someone confirm that there is no way to achieve my goal or provide a solution (or a hint so I can work it out myself)?
What I found so far:
I found this answer and if I understand the answer it says that there is no way to solve my problem:
Resources in a .jar file are not files in the sense that the OS can access them directly via normal file access APIs.
And since java.io.File represents exactly that kind of file (i.e. a thing that looks like a file to the OS), it can't be used to refer to anything in a .jar file.
A possible workaround is to extract the resource to a temporary file and refer to that with a File.
I think I can follow the reasoning behind it, but it is over eight years old now and while I am not very educated when it comes to file systems and archives, I know that the Java language has evolved quite much since then, so maybe there is hope? :)
Probably useless background information:
The files are genomes in the 2bit format and I use the TwoBitParser from biojava via the wrapper class TwoBitFacade?. The Javadocs can be found here and here.
Resources are not files, and they live in a JAR file, which is not a random access medium.
I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.
Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).
The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.
So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.
I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.
Additionally, does "random" reading mean that it first uncompresses
the entire archive
If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.
The purpose is that I want to retrieve some basic information from the
file without having to uncompress the entire thing just to read it(ie
username).
As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.
The source code comment of zran describes how such tools are working:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
In conclusion one can say that the complete file has to be processed for generating the necessary index.
That is much faster than actually decompress everything.
The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.
While trying to come-up with a servlet based application to read files and manipulate them (image type conversion) here is a question that came up to me:
Is it possible to inspect a file content and know the filetype?
Is there a standard that specifies that each file MUST provide some type of marker in their content so that the application will not have to rely on the file extension constraints?
Consider an application scenario:
I am creating an application that will be able to convert different file formats to a set of output formats. Say user uploads an PDF, my application can suggest that the possible conversion formats are microsoft word or TIFF or JPEG etc.
As my application will gradually support different file formats (over a period of time), I want my application to inspect the input file instead of having the user to specify the format. And suggest to user the possible formats of output.
I understand this is an open ended, broad question. Please let me know if it needs to be modified.
Thanks,
Ayusman
Yeap you can figure out the type without an extension using the magic number.
Also, the way the file command figures it out, is actually through a 3 step check:
Check for filesystem properties to identifie empty files, folders, etc...
The said magic number
In text files, check for language in it
Here's a library that'll help you with Magic Numbers: jmimemagic
I'm working on an application that syncs data. For Mac OS, files are uploaded and if they contain resource fork information, the fork is read and stored as a string using: file/..namedfork/rsrc
Users can access their files using a Web application(Java) that's running on a Linux server, is there a way that I can generate a valid AppleDouble format file using only the data fork and the string I read from the namedfork? I don't mind losing the Finder Metadata.
Note: The generated file will be downloaded (using the Web Application) as a single file for Mac OS users.
Is this possible?
Regards
As far as I'm aware, OS 9/OS X can only natively access the resource forks on files served by AppleTalk shares. For other media, e.g. SMB (Microsoft Networking) or HTTP, the only way to preserve the resource fork is to place the file in an archive.
There are several Mac-specific archive formats that support this, for example, StuffIt and HQX. I very much doubt the Linux binaries for StuffIt would allow packaging a resource fork from a separate file, but at least there is something for you to evaluate.
Looking at the AppleDouble Wikipedia entry, it seems it may be possible to create such a file from a non-Apple machine using an open source tool, and sending the resultant file using the multipart/appledouble MIME type. Perhaps you could call this binary from your Java code?
The wikipedia article states:
AppleSingle combined both file forks and the related Finder meta-file information into a single file, whereas AppleDouble stored them as two separate files.
The apple knowledgebase article states:
The second new file has the name of the original file prefixed by a "._ " and contains the resource fork of the original file.
So I assume you just have to save the content of your resource fork string into the appropriately named file.
Edit:
After your comment I'm not sure what you want. Your question was how to
Create AppleDouble formatted file in Linux
and the documentation I linked to shows that you need to create two files to do that one containing the data and one containing the resource fork with a name that has ._ prefixed. If that is not what you want then you need to ask a different question.