gz files in one directory. I want to combine them in one big .gz file and unzip it and load it into HDFS.
For e.g. repo contains files a.gz,b.gz,c.gz. Now I want to combine them into one file called d.gz I want to unzip it and load into HDFS. These .gz files are CSV files.
To unzip it I know I can GZIPInput/OutputStream but how do I combine files into one big files in Java.
Please guide. Thanks in advance.
A gz file contains exactly one file. It's not meant to contain multiple files.
The best way to do this is TAR the files together then GZ the resulting TAR. TAR has command line options to automate this into a single operation. For Java, use jtar: https://code.google.com/p/jtar/
Alternatively, a ZIP file may be what you're looking for.
Related
I have many xml files on hdfs which i extracted from a sequence files using java program.
Initially, the files were few so I copied the extracted xml files onto my local and then ran a unix zip command then zipped the xmls into a single .zip file.
The no of xml files have now increased and now i cant copy them onto local because I will run out of memory.
My need is to just zip all of those xml files(on hdfs) into a single zipped file(to hdfs) without a need of copying it to local.
I couldnt find any lead to start.. Can anyone provide me a start point or any code(even java MR) they have so that I can go further. I could see this can be done using mapreduce but I have never programmed in it thats why trying other ways
Thanks in advance..
I am creating a zip file using ZipOutputStream. There will also be a manifest file (a csv file) which will have links to the entries in the Zip file. How do I programmatically create links for the zip entries ?
If you keep track of all the entries while you write them, you should be able to add another entry containing the "links" (but how should a csv link to a file? Please specify what you try to achieve).
If you intend to use the file under windows, you could create .lnk files programmatically; but this only works for one file per link. On unices, ZipOutputStream cannot create symlinks, but ZipFileSystem can.
How can I add/modify/delete/merge recursive directory in a zip file (in Java) without file system?
Do I have to respect the order of zip entries?
Yes, I know merging directories is very complex job..
If you need to add whole directory with files to zip archive recursively only by Java core efforts, then you can use good example from Mkyong's blog. If you need to append files to existing zip-file, the you should use a link from #McDowell's comment: Appending files to a zip file with Java
There is no simple answer, your going to need to write a faire bit of code. You can't use the JDK ZipFile class, as that only supports reading zip files.
Instead use Commons Compress. Have a look at the examples and the zip documentation to get going.
Basically you'll need to open an input zip file, and an output zip file. Read each entry in tern, and decide whether to write it to the output, transform and write, add a new entry, or skip it, . When you get to the end close both zip files.
When processing a zip file, it's not really recursive, as all the entries are just a linear list with a path and filename. The recursive part comes when a zip contains a zip, and that is quite easy to handle.
I have a couple of zip file which have multiple files and folders. It basically contains text files. Say extension 'a' and 'b'.
I want to separate the extension 'a' files and extension 'b' files into separate zip files, using a perl script or java code.
Instead of unzipping the files, can I just pick the contents and put it into another zip file. Is this even possible? Any help would be great.
The reason why I was wondering this is I have large number of zip files of large size, so if this is possible my code will be very efficient.
And any comments whether to use perl or java would be a bonus.
Thank you.
You can use the Java ZipFile class to read contents of a zip file, iterate over the entries in the zip and obtaining input streams for relevant entries. Using the ZipOutputStream it is possible to directly put the files into a new zip file - although they are decompressed/compressed in between. I don't know of a tool that can copy the zipped contents directly.
This might be done uzing Perl Archive::Zip library:
Actual oneliner might looks like the following:
perl -MArchive::Zip -e 'Archive::Zip->new("test.zip")->extractMember("testx.txt", "foo.txt");'
But I would like to provide full code with even some checks:
use Archive::Zip;
my $zip = Archive::Zip->new("test.zip");
my $file_path = "test.txt";
my $PARANOID = 1;
if ($PARANOID) {
my $file = $zip->memberNamed($file_path);
unless ($file) {
die "File '$file_path' not found in the archive";
}
}
$zip->extractMember($file_path, "extracted_file.txt");
Please note that you need to have Archive::Zip library installed:
cpan Archive::Zip
Or, if you are more confident with Perl ecosystem and have shiny cpanm utility installed:
cpanm Archive::Zip
I just found that zip on Unix has a --copy option which copies a file from one archive to another. I don't believe it decompresses the file in the process so it should be just what you (and I) need.
Syntax is:
$ zip source.zip "*.c" --copy --out destination.zip
Referring to this post I have zipped and unzipped folders successfully - TrueZip - How to decompress inner jar/zip files without expanding them as directories?
Is there any way to split zip into parts using truezip, similar to 7z which allows us to create parts of zip file?
No, ZIP file splitting/spanning is not supported by TrueZIP.