mapreduce in java - gzip input files - java

I'm using java, and i'm trying to write a mapreduce that will recieve as an Input a folder containing multiple gz files.
I've been looking all over but all the tutorials that i've found exmplain how to process a simple text File, but haven't found anything that solves my problem.
I've asked around at my workplace, but only got references to scala, which i'm not familier with.
Any help would be appreciated.

Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.
So all you have to do write the logic as you would for a text file and pass in the directory which contains the .gz files as input.
But the issue with gzip files is that they are not splittable, imagine you have gzip files of each 5GB, then each mapper will process on the whole 5GB file instead of working with the default block size.

Related

Spark, ignore some input files

I have my data on hdfs, the folder structure is something like,
hdfs://ns1/abc/20200101/00/00/
hdfs://ns1/abc/20200101/00/01/
hdfs://ns1/abc/20200101/00/02/
......
Basically, we create folder every minute and put hundreds of files in the folder.
We have a spark (2.3) application (written in java) which processes data on a daily basis, so the input path we used is like hdfs://ns1/abc/20200101, simple and straight, but sometime, a few files are corrupt or zero size, this causes the whole spark job failed.
So is there a simpe way to just ingore any bad file? have tried --conf spark.sql.files.ignoreCorruptFiles=true, but doesnt help at all.
Or can we have some 'file pattern' on command-line when submitting spark job, since those bad files are usually using different file extension.
Or, since I'm using JavaSparkContext#newAPIHadoopFile(path, ...) to read data from hdfs, any trick I can do with JavaSparkContext#newAPIHadoopFile(path, ...), so that it will ignore bad file?
Thanks.

Get RandomAccessFile from JAR archive

Summary:
I have a program I want to ship as a single jar file.
It depends on three big resource files (700MB each) in a binary format. The file content can easily be accessed via indexing, my parser therefore reads these files as RandomAccessFile-objects.
So my goal is to access resource files from a jar via File objects.
My problem:
When accessing the resource files from my file system, there is no issue, but I aim to pack them into the jar file of the program, so the user does not need to handle these files themselves.
The only way I found so far to access a file packed in a jar is via InputStream (generated by class.getResourceAsStream()), which is totally useless for my application as it would be much too slow reading these files from start to end instead of using RandomAccessFile.
Copying the file content into a file, reading it and deleting it in runtime is no option eigher for the same reason.
Can someone confirm that there is no way to achieve my goal or provide a solution (or a hint so I can work it out myself)?
What I found so far:
I found this answer and if I understand the answer it says that there is no way to solve my problem:
Resources in a .jar file are not files in the sense that the OS can access them directly via normal file access APIs.
And since java.io.File represents exactly that kind of file (i.e. a thing that looks like a file to the OS), it can't be used to refer to anything in a .jar file.
A possible workaround is to extract the resource to a temporary file and refer to that with a File.
I think I can follow the reasoning behind it, but it is over eight years old now and while I am not very educated when it comes to file systems and archives, I know that the Java language has evolved quite much since then, so maybe there is hope? :)
Probably useless background information:
The files are genomes in the 2bit format and I use the TwoBitParser from biojava via the wrapper class TwoBitFacade?. The Javadocs can be found here and here.
Resources are not files, and they live in a JAR file, which is not a random access medium.

Best way for Hadoop to handle lots of image files

I've successfully managed to handle multiple image files in 2 ways in Hadoop:
Using Java to stitch images together using a sequence file. This requires a text file that points to the locations of all the files.
Using Python and Hadoop streaming to cache files to each node using -cacheArchive in the form of a tar.gz archive.
Both methods seem a bit ropey to me. Let's say I have one million files, I do not want to have to create the text file or zip up so many files. Is there a way i can just point my mapper to an hdfs folder and have it read through that folder at runtime? I know input can be used, but this is for text files. Or am I missing something? Any pointers are most appreciated.

TrueZip Random Access Functionality

I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.
Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).
The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.
So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.
I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.
Additionally, does "random" reading mean that it first uncompresses
the entire archive
If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.
The purpose is that I want to retrieve some basic information from the
file without having to uncompress the entire thing just to read it(ie
username).
As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.
The source code comment of zran describes how such tools are working:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
In conclusion one can say that the complete file has to be processed for generating the necessary index.
That is much faster than actually decompress everything.
The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.

Reading zip files stored in GAE Blobstore

I have followed the sample code below to upload a zip file in the blobstore. I am able to upload the zip file in but i have some concerns reading the file.
Sample Code http://code.google.com/appengine/docs/python/blobstore/overview.html#Complete_Sample_App
My zip file has 6 CSV files where my system will read the files and import the values in the datastore. However i am aware that there are some restrictions to read the file which must be less than 1MB.
Can anyone suggest how i can go about reading the zip file and process the CSV file? What will happen if my data saved in the blobstore is more than 1MB?
Hope to hear from you. Advance thank.
Individual API calls to the blobstore API must be less than 1MB, but you can read as much data as you want with multiple calls. See this blog post for an example of using BlobReader to read the contents of a zip file from the blobstore; it's written using Python, but BlobReader is available in the Java SDK too, and the same technique applies.

Categories