For non splittable files such as GZIP there will be only one map job because GZIP files are no splittable. Is there any option or optimization to store all blocks of such files in one data node so we can at least save network bandwidth?
Increasing HDFS block size for your gzip file to be larger than file size should do the trick. For more info on setting HDFS block size per file see this answer
Related
Trying to figure out if it's possible to download a specific file, or a range of bytes, from an uncompressed TAR archive in S3.
The use case can be described like this:
The TAR file is generated by my application (so we have control of that)
The TAR file lives in an S3 bucket
The TAR file is named archive.tar
The TAR file contains two files: metadata.txt and payload.png
metadata.txt is guaranteed to always be of size "n" bytes, where "n" is relatively small
payload.png can be any size and thus can be a very large file (> 1 GB)
My application needs to be able to download metadata.txt to understand how to process the TAR file, and I don't want the application to have to download the whole TAR file just for the metadata.txt file
Ideally, at any given point, I should only ever have the metadata.txt file opened in memory and never the entire TAR archive or any part of payload.png. I don't want to incur the network or memory overhead of downloading a huge TAR archive just to be able to read the small metadata.txt file contained.
I've noticed S3ObjectInputStream in the AWS SDK, but I'm not sure how to use it with a TAR file for my use case.
Anyone ever implement something similar or have any pointers to references I can check out to help with this?
Yes, it’s possible for an uncompressed tarball, the file format has header records about the files you can use to check its contents.
I'm more of a Python than a Java guy, but take a look at my implementation of tarball range requests here and docs here.
In short, you can check the header (the file name always comes first, and is padded to 512 byte blocks with NULL b"\x00" bytes), get the range corresponding to the file length to determine the variable length, get the remainder of that file length of 512 to determine the end-of-file padding, and then iterate up to 1024 before the end of the file (you can send a HEAD request to get the total bytes, or it's sent back when you execute a range request, AKA partial content request). The 1024-before-the-end part is because there are at least 2 empty blocks of 512 bytes at the end of a tar archive.
When iterating, it's probably sensible to check if the filename of each new block you expect to find a file header in is actually all NULL bytes, as this indicates you've actually entered one of the end-of-file blocks (the spec seems to say "at least 2 empty blocks" so there may be more). But if you control the tar files being generated maybe you wouldn't need to bother.
I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:
// return a list of paths to small files
List<Sting> paths = getAllPaths();
// read up to 100000 small files at once into memory
sparkSession
.read()
.parquet(paths)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
Problem
The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.
Questions
Will HAR or sequence files speed up Apache Spark batch read of 10k-100k of small files? Why?
Will HAR or sequence files slow down persisting of that small files? Why?
P.S.
Batch read is the only operation required for that small files, I don't need to read them by id or anything else.
From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?
wholeTextFiles uses WholeTextFileInputFormat ... Because it extends CombineFileInputFormat, it will try to combine
groups of smaller files into one partition ... Each record in the RDD
... has the entire contents of the file
Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html
RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int
minPartitions) Read a directory of text files from HDFS, a local
file system (available on all nodes), or any Hadoop-supported file
system URI.
Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala
A org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat for
reading whole text files. Each file is read as key-value pair, where the key is the file path and
the value is the entire content of file.
For the record, Hadoop CombineInputFormat is the standard way to stuff multiple small files in a single Mapper; it can be used in Hive with properties hive.hadoop.supports.splittable.combineinputformat and hive.input.format.
Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required
That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing
Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)
I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.
Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers?
Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.
Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2?) because every record has its own deflate block. But the record offsets must be known in advance.
But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.
Here's my use case:
I want to store many small entries of about 1K into archive files of about 8M.
I want to be able to read individual entries efficiently (without reading the whole file).
I want to be able to compress the archive efficiently. In the test I performed, a TAR+ZIP archive was 4x smaller than just a ZIP archive. This isn't surprising at all, there's not much opportunity to compress individual 1K entries.
I don't need to update the archive. Once created, it's immutable.
Is there any archive format which supports both (global compression + individual access)? Theoretically, the two goals are not mutually exclusive.
Note: This is for a Java project, so I am restricted to a format that also has a java library.
I am not aware of a canned solution for your problem, so you may need to write it yourself.
It certainly can be done. I would use the tar format, since it is simple and well understood, but it would require an auxiliary file with index information into the compressed archive. What you would do would be to control the compression of the tar file to create entry points that require no history. Those entry points would need to be much farther apart than 1K to enable good compression, but they can be close enough together to provide relatively fast random access to the 1K files.
The simplest approach would be to use gzip to separately compress chunks of the tar file representing sets of complete files that together are around 128K bytes. The gzip streams can be simply concatenated, and the resulting .tar.gz file would work normally with tar utilities. It is a property of the gzip format that valid gzip streams concatenated are also valid gzip streams.
The auxiliary file would contain a list of the files in the tar archive, their size and offset in the uncompressed tar file, and then separately the compressed and uncompressed offsets of each gzip stream starting point. Then when extracting a file you would look for its offset in the uncompressed tar file, find the gzip stream with the largest uncompressed offset less than or equal to that file's offset, and then start decompressing from the corresponding compressed offset until you get to that file.
For this example, on average you would only need to decompress 64K to get to any given file in the archive.
In general the built compression table is interspersed with compressed data refering to it.
If one wants to do compression oneself, one way would be:
[sharedcompression table]...
[compression table addition specific to file 1] [file 1]
,, ,, ,, ,, ,, 2 ,, 2
...
And at the end shuffle/share compression table parts.
Whether one would gain against 7zip, bzip and others is the question.
BTW java zip handling (still?) does not use the optional file index at file end.
I am reading a ZIP file using java as below:
Enumeration<? extends ZipEntry> zes=zip.entries();
while(zes.hasMoreElements()) {
ZipEntry ze=zes.nextElement();
// do stuff..
}
I am getting an out of memory error, the zip file size is about 160MB. The stacktrace is as below:
Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$1.<init>(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:197)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.zipFilePass2(DatToInsertDBBatch.java:250)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.processCompany(DatToInsertDBBatch.java:206)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.run(DatToInsertDBBatch.java:114)
at java.util.TimerThread.mainLoop(Timer.java:534)
at java.util.TimerThread.run(Timer.java:484)
How do I enumerate the contents of a big zip file without having increase my heap size? Also when I dont enumerate the contents and just access a single file like this:
ZipFile zip=new ZipFile(zipFile);
ZipEntry ze=zip.getEntry("docxml.xml");
Then I dont get an out of memory error. Why does this happen? How does a Zip file handle zip entries? The other option would be to use a ZIPInputStream. Would that have a small memory footprint. I would need to run this code eventually on a micro EC2 instance on the Amazon cloud (613 MB RAM)
EDIT: providing more information on how I process the zip entries after I get them
Enumeration<? extends ZipEntry> zes=zip.entries();
while(zes.hasMoreElements()) {
ZipEntry ze=zes.nextElement();
S3Object s3Object=new S3Object(bkp.getCompanyFolder()+map.get(ze.getName()).getRelativeLoc());
s3Object.setDataInputStream(zip.getInputStream(ze));
s3Object.setStorageClass(S3Object.STORAGE_CLASS_REDUCED_REDUNDANCY);
s3Object.addMetadata("x-amz-server-side-encryption", "AES256");
s3Object.setContentType(Mimetypes.getInstance().getMimetype(s3Object.getKey()));
s3Object.setContentDisposition("attachment; filename="+FilenameUtils.getName(s3Object.getKey()));
s3objs.add(s3Object);
}
I get the zipinputstream from the zipentry and store that in the S3object. I collect all the S3Objects in a list and then finally upload them to Amazon S3. For those who dont know Amazon S3, its a file storage service. You upload the file via HTTP.
I am thinking maybe since i collect all the individual inputstreams this is happening? Would it help if I batched it up? Like a 100 inputstreams at a time? Or would it be better if I unzipped it first and then used the unzipped file to upload rather storing streams?
It is very unlikley that you get an out of memory exception because of processing a ZIP file. The Java classes ZipFile and ZipEntry don't contain anything that could possibly fill up 613 MB of memory.
What could exhaust your memory is to keep the decompressed files of the ZIP archive in memory, or - even worse - keeping them as an XML DOM, which is very memory intensive.
Switching to another ZIP library will hardly help. Instead, you should look into changing your code so that it processes the ZIP archive and the contained files like streams and only keeps a limited part of each file in memory at a time.
BTW: I would be nice if you could provide more information about the huge ZIP files (do they contain many small files or few large files?) and about what you do with each ZIP entry.
Update:
Thanks for the additional information. It looks like you keep the contents of the ZIP file in memory (although it somewhat depends on the implementation of the S3Object class, which I don't know).
It's probably best to implement some sort of batching as you propose yourself. You could for example add up the decompressed size of each ZIP entry and upload the files every time the total size exceeds 100 MB.
You're using ZipFile class now, as I see. Probably usage ZipInputStream will be a better option because it has 'closeEntry()' method which (as I hope) deallocates memory resources used by an entry. But I haven't used it before, it's just a guess.
The default size of a JVM is 64MB.
You need to specify a larger size on the command line. use the -Xmx switch. E.g. -Xmx256m
Indeed, java.util.zip.ZipFile has a size() method, but doesn't provide a method to access entries by index. Perhaps you need to use a different ZIP library. As I remember, I used TrueZIP with rather large archives.