Apache Spark on HDFS: read 10k-100k of small files at once - java

I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:
// return a list of paths to small files
List<Sting> paths = getAllPaths();
// read up to 100000 small files at once into memory
sparkSession
.read()
.parquet(paths)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
Problem
The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.
Questions
Will HAR or sequence files speed up Apache Spark batch read of 10k-100k of small files? Why?
Will HAR or sequence files slow down persisting of that small files? Why?
P.S.
Batch read is the only operation required for that small files, I don't need to read them by id or anything else.

From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?
wholeTextFiles uses WholeTextFileInputFormat ... Because it extends CombineFileInputFormat, it will try to combine
groups of smaller files into one partition ... Each record in the RDD
... has the entire contents of the file
Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html
RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int
minPartitions) Read a directory of text files from HDFS, a local
file system (available on all nodes), or any Hadoop-supported file
system URI.
Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala
A org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat for
reading whole text files. Each file is read as key-value pair, where the key is the file path and
the value is the entire content of file.
For the record, Hadoop CombineInputFormat is the standard way to stuff multiple small files in a single Mapper; it can be used in Hive with properties hive.hadoop.supports.splittable.combineinputformat and hive.input.format.
Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required
That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing
Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

Related

Google Bigquery load data with local file size limit

Is there any limit on Google Bigquery load data with a local file with API?
As Google Bigquery document mention regarding Web UI, local file size is than <=10 MB and 16,000 Rows. Is the same limit will apply to API?
There is no BigQuery API to load local files. Local file load via bq command or Web UI - and I believe what happened when you do this - it is just upload file to GCS on your behalf and after this just doing normal API load job from GCS - you can see it clearly in UI. But because Google want to have reasonable user experience from WebUI/bq command - additional much more strict limits are here for upload "local" files.
My recommendation to go GCS path to load big files
(https://cloud.google.com/bigquery/docs/loading-data-cloud-storage)
Important thing - it is free (compare with streaming where you will pay for streamed data)
Limits are following (from https://cloud.google.com/bigquery/quotas)
Load jobs per table per day — 1,000 (including failures)
Maximum columns per table — 10,000
Maximum size per load job — 15 TB across all input files for CSV, JSON, and Avro
Maximum number of files per load job — 10 Million total files including all files matching all wildcard URIs
For CSV and JSON - 4 GB compressed file, 5TB uncompressed
There are no special limits for local file uploads, 10MB and 16000 rows is only for UI. But I don't recommend uploading huge local files.

Hadoop Yarn write to local file system

I have a scenario where I process 1000's of small files using Hadoop. The output of the Hadoop job is then to be used as input for a non-Hadoop algorithm. In the current workflow, data is read, converted to Sequence files, processed and resulting small files are then outputted to HDFS in the form of Sequence file. However, non-Hadoop algorithm cannot understand Sequence File. Therefore, I've written another simple Hadoop job to read resulting files' data from Sequence File and create final small files that can be used by the non-Hadoop algorithm.
The catch here is that for the final job I have to read Sequence Files from HDFS and write to Local file system of each node to be processed by non-Hadoop algorithm. I've tried setting the output path as file:///<local-fs-path> and using Hadoop LocalFileSystem class. However, doing so outputs the final results to namenode's local file system only.
Just to complete the picture, I have 10 nodes Hadoop setup with Yarn. Is there a way in Hadoop Yarn mode to read data from HDFS and write results to local file system of each processing node?
Thanks
Not really. While you can write to LocalFileSystem, you can't ask YARN to run your application on all nodes. Also, depending on how your cluster is configured, YARN's Node Managers might not be running on all nodes of your system.
A possible workaround is to keep your converted files in HDFS then have your non-Hadoop process first call hdfs dfs -copyToLocal.

Hadoop process WARC files

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.
Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers?
Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.
Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2?) because every record has its own deflate block. But the record offsets must be known in advance.
But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.

loading data into hdfs in parallel

I have a Hadoop cluster consisting of 3 Nodes. I want to load a 180 GB file into HDFS as fast as possible. I know neither -put nor -copyFromLocal are going to help me in this as they are single threaded.
I'm thinking in terms of Map/Reduce. Is there a way to distribute the loading process to the nodes themselves. So each node will load a part of the file say 60 GB each. I don't want to do this manually from each node (that defeats the purpose). If there is way to do this using Java and Map/Reduce I would love to read about it. I know Hadoop can process wildcard input files. Say each 60GB chunk is named like this: file_1, file_2, file_3..I can then use file_* for my next MR jobs. The trouble I'm having is understanding how to efficiently load the file first into hadoop in a fast / multi-threaded way.
Thanks in advance!
Edit:
distcp - seems to be doing parallel copying into HDFS but only between clusters, and not within a cluster. I wonder why they didn't think of that, and if they did, what are the limitations or bottlenecks around this.
Also http://blog.syncsort.com/2012/06/moving-data-into-hadoop-faster/ seems to document benchmarks around this topic but they're using DMExpress (commercial tool) to do the loading. It would be great to have an Open Source alternative.
With your configuration, I don't know if parallelization of writes improve your performances because you want to write one file.
Suppose we have default configuration. Default replication factor is 3, so your file is considered as written when each blocks of your file is written on 3 machines of your cluster (in your case, in all machines of your cluster).
If you have more than one disk per machine, dividing your file on smallest part (as part as disk used by HDFS on one machine) can help to improve writing performance only if your application is the only one to use the cluster and you are not limited by your network. In this case your bottleneck is your disks.
If you can manage divided file on your clients a simple way to be sure all parts of your file is copied on HDFS is to create a directory which is the name of your file concatenated with a suffix which showing that the file is on copy. This directory contains all parts of your file. When all copying threads are finished you can rename the directory without suffix. Your clients can access to all parts of file only when suffix is removed. Rename consists on an operation in metadata on Namenode it is a fastest operation as compared as file copy.
Others solutions :
Usage of a marker file is not the best option because you lose an HDFS block (by default block size if 128 MB).
Recreating the file from its parts is similar to a rewriting of data so it is inefficient.

Does HDFS stores non splittable files in one data node?

For non splittable files such as GZIP there will be only one map job because GZIP files are no splittable. Is there any option or optimization to store all blocks of such files in one data node so we can at least save network bandwidth?
Increasing HDFS block size for your gzip file to be larger than file size should do the trick. For more info on setting HDFS block size per file see this answer

Categories