Hadoop Yarn write to local file system - java

I have a scenario where I process 1000's of small files using Hadoop. The output of the Hadoop job is then to be used as input for a non-Hadoop algorithm. In the current workflow, data is read, converted to Sequence files, processed and resulting small files are then outputted to HDFS in the form of Sequence file. However, non-Hadoop algorithm cannot understand Sequence File. Therefore, I've written another simple Hadoop job to read resulting files' data from Sequence File and create final small files that can be used by the non-Hadoop algorithm.
The catch here is that for the final job I have to read Sequence Files from HDFS and write to Local file system of each node to be processed by non-Hadoop algorithm. I've tried setting the output path as file:///<local-fs-path> and using Hadoop LocalFileSystem class. However, doing so outputs the final results to namenode's local file system only.
Just to complete the picture, I have 10 nodes Hadoop setup with Yarn. Is there a way in Hadoop Yarn mode to read data from HDFS and write results to local file system of each processing node?
Thanks

Not really. While you can write to LocalFileSystem, you can't ask YARN to run your application on all nodes. Also, depending on how your cluster is configured, YARN's Node Managers might not be running on all nodes of your system.
A possible workaround is to keep your converted files in HDFS then have your non-Hadoop process first call hdfs dfs -copyToLocal.

Related

Apache Spark on HDFS: read 10k-100k of small files at once

I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:
// return a list of paths to small files
List<Sting> paths = getAllPaths();
// read up to 100000 small files at once into memory
sparkSession
.read()
.parquet(paths)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
Problem
The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.
Questions
Will HAR or sequence files speed up Apache Spark batch read of 10k-100k of small files? Why?
Will HAR or sequence files slow down persisting of that small files? Why?
P.S.
Batch read is the only operation required for that small files, I don't need to read them by id or anything else.
From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?
wholeTextFiles uses WholeTextFileInputFormat ... Because it extends CombineFileInputFormat, it will try to combine
groups of smaller files into one partition ... Each record in the RDD
... has the entire contents of the file
Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html
RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int
minPartitions) Read a directory of text files from HDFS, a local
file system (available on all nodes), or any Hadoop-supported file
system URI.
Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala
A org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat for
reading whole text files. Each file is read as key-value pair, where the key is the file path and
the value is the entire content of file.
For the record, Hadoop CombineInputFormat is the standard way to stuff multiple small files in a single Mapper; it can be used in Hive with properties hive.hadoop.supports.splittable.combineinputformat and hive.input.format.
Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required
That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing
Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

Hadoop process WARC files

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.
Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers?
Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.
Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2?) because every record has its own deflate block. But the record offsets must be known in advance.
But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.

Hadoop Single-node vs Multi-node

I am setup single node and multinode (1 master and 1 slave) cluster. When I try to run my application, it takes same time for both i.e. single node and multi-node. In my application, I am copying data from HDFS to local file system and then performing processing on it. Is this because I have files stored locally and the files are not accessible to other nodes in the cluster? I am providing a file which is actually divided into 3 chunks so logically it should be processed faster on multi-node.
Any idea?
Thanks!
When I try to run my application, it takes same time for both i.e.
single node and multi-node.
Well the difference in time taken will vary depending on the type of operation performed and the amount of load generated by your application. For example, copying few MB's of data will take almost the same time on both single and multi-node cluster. Even, single node cluster might show up good results for small data set as compared to multi-node cluster. The actual power of Hadoop lies in processing of colossal volumes of data sets by utilizing multi-node clusters for parallel processing.
In my application, I am copying data from HDFS to local file system
and then performing processing on it.
I do not see any sense in copying data on local file system for processing in a multi-node environment. In this way you are limiting yourself from using the power of distributed computing.

loading data into hdfs in parallel

I have a Hadoop cluster consisting of 3 Nodes. I want to load a 180 GB file into HDFS as fast as possible. I know neither -put nor -copyFromLocal are going to help me in this as they are single threaded.
I'm thinking in terms of Map/Reduce. Is there a way to distribute the loading process to the nodes themselves. So each node will load a part of the file say 60 GB each. I don't want to do this manually from each node (that defeats the purpose). If there is way to do this using Java and Map/Reduce I would love to read about it. I know Hadoop can process wildcard input files. Say each 60GB chunk is named like this: file_1, file_2, file_3..I can then use file_* for my next MR jobs. The trouble I'm having is understanding how to efficiently load the file first into hadoop in a fast / multi-threaded way.
Thanks in advance!
Edit:
distcp - seems to be doing parallel copying into HDFS but only between clusters, and not within a cluster. I wonder why they didn't think of that, and if they did, what are the limitations or bottlenecks around this.
Also http://blog.syncsort.com/2012/06/moving-data-into-hadoop-faster/ seems to document benchmarks around this topic but they're using DMExpress (commercial tool) to do the loading. It would be great to have an Open Source alternative.
With your configuration, I don't know if parallelization of writes improve your performances because you want to write one file.
Suppose we have default configuration. Default replication factor is 3, so your file is considered as written when each blocks of your file is written on 3 machines of your cluster (in your case, in all machines of your cluster).
If you have more than one disk per machine, dividing your file on smallest part (as part as disk used by HDFS on one machine) can help to improve writing performance only if your application is the only one to use the cluster and you are not limited by your network. In this case your bottleneck is your disks.
If you can manage divided file on your clients a simple way to be sure all parts of your file is copied on HDFS is to create a directory which is the name of your file concatenated with a suffix which showing that the file is on copy. This directory contains all parts of your file. When all copying threads are finished you can rename the directory without suffix. Your clients can access to all parts of file only when suffix is removed. Rename consists on an operation in metadata on Namenode it is a fastest operation as compared as file copy.
Others solutions :
Usage of a marker file is not the best option because you lose an HDFS block (by default block size if 128 MB).
Recreating the file from its parts is similar to a rewriting of data so it is inefficient.

Accessing files from other filesystems along with hdfs files in a hadoop mapreduce application

I know that we can call a map-reduce job from a normal java application. Now the map-reduce jobs in my case has to deal with files on hdfs and also files on other filesystem. Is it possible in hadoop that we can access files from other file system while simultaneously using the files on hdfs. Is that possible ?
So basically my intention is that I have one large file which I want to put it in HDFS for parallel computing and then compare the blocks of this file with some other files(which I do not want to put in HDFS coz they need to be accessed as full length file at once.
It should be possible to access non-HDFS file system from mapper/reducer tasks just like any other tasks. One thing to note is that if there a are say 1K mappers and each of them will try to open the non-HDFS file, this might lead to a bottle neck based on the type of the external file system. The same is applicable with mappers pulling data from a database also.
You can use the distributed cache to distribute the files to your mappers, they can open and read the files in their configure() method (don't read them in map() as it will be called many times.)
edit
In order to access file from the local filesystem in your map reduce job, you can add those files to the distributed cache when you setup your job configuration.
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job);
The MapReduce framework will make sure those files are accessible by your mappers.
public void configure(JobConf job) {
// Get the cached archives/files
Path[] localFiles = DistributedCache.getLocalCacheFiles(job);
// open, read and store for use in the map phase.
}
and remove the files when your job is done.

Categories