Hadoop process WARC files - java

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.
Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers?
Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.

Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2?) because every record has its own deflate block. But the record offsets must be known in advance.
But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.

Related

Apache Beam / Google Dataflow Final step to run only once

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.
If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.

How do I effectively process a largs gzipped file in dataflow?

We have som batch jobs that process gzipped files that are ~10GB zipped and ~30GB unzipped.
Trying to process this, in java, takes an unreasonable amount of time and we are looking for how to do it more effective. If we use TextIO or the native java sdk for gcs to download the file it takes more than 8 hours to process, and the reason is ut can scale out for some reason. Most likely it won't split the file since it is gzipped.
If I unzipped the file and process the unzipped file the job take roughly 10 minute, so in the order of 100 times as fast.
I can totally understand that it might take some extra time to process a gzipped file, but 100 times as long time is too much.
You're correct that gzipped files are not splittable, so Dataflow has no way to parallelize reading each gzipped input file. Storing uncompressed in GCS is the best route if it's possible for you.
Regarding the 100x performance difference: how many worker VMs did your pipeline scale up to in the uncompressed vs compressed versions of your pipeline? If you have a job id we can look it up internally to investigate further.

Apache Spark on HDFS: read 10k-100k of small files at once

I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:
// return a list of paths to small files
List<Sting> paths = getAllPaths();
// read up to 100000 small files at once into memory
sparkSession
.read()
.parquet(paths)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
Problem
The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.
Questions
Will HAR or sequence files speed up Apache Spark batch read of 10k-100k of small files? Why?
Will HAR or sequence files slow down persisting of that small files? Why?
P.S.
Batch read is the only operation required for that small files, I don't need to read them by id or anything else.
From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?
wholeTextFiles uses WholeTextFileInputFormat ... Because it extends CombineFileInputFormat, it will try to combine
groups of smaller files into one partition ... Each record in the RDD
... has the entire contents of the file
Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html
RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int
minPartitions) Read a directory of text files from HDFS, a local
file system (available on all nodes), or any Hadoop-supported file
system URI.
Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala
A org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat for
reading whole text files. Each file is read as key-value pair, where the key is the file path and
the value is the entire content of file.
For the record, Hadoop CombineInputFormat is the standard way to stuff multiple small files in a single Mapper; it can be used in Hive with properties hive.hadoop.supports.splittable.combineinputformat and hive.input.format.
Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required
That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing
Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

Maintain list of processed files to prevent duplicate file processing

I am looking for guidance in the design approach for resolving one of the problems that we have in our application.
We have scheduled jobs in our Java application and we use Quartz scheduler for it. Our application can have thousands of jobs that do the following:
Scan a folder location for any new files.
If there is a new file, then kick off the associated workflow to process it.
The requirement is to:
Process only new files.
If any duplicate file arrives (file with the same name), then don't process it.
As of now, we persist the list of the processed files at quartz job metadata. But this solution is not scalable as over the years (and depending on number of files received per day which could be range from 100K per day), the job metadata (that persist list of files processed) grows very large and it started giving us problems with data truncation error (while persisting job metadata in quartz table) and slowness.
What is the best approach for implementing this requirement and ensuring that we don't process duplicate files that arrive with the same name? Shall we consider the approach of persisting processed file list in the external database instead of job metadata? If we use a single external database table for persisting list of processed files for all those thousands of jobs then the table size may grow huge over the years which doesn't look the best approach (however proper indexing may help in this case).
Any guidance here shall be appreciated. It looks like a common use case to me for applications who continuously process new files - therefore looking for best possible approach to address this concern.
If not processing duplicate files is critical for you, the best way to do it would be by storing the file names in a database. Keep in mind that this could be slow since you would be query for each file name, or have a large query for all the new file names.
That said, if you're willing to process new files which may be a duplicate, there are a number of things that can be done as an alternative:
Move processed files to another folder, so that your folder will always have unprocessed files
Add a custom attribute to your processed files, and process files that do not have that attribute. Be aware that this method is not supported by all file systems. See this answer for more information.
Keep a reference to the time when your last quartz job started, and process new files which were created after that time.

loading data into hdfs in parallel

I have a Hadoop cluster consisting of 3 Nodes. I want to load a 180 GB file into HDFS as fast as possible. I know neither -put nor -copyFromLocal are going to help me in this as they are single threaded.
I'm thinking in terms of Map/Reduce. Is there a way to distribute the loading process to the nodes themselves. So each node will load a part of the file say 60 GB each. I don't want to do this manually from each node (that defeats the purpose). If there is way to do this using Java and Map/Reduce I would love to read about it. I know Hadoop can process wildcard input files. Say each 60GB chunk is named like this: file_1, file_2, file_3..I can then use file_* for my next MR jobs. The trouble I'm having is understanding how to efficiently load the file first into hadoop in a fast / multi-threaded way.
Thanks in advance!
Edit:
distcp - seems to be doing parallel copying into HDFS but only between clusters, and not within a cluster. I wonder why they didn't think of that, and if they did, what are the limitations or bottlenecks around this.
Also http://blog.syncsort.com/2012/06/moving-data-into-hadoop-faster/ seems to document benchmarks around this topic but they're using DMExpress (commercial tool) to do the loading. It would be great to have an Open Source alternative.
With your configuration, I don't know if parallelization of writes improve your performances because you want to write one file.
Suppose we have default configuration. Default replication factor is 3, so your file is considered as written when each blocks of your file is written on 3 machines of your cluster (in your case, in all machines of your cluster).
If you have more than one disk per machine, dividing your file on smallest part (as part as disk used by HDFS on one machine) can help to improve writing performance only if your application is the only one to use the cluster and you are not limited by your network. In this case your bottleneck is your disks.
If you can manage divided file on your clients a simple way to be sure all parts of your file is copied on HDFS is to create a directory which is the name of your file concatenated with a suffix which showing that the file is on copy. This directory contains all parts of your file. When all copying threads are finished you can rename the directory without suffix. Your clients can access to all parts of file only when suffix is removed. Rename consists on an operation in metadata on Namenode it is a fastest operation as compared as file copy.
Others solutions :
Usage of a marker file is not the best option because you lose an HDFS block (by default block size if 128 MB).
Recreating the file from its parts is similar to a rewriting of data so it is inefficient.

Categories