Spark, ignore some input files

Spark, ignore some input files - java

I have my data on hdfs, the folder structure is something like,
hdfs://ns1/abc/20200101/00/00/
hdfs://ns1/abc/20200101/00/01/
hdfs://ns1/abc/20200101/00/02/
......
Basically, we create folder every minute and put hundreds of files in the folder.
We have a spark (2.3) application (written in java) which processes data on a daily basis, so the input path we used is like hdfs://ns1/abc/20200101, simple and straight, but sometime, a few files are corrupt or zero size, this causes the whole spark job failed.
So is there a simpe way to just ingore any bad file? have tried --conf spark.sql.files.ignoreCorruptFiles=true, but doesnt help at all.
Or can we have some 'file pattern' on command-line when submitting spark job, since those bad files are usually using different file extension.
Or, since I'm using JavaSparkContext#newAPIHadoopFile(path, ...) to read data from hdfs, any trick I can do with JavaSparkContext#newAPIHadoopFile(path, ...), so that it will ignore bad file?
Thanks.

Related

Need help choosing a robust archive format

I'm working on a Java application that runs on Lubuntu on single-board computers and produces thousands of image files, which are then transferred over FTP. The transfer takes several times longer for multiple files than it does for a single file of the same size as the total of the multiple files, I'm assuming because the FTP client has to establish a new connection for every file. So I thought I'd have the application put the image files in a single archive file, but the problem with this is that sometimes the SBC won't shut down cleanly for various reasons, and the entire archive may be corrupted all the images will be lost. Archiving the files afterwards is not a great option basically because it takes a long time. An intermediate solution may be to create multiple midsize archives, but I'm not happy with it.
I wrote a simple unit test to experiment with ZipOutputStream, and if I cancel the test it before it closes the stream, the resulting zip file gets corrupted, unsurprisingly. Could anyone suggest a different widely recognized archive format and/or implementation that might be more robust?

The tar format, jtar implementation seem to work pretty well. If I cancel in the middle of writing, I can still open the archive at least with 7zip and even get the partially written last entry.

large Files transfer over the network

I have a requirement where in large size zipped files (size in GBs) are coming in a directory on a unix server (lets say server1) and I have to write application which will poll that directory and copy the files to another unix server (lets say server2) as they come . I have a way to know when one file is completely copied in a directory (using corresponding meta data file which will only come when copy operation of a single file is complete) . Since there are hundreds of files, we dont want to wait for all the files to be copied. Once files are copied to server2 , I have to do unzipping and some validations before I land up those files in my final repository.
Questions
What would be the appropriate tech to use for this scenario,shell scripting or java or something else in terms of speed ?
Since we will be doing the transfer operation file by file , how do we achieve parallelism (other than multithreading if we use java) ?
Any existing lib/package/tool available which can fit this scenario .

Writing mapreduce output directly onto a webpage

I have a mapreduce job which writes its output to a file in HDFS. But instead of writing it to HDFS, I want the output to be written directly on a webpage. I have created a web project in eclipse and written driver, mapper and reducer classes in it. When I run it with tomcat server, it didn't work.
So how can the output be displayed on a webpage?

If you are using MAP-R distribution , you can write the output of your map reduce job to the file system (not the HDFS), but to fix your issue will require more info.

HDFS (by itself) is not really designed for low-latency random read/writes. A few options you do have however are WebHDFS / HTTPfs. This exposes a REST API to HDFS. http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html and http://hadoop.apache.org/docs/r2.4.1/hadoop-hdfs-httpfs/. You could have the webserver pull whatever file you want and serve it on the webpage. I don't think this is a very good solution however.
A better solution might be to have MapReduce output to HBase (http://hbase.apache.org/) and have your webserver pull from HBase. It is far better suited for low-latency random read / writes.

mapreduce in java - gzip input files

I'm using java, and i'm trying to write a mapreduce that will recieve as an Input a folder containing multiple gz files.
I've been looking all over but all the tutorials that i've found exmplain how to process a simple text File, but haven't found anything that solves my problem.
I've asked around at my workplace, but only got references to scala, which i'm not familier with.
Any help would be appreciated.

Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.
So all you have to do write the logic as you would for a text file and pass in the directory which contains the .gz files as input.
But the issue with gzip files is that they are not splittable, imagine you have gzip files of each 5GB, then each mapper will process on the whole 5GB file instead of working with the default block size.

Best way for Hadoop to handle lots of image files

I've successfully managed to handle multiple image files in 2 ways in Hadoop:
Using Java to stitch images together using a sequence file. This requires a text file that points to the locations of all the files.
Using Python and Hadoop streaming to cache files to each node using -cacheArchive in the form of a tar.gz archive.
Both methods seem a bit ropey to me. Let's say I have one million files, I do not want to have to create the text file or zip up so many files. Is there a way i can just point my mapper to an hdfs folder and have it read through that folder at runtime? I know input can be used, but this is for text files. Or am I missing something? Any pointers are most appreciated.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark, ignore some input files - java

Related

Need help choosing a robust archive format

large Files transfer over the network

Writing mapreduce output directly onto a webpage

mapreduce in java - gzip input files

Best way for Hadoop to handle lots of image files

Categories

Resources