How are large directory trees processed in using the Spark API?

How are large directory trees processed in using the Spark API? - java

I'm a new Spark user and I'm trying to process a large file set of XML files sitting on a HDFS file system. There are about 150k files, totalling about 28GB, on a "development" cluster of 1 machine (actually a VM).
The files are organised into a directory structure in HDFS such that there are about a hundred subdirectories under a single parent directory. Each "child" directory contains anything between a couple of hundred and a couple of thousand XML files.
My task is to parse each XML file, extract a few values using XPath expressions, and save the result to HBase. I'm trying to do this with Apache Spark, and I'm not having much luck. My problem appears to be a combination of the Spark API, and the way RDDs work. At this point it might be prudent to share some pseudocode to express what I'm trying to do:
RDD[String] filePaths = getAllFilePaths()
RDD[Map<String,String>] parsedFiles = filePaths.map((filePath) => {
// Load the file denoted by filePath
// Parse the file and apply XPath expressions
})
// After calling map() above, I should have an RDD[Map<String,String>] where
// the map is keyed by a "label" for an xpath expression, and the
// corresponding value is the result of the expression applied to the file
So, discounting the part where I write to HBase for a moment, lets focus on the above. I cannot load a file from within the RDD map() call.
I have tried this a number of different ways, and all have failed:
Using a call to SparkContext.textFile("/my/path") to load the file fails because SparkContext is not serializable
Using a call to FileSystem.open(path) from the Hadoop API, where the FileSystem is instantiated outside the RDD fails because FileSystem is not serializable
Using a call to FileSystem.open(path) from the Hadoop API, where the FileSystem is instantiated inside the RDD fails because the program runs out of file handles.
Alternative approaches have included attempting to use SparkContext.wholeTextFiles("/my/path/*") so I don't have to do the file load from within the map() call, fails because the program runs out of memory. This is presumably because it loads the files eagerly.
Has anyone attempted anything similar in their own work, and if so, what approach did you use?

Try using a wildcard to read the whole directory.
val errorCount = sc.textFile("hdfs://some-directory/*")
Actually, spark can read a whole hfs directory, quote from spark documentation
All of Spark’s file-based input methods, including textFile, support
running on directories, compressed files, and wildcards as well. For
example, you can use textFile("/my/directory"),
textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

Related

Saving spark dataset to an existing csv file

I am trying to save the contents of dataset to csv using
df.coalesce(1)
.write()
.format("csv")
.mode("append")
.save(PATH+"/trial.csv");
My aim is to keep appending the results of dataset to trial.csv file. However, it creates a folder called trial.csv and creates csv inside of that. When I run it again, it creates another csv file inside the trail.csv folder. But I just want it to keep appending to one csv file, which I unable to do.
I know we can do some script from outside of code(program) and do it, but can we achieve it from inside our code? I am using Java.

Appending to an existing file its something hard to do for a distributed, multi-thread application, it will turn something parallelised into a sequential task. The way that is usually achieved, is to persist per thread or task in spark, a single file in the specified path, and this path will be a folder containing all the files. To read them, you can read the complete folder.
If your data is not big, and you really need a single file, try with repartition method to 1, this will make a single task to write the new data, but it will never append the data to previous files.
You have to be careful, but you can do something like this:
df.union(spark.read(PATH+"/trial.csv"))
.coalesce(1)
.write
.format("csv")
.mode("append")
.save(PATH+"/trial_auxiliar.csv")
Then move it to the previous folder, with spark or a move command of Hadoop. Never write and read in the same job to the same folder, and keep in mind that this won't guarantee the data order.

Accessing resource file when running as a DataflowPipelineRunner in Google Dataflow

In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?

You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.

How to globally read in an auxiliary data file for a MapReduce application?

I've written a MapReduce application that checks whether a very large set of test points (~3000 sets of x,y,x coordinates) fall within a set of polygons. The input files are formatted as follows:
{Polygon_1 Coords} {TestPointSet_1 Coords}
{Polygon_2 Coords} {TestPointSet_1 Coords}
...
{Polygon_1 Coords} {TestPointSet_2 Coords}
{Polygon_2 Coords} {TestPointSet_2 Coords}
...
There is only 1 input file per MR job, and each file ends up being about 500 MB in size. My code works great and the jobs run within seconds. However, there is a major bottleneck - the time it takes to transfer hundreds of these input files to my Hadoop cluster. I could cut down on the file size significantly if I could figure out a way to read in an auxiliary data file that contains one copy of each TestPointSet and then designate which set to use in my input files.
Is there a way to read in this extra data file and store it globally so that it can be accessed across multiple mapper calls?
This is my first time writing code in MR or Java, so I'm probably unaware of a very simple solution. Thanks in advance!

It can be achieved using hadoop's distributedcache feature. DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.Google it and you can find code example.

Complex MapReduce configuration scenario

Consider an application that wants to use Hadoop in order to process large amounts of proprietary binary-encoded text data in approximately the following simplified MapReduce sequence:
Gets a URL to a file or a directory as input
Reads the list of the binary files found under the input URL
Extracts the text data from each of those files
Saves the text data into new, extracted plain text files
Classifies extracted files into (sub)formats of special characteristics (say, "context")
Splits each of the extracted text files according to its context, if necessary
Processes each of the splits using the context of the original (unsplit) file
Submits the processing results to a proprietary data repository
The format-specific characteristics (context) identified in Step 5 are also saved in a (small) text file as key-value pairs, so that they are accessible by Step 6 and Step 7.
Splitting in Step 6 takes place using custom InputFormat classes (one per custom file format).
In order to implement this scenario in Hadoop, one could integrate Step 1 - Step 5 in a Mapper and use another Mapper for Step 7.
A problem with this approach is how to make a custom InputFormat know which extracted files to use in order to produce the splits. For example, format A may represent 2 extracted files with slightly different characteristics (e.g., different line delimiter), hence 2 different contexts, saved in 2 different files.
Based on the above, the getSplits(JobConf) method of each custom InputFormat needs to have access to the context of each file before splitting it. However, there can be (at most) 1 InputFormat class per format, so how would one correlate the appropriate set of extracted files with the correct context file?
A solution could be to use some specific naming convention for associating extracted files and contexts (or vice versa), but is there any better way?

This sounds more like a Storm (stream processing) problem with a spout that loads the list of binary files from a URL and subsequent bolts in the topology performing each of the following actions.

How to Override InputFormat and OutputFormat In hadoop Application

I have an application which need to read a file which is a serialized result of ArrayList.(ArrayList<String>, 50000 records in this list, size: 20MB)
I don't know exactly how to read the data in to hadoop platform. I only have some sense I need to override InputFormat and OutpurFormat.
I'm a beginner in hadoop platform. Could you give me some advise?
Thanks,
Zheng.

To start with you'll need to extend the FileInputFormat, notable implementing the abstract FileInputFormat.createRecordReader method.
You can look through the source of something like the LineRecordReader (which is what TextInputFormat uses to process text files).
From there you're pretty much on your own (i.e. it depends on how your ArrayList has been serialized). Look through the source for the LineRecordReader and try and relate that to how your ArrayList has been serialized.
Some other points of note, is your file format splittable? I.e. can you seek to an offset in the file and recover the stream from there (Text files can as they just scan forward to the end of the current line and then start from there). If your file format uses compression, you also need to take this into account (you cannot for example randomly seek to a position in a gzip file). By default FileInputFormat.isSplittable will return true, which you may want to initially override to be false. If you do stick with 'unsplittable' then note that your file will be processed by a single mapper (not matter it's size).

Before processing data on Hadoop you should upload data to HDFS or another supported file system of cause if it wasn't upload here by something else. If you are controlling the uploading process you can convert data on the uploading stage to something you can easily process, like:
simple text file (line per array's item)
SequenceFile if array can contain lines with '\n'
This is the simplest solution since you don't have to interfere to Hadoop's internals.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How are large directory trees processed in using the Spark API? - java

Related

Saving spark dataset to an existing csv file

Accessing resource file when running as a DataflowPipelineRunner in Google Dataflow

How to globally read in an auxiliary data file for a MapReduce application?

Complex MapReduce configuration scenario

How to Override InputFormat and OutputFormat In hadoop Application

Categories

Resources