Complex MapReduce configuration scenario - java

Consider an application that wants to use Hadoop in order to process large amounts of proprietary binary-encoded text data in approximately the following simplified MapReduce sequence:
Gets a URL to a file or a directory as input
Reads the list of the binary files found under the input URL
Extracts the text data from each of those files
Saves the text data into new, extracted plain text files
Classifies extracted files into (sub)formats of special characteristics (say, "context")
Splits each of the extracted text files according to its context, if necessary
Processes each of the splits using the context of the original (unsplit) file
Submits the processing results to a proprietary data repository
The format-specific characteristics (context) identified in Step 5 are also saved in a (small) text file as key-value pairs, so that they are accessible by Step 6 and Step 7.
Splitting in Step 6 takes place using custom InputFormat classes (one per custom file format).
In order to implement this scenario in Hadoop, one could integrate Step 1 - Step 5 in a Mapper and use another Mapper for Step 7.
A problem with this approach is how to make a custom InputFormat know which extracted files to use in order to produce the splits. For example, format A may represent 2 extracted files with slightly different characteristics (e.g., different line delimiter), hence 2 different contexts, saved in 2 different files.
Based on the above, the getSplits(JobConf) method of each custom InputFormat needs to have access to the context of each file before splitting it. However, there can be (at most) 1 InputFormat class per format, so how would one correlate the appropriate set of extracted files with the correct context file?
A solution could be to use some specific naming convention for associating extracted files and contexts (or vice versa), but is there any better way?

This sounds more like a Storm (stream processing) problem with a spout that loads the list of binary files from a URL and subsequent bolts in the topology performing each of the following actions.

Related

get csv data using java and validate it against expected results

I have the following data in a CSV file:
video1duration,video2duration,video3duration
00:01:00, 00:00:24, 00:00:15
00:01:00, 00:00:24, 00:00:15
00:01:00, 00:00:24, 00:00:15
The file is stored in a folder locally in my computer.
I need help with writing code to do the followings:
- pass the path of the CSV file to access its data, then treat the data as actual data and then validate each cell/value against expected data that will be written in the IDE as follows:
video1duration,video2duration,video3duration
00:02:00, 00:05:24, 00:00:15
00:04:00, 00:10:24, 00:00:15
00:01:00, 00:00:24, 00:00:15
As I understand your question, you have two-stage process. Trying to merge these two separate things into one will certainly result in less legible and harder-to-maintain code (everything as one giant package/class/function).
Your first stage is to import a .csv file and parse it using any of these 3 methods: Using java.util.Scanner
Using String.split() function
Using 3rd Party libraries like OpenCSV
It is possible to validate that your .csv is valid, and that it contains tabular data without knowing or caring about what the data will later be used for.
In the second stage take tabular data (e.g. an array of arrays) and turn it into a tree. At this point, your hierarchy package will be doing validation but it will only be validating the tree structure (e.g. every node except the root has one parent, etc.). If you want to delve further this might be interesting: https://www.npmjs.com/package/csv-file-validator.

Accessing resource file when running as a DataflowPipelineRunner in Google Dataflow

In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?
You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.

spring batch structure for parallel processing

I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format.
The requirements are clear:
select the files to ingest from an external source: there can be multiple releases of some files each day so the latest release must be picked
turn each line of each file into json by combining the delimited fields with the column names of the first line (which is skipped)
send each line of json to a RESTFul Api
We have one step which uses a MultiResourceItemReader which processes files in sequence. The files are inputstreams which time out.
Ideally I think we want to have
a step which identifies the files to ingest
a step which processes files in parallel
Thanks in advance.
This is a fun one. I'd implement a customer line tokenizer that extends DelimitedLineTokenizer and also implements LineCallbackHandler. I'd then configure your FlatFileItemReader to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names.
A custom FieldSetMapper would then receive a FieldSet with all your name/value pairs, which I'd just pass to the ItemProcessor. Your processor could then build your JSON strings and pass them off to your writer.
Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer if Writer is done after receiving response from service.
Anyway, you don't need a separate step to just know the file name. Make it part of application initialization code.
Strategies to parallelize your application are listed here.
You just said a bunch of files. If number of lines in those files have similar count, I would go by partitioning approach ( i.e. by implementing Partitioner interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer). You wouldn't need MultiResourceItemReader in this case but simple single file reader as each file will have its own reader. Partitioning
If line count in those files vary a lot i.e. if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe.
Approach Parallel Steps doesn't look suitable for your case since your steps are not independent.
Hope it helps !!

How are large directory trees processed in using the Spark API?

I'm a new Spark user and I'm trying to process a large file set of XML files sitting on a HDFS file system. There are about 150k files, totalling about 28GB, on a "development" cluster of 1 machine (actually a VM).
The files are organised into a directory structure in HDFS such that there are about a hundred subdirectories under a single parent directory. Each "child" directory contains anything between a couple of hundred and a couple of thousand XML files.
My task is to parse each XML file, extract a few values using XPath expressions, and save the result to HBase. I'm trying to do this with Apache Spark, and I'm not having much luck. My problem appears to be a combination of the Spark API, and the way RDDs work. At this point it might be prudent to share some pseudocode to express what I'm trying to do:
RDD[String] filePaths = getAllFilePaths()
RDD[Map<String,String>] parsedFiles = filePaths.map((filePath) => {
// Load the file denoted by filePath
// Parse the file and apply XPath expressions
})
// After calling map() above, I should have an RDD[Map<String,String>] where
// the map is keyed by a "label" for an xpath expression, and the
// corresponding value is the result of the expression applied to the file
So, discounting the part where I write to HBase for a moment, lets focus on the above. I cannot load a file from within the RDD map() call.
I have tried this a number of different ways, and all have failed:
Using a call to SparkContext.textFile("/my/path") to load the file fails because SparkContext is not serializable
Using a call to FileSystem.open(path) from the Hadoop API, where the FileSystem is instantiated outside the RDD fails because FileSystem is not serializable
Using a call to FileSystem.open(path) from the Hadoop API, where the FileSystem is instantiated inside the RDD fails because the program runs out of file handles.
Alternative approaches have included attempting to use SparkContext.wholeTextFiles("/my/path/*") so I don't have to do the file load from within the map() call, fails because the program runs out of memory. This is presumably because it loads the files eagerly.
Has anyone attempted anything similar in their own work, and if so, what approach did you use?
Try using a wildcard to read the whole directory.
val errorCount = sc.textFile("hdfs://some-directory/*")
Actually, spark can read a whole hfs directory, quote from spark documentation
All of Spark’s file-based input methods, including textFile, support
running on directories, compressed files, and wildcards as well. For
example, you can use textFile("/my/directory"),
textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

Custom Binary Input - Hadoop

I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them.
These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per_pixel] are the first image, then the second and so on].
What is a good Input format for these kinds of files? I thought two possible solutions:
Convert them to sequence files by placing the metadata in the sequence file header and have pairs for each image. In this case can I access the metadata from all mappers?
Write a custom InputFormat and RecordReader and create splits for each image while placing the metadata in distributed cache.
I am new in Hadoop, so I may be missing something. Which approach you think is better? is any other way that I am missing?
Without knowing your file formats, the first option seems to be the better option. Using sequence files you can leverage a lot of SequenceFile related tools to get better performance. However, there are two things that do concern me with this approach.
How will you get your .mrc files into a .seq format?
You mentioned that the header is large, this may reduce the performance of SequenceFiles
But even with those concerns, I think that representing your data in SequenceFile's is the best option.

Categories