Mapreduce - sequence jobs?

Mapreduce - sequence jobs? - java

I am using MapReduce (just map, really) to do a data processing task in four phases. Each phase is one MapReduce job. I need them to run in sequence, that is, don't start phase 2 until phase 1 is done, etc. Does anyone have experience doing this that can share?
Ideally we'd do this 4-job sequence overnight, so making it
cron-able would be a fine thing as well.
thank you

As Daniel mentions, the appengine-pipeline library is meant to solve this problem. I go over chaining mapreduce jobs together in this blog post, under the section "Implementing your own Pipeline jobs".
For convenience, I'll paste the relevant section here:
Now that we know how to launch the predefined MapreducePipeline, let’s take a look at implementing and running our own custom pipeline jobs. The pipeline library provides a low-level library for launching arbitrary distributed computing jobs within appengine, but, for now, we’ll talk specifically about how we can use this to help us chain mapreduce jobs together. Let’s extend our previous example to also output a reverse index of characters and IDs.
First, we define the parent pipeline job.
class ChainMapReducePipeline(mapreduce.base_handler.PipelineBase):
def run(self):
deduped_blob_key = (
yield mapreduce.mapreduce_pipeline.MapreducePipeline(
"test_combiner",
"main.map",
"main.reduce",
"mapreduce.input_readers.RandomStringInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
combiner_spec="main.combine",
mapper_params={
"string_length": 1,
"count": 500,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16))
char_to_id_index_blob_key = (
yield mapreduce.mapreduce_pipeline.MapreducePipeline(
"test_chain",
"main.map2",
"main.reduce2",
"mapreduce.input_readers.BlobstoreLineInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
# Pass output from first job as input to second job
mapper_params=(yield BlobKeys(deduped_blob_key)),
reducer_params={
"mime_type": "text/plain",
},
shards=4))
This launches the same job as the first example, takes the output from that job, and feeds it into the second job, which reverses each entry. Notice that the result of the first pipeline yield is passed in to mapper_params of the second job. The pipeline library uses magic to detect that the second pipeline depends on the first one finishing and does not launch it until the deduped_blob_key has resolved.
Next, I had to create the BlobKeys helper class. At first, I didn’t think this was necessary, since I could just do:
mapper_params={"blob_keys": deduped_blob_key},
But, this didn’t work for two reasons. The first is that “generator pipelines cannot directly access the outputs of the child Pipelines that it yields”. The code above would require the generator pipeline to create a temporary dict object with the output of the first job, which is not allowed. The second is that the string returned by BlobstoreOutputWriter is of the format “/blobstore/”, but BlobstoreLineInputReader expects simply “”. To solve these problems, I made a little helper BlobKeys class. You’ll find yourself doing this for many jobs, and the pipeline library even includes a set of common wrappers, but they do not work within the MapreducePipeline framework, which I discuss at the bottom of this section.
class BlobKeys(third_party.mapreduce.base_handler.PipelineBase):
"""Returns a dictionary with the supplied keyword arguments."""
def run(self, keys):
# Remove the key from a string in this format:
# /blobstore/<key>
return {
"blob_keys": [k.split("/")[-1] for k in keys]
}
Here is the code for the map2 and reduce2 functions:
def map2(data):
# BlobstoreLineInputReader.next() returns a tuple
start_position, line = data
# Split input based on previous reduce() output format
elements = line.split(" - ")
random_id = elements[0]
char = elements[1]
# Swap 'em
yield (char, random_id)
def reduce2(key, values):
# Create the reverse index entry
yield "%s - %s\n" % (key, ",".join(values))

I'm unfamiliar with google-app-engine, however couldn't you put all of the job-configurations in a single main program and then run them in sequence? something like the following? I think this works in normal map-reduce programs, so if google-app-engine code isn't too different it should work fine.
Configuration conf1 = getConf();
Configuration conf2 = getConf();
Configuration conf3 = getConf();
Configuration conf4 = getConf();
//whatever configuration you do for the jobs
Job job1 = new Job(conf1,"name1");
Job job2 = new Job(conf2,"name2");
Job job3 = new Job(conf3,"name3");
Job job4 = new Job(conf4,"name4");
//setup for the jobs here
job1.waitForCompletion(true);
job2.waitForCompletion(true);
job3.waitForCompletion(true);
job4.waitForCompletion(true);

You need the appengine-pipeline project, which is meant for exactly this.

Related

Does flink sends duplicate records if there are multiple downstream operator

If I run an example flink application like below:
DataStream ds;
ds.map(new MapFunction1()).print();
ds.map(new MapFunction2()).print();
Will flink send twice for each records from ds to downstream operators(MapFunction1 and MapFunction2) internally?
I know that data exchange in flink is happened in taskmanager level instead of operator level.

Yes, try:
StreamExecutionEnvironment environment =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> someIntegers = environment.generateSequence(0, 0);
someIntegers.map(aLong -> aLong + 1).print();
someIntegers.map(aLong -> aLong + 2).print();
environment.execute();
OutPut:
1> 1
1> 2

The job graph for this application looks like this, and the whole application runs in a single thread, in one taskmanager. I disabled operator chaining to get the Flink webui to generate this diagram, but if I hadn't done that, there'd have been no networking involved at all.

Spark Streaming: Foreach writing not being initiated in case of large amount of data [duplicate]

I have a simple Spark application running on cluster mode.
val funcGSSNFilterHeader = (x: String) => {
println(!x.contains("servedMSISDN")
!x.contains("servedMSISDN")
}
val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))
val ggsnFileLines = ssc.fileStream[LongWritable, Text, TextInputFormat]("C:\\Users\\Mbazarganigilani\\Documents\\RA\\GGSN\\Files1", filterF, false)
val ggsnArrays = ggsnFileLines
.map(x => x._2.toString()).filter(x => funcGSSNFilterHeader(x))
ggsnArrays.foreachRDD(s => {println(x.toString()})
I need to print !x.contains("servedMSISDN") inside the map function for debugging purposes, but this doesn't print on the console

Your code contains driver (main/master) and executors (which runs on the nodes in cluster mode).
Functions which runs inside a "map" runs on the executors
i.e. when you are in cluster mode, execution print inside map function will result in print to the nodes console (which you won't see).
In order to debug a program, you can:
Run the code in "local" mode, and the prints in the "map function" will be printed the console of your "master/main node" as the executors are running on the same machine
Replace "print to console" with save to file / save to elastic / etc
Note that in addition to the local vs cluster mode - It seems like you have a typo in your code:
ggsnArrays.foreachRDD(s => {println(x.toString()})
Should be:
ggsnArrays.foreachRDD(s => {println(x.toString)})

Two possibilities:
Your logs are on worker nodes, so you must check worker logs for these log messages. As suggested before, you can run your application in local mode to check logs on your machine. By the way, it's better to use i.e. SLF4j than just println, but I assume it's only for learning :)
In snippet there is no ssc.start() and ssc.awaitTermination(). Did you run these commands? If not, foreachRDD will not be executed any time. If the example is ok, please add these line at the end of script and try again, but please check worker nodes logs :)

Yaron already explained to you, that you cannot see the output of a print-statement since it is not executed in your driver but on the worker-nodes.
In addition to his answer note that you can access the SparkUI or Spark History Server.
There you can visually see which executors are running on which node in the Executors Tab. As you can see on the screenshot you'll have access to the stdout and stderr (if your cluster is correctly configured) and see what is written there.

How to know which stage of a job is currently running in Apache Spark?

Consider I have a job as follow in Spark;
CSV File ==> Filter By A Column ==> Taking Sample ==> Save As JSON
Now my requirement is how do I know which step(Fetching file or Filtering or Sampling) of the job is currently executing programatically (Preferably using Java API)? Is there any way for this?
I can track Job,Stage and Task using SparkListener class. And it can be done like tracking a stage Id. But how to know which stage Id is for which step in the job chain.
What I want to send a notification to user when consider Filter By A Column is completed. For that I made a class that extends SparkListener class. But I can not find out from where I can get the name of currently executing transformation name. Is it possible to track at all?
public class ProgressListener extends SparkListener{
#Override
public void onJobStart(SparkListenerJobStart jobStart)
{
}
#Override
public void onStageSubmitted(SparkListenerStageSubmitted stageSubmitted)
{
//System.out.println("Stage Name : "+stageSubmitted.stageInfo().getStatusString()); giving action name only
}
#Override
public void onTaskStart(SparkListenerTaskStart taskStart)
{
//no such method like taskStart.name()
}
}

You cannot exactly know when, e.g., the filter operation starts or finishes.
That's because you have transformations (filter,map,...) and actions (count, foreach,...). Spark will put as many operations into one stage as possible. Then the stage is executed in parallel on the different partitions of your input. And here comes the problem.
Assume you have several workers and the following program
LOAD ==> MAP ==> FILTER ==> GROUP BY + Aggregation
This program will probably have two stages: the first stage will load the file and apply the map and filter.
Then the output will be shuffled to create the groups. In the second stage the aggregation will be performed.
Now, the problem is, that you have several workers and each will process a portion of your input data in parallel. That is, every executor in your cluster will receive a copy of your program(the current stage) and execute this on the assigned partition.
You see, you will have multiple instances of your map and filter operators that are executed in parallel, but not necessarily at the same time. In an extreme case, worker 1 will finish with stage 1 before worker 20 has started at all (and therefore finish with its filter operation before worker 20).
For RDDs Spark uses the iterator model inside a stage. For Datasets in the latest Spark version however, they create a single loop over the partition and execute the transformations. This means that in this case Spark itself does not really know when a transformation operator finished for a single task!
Long story short:
You are not able the know when an operation inside a stage finishes
Even if you could, there are multiple instances that will finish at different times.
So, now I already had the same problem:
In our Piglet project (please allow some adverstisement ;-) ) we generate Spark code from Pig Latin scripts and wanted to profile the scripts. I ended up in inserting mapPartition operator between all user operators that will send the partition ID and the current time to a server which will evaluate the messages. However, this solution also has its limitations... and I'm not completely satisfied yet.
However, unless you are able to modify the programs I'm afraid you cannot achieve what you want.

Did you consider this option: http://spark.apache.org/docs/latest/monitoring.html
It seems you can use the following rest api to get a certain job state /applications/[app-id]/jobs/[job-id]
You can set the JobGroupId and JobGroupDescription so you can track what job group is being handled. i.e. setJobGroup
Assuming you'll call the JobGroupId "test"
sc.setJobGroup("1", "Test job")
When you'll call the http://localhost:4040/api/v1/applications/[app-id]/jobs/[job-id]
You'll get a json with a descriptive name for that job:
{
"jobId" : 3,
"name" : "count at <console>:25",
"description" : "Test Job",
"submissionTime" : "2017-02-22T05:52:03.145GMT",
"completionTime" : "2017-02-22T05:52:13.429GMT",
"stageIds" : [ 3 ],
"jobGroup" : "1",
"status" : "SUCCEEDED",
"numTasks" : 4,
"numActiveTasks" : 0,
"numCompletedTasks" : 4,
"numSkippedTasks" : 0,
"numFailedTasks" : 0,
"numActiveStages" : 0,
"numCompletedStages" : 1,
"numSkippedStages" : 0,
"numFailedStages" : 0
}

Can Spark Streaming do Anything Other Than Word Count?

I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).

The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)

WrongValueClass in apache Mahout

I have written a mapreduce programm using mahout. the map output value is ClusterWritable .when i run the code in eclipse, it is run with no error, but when i run rhe jar file in terminal, it shows the exception:
java.io.IOException: wrong value class: org.apache.mahout.math.VectorWritable is not class org.apache.mahout.clustering.iterator.ClusterWritable
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:498)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.mahout.clustering.canopy.CanopyMapper.cleanup(CanopyMapper.java:59)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
The output code in map is:
context.write(new Text(), new ClusterWritable());
but i don't know why it says that the value type is VectorWritable.

Mapper being run, resulting in stacktrace above is Mahout's CanopyMapper, and not custom one you've written.
CanopyMapper.cleanup method is outputting (key: Text, value: VectorWritable).
See CanopyMapper.java
See also CanopyDriver.java and its buildClustersMR method, where MR job is configured, mapper, reducer, and appropriate output key/value classes.
You didn't state, so I'm guessing that you're using more than one MR job in a data flow pipeline. Check that outputs of each job in pipeline are valid/expected input for next job in pipeline. Consider using cascading/scalding to define your data flow (see http://www.slideshare.net/melrief/scalding-programming-model-for-hadoop)
Consider using Mahout user mailing list to post Mahout related questions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.