How does Hadoop actually accept MR jobs and input data? - java

All of the introductory tutorials and docs that I can find on Hadoop have simple/contrived (word count-style) examples, where each of them is submitted to MR by:
SSHing into the JobTracker node
Making sure that a JAR file containing the MR job is on HDFS
Running an HDFS command of the form bin/hadoop jar share/hadoop/mapreduce/my-map-reduce.jar <someArgs> that actually runs Hadoop/MR
Either reading the MR result from the command-line or opening a text file containing the result
Although these examples are great for showing total newbies how to work with Hadoop, it doesn't show me how Java code actually integrates with Hadoop/MR at the API level. I guess I am sort of expecting that:
Hadoop exposes some kind of client access/API for submitting MR jobs to the cluster
Once the jobs are complete, some asynchronous mechanism (callback, listener, etc.) reports the result back to the client
So, something like this (Groovy pseudo-code):
class Driver {
static void main(String[] args) {
new Driver().run(args)
}
void run(String[] args) {
MapReduceJob myBigDataComputation = new SolveTheMeaningOfLifeJob(convertToHadoopInputs(args), new MapReduceCallback() {
#Override
void onResult() {
// Now that you know the meaning of life, do nothing.
}
})
HadoopClusterClient hadoopClient = new HadoopClusterClient("http://my-hadoop.example.com/jobtracker")
hadoopClient.submit(myBigDataComputation)
}
}
So I ask: Surely the simple examples in all the introductory tutorials, where you SSH into nodes and run Hadoop from the CLI, and open text files to view its results...surely that can't be the way Big Data companies actually integrate with Hadoop. Surely, something along the lines of my pseudo-code snippet above is used to kick off an MR job and fetch its results. What is it?

In one word, kicking off an MR job can be done, using Oozie scheduler. But before that, you write a map reduce job. It has the driver class which is the starting point of the job. You give all information needed for the job to run in Driver class: like map input, mapper class, if any partitioners, config details and reducer details.
Once these are there in jar file, and you start a job as above(hadoop jar) using CLI(in reality oozie does it), the rest is taken care by Hadoop ecosystem. Hope I answered your question

Related

Google Cloud Dataflow: Submitted job is executing but using old code

I'm writing a Dataflow pipeline that should do 3 things:
Reading .csv files from GCP Storage
Parsing the data to BigQuery campatible TableRows
Writing the data to a BigQuery table
Up until now this all worked like a charm. And it still does, but when I change the source and destination variables nothing changes. The job that actually runs is an old one, not the recently changed (and committed) code. Somehow when I run the code from Eclipse using the BlockingDataflowPipelineRunner the code itself is not uploaded but an older version is used.
Normally nothing wrong with the code but to be as complete as possible:
public class BatchPipeline {
String source = "gs://sourcebucket/*.csv";
String destination = "projectID:datasetID.testing1";
//Creation of the pipeline with default arguments
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> line = p.apply(TextIO.Read.named("ReadFromCloudStorage")
.from(source));
#SuppressWarnings("serial")
PCollection<TableRow> tablerows = line.apply(ParDo.named("ParsingCSVLines").of(new DoFn<String, TableRow>(){
#Override
public void processElement(ProcessContext c){
//processing code goes here
}
}));
//Defining the BigQuery table scheme
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("datetime").setType("TIMESTAMP").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("consumption").setType("FLOAT").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("meterID").setType("STRING").setMode("REQUIRED"));
TableSchema schema = new TableSchema().setFields(fields);
String table = destination;
tablerows.apply(BigQueryIO.Write
.named("BigQueryWrite")
.to(table)
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withoutValidation());
//Runs the pipeline
p.run();
}
This problem arose because I've just changed laptops and had to reconfigure everything. I'm working on a clean Ubuntu 16.04 LTS OS with all the dependencies for GCP development installed (normally). Normally everything is configured quite well since I'm able to start a job (which shouldn't be possible if my config is erred, right?). I'm using Eclipse Neon btw.
So where could the problem lie? It seems to me that there is a problem uploading the code, but I've made sure that my cloud git repo is up-to-date and the staging bucket has been cleaned up ...
**** UPDATE ****
I never found what was exactly going wrong but when I checked out the creation dates of the files in my deployed jar, I indeed saw that they were never really updated. The jar file itself had however a recent timestamp which made me overlook that problem completely (rookie mistake).
I eventually got it all working again by simply creating a new Dataflow project in Eclipse and copying my .java files from the broken project into the new one. Everything worked like a charm from then on.
Once you submit a Dataflow job, you can check which artifacts were part of the job specification by inspecting the files that are part of the job description which is available via DataflowPipelineWorkerPoolOptions#getFilesToStage. The code snippet below gives a little sample of how to get this information.
PipelineOptions myOptions = ...
myOptions.setRunner(DataflowPipelineRunner.class);
Pipeline p = Pipeline.create(myOptions);
// Build up your pipeline and run it.
p.apply(...)
p.run();
// At this point in time, the files which were staged by the
// DataflowPipelineRunner will have been populated into the
// DataflowPipelineWorkerPoolOptions#getFilesToStage
List<String> stagedFiles = myOptions.as(DataflowPipelineWorkerPoolOptions.class).getFilesToStage();
for (String stagedFile : stagedFiles) {
System.out.println(stagedFile);
}
The above code should print out something like:
/my/path/to/file/dataflow.jar
/another/path/to/file/myapplication.jar
/a/path/to/file/alibrary.jar
It is likely that the resources part of the job that your uploading are out of date in some way containing your old code. Look through all the directories and jar parts of the staging list and find all instances of BatchPipeline and verify their age. jar files can be extracted using the jar tool or any zip file reader. Alternatively use javap or any other class file inspector to validate that the BatchPipeline class file lines up with the expected changes you have made.

Monitor progress and intermediate results in Spark

I have a simple Spark task, something like this:
JavaRDD<Solution> solutions = rdd.map(new Solve());
// Select best solution by some criteria
The solve routine takes some time. For a demo application, I need to get some property of each solution as soon as it is calculated, before the call to rdd.map terminates.
I've tried using accumulators and SparkListener, overriding the onTaskEnd method, but it seems to be called only at the end of the mapping, not per thread, E.g.
sparkContext.sc().addSparkListener(new SparkListener() {
public void onTaskEnd(SparkListenerTaskEnd taskEnd) {
// do something with taskEnd.taskInfo().accumulables()
}
});
How can I get an asynchronous message for each map function end?
Spark runs locally or in a standalone cluster mode.
Answers can be in Java or Scala, both are OK.

Using the org.apache.hadoop.utilProgressable interface

Can someone provide an example of how the Progressable interface might be implemented for use when calling FileSystem.create()? I saw the following code snippet in another post, but it did not show where bytesWritten came from:
OutputStream os = hdfs.create( file,
new Progressable() {
public void progress() {
out.println("...bytes written: [ "+bytesWritten+" ]");
} });
The documentation of this interface says it is for reporting progress to the Hadoop framework to avoid timeout in the case of a lengthy operation, but "Hadoop: The Definitive Guide" says it is for notifying the application of the progress of the data being written to the data nodes, which doesn't make much sense since it is a create.
Thanks, RF
If you have an implementation of Mapper where an invocation of map() may take a long time (like more than several minutes), then you can periodically call progress() on the provided context object to let Hadoop know that your code isn't hung. That's what they mean by "explicitly reporting progress" - it works when you're using an object provided by the framework that implements Progressable, it obviously doesn't work that way when you write your own implementation of Progressable.
I should have read the Hadoop book further -- here is the example they gave later on:
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
The accompanying text says " We illustrate progress
by printing a period every time the progress() method is called by Hadoop, which is after each 64 KB packet of data is written to the datanode pipeline".
I guess my question becomes, how does this "explicitly report progress to the Hadoop framework" as stated by the documentation of Progressable?

WrongValueClass in apache Mahout

I have written a mapreduce programm using mahout. the map output value is ClusterWritable .when i run the code in eclipse, it is run with no error, but when i run rhe jar file in terminal, it shows the exception:
java.io.IOException: wrong value class: org.apache.mahout.math.VectorWritable is not class org.apache.mahout.clustering.iterator.ClusterWritable
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:498)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.mahout.clustering.canopy.CanopyMapper.cleanup(CanopyMapper.java:59)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
The output code in map is:
context.write(new Text(), new ClusterWritable());
but i don't know why it says that the value type is VectorWritable.
Mapper being run, resulting in stacktrace above is Mahout's CanopyMapper, and not custom one you've written.
CanopyMapper.cleanup method is outputting (key: Text, value: VectorWritable).
See CanopyMapper.java
See also CanopyDriver.java and its buildClustersMR method, where MR job is configured, mapper, reducer, and appropriate output key/value classes.
You didn't state, so I'm guessing that you're using more than one MR job in a data flow pipeline. Check that outputs of each job in pipeline are valid/expected input for next job in pipeline. Consider using cascading/scalding to define your data flow (see http://www.slideshare.net/melrief/scalding-programming-model-for-hadoop)
Consider using Mahout user mailing list to post Mahout related questions.

Getting all output data form console when running process with Apache Commons Exec

The thing is... I'm running a process with the DefaultExecutor class of org.apache.commons.exec libraries. Like this:
public class Main {
public static void main(String[] args) throws IOException, InterruptedException {
CommandLine cmd = new CommandLine("java");
DefaultExecutor exec = new DefaultExecutor();
exec.setExitValue(1);
exec.execute(cmd);
}
I need to take that output "on the run" with another thread, to log it elsewhere. What is the best way of accomplish that?
Use a PipedOutputStream and a PipedInputStream. You can find an example here. Don't forget to close your streams.
You should probably look at log4j, a rather useful project from Apache. In a project I was recently working on, log4j was used to put all of the logs from various threads into one convenient file. Just make sure that you construct the logger in such as way that only one instance of it is available, and this should solve your problem.
Unfortunately, I was only an intern, and was not present when the team set up the logging system, so I can't actually help you with configuration. Luckily the project's website appears to have plenty of documentation to help you out.

Categories