WrongValueClass in apache Mahout - java

I have written a mapreduce programm using mahout. the map output value is ClusterWritable .when i run the code in eclipse, it is run with no error, but when i run rhe jar file in terminal, it shows the exception:
java.io.IOException: wrong value class: org.apache.mahout.math.VectorWritable is not class org.apache.mahout.clustering.iterator.ClusterWritable
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:498)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.mahout.clustering.canopy.CanopyMapper.cleanup(CanopyMapper.java:59)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
The output code in map is:
context.write(new Text(), new ClusterWritable());
but i don't know why it says that the value type is VectorWritable.

Mapper being run, resulting in stacktrace above is Mahout's CanopyMapper, and not custom one you've written.
CanopyMapper.cleanup method is outputting (key: Text, value: VectorWritable).
See CanopyMapper.java
See also CanopyDriver.java and its buildClustersMR method, where MR job is configured, mapper, reducer, and appropriate output key/value classes.
You didn't state, so I'm guessing that you're using more than one MR job in a data flow pipeline. Check that outputs of each job in pipeline are valid/expected input for next job in pipeline. Consider using cascading/scalding to define your data flow (see http://www.slideshare.net/melrief/scalding-programming-model-for-hadoop)
Consider using Mahout user mailing list to post Mahout related questions.

Related

Netlogo Api Controller - Get Table View

I am using Netlogo Api Controller With spring boot
this my code (i got it from this link )
HeadlessWorkspace workspace = HeadlessWorkspace.newInstance();
try {
workspace.open("models/Residential_Solar_PV_Adoption.nlogo",true);
workspace.command("set number-of-residences 900");
workspace.command("set %-similar-wanted 7");
workspace.command("set count-years-simulated 14");
workspace.command("set number-of-residences 500");
workspace.command("set carbon-tax 13.7");
workspace.command("setup");
workspace.command("repeat 10 [ go ]");
workspace.command("reset-ticks");
workspace.dispose();
workspace.dispose();
}
catch(Exception ex) {
ex.printStackTrace();
}
i got this result in the console:
But I want to get the table view and save to database. Which command can I use to get the table view ?
Table view:
any help please ?
If you can clarify why you're trying to generate the data this way, I or others might be able to give better advice.
There is no single NetLogo command or NetLogo API method to generate that table, you have to use BehaviorSpace to get it. Here are some options, listed in rough order of simplest to hardest.
Option 1
If possible, I'd recommend just running BehaviorSpace experiments from the command line to generate your table. This will get you exactly the same output you're looking for. You can find information on how to do that in the NetLogo manual's BehaviorSpace guide. If necessary, you can run NetLogo headless from the command line from within a Java program, just look for resources on calling out to external programs from Java, maybe with ProcessBuilder.
If you're running from within Java in order to setup and change the parameters of your BehaviorSpace experiments in a way that you cannot do from within the program, you could instead generate experiment XML files in Java to pass to NetLogo at the command line. See the docs on the XML format.
Option 2
You can recreate the contents of the table using the CSV extension in your model and adding a few more commands to generate the data. This will not create the exact same table, but it will get your data output in a computer and human readable format.
In pure NetLogo code, you'd want something like the below. Note that you can control more of the behavior (like file names or the desired variables) by running other pre-experiment commands before running setup or go in your Java code. You could also run the CSV-specific file code from Java using the controlling API and leave the model unchanged, but you'll need to write your own NetLogo code version of the csv:to-row primitive.
globals [
;; your model globals here
output-variables
]
to setup
clear-all
;;; your model setup code here
file-open "my-output.csv"
; the given variables should be valid reporters for the NetLogo model
set output-variables [ "ticks" "current-price" "number-of-residences" "count-years-simulated" "solar-PV-cost" "%-lows" "k" ]
file-print csv:to-row output-variables
reset-ticks
end
to go
;;; the rest of your model code here
file-print csv:to-row map [ v -> runresult v ] output-variables
file-flush
tick
end
Option 3
If you really need to reproduce the BehaviorSpace table export exactly, you can try to run a BehaviorSpace experiment directly from Java. The table is generated by this code but as you can see it's tied in with the LabProtocol class, meaning you'll have to setup and run your model through BehaviorSpace instead of just step-by-step using a workspace as you've done in your sample code.
A good example of this might be the Main.scala object, which extracts some experiment settings from the expected command-line arguments, and then uses them with the lab.run() method to run the BehaviorSpace experiment and generate the output. That's Scala code and not Java, but hopefully it isn't too hard to translate. You'd similarly have to setup an org.nlogo.nvm.LabInterface.Settings instance and pass that off to a HeadlessWorkspace.newLab.run() to get things going.

How to see the spark-launcher command being submitted when using org.apache.spark.launcher.SparkLauncher?

Is there a way to see the spark-launcher CLI command that will be submitted to spark when using the following line in JAVA to send an already configured spark launcher?
SparkLauncher().launch();
Edit: The method "createBuilder" within "SparkLauncher.class" seems to have it, as a list, during the ".launch()" process.
Now this is a matter of finding the way to extract that information from the point there the "SparkLauncher().launch()" is being submitted.
Given you define your sparkLauncher as: "launcher"
You can get the spark-submit command this way:
String sparkSubmitCommand = StringUtils.join(launcher.createBuilder().command(), " ");

How does Hadoop actually accept MR jobs and input data?

All of the introductory tutorials and docs that I can find on Hadoop have simple/contrived (word count-style) examples, where each of them is submitted to MR by:
SSHing into the JobTracker node
Making sure that a JAR file containing the MR job is on HDFS
Running an HDFS command of the form bin/hadoop jar share/hadoop/mapreduce/my-map-reduce.jar <someArgs> that actually runs Hadoop/MR
Either reading the MR result from the command-line or opening a text file containing the result
Although these examples are great for showing total newbies how to work with Hadoop, it doesn't show me how Java code actually integrates with Hadoop/MR at the API level. I guess I am sort of expecting that:
Hadoop exposes some kind of client access/API for submitting MR jobs to the cluster
Once the jobs are complete, some asynchronous mechanism (callback, listener, etc.) reports the result back to the client
So, something like this (Groovy pseudo-code):
class Driver {
static void main(String[] args) {
new Driver().run(args)
}
void run(String[] args) {
MapReduceJob myBigDataComputation = new SolveTheMeaningOfLifeJob(convertToHadoopInputs(args), new MapReduceCallback() {
#Override
void onResult() {
// Now that you know the meaning of life, do nothing.
}
})
HadoopClusterClient hadoopClient = new HadoopClusterClient("http://my-hadoop.example.com/jobtracker")
hadoopClient.submit(myBigDataComputation)
}
}
So I ask: Surely the simple examples in all the introductory tutorials, where you SSH into nodes and run Hadoop from the CLI, and open text files to view its results...surely that can't be the way Big Data companies actually integrate with Hadoop. Surely, something along the lines of my pseudo-code snippet above is used to kick off an MR job and fetch its results. What is it?
In one word, kicking off an MR job can be done, using Oozie scheduler. But before that, you write a map reduce job. It has the driver class which is the starting point of the job. You give all information needed for the job to run in Driver class: like map input, mapper class, if any partitioners, config details and reducer details.
Once these are there in jar file, and you start a job as above(hadoop jar) using CLI(in reality oozie does it), the rest is taken care by Hadoop ecosystem. Hope I answered your question

Mapreduce - sequence jobs?

I am using MapReduce (just map, really) to do a data processing task in four phases. Each phase is one MapReduce job. I need them to run in sequence, that is, don't start phase 2 until phase 1 is done, etc. Does anyone have experience doing this that can share?
Ideally we'd do this 4-job sequence overnight, so making it
cron-able would be a fine thing as well.
thank you
As Daniel mentions, the appengine-pipeline library is meant to solve this problem. I go over chaining mapreduce jobs together in this blog post, under the section "Implementing your own Pipeline jobs".
For convenience, I'll paste the relevant section here:
Now that we know how to launch the predefined MapreducePipeline, let’s take a look at implementing and running our own custom pipeline jobs. The pipeline library provides a low-level library for launching arbitrary distributed computing jobs within appengine, but, for now, we’ll talk specifically about how we can use this to help us chain mapreduce jobs together. Let’s extend our previous example to also output a reverse index of characters and IDs.
First, we define the parent pipeline job.
class ChainMapReducePipeline(mapreduce.base_handler.PipelineBase):
def run(self):
deduped_blob_key = (
yield mapreduce.mapreduce_pipeline.MapreducePipeline(
"test_combiner",
"main.map",
"main.reduce",
"mapreduce.input_readers.RandomStringInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
combiner_spec="main.combine",
mapper_params={
"string_length": 1,
"count": 500,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16))
char_to_id_index_blob_key = (
yield mapreduce.mapreduce_pipeline.MapreducePipeline(
"test_chain",
"main.map2",
"main.reduce2",
"mapreduce.input_readers.BlobstoreLineInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
# Pass output from first job as input to second job
mapper_params=(yield BlobKeys(deduped_blob_key)),
reducer_params={
"mime_type": "text/plain",
},
shards=4))
This launches the same job as the first example, takes the output from that job, and feeds it into the second job, which reverses each entry. Notice that the result of the first pipeline yield is passed in to mapper_params of the second job. The pipeline library uses magic to detect that the second pipeline depends on the first one finishing and does not launch it until the deduped_blob_key has resolved.
Next, I had to create the BlobKeys helper class. At first, I didn’t think this was necessary, since I could just do:
mapper_params={"blob_keys": deduped_blob_key},
But, this didn’t work for two reasons. The first is that “generator pipelines cannot directly access the outputs of the child Pipelines that it yields”. The code above would require the generator pipeline to create a temporary dict object with the output of the first job, which is not allowed. The second is that the string returned by BlobstoreOutputWriter is of the format “/blobstore/”, but BlobstoreLineInputReader expects simply “”. To solve these problems, I made a little helper BlobKeys class. You’ll find yourself doing this for many jobs, and the pipeline library even includes a set of common wrappers, but they do not work within the MapreducePipeline framework, which I discuss at the bottom of this section.
class BlobKeys(third_party.mapreduce.base_handler.PipelineBase):
"""Returns a dictionary with the supplied keyword arguments."""
def run(self, keys):
# Remove the key from a string in this format:
# /blobstore/<key>
return {
"blob_keys": [k.split("/")[-1] for k in keys]
}
Here is the code for the map2 and reduce2 functions:
def map2(data):
# BlobstoreLineInputReader.next() returns a tuple
start_position, line = data
# Split input based on previous reduce() output format
elements = line.split(" - ")
random_id = elements[0]
char = elements[1]
# Swap 'em
yield (char, random_id)
def reduce2(key, values):
# Create the reverse index entry
yield "%s - %s\n" % (key, ",".join(values))
I'm unfamiliar with google-app-engine, however couldn't you put all of the job-configurations in a single main program and then run them in sequence? something like the following? I think this works in normal map-reduce programs, so if google-app-engine code isn't too different it should work fine.
Configuration conf1 = getConf();
Configuration conf2 = getConf();
Configuration conf3 = getConf();
Configuration conf4 = getConf();
//whatever configuration you do for the jobs
Job job1 = new Job(conf1,"name1");
Job job2 = new Job(conf2,"name2");
Job job3 = new Job(conf3,"name3");
Job job4 = new Job(conf4,"name4");
//setup for the jobs here
job1.waitForCompletion(true);
job2.waitForCompletion(true);
job3.waitForCompletion(true);
job4.waitForCompletion(true);
You need the appengine-pipeline project, which is meant for exactly this.

Using FileInputFormat.addInputPaths to recursively add HDFS path

I've got a HDFS structure something like
a/b/file1.gz
a/b/file2.gz
a/c/file3.gz
a/c/file4.gz
I'm using the classic pattern of
FileInputFormat.addInputPaths(conf, args[0]);
to set my input path for a java map reduce job.
This works fine if I specify args[0] as a/b but it fails if I specify just a (my intention being to process all 4 files)
the error being
Exception in thread "main" java.io.IOException: Not a file: hdfs://host:9000/user/hadoop/a
How do I recursively add everything under a ?
I must be missing something simple...
As Eitan Illuz mentioned here, in Hadoop 2.4.0 a mapreduce.input.fileinputformat.input.dir.recursive configuration property was introduced that when set to true instructs the input format to include files recursively.
In Java code it looks like this:
Configuration conf = new Configuration();
conf.setBoolean("mapreduce.input.fileinputformat.input.dir.recursive", true);
Job job = Job.getInstance(conf);
// etc.
I've been using this new property and find that it works well.
EDIT: Better yet, use this new method on FileInputFormat that achieves the same result:
Job job = Job.getInstance();
FileInputFormat.setInputDirRecursive(job, true);
This is a bug in the current version of Hadoop. Here is the JIRA for the same. It's still in open state. Either make the changes in the code and build the binaries or wait for it to be fixed in the coming releases. Processing of the files recursively can be turned on/off, check the patch attached to the JIRA for more details.

Categories