Using the org.apache.hadoop.utilProgressable interface - java

Can someone provide an example of how the Progressable interface might be implemented for use when calling FileSystem.create()? I saw the following code snippet in another post, but it did not show where bytesWritten came from:
OutputStream os = hdfs.create( file,
new Progressable() {
public void progress() {
out.println("...bytes written: [ "+bytesWritten+" ]");
} });
The documentation of this interface says it is for reporting progress to the Hadoop framework to avoid timeout in the case of a lengthy operation, but "Hadoop: The Definitive Guide" says it is for notifying the application of the progress of the data being written to the data nodes, which doesn't make much sense since it is a create.
Thanks, RF

If you have an implementation of Mapper where an invocation of map() may take a long time (like more than several minutes), then you can periodically call progress() on the provided context object to let Hadoop know that your code isn't hung. That's what they mean by "explicitly reporting progress" - it works when you're using an object provided by the framework that implements Progressable, it obviously doesn't work that way when you write your own implementation of Progressable.

I should have read the Hadoop book further -- here is the example they gave later on:
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
The accompanying text says " We illustrate progress
by printing a period every time the progress() method is called by Hadoop, which is after each 64 KB packet of data is written to the datanode pipeline".
I guess my question becomes, how does this "explicitly report progress to the Hadoop framework" as stated by the documentation of Progressable?

Related

Monitor progress and intermediate results in Spark

I have a simple Spark task, something like this:
JavaRDD<Solution> solutions = rdd.map(new Solve());
// Select best solution by some criteria
The solve routine takes some time. For a demo application, I need to get some property of each solution as soon as it is calculated, before the call to rdd.map terminates.
I've tried using accumulators and SparkListener, overriding the onTaskEnd method, but it seems to be called only at the end of the mapping, not per thread, E.g.
sparkContext.sc().addSparkListener(new SparkListener() {
public void onTaskEnd(SparkListenerTaskEnd taskEnd) {
// do something with taskEnd.taskInfo().accumulables()
}
});
How can I get an asynchronous message for each map function end?
Spark runs locally or in a standalone cluster mode.
Answers can be in Java or Scala, both are OK.

How does Hadoop actually accept MR jobs and input data?

All of the introductory tutorials and docs that I can find on Hadoop have simple/contrived (word count-style) examples, where each of them is submitted to MR by:
SSHing into the JobTracker node
Making sure that a JAR file containing the MR job is on HDFS
Running an HDFS command of the form bin/hadoop jar share/hadoop/mapreduce/my-map-reduce.jar <someArgs> that actually runs Hadoop/MR
Either reading the MR result from the command-line or opening a text file containing the result
Although these examples are great for showing total newbies how to work with Hadoop, it doesn't show me how Java code actually integrates with Hadoop/MR at the API level. I guess I am sort of expecting that:
Hadoop exposes some kind of client access/API for submitting MR jobs to the cluster
Once the jobs are complete, some asynchronous mechanism (callback, listener, etc.) reports the result back to the client
So, something like this (Groovy pseudo-code):
class Driver {
static void main(String[] args) {
new Driver().run(args)
}
void run(String[] args) {
MapReduceJob myBigDataComputation = new SolveTheMeaningOfLifeJob(convertToHadoopInputs(args), new MapReduceCallback() {
#Override
void onResult() {
// Now that you know the meaning of life, do nothing.
}
})
HadoopClusterClient hadoopClient = new HadoopClusterClient("http://my-hadoop.example.com/jobtracker")
hadoopClient.submit(myBigDataComputation)
}
}
So I ask: Surely the simple examples in all the introductory tutorials, where you SSH into nodes and run Hadoop from the CLI, and open text files to view its results...surely that can't be the way Big Data companies actually integrate with Hadoop. Surely, something along the lines of my pseudo-code snippet above is used to kick off an MR job and fetch its results. What is it?
In one word, kicking off an MR job can be done, using Oozie scheduler. But before that, you write a map reduce job. It has the driver class which is the starting point of the job. You give all information needed for the job to run in Driver class: like map input, mapper class, if any partitioners, config details and reducer details.
Once these are there in jar file, and you start a job as above(hadoop jar) using CLI(in reality oozie does it), the rest is taken care by Hadoop ecosystem. Hope I answered your question

Get events from OS

I work on windows but I am stuck here on Mac. I have the Canon SDK and have built a JNA wrapper over it. It works well on windows and need some help with Mac.
In the sdk, there is a function where one can register a callback function. Basically when an event occurs in camera, it calls the callback function.
On windows, after registering, I need to use User32 to get the event and to dispatch the event by:
private static final User32 lib = User32.INSTANCE;
boolean hasMessage = lib.PeekMessage( msg, null, 0, 0, 1 ); // peek and remove
if( hasMessage ){
lib.TranslateMessage( msg );
lib.DispatchMessage( msg ); //message gets dispatched and hence the callback function is called
}
In the api, I do not find a similar class in Mac. How do I go about this one??
PS: The JNA api for unix is extensive and I could not figure out what to look for. The reference might help
This solution is using the Cocoa framework. Cocoa is deprecated and I am not aware of any other alternative solution. But the below works like charm.
Finally I found the solution using Carbon framework. Here is my MCarbon interface which defines calls I need.
public interface MCarbon extends Library {
MCarbon INSTANCE = (MCarbon) Native.loadLibrary("Carbon", MCarbon.class);
Pointer GetCurrentEventQueue();
int SendEventToEventTarget(Pointer inEvent, Pointer intarget);
int RemoveEventFromQueue(Pointer inQueue, Pointer inEvent);
void ReleaseEvent(Pointer inEvent);
Pointer AcquireFirstMatchingEventInQueue(Pointer inQueue,NativeLong inNumTypes,EventTypeSpec[] inList, NativeLong inOptions);
//... so on
}
The solution to the problem is solved using the below function:
NativeLong ReceiveNextEvent(NativeLong inNumTypes, EventTypeSpec[] inList, double inTimeout, byte inPullEvent, Pointer outEvent);
This does the job. As per documentation -
This routine tries to fetch the next event of a specified type.
If no events in the event queue match, this routine will run the
current event loop until an event that matches arrives, or the
timeout expires. Except for timers firing, your application is
blocked waiting for events to arrive when inside this function.
Also if not ReceiveNextEvent, then other functions as mentioned in MCarbon class above would be useful.
I think Carbon framework documentation would give more insights and flexibilities to solve the problem. Apart from Carbon, in forums people have mentioned about solving using Cocoa, but none I am aware of.
Edit: Thanks to technomarge, more information here

Why is my Com4J interface hanging during iteration?

I have to interface a third party COM API into an Java application. So I decided to use Com4j, and so far I've been satisfied, but now I've run into a problem.
After running the tlbgen I have an object called IAddressCollection which according to the original API documentation conforms to the IEnum interface definition. The object provides an iterator() function that returns a java.util.Iterator<Com4jObject>. The object comes from another object called IMessage when I want to find all the addresses for the message. So I would expect the code to work like this:
IAddressCollection adrCol = IMessage.getAddressees();
Iterator<Com4jObject> adrItr = adrCol.iterator();
while(adrItr.hasNext()){
Com4jObject adrC4j = adrItr.next();
// normally here I would handle the queryInterface
// and work with the rest of the API
}
My problem is that when I attempt the adrItr.next() nothing happens, the code stops working but hangs. No exception is thrown and I usually have to kill it through the task manager. So I'm wondering is this a problem that is common with Com4j, or am I handling this wrong, or is it possibly a problem with the API?
Ok, I hate answering my own question but in this case I found the problem. The issue was the underlying API. The IAddressCollection uses a 1 based indexing instead of a 0 based as I would have expected. It didn't provide this information in the API documentation. There is an item function where I can pull the object this way and so I can handle this with
IAddressCollection adrCol = IMessage.getAddressees();
for(int i = 1; i <= adrCol.count(); i++){
IAddress adr = adrCol.item(i);
// IAddress is the actual interface that I wanted and this works
}
So sorry for the annoyance on this.

Supressing console (disp) output for compiled MATLAB > Java program (using MATLAB Compiler)

Edit 2 After recieving a response from Mathworks support I've answered the question myself. In brief, there is an options class MWComponentOptions that is passed to the exported class when instantiated. This can, among other things, specify unique print streams for error output and regular output (i.e. from disp()-liked functions). Thanks for all the responses none the less :)
====================================================================
Just a quick question - is there any way to prevent MATLAB code from outputting to the Java console with disp (and similar) functions once compiled? What is useful debugging information in MATLAB quickly becomes annoying extra text in the Java logs.
The compilation tool I'm using is MATLAB Compiler (which I think is not the same as MATLAB Builder JA, but I might be wrong). I can't find any good documentation on the mcc command so am not sure if there are any options for this.
Of course if this is impossible and a direct consequence of the compiler converting all MATLAB code to its Java equivalent then that's completely understandable.
Thanks in advance
Edit This will also be useful to handle error reporting on the Java side alone - currently all MATLAB errors are sent to the console regardless of whether they are caught or not.
The isdeployed function returns true if run in a deployed application (with e.g. MATLAB Compiler or Builder JA) and false when running in live MATLAB.
You can surround your disp statements with an if isdeployed block.
I heard back from a request to Mathworks support, and they provided the following solution:
When creating whatever class has been exported, you can specify an MWComponentOptions object. This is poorly documented in R2012b, but for what I wanted the following example would work:
MWComponentOptions options = new MWComponentOptions();
PrintStream o = new PrintStream(new File("MATLAB log.log"));
options.setPrintStream(o); // send all standard dips() output to a log file
// the following ignores all error output (this will be caught by Java exception handling anyway)
options.setErrorStream((java.io.PrintStream)null);
// instantiate and use the exported class
myClass obj = new myClass(options);
obj.myMatlabFunction();
// etc...
Update
In case anyone does want to suppress all output, casing null to java.io.PrintStream ended up causing a NullPointerException in deployment. A better way to suppress all output is use to create a dummy print stream, something like:
PrintStream dummy = new PrintStream(new OutputStream() {
public void close() {}
public void flush() {}
public void write(byte[] b) {}
public void write(byte[] b, int off, int len) {}
public void write(int b) {}
} );
Then use
options.setErrorStream(dummy);
Hope this helps :)
Another possible hack if you have a stand-alone application and don't want to bother with classes at all:
Use evalc and deploy your func name during compile:
function my_wrap()
evalc('my_orig_func(''input_var'')');
end
And compile like
mcc -m my_wrap my_orig_func <...>
Well, it is obviously yet another hack.

Categories