Monitor progress and intermediate results in Spark

Monitor progress and intermediate results in Spark - java

I have a simple Spark task, something like this:
JavaRDD<Solution> solutions = rdd.map(new Solve());
// Select best solution by some criteria
The solve routine takes some time. For a demo application, I need to get some property of each solution as soon as it is calculated, before the call to rdd.map terminates.
I've tried using accumulators and SparkListener, overriding the onTaskEnd method, but it seems to be called only at the end of the mapping, not per thread, E.g.
sparkContext.sc().addSparkListener(new SparkListener() {
public void onTaskEnd(SparkListenerTaskEnd taskEnd) {
// do something with taskEnd.taskInfo().accumulables()
}
});
How can I get an asynchronous message for each map function end?
Spark runs locally or in a standalone cluster mode.
Answers can be in Java or Scala, both are OK.

Related

Spring Integration: start JPA polling only when all the results of last polling has been processed

I have a following flow that I would like to implement using Spring Integration Java DSL:
Poll a table in a database every 2 hours which returns id of documents that need to be processed
For each id, process a document through an HTTP gateway
Store a response in a database
I have a working Java code that does exactly these steps. An additional requirement that I'm struggling with is that the polling for the next round of documents shouldn't happen until all the documents from the last polling has been processed and stored in the database.
Is there any pattern in Spring Integration that I could use for this additional requirement?
Here is a simplified code - it will get more complex and I'll split processing of the documents (HTTP outbound and persisting) into separate classes / flows:
return IntegrationFlows.from(Jpa.inboundAdapter(this.targetEntityManagerFactory)
.entityClass(ProcessingMetadata.class)
.jpaQuery("select max(p.modifiedDate) from ProcessingMetadata p " +
"where p.status = com.test.ProcessingStatus.PROCESSED")
.maxResults(1)
.expectSingleResult(true),
e -> e.poller(Pollers.fixedDelay(Duration.ofSeconds(10))))
.handle(Jpa.retrievingGateway(this.sourceEntityManagerFactory)
.entityClass(DocumentHeader.class)
.jpaQuery("from DocumentHeader d where d.modified > :modified")
.parameterExpression("modified", "payload"))
.handle(Http.outboundGateway(uri)
.httpMethod(HttpMethod.POST)
.expectedResponseType(String.class))
.handle(Jpa.outboundAdapter(this.targetEntityManagerFactory)
.entityClass(ProcessingMetadata.class)
.persistMode(PersistMode.PERSIST),
e -> e.transactional(true))
.get();
UPDATE
Following Artem's suggestion, I'm trying to implement it using a SimpleActiveIdleMessageSourceAdvice
class WaitUntilCompleted extends SimpleActiveIdleMessageSourceAdvice {
public WaitUntilCompleted(DynamicPeriodicTrigger trigger) {
super(trigger);
}
#Override
public boolean beforeReceive(MessageSource<?> source) {
return false;
}
}
If I understand it correctly, above code would stop polling. Now I have no idea how to attach this Advice to the Jpa.inboundAdapter... It doesn't seem to have a proper method (neither Advice nor Spec Handler). Do I miss something obvious here? I've tried attaching the Advice to the Jpa.retrievingGateway but it doesn't change the flow at all.
UPDATE2
Check this question for a complete solution: Spring Integration: how to unit test an advice

I have answered today for similar question: How to poll from a queue 1 message at a time after downstream flow is completed in Spring Integration.
You also may have a trick on database level do not let to see new records in the table while others are locked. Or you can have some UPDATE in the end of flow while your SELECT won't see appropriate records until they are updated respectively.
But anyway any of those approaches I suggest for that question should be applied here as well.
Also you indeed can consider to rely on the SimpleActiveIdleMessageSourceAdvice since your solution is already based on a MessageSource implementation.
UPDATE
For your use-case it is probably would be better to extend that SimpleActiveIdleMessageSourceAdvice and override its beforeReceive() to check some state that you are able to read more data or not. The idlePollPeriod and activePollPeriod could be the same value: doesn't look like it make sense to change it in between since you are going to the idle state just after reading the next set of data.
For the state to check it really might be a simple AtomicBoolean bean which you should change after you process the current set of documents. That might be something after an aggregator or anything else you can use in your solution.
UPDATE 2
To use a WaitUntilCompleted for your Jpa.inboundAdapter you should have a configuration like this:
IntegrationFlows.from(Jpa.inboundAdapter(this.targetEntityManagerFactory)
.entityClass(ProcessingMetadata.class)
.jpaQuery("select max(p.modifiedDate) from ProcessingMetadata p " +
"where p.status = com.test.ProcessingStatus.PROCESSED")
.maxResults(1)
.expectSingleResult(true),
e -> e.poller(Pollers.fixedDelay(Duration.ofSeconds(10)).advice(waitUntilCompleted())))
Pay attention to the .advice(waitUntilCompleted()) which is a part of the pller configuration and points to your advice bean.

Code running on main thread even with subscribeOn specified

I'm in the process of migrating an AsyncTaskLoader to RxJava, trying to understand all the details about the RxJava approach to concurrency. Simple things were running ok, however I'm struggling with the following code:
This is the top level method that gets executed:
mCompositeDisposable.add(mDataRepository
.getStuff()
.subscribeOn(mSchedulerProvider.io())
.subscribeWith(...)
mDataRepository.getStuff() looks like this:
public Observable<StuffResult> getStuff() {
return mDataManager
.listStuff()
.flatMap(stuff -> Observable.just(new StuffResult(stuff)))
.onErrorReturn(throwable -> new StuffResult(null));
And the final layer:
public Observable<Stuff> listStuff() {
Log.d(TAG, ".listStuff() - "+Thread.currentThread().getName());
String sql = <...>;
return mBriteDatabase.createQuery(Stuff.TABLE_NAME, sql).mapToList(mStuffMapper);
}
So with the code above, the log will print out .listStuff() - main, which is not exactly what I'm looking for. And I'm not really sure why. I was under impression that by setting subscribeOn, every event pulled from the chain will be processed on the thread specified in the subscribeOn method.
What I think is happening, is that the source-aka-final-layer code, before reaching mBriteDatabase, is not from the RxJava world and therefore is not an event until createQuery is called. So I probably need some sort of a wrapper? I've tried applying .fromCallable, however that's a wrapper for non Rx code, and my database layer returns an observable...

Your Log.d call happens
immediately when listStuff gets called
which is immediately after getStuff gets called
which is the first thing happening in the top level code fragment you show us.
If you need to do it when the subscription happens, you need to be explicit:
public Observable<Stuff> listStuff() {
String sql = <...>;
return mBriteDatabase.createQuery(Stuff.TABLE_NAME, sql)
.mapToList(mStuffMapper)
.doOnsubscribe(() -> Log.d(TAG, ".listStuff() - "+Thread.currentThread().getName()));
}

How does Hadoop actually accept MR jobs and input data?

All of the introductory tutorials and docs that I can find on Hadoop have simple/contrived (word count-style) examples, where each of them is submitted to MR by:
SSHing into the JobTracker node
Making sure that a JAR file containing the MR job is on HDFS
Running an HDFS command of the form bin/hadoop jar share/hadoop/mapreduce/my-map-reduce.jar <someArgs> that actually runs Hadoop/MR
Either reading the MR result from the command-line or opening a text file containing the result
Although these examples are great for showing total newbies how to work with Hadoop, it doesn't show me how Java code actually integrates with Hadoop/MR at the API level. I guess I am sort of expecting that:
Hadoop exposes some kind of client access/API for submitting MR jobs to the cluster
Once the jobs are complete, some asynchronous mechanism (callback, listener, etc.) reports the result back to the client
So, something like this (Groovy pseudo-code):
class Driver {
static void main(String[] args) {
new Driver().run(args)
}
void run(String[] args) {
MapReduceJob myBigDataComputation = new SolveTheMeaningOfLifeJob(convertToHadoopInputs(args), new MapReduceCallback() {
#Override
void onResult() {
// Now that you know the meaning of life, do nothing.
}
})
HadoopClusterClient hadoopClient = new HadoopClusterClient("http://my-hadoop.example.com/jobtracker")
hadoopClient.submit(myBigDataComputation)
}
}
So I ask: Surely the simple examples in all the introductory tutorials, where you SSH into nodes and run Hadoop from the CLI, and open text files to view its results...surely that can't be the way Big Data companies actually integrate with Hadoop. Surely, something along the lines of my pseudo-code snippet above is used to kick off an MR job and fetch its results. What is it?

In one word, kicking off an MR job can be done, using Oozie scheduler. But before that, you write a map reduce job. It has the driver class which is the starting point of the job. You give all information needed for the job to run in Driver class: like map input, mapper class, if any partitioners, config details and reducer details.
Once these are there in jar file, and you start a job as above(hadoop jar) using CLI(in reality oozie does it), the rest is taken care by Hadoop ecosystem. Hope I answered your question

Lua / Java / LuaJ - Handling or Interrupting Infinite Loops and Threads

I'm using LuaJ to run user-created Lua scripts in Java. However, running a Lua script that never returns causes the Java thread to freeze. This also renders the thread uninterruptible. I run the Lua script with:
JsePlatform.standardGlobals().loadFile("badscript.lua").call();
badscript.lua contains while true do end.
I'd like to be able to automatically terminate scripts which are stuck in unyielding loops and also allow users to manually terminate their Lua scripts while they are running. I've read about debug.sethook and pcall, though I'm not sure how I'd properly use them for my purposes. I've also heard that sandboxing is a better alternative, though that's a bit out of my reach.
This question might also be extended to Java threads alone. I've not found any definitive information on interrupting Java threads stuck in a while (true);.
The online Lua demo was very promising, but it seems the detection and termination of "bad" scripts is done in the CGI script and not Lua. Would I be able to use Java to call a CGI script which in turn calls the Lua script? I'm not sure that would allow users to manually terminate their scripts, though. I lost the link for the Lua demo source code but I have it on hand. This is the magic line:
tee -a $LOG | (ulimit -t 1 ; $LUA demo.lua 2>&1 | head -c 8k)
Can someone point me in the right direction?
Some sources:
Embedded Lua - timing out rogue scripts (e.g. infinite loop) - an example anyone?
Prevent Lua infinite loop
Embedded Lua - timing out rogue scripts (e.g. infinite loop) - an example anyone?
How to interrupt the Thread when it is inside some loop doing long task?
Killing thread after some specified time limit in Java

I struggled with the same issue and after some digging through the debug library's implementation, I created a solution similar to the one proposed by David Lewis, but did so by providing my own DebugLibrary:
package org.luaj.vm2.lib;
import org.luaj.vm2.LuaValue;
import org.luaj.vm2.Varargs;
public class CustomDebugLib extends DebugLib {
public boolean interrupted = false;
#Override
public void onInstruction(int pc, Varargs v, int top) {
if (interrupted) {
throw new ScriptInterruptException();
}
super.onInstruction(pc, v, top);
}
public static class ScriptInterruptException extends RuntimeException {}
}
Just execute your script from inside a new thread and set interrupted to true to stop the execution. The exception will be encapsulated as the cause of a LuaError when thrown.

There are problems, but this goes a long way towards answering your question.
The following proof-of-concept demonstrates a basic level of sandboxing and throttling of arbitrary user code. It runs ~250 instructions of poorly crafted 'user input' and then discards the coroutine. You could use a mechanism like the one in this answer to query Java and conditionally yield inside a hook function, instead of yielding every time.
SandboxTest.java:
public static void main(String[] args) {
Globals globals = JsePlatform.debugGlobals();
LuaValue chunk = globals.loadfile("res/test.lua");
chunk.call();
}
res/test.lua:
function sandbox(fn)
-- read script and set the environment
f = loadfile(fn, "t")
debug.setupvalue(f, 1, {print = print})
-- create a coroutine and have it yield every 50 instructions
local co = coroutine.create(f)
debug.sethook(co, coroutine.yield, "", 50)
-- demonstrate stepped execution, 5 'ticks'
for i = 1, 5 do
print("tick")
coroutine.resume(co)
end
end
sandbox("res/badfile.lua")
res/badfile.lua:
while 1 do
print("", "badfile")
end
Unfortunately, while the control flow works as intended, something in the way the 'abandoned' coroutine should get garbage collected is not working correctly. The corresponding LuaThread in Java hangs around forever in a wait loop, keeping the process alive. Details here:
How can I abandon a LuaJ coroutine LuaThread?

I've never used Luaj before, but could you not put your one line
JsePlatform.standardGlobals().loadFile("badscript.lua").call();
Into a new thread of its own, which you can then terminate from the main thread?
This would require you to make some sort of a supervisor thread (class) and pass any started scripts to it to supervise and eventually terminate if they don't terminate on their own.

EDIT: I've not found any way to safely terminate LuaJ's threads without modifying LuaJ itself. The following was what I came up with, though it doesn't work with LuaJ. However, it can be easily modified to do its job in pure Lua. I may be switching to a Python binding for Java since LuaJ threading is so problematic.
--- I came up with the following, but it doesn't work with LuaJ ---
Here is a possible solution. I register a hook with debug.sethook that gets triggered on "count" events (these events occur even in a while true do end). I also pass a custom "ScriptState" Java object I created which contains a boolean flag indicating whether the script should terminate or not. The Java object is queried in the Lua hook which will throw an error to close the script if the flag is set (edit: throwing an error doesn't actually terminate the script). The terminate flag may also be set from inside the Lua script.
If you wish to automatically terminate unyielding infinite loops, it's straightforward enough to implement a timer system which records the last time a call was made to the ScriptState, then automatically terminate the script if sufficient time passes without an API call (edit: this only works if the thread can be interrupted). If you want to kill infinite loops but not interrupt certain blocking operations, you can adjust the ScriptState object to include other state information that allows you to temporarily pause auto-termination, etc.
Here is my interpreter.lua which can be used to call another script and interrupt it if/when necessary. It makes calls to Java methods so it will not run without LuaJ (or some other Lua-Java library) unless it's modified (edit: again, it can be easily modified to work in pure Lua).
function hook_line(e)
if jthread:getDone() then
-- I saw someone else use error(), but an infinite loop still seems to evade it.
-- os.exit() seems to take care of it well.
os.exit()
end
end
function inithook()
-- the hook will run every 100 million instructions.
-- the time it takes for 100 million instructions to occur
-- is based on computer speed and the calling environment
debug.sethook(hook_line, "", 1e8)
local ret = dofile(jLuaScript)
debug.sethook()
return ret
end
args = { ... }
if jthread == nil then
error("jthread object is nil. Please set it in the Java environment.",2)
elseif jLuaScript == nil then
error("jLuaScript not set. Please set it in the Java environment.",2)
else
local x,y = xpcall(inithook, debug.traceback)
end
Here's the ScriptState class that stores the flag and a main() to demonstrate:
public class ScriptState {
private AtomicBoolean isDone = new AtomicBoolean(true);
public boolean getDone() { return isDone.get(); }
public void setDone(boolean v) { isDone.set(v); }
public static void main(String[] args) {
Thread t = new Thread() {
public void run() {
System.out.println("J: Lua script started.");
ScriptState s = new ScriptState();
Globals g = JsePlatform.debugGlobals();
g.set("jLuaScript", "res/main.lua");
g.set("jthread", CoerceJavaToLua.coerce(s));
try {
g.loadFile("res/_interpreter.lua").call();
} catch (Exception e) {
System.err.println("There was a Lua error!");
e.printStackTrace();
}
}
};
t.start();
try { t.join(); } catch (Exception e) { System.err.println("Error waiting for thread"); }
System.out.println("J: End main");
}
}
res/main.lua contains the target Lua code to be run. Use environment variables or parameters to pass additional information to the script as usual. Remember to use JsePlatform.debugGlobals() instead of JsePlatform.standardGlobals() if you want to use the debug library in Lua.
EDIT: I just noticed that os.exit() not only terminates the Lua script but also the calling process. It seems to be the equivalent of System.exit(). error() will throw an error but will not cause the Lua script to terminate. I'm trying to find a solution for this now.

Thanks to #Seldon for suggesting the use of custom DebugLib. I implemented a simplified version of that by just checking before every instruction if a predefined amount of time is elapsed. This is of course not super accurate because there is some time between class creation and script execution. Requires no separate threads.
class DebugLibWithTimeout(
timeout: Duration,
) : DebugLib() {
private val timeoutOn = Instant.now() + timeout
override fun onInstruction(pc: Int, v: Varargs, top: Int) {
val timeoutElapsed = Instant.now() > timeoutOn
if (timeoutElapsed)
throw Exception("Timeout")
super.onInstruction(pc, v, top)
}
}
Important note: if you sandbox an untrusted script calling load function on Lua-code and passing a separate environment to it, this will not work. onInstruction() seems to be called only if the function environment is a reference to _G. I dealt with that by stripping everything from _G and then adding whitelisted items back.
-- whitelisted items
local sandbox_globals = {
print = print
}
local original_globals = {}
for key, value in pairs(_G) do
original_globals[key] = value
end
local sandbox_env = _G
-- Remove everything from _G
for key, _ in pairs(sandbox_env) do
sandbox_env[key] = nil
end
-- Add whitelisted items back.
-- Global pairs-function cannot be used now.
for key, value in original_globals.pairs(sandbox_globals) do
sandbox_env[key] = value
end
local function run_user_script(script)
local script_function, message = original_globals.load(script, nil, 't', sandbox_env)
if not script_function then
return false, message
end
return pcall(script_function)
end

Executing dependent tasks in java

I need to find a way to execute mutually dependent tasks.
First task has to download a zip file from remote server.
Second tasks goal is to unzip the file downloaded by the first task.
Third task has to process files extracted from zip.
so, third is dependent on second and second on first task.
Naturally if one of the tasks fails, others shouldn't be executed. Since the first task downloads files from remote server, there should be a mechanism for restarting the task is server is not available.
Tasks have to be executed daily.
Any suggestions, patterns or java API?
Regards!

It seems that you do not want to devide them into tasks, just do like this:
process(unzip(download(uri)));

It depends a bit on external requirements. Is there any user involvement? Monitoring? Alerting?...
The simplest would obviously be just methods that check if the previous has done what it should.
download() downloads file to specified place.
unzip() extracts the file to a specified place if the downloaded file is in place.
process() processes the data if it has been extracted.
A more "formal" way of doing it would be to use a workflow engine. Depending on requirements, you can get some that do everything from fancy UIs, to some that follow formal standardised .XML-definitions of the workflow - and any in between.
http://java-source.net/open-source/workflow-engines

Create one public method to execute the full chain and private methods for the tasks:
public void doIt() {
if (download() == false) {
// download failed
} else if (unzip() == false) {
// unzip failed;
} else (process() == false)
// processing failed
}
}
private boolean download() {/* ... */}
private boolean unzip() {/* ... */}
private boolean process() {/* ... */}
So you have an API that gurantees that all steps are executed in the correct sequence and that a step is only executed if certain conditions are met (the above example just illustrates this pattern)

For daily execution you can use the Quartz Framework.
As the tasks are depending on each other I would recommend to evaluate the error codes or exceptions the tasks are returning. Then just continue if the previous task was successful.

The normal way to perform these tasks is to; call each task in order, and throw an exception when you have a failure which prevents the following tasks being performed. Something like
try {
download();
unzip();
process();
} catch(Exception failed) {
failed.printStackTrace();
}

I think what you are interested in is some kind of transaction definition.
I.e.
- Define TaskA (e.g. download)
- Define TaskB (e.g. unzip)
- Define TaskC (e.g. process)
Assuming that you intention is to have tasks working independent as well, e.g. only download a file (not execute also TaskB, TaskC) you should define Transaction1 composed of TaskA,TaskB,TaskC or Transaction2 composed of only TaskA.
The semantics e.g. concerning Transaction1 that TaskA,TaskB and TaskC should be executed sequentially and all or none could be captured in your transaction definitions.
The definitions can be in xml configuration files and you can use a framework e.g. Quartz for scheduling.
A higher construct shall check for the transactions and execute them as defined.

Dependent tasks execution made easy with Dexecutor
Disclaimer : I am the owner of the library
Basically you need the following pattern
Use Dexecutor.addDependency method
DefaultDexecutor<Integer, Integer> executor = newTaskExecutor();
//Building
executor.addDependency(1, 2);
executor.addDependency(2, 3);
executor.addDependency(3, 4);
executor.addDependency(4, 5);
//Execution
executor.execute(ExecutionConfig.TERMINATING);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.