serial and parallel run workflow in cq5 - java

I have 2 workflow.
I want to write workflow if workflow. Thus I want to know how to
Launch 2 workflow in parallel
Launch second workflow after first workflow termination

You can achieve this by having multiple workflow launchers, though you have to be careful what these workflows do to the workload, e.g. if they would concurrently change the same property.
There are multiple ways to do this:
Either write a property in the last step of the first workflow and have the second workflow be triggered with a launcher if this property is set. Or you can start another workflow from a custom step:
protected void processItem(WorkItem item, WorkflowSession wfSession, WorkflowData workflowData, String config) throws WorkflowException {
String wfId = "myWorkflowId";
WorkflowModel model = wfSession.getModel(wfId);
wfSession.startWorkflow(model, workflowData);
//optionaly terminate the current workflow programmatically
wfSession.terminateWorkflow(item.getWorkflow());
}

Related

Java Ledger API - return contractId in submit command

Is there a way to automatically return the contractId generated by a command like:
client.getCommandSubmissionClient().submit(...).blockingGet();
If not, what's the best way to do it?
a simple way for finding the transaction you are looking for would be something like this:
client.getTransactionsClient()
.getTransactions(LedgerOffset.LedgerBegin.getInstance(), new FiltersByParty(Collections.singletonMap(party, NoFilter.instance)), false)
.filter(t => "MyCommandId".equals(t.getCommandId))
.singleOrError()
.blockingGet()
Note that here we are reading from LedgerBegin. Normally you would ask for the ledger end via client.getTransactionsClient().getLedgerEnd() before submitting the command and use the that offset for subscribing to transactions.
There is no inbuilt synchronous API call that returns the resulting transaction of a (successful) command submission. The command service only returns the command completion (ie success/fail).
One way to do what you want is to use the commandId field. It allows the submitting party to correlate the command submission and resulting transaction. You would, however, have to build a wrapper combining the command and transaction services yourself.

Run a command based program at a custom date-time (Add/modify/delete)

I have a python script which takes few params as argument and I need to run tasks based on this script at a given date and time with other params. I am making an UI to add/modify/delete such tasks with all given params. How do I do it? Is there any tool available? I dont think crontabs are the best solution to this especially due to frequent need of task modification/deletion. The requirement is for linux machine.
One soln could be: Create an API to read all the tasks stored in DB to execute the python script timely and call that API after every few minutes via crontab.
But I am looking for a better solution. Suggestions are welcome.
I am assuming that all the arguments (command line) are know beforehand, in which case you have couple of options
Use a python scheduler to programmatically schedule your
tasks without cron. This scheduler script can be run either as daemon or via cron job to run all the time.
Use a python crontab module to modify
cron jobs from python program itself
If the arguments to scripts are generated dynamically at various time schedule (or user provided), then the only is to use a GUI to get the updated arguments and run python script to modify cron jobs.
from datetime import datetime
from threading import Timer
x=datetime.today()
y=x.replace(day=x.day+1, hour=1, minute=0, second=0, microsecond=0)
delta_t=y-x
secs=delta_t.seconds+1
def hello_world():
print "hello world"
#...
t = Timer(secs, hello_world)
t.start()
This will execute a function in the next day at 1 am.
You could do it with timer units with systemd. What are the advantages over cron?
Dependencies to other services can be defined, so that either other
services must be executed first so that a service is started at all
(Requires), or a service is not started if it would get into
conflict with another service currently running (Conflicts).
Relative times are possible: You can cause Timer Units to start a
service every ten minutes after it has been executed. This
eliminates overlapping service calls that at some point cause the
CPU to be blocked because the interval in the cron is too low.
Since Timer Units themselves are also services, they can be
elegantly activated or deactivated, while cronjobs can only be
deactivated by commenting them out or deleting them.
Easily understandable indication of times and spaces compared to
Cron.
Here come an example:
File: /etc/systemd/system/testfile.service
[Unit]
Description=Description of your app.
[Service]
User=yourusername
ExecStart=/path/to/yourscript
The Timer Unit specifies that the service unit defined above is to be started 30 minutes after booting and then ten minutes after its last activity.
File: /etc/systemd/system/testfile.timer
[Unit]
Description=Some description of your task.
[Timer]
OnBootSec=30min
OnUnitInactiveSec=10min
Persistent=true
User=testuser
Unit=testfile.service
[Install]
WantedBy=timers.target
One solution would be to have a daemon running in the background, waking up regularly to execute the due tasks.
It would sleep x minutes, then query the database for all the not-yet-done tasks which todo datetime is smaller than the current datetime. It would execute the tasks, mark the tasks as done, save the result and go back to sleep.
You can also use serverless computation, such as AWS Lambda which can be triggered by scheduled events. It seems to support the crontab notation or similar but you could also add the next event every time you finish one run.
I found the answer to this myself i.e, Timers Since my experience and usecase was in java, I used it by creating REST API in spring and managing in-memory cache of timers in java layer as a copy of DB. One can use Timers in any language to achieve something similar. Now I can run any console based application and pass all the required arguments inside the respective timer. Similarly I can update or delete any timer by simply calling .cancel() method on that respective timer from the hashmap.
public static ConcurrentHashMap<String, Timer> PostCache = new ConcurrentHashMap<>();
public String Schedulepost(Igpost igpost) throws ParseException {
String res = "";
TimerTask task = new TimerTask() {
public void run() {
System.out.println("Sample Timer basedTask performed on: " + new Date() + "\nThread's name: " + Thread.currentThread().getName());
System.out.println(igpost.getPostdate()+" "+igpost.getPosttime());
}
};
DateFormat dateFormatter = new SimpleDateFormat("yyyy-MM-dd HH:mm");
Date date = dateFormatter.parse(igpost.getPostdate()+" "+igpost.getPosttime());
Timer timer = new Timer(igpost.getImageurl());
CacheHelper.PostCache.put(igpost.getImageurl(),timer);
timer.schedule(task, date);
return res;
}
Thankyou everybody for suggestions.

How to know which stage of a job is currently running in Apache Spark?

Consider I have a job as follow in Spark;
CSV File ==> Filter By A Column ==> Taking Sample ==> Save As JSON
Now my requirement is how do I know which step(Fetching file or Filtering or Sampling) of the job is currently executing programatically (Preferably using Java API)? Is there any way for this?
I can track Job,Stage and Task using SparkListener class. And it can be done like tracking a stage Id. But how to know which stage Id is for which step in the job chain.
What I want to send a notification to user when consider Filter By A Column is completed. For that I made a class that extends SparkListener class. But I can not find out from where I can get the name of currently executing transformation name. Is it possible to track at all?
public class ProgressListener extends SparkListener{
#Override
public void onJobStart(SparkListenerJobStart jobStart)
{
}
#Override
public void onStageSubmitted(SparkListenerStageSubmitted stageSubmitted)
{
//System.out.println("Stage Name : "+stageSubmitted.stageInfo().getStatusString()); giving action name only
}
#Override
public void onTaskStart(SparkListenerTaskStart taskStart)
{
//no such method like taskStart.name()
}
}
You cannot exactly know when, e.g., the filter operation starts or finishes.
That's because you have transformations (filter,map,...) and actions (count, foreach,...). Spark will put as many operations into one stage as possible. Then the stage is executed in parallel on the different partitions of your input. And here comes the problem.
Assume you have several workers and the following program
LOAD ==> MAP ==> FILTER ==> GROUP BY + Aggregation
This program will probably have two stages: the first stage will load the file and apply the map and filter.
Then the output will be shuffled to create the groups. In the second stage the aggregation will be performed.
Now, the problem is, that you have several workers and each will process a portion of your input data in parallel. That is, every executor in your cluster will receive a copy of your program(the current stage) and execute this on the assigned partition.
You see, you will have multiple instances of your map and filter operators that are executed in parallel, but not necessarily at the same time. In an extreme case, worker 1 will finish with stage 1 before worker 20 has started at all (and therefore finish with its filter operation before worker 20).
For RDDs Spark uses the iterator model inside a stage. For Datasets in the latest Spark version however, they create a single loop over the partition and execute the transformations. This means that in this case Spark itself does not really know when a transformation operator finished for a single task!
Long story short:
You are not able the know when an operation inside a stage finishes
Even if you could, there are multiple instances that will finish at different times.
So, now I already had the same problem:
In our Piglet project (please allow some adverstisement ;-) ) we generate Spark code from Pig Latin scripts and wanted to profile the scripts. I ended up in inserting mapPartition operator between all user operators that will send the partition ID and the current time to a server which will evaluate the messages. However, this solution also has its limitations... and I'm not completely satisfied yet.
However, unless you are able to modify the programs I'm afraid you cannot achieve what you want.
Did you consider this option: http://spark.apache.org/docs/latest/monitoring.html
It seems you can use the following rest api to get a certain job state /applications/[app-id]/jobs/[job-id]
You can set the JobGroupId and JobGroupDescription so you can track what job group is being handled. i.e. setJobGroup
Assuming you'll call the JobGroupId "test"
sc.setJobGroup("1", "Test job")
When you'll call the http://localhost:4040/api/v1/applications/[app-id]/jobs/[job-id]
You'll get a json with a descriptive name for that job:
{
"jobId" : 3,
"name" : "count at <console>:25",
"description" : "Test Job",
"submissionTime" : "2017-02-22T05:52:03.145GMT",
"completionTime" : "2017-02-22T05:52:13.429GMT",
"stageIds" : [ 3 ],
"jobGroup" : "1",
"status" : "SUCCEEDED",
"numTasks" : 4,
"numActiveTasks" : 0,
"numCompletedTasks" : 4,
"numSkippedTasks" : 0,
"numFailedTasks" : 0,
"numActiveStages" : 0,
"numCompletedStages" : 1,
"numSkippedStages" : 0,
"numFailedStages" : 0
}

Monitor progress and intermediate results in Spark

I have a simple Spark task, something like this:
JavaRDD<Solution> solutions = rdd.map(new Solve());
// Select best solution by some criteria
The solve routine takes some time. For a demo application, I need to get some property of each solution as soon as it is calculated, before the call to rdd.map terminates.
I've tried using accumulators and SparkListener, overriding the onTaskEnd method, but it seems to be called only at the end of the mapping, not per thread, E.g.
sparkContext.sc().addSparkListener(new SparkListener() {
public void onTaskEnd(SparkListenerTaskEnd taskEnd) {
// do something with taskEnd.taskInfo().accumulables()
}
});
How can I get an asynchronous message for each map function end?
Spark runs locally or in a standalone cluster mode.
Answers can be in Java or Scala, both are OK.

Executing dependent tasks in java

I need to find a way to execute mutually dependent tasks.
First task has to download a zip file from remote server.
Second tasks goal is to unzip the file downloaded by the first task.
Third task has to process files extracted from zip.
so, third is dependent on second and second on first task.
Naturally if one of the tasks fails, others shouldn't be executed. Since the first task downloads files from remote server, there should be a mechanism for restarting the task is server is not available.
Tasks have to be executed daily.
Any suggestions, patterns or java API?
Regards!
It seems that you do not want to devide them into tasks, just do like this:
process(unzip(download(uri)));
It depends a bit on external requirements. Is there any user involvement? Monitoring? Alerting?...
The simplest would obviously be just methods that check if the previous has done what it should.
download() downloads file to specified place.
unzip() extracts the file to a specified place if the downloaded file is in place.
process() processes the data if it has been extracted.
A more "formal" way of doing it would be to use a workflow engine. Depending on requirements, you can get some that do everything from fancy UIs, to some that follow formal standardised .XML-definitions of the workflow - and any in between.
http://java-source.net/open-source/workflow-engines
Create one public method to execute the full chain and private methods for the tasks:
public void doIt() {
if (download() == false) {
// download failed
} else if (unzip() == false) {
// unzip failed;
} else (process() == false)
// processing failed
}
}
private boolean download() {/* ... */}
private boolean unzip() {/* ... */}
private boolean process() {/* ... */}
So you have an API that gurantees that all steps are executed in the correct sequence and that a step is only executed if certain conditions are met (the above example just illustrates this pattern)
For daily execution you can use the Quartz Framework.
As the tasks are depending on each other I would recommend to evaluate the error codes or exceptions the tasks are returning. Then just continue if the previous task was successful.
The normal way to perform these tasks is to; call each task in order, and throw an exception when you have a failure which prevents the following tasks being performed. Something like
try {
download();
unzip();
process();
} catch(Exception failed) {
failed.printStackTrace();
}
I think what you are interested in is some kind of transaction definition.
I.e.
- Define TaskA (e.g. download)
- Define TaskB (e.g. unzip)
- Define TaskC (e.g. process)
Assuming that you intention is to have tasks working independent as well, e.g. only download a file (not execute also TaskB, TaskC) you should define Transaction1 composed of TaskA,TaskB,TaskC or Transaction2 composed of only TaskA.
The semantics e.g. concerning Transaction1 that TaskA,TaskB and TaskC should be executed sequentially and all or none could be captured in your transaction definitions.
The definitions can be in xml configuration files and you can use a framework e.g. Quartz for scheduling.
A higher construct shall check for the transactions and execute them as defined.
Dependent tasks execution made easy with Dexecutor
Disclaimer : I am the owner of the library
Basically you need the following pattern
Use Dexecutor.addDependency method
DefaultDexecutor<Integer, Integer> executor = newTaskExecutor();
//Building
executor.addDependency(1, 2);
executor.addDependency(2, 3);
executor.addDependency(3, 4);
executor.addDependency(4, 5);
//Execution
executor.execute(ExecutionConfig.TERMINATING);

Categories