How to retry failed job on any node with Apache Ignite/GridGain - java

I'm experimenting with fault tolerance in Apache Ignite.
What I can't figure out is how to retry a failed job on any node. I have a use case where my jobs will be calling a third-party tool as a system process via process buildr to do some calculations. In some cases the tool may fail, but in most cases it's OK to retry the job on any node - including the one where it previously failed.
At the moment Ignite seems to reroute the job to another node which did not have this job before. So, after a while all nodes are gone and the task fails.
What I'm looking for is how to retry a job on any node.
Here's a test to demonstrate my problem.
Here's my randomly failing job:
public static class RandomlyFailingComputeJob implements ComputeJob {
private static final long serialVersionUID = -8351095134107406874L;
private final String data;
public RandomlyFailingComputeJob(String data) {
Validate.notNull(data); = data;
public void cancel() {
public Object execute() throws IgniteException {
final double random = Math.random();
if (random > 0.5) {
throw new IgniteException();
} else {
return StringUtils.reverse(data);
An below is the task:
public static class RandomlyFailingComputeTask extends
ComputeTaskSplitAdapter<String, String> {
private static final long serialVersionUID = 6756691331287458885L;
public ComputeJobResultPolicy result(ComputeJobResult res,
List<ComputeJobResult> rcvd) throws IgniteException {
if (res.getException() != null) {
return ComputeJobResultPolicy.FAILOVER;
return ComputeJobResultPolicy.WAIT;
public String reduce(List<ComputeJobResult> results)
throws IgniteException {
final Collection<String> reducedResults = new ArrayList<String>(
for (ComputeJobResult result : results) {
reducedResults.add(result.<String> getData());
return StringUtils.join(reducedResults, ' ');
protected Collection<? extends ComputeJob> split(int gridSize,
String arg) throws IgniteException {
final String[] args = StringUtils.split(arg, ' ');
final Collection<ComputeJob> computeJobs = new ArrayList<ComputeJob>(
for (String data : args) {
computeJobs.add(new RandomlyFailingComputeJob(data));
return computeJobs;
Test code:
final Ignite ignite = Ignition.start();
final String original = "The quick brown fox jumps over the lazy dog";
final String reversed = StringUtils.join(
ignite.compute().execute(new RandomlyFailingComputeTask(),
original), ' ');
As you can see, should always be failovered. Since the probability of failure != 1, I expect the task to successfully terminate at some point.
With the probability threshold of 0.5 and a total of 3 nodes this hardly happens. I'm getting an exception like class org.apache.ignite.cluster.ClusterTopologyException: Failed to failover a job to another node (failover SPI returned null). After some debugging I've found out that this is because I eventually run out of nodes. All of the are gone.
I understand that I can write my own FailoverSpi to handle this.
But this just doesn't feel right.
First, it seems to be an overkill to do this.
But then the SPI is a kind of global thing. I'd like to decide per job if it should be retried or failed over. This may, for instance, depend on what the exit code of the third-party tool I'm invoking. So configuring failover over the global SPI isn't right.

Current implementation of AlwaysFailoverSpi (which is the default one) doesn't failover if it has already tried all nodes for a particular job. I believe it can be a configuration option, but for now you will have to implement your own failover SPI (it should be pretty simple - just pick a random node from the topology each time a job is trying to fail over).
As for global nature of the SPI, you're right, but its failover() takes FailoverContext, which has information about failed job (task name, attributes, exception, etc.), so you can make decision based on this information.


How to use non-keyed state with Kafka Consumer in Flink?

I'm trying to implement (just starting work with Java and Flink) a non-keyed state in KafkaConsumer object, since in this stage no keyBy() in called. This object is the front end and the first module to handle messages from Kafka.
SourceOutput is a proto file representing the message.
I have the KafkaConsumer object :
public class KafkaSourceFunction extends ProcessFunction<byte[], SourceOutput> implements Serializable
public void processElement(byte[] bytes, ProcessFunction<byte[], SourceOutput>.Context
context, Collector<SourceOutput> collector) throws Exception
// Here, I want to call to sorting method
I have an object (KafkaSourceSort) that do all the sorting and should keep the unordered message in priorityQ in the state and also responsible to deliver the message if it comes in the right order thru the collector.
class SessionInfo
public PriorityQueue<SourceOutput> orderedMessages = null;
public void putMessage(SourceOutput Msg)
if(orderedMessages == null)
orderedMessages = new PriorityQueue<SourceOutput>(new SequenceComparator());
public class KafkaSourceState implements Serializable
public TreeMap<String, SessionInfo> Sessions = new TreeMap<>();
I read that I need to use a non-keyed state (ListState) which should contain a map of sessions while each session contains a priorityQ holding all messages related to this session.
I found an example so I implement this:
public class KafkaSourceSort implements SinkFunction<KafkaSourceSort>,
private transient ListState<KafkaSourceState> checkpointedState;
private KafkaSourceState state;
public void snapshotState(FunctionSnapshotContext functionSnapshotContext) throws Exception
public void initializeState(FunctionInitializationContext context) throws Exception
ListStateDescriptor<KafkaSourceState> descriptor =
new ListStateDescriptor<KafkaSourceState>(
TypeInformation.of(new TypeHint<KafkaSourceState>() {}));
checkpointedState = context.getOperatorStateStore().getListState(descriptor);
if (context.isRestored())
state = (KafkaSourceState) checkpointedState.get();
public void invoke(KafkaSourceState value, SinkFunction.Context contex) throws Exception
state = value;
// ...
I see that I need to implement an invoke message which probably will be called from processElement() but the signature of invoke() doesn't contain the collector and I don't understand how to do so or even if I did OK till now.
Please, a help will be appreciated.
A SinkFunction is a terminal node in the DAG that is your job graph. It doesn't have a Collector in its interface because it cannot emit anything downstream. It is expected to connect to an external service or data store and send data there.
If you share more about what you are trying to accomplish perhaps we can offer more assistance. There may be an easier way to go about this.

How to reset metrics every X seconds?

I am trying to measure application and jvm level metrics on my application using DropWizard Metrics library.
Below is my metrics class which I am using across my code to increment/decrement the metrics. I am calling increment and decrement method of below class to increment and decrement metrics.
public class TestMetrics {
private final MetricRegistry metricRegistry = new MetricRegistry();
private static class Holder {
private static final TestMetrics INSTANCE = new TestMetrics();
public static TestMetrics getInstance() {
return Holder.INSTANCE;
private TestMetrics() {}
public void increment(final Names... metricsName) {
for (Names metricName : metricsName)
metricRegistry.counter(name(TestMetrics.class, metricName.value())).inc();
public void decrement(final Names... metricsName) {
for (Names metricName : metricsName)
metricRegistry.counter(name(TestMetrics.class, metricName.value())).dec();
public MetricRegistry getMetricRegistry() {
return metricRegistry;
public enum Names {
// some more fields here
INVALID_ID("invalid-id"), MESSAGE_DROPPED("drop-message");
private final String value;
private Names(String value) {
this.value = value;
public String value() {
return value;
And here is how I am using above TestMetrics class to increment the metrics basis on the case where I need to. Below method is called by multiple threads.
public void process(GenericRecord record) {
// ... some other code here
try {
String clientId = String.valueOf(record.get("clientId"));
String procId = String.valueOf(record.get("procId"));
if (Strings.isNullOrEmpty(clientId) && Strings.isNullOrEmpty(procId)
&& !NumberUtils.isNumber(clientId)) {
// .. other code here
} catch (Exception ex) {
Now I have another class which runs every 30 seconds only (I am using Quartz framework for that) from where I want to print out all the metrics and its count. In general, I will send these metrics every 30 seconds to some other system but for now I am printing it out here. Below is how I am doing it.
public class SendMetrics implements Job {
public void execute(final JobExecutionContext ctx) throws JobExecutionException {
MetricRegistry metricsRegistry = TestMetrics.getInstance().getMetricRegistry();
Map<String, Counter> counters = metricsRegistry.getCounters();
for (Entry<String, Counter> counter : counters.entrySet()) {
Now my question is: I want to reset all my metrics count every 30 seconds. Meaning when my execute method prints out the metrics, it should print out the metrics for that 30 second only (for all the metrics) instead of printing for that whole duration from when the program is running.
Is there any way that all my metrics should have count for 30 seconds only. Count of whatever has happened in last 30 seconds.
As an answer because it is too long:
You want to reset the counters. There is no API for this. The reasons are discussed in the linked github issue. The article describes a possible workaround. You have your counters and use them as usual - incrementing and decrementing. But you can't reset them. So add new Gauge which value is following the counter you want to reset after it have reported to you. The getValue() method of the Gauge is called when you want to report the counter value. After storing the current value the method is decreasing the value of the counter with it. This effectively reset the counter to 0. So you have your report and also have the counter reset. This is described in Step 1.
Step 2 adds a filter that prohibits the actual counter to be reported because you are now reporting through the gauge.

Join two streams using a count-based window

I am new to Flink Streaming API and I want to complete the following simple (IMO) task. I have two streams and I want to join them using count-based windows. The code I have so far is the following:
public class BaselineCategoryEquiJoin {
private static final String recordFile = "some_file.txt";
private static class ParseRecordFunction implements MapFunction<String, Tuple2<String[], MyRecord>> {
public Tuple2<String[], MyRecord> map(String s) throws Exception {
MyRecord myRecord = parse(s);
return new Tuple2<String[], myRecord>(myRecord.attributes, myRecord);
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironment();
ExecutionConfig config = environment.getConfig();
DataStream<Tuple2<String[], MyRecord>> dataStream = environment.readTextFile(recordFile)
.map(new ParseRecordFunction());
DataStream<Tuple2<String[], MyRecord>> dataStream1 = environment.readTextFile(recordFile)
.map(new ParseRecordFunction());
DataStreamSink<Tuple2<String[], String[]>> joinedStream = dataStream1
.where(new KeySelector<Tuple2<String[],MyRecord>, String[]>() {
public String[] getKey(Tuple2<String[], MyRecord> recordTuple2) throws Exception {
return recordTuple2.f0;
}).equalTo(new KeySelector<Tuple2<String[], MyRecord>, String[]>() {
public String[] getKey(Tuple2<String[], MyRecord> recordTuple2) throws Exception {
return recordTuple2.f0;
.apply(new JoinFunction<Tuple2<String[],MyRecord>, Tuple2<String[],MyRecord>, Tuple2<String[], String[]>>() {
public Tuple2<String[], String[]> join(Tuple2<String[], MyRecord> tuple1, Tuple2<String[], MyRecord> tuple2) throws Exception {
return new Tuple2<String[], String[]>(tuple1.f0, tuple1.f0);
My code works without errors, but it does not produce any results. In fact, the call to apply method is never called (verified by adding a breakpoint on debug mode). I think, the main reason for the previous is that my data do not have a time attribute. Therefore, windowing (materialized through window) is not done properly. Therefore, my question is how can I indicate that I want my join to take place based on count-windows. For instance, I want the join to materialize every 100 tuples from each stream. Is the previous feasible in Flink? If yes, what should I change in my code to achieve it.
At this point, I have to inform you that I tried to call the countWindow() method, but for some reason it is not offered by Flink's JoinedStreams.
Thank you
Count-based joins are not supported. You could emulate count-based windows, by using "event-time" semantics and apply a unique seq-id as timestamp to each record. Thus, a time-window of "5" would be effectively a count-window of 5.

How to pass a countdown latch to an Apache Storm/Trident Filter without incurring a not-serializable exception

I'm trying to create some tests to verify data going through an Apache Storm topology (using the Trident API)
I've created this simple filter to access callbacks:
public class CallbackFilter extends BaseFilter {
private final TupleCallback callback;
public CallbackFilter(TupleCallback callback) {
this.callback = callback;
public boolean isKeep(TridentTuple tuple) {
if (callback != null) {
return true;
public interface TupleCallback extends Serializable{
void callback(TridentTuple tuple);
If I try this, I get a runtime exception saying CountdownLatch is not serializable:
public void testState() throws Exception {
CountDownLatch latch = new CountDownLatch(4);
TridentTopology tridentTopology = new TridentTopology();
FeederBatchSpout spout = ...
TridentState state = ...
// problematic code:
CallbackFilter.TupleCallback callback = (CallbackFilter.TupleCallback & Serializable) tuple -> {
System.out.println("tuple = " + tuple);
latch.countDown(); //latch is not serializable - exception!
CallbackFilter latchFilter = new CallbackFilter(callback);
.each(new Fields("foo", "bar"), latchFilter);
So it appears Storm is serializing all of the components of a topology and then submitting them in the serialized form, probably for clustering or whatnot.
Is there any way of getting a callback from Storm to the calling test? Maybe some sort of test mode that doesn't serialize the topology? It's kinda hard to see what is going on inside the topology from a test point of view, especially at each stage of a topology.
even doing something like this doesn't work!
List<TridentTuple> tupleList = new ArrayList<>();
CallbackFilter.TupleCallback callback = (CallbackFilter.TupleCallback & Serializable) tuple -> {
I see the tupleList being added to in the debugger, but in the space of the test, the list stays zero. It's like the topology is running in its own JVM.

how to invoke schedular every day

what i do is ,when i run first time a servlet (which is invoked from jsp) that while put an entry of that service,daily in conf file.i want to run a scheduler which will invoke program(servlet- which runs and send mail) for that service daily 10 .
below is the code i use to execute a task.but problem is when i stop the server ,the scheduler stops and nothing happens
public class Schedule
public static final String CONF_PATH = "../webapps/selen/WEB-INF/";
public static Properties schProps = null;
public static FileInputStream sis = null;
public static long period;
public static Timer timer = new Timer();
public static String servicename = null;
public static String keyValues = null;
public static String reValues[] = null;
public static String schedulingValue = null;
public static String service_url = null;
public static String browserlist = null;
public static String testType = null;
public static String mailCheacked = null;
public static String toaddr = null;
public static HttpServletRequest request = null;
public static HttpServletResponse response = null;
public static String serversURL = null;
public static String contextPath = null;
public static Date delay = null;
public void scheduleLoad(String serviceValue) throws Exception
schProps = new Properties();
sis = new FileInputStream(CONF_PATH);
servicename = SServlet.serviceName;
keyValues = schProps.getProperty(serviceValue);
reValues = keyValues.split(",");
String request = reValues[0];
String response = reValues[1];
schedulingValue = reValues[2];
service_url = reValues[3];
browserlist = reValues[4];
testType = reValues[5];
mailCheacked = reValues[6];
toaddr = reValues[7];
serversURL = reValues[8];
contextPath = reValues[9];
Calendar cal =Calendar.getInstance();
delay = cal.getTime();
period = 1000 * 60 * 60 * 24;
else if(reValues[2].equals("Stop"))
catch(NullPointerException npe)
System.out.println("null point exception ");
if(sis !=null)
public static void schedule()
MyTimerTask mt = new MyTimerTask(request,response,servicename,service_url,browserlist,mailCheacked,testType,schedulingValue,toaddr,serversURL,contextPath);
public static void stop()
class MyTimerTask extends TimerTask
public HttpServletRequest request;
public HttpServletResponse response;
public String servicename;
public String service_url;
public String browserlist;
public String mailCheacked;
public String testType;
public String schedulingValue;
public String toaddr;
public String serversURL;
public String contextPath;
public MyTimerTask(HttpServletRequest request,HttpServletResponse response, String servicename,String service_url,String browserlist,String mailCheacked,String testType,String schedulingValue,String toaddr,String serversURL, String contextPath)
this.request = request;
this.response = response;
this.servicename = servicename;
this.service_url = service_url;
this.browserlist = browserlist;
this.mailCheacked = mailCheacked;
this.testType = testType;
this.schedulingValue = schedulingValue;
this.toaddr = toaddr;
this.serversURL = serversURL;
this.contextPath = contextPath;
public void run()
SServlet sservlet = new SServlet();
The JDK Timer runs in the JVM, not in the operating system. It's not CRON or Windows scheduler. So when you stop your server (Tomcat? JBoss? Glassfish?), you are effectivly stopping the JVM that the Timer lives in so of course it won't run any more. If you want a timer (scheduler) that runs independently of your server, you will have to start it in it's own JVM, either as a standalone java program using the java command or inside another server instance.
On a side note, if you're open to some critique, a small review of your code:
Avoid mixing static and non-static contexts if possible. Your Schedule class instance method scheduleLoad() makes heavy use of static member variables for statefull storage. Variables are either only used in the execution of a method (in which case they should be declared inside that method) or they are used to describe the state of an object (in which case they should be private instance members of the class) or they are global constants or immutable global variables (in which case they should be declared static final). Exceptions to these exist, but are less common.
Avoid declaring member variables public if they are not also final. Adhere to the JavaBean pattern, use getters and setters. If a variable is, in reality, a constant then it should be public static final.
Avoid using classes or parameters out of scope. For instance, your MyTimerTask uses HttpServletRequest and HttpServletResponse as member variables and method parameters. This makes no sense as MyTimerTask is not used in the scope of a servlet request (and will subsequently always be null, right?). Or, if that is indeed the case, if you are explicitly setting the static members of the Schedule in some servlet and then invoking scheduleLoad(), see my first point about improper use of static context. Your code would not be thread-safe and concurrent invocation of whichever servlet that uses the Schedule would produce unpredictable behaviour.
It's hard to know where to start as I'm not sure what your level of expertise is in Java. If you are unfamiliar with how to execute stand-alone java applications, I would suggest having a go at some tutorials. Oracle has a bunch at is a good place to start as it walks you through a very basic "hello world" type application with a main method and how to execute it using the java command, as well as some common mistakes and problems.
Once you've figured all that out, take a few minutes to figure out what your application should do, which resources it will require and if it needs to call any "external" systems. You mentioned that it should "execute a servlet to send mail". Does that mean that it has to call a specific servlet or is it just the sending mail that is really what you are after. In that case, maybe you can just move all the mail sending logic to your standalone program? If not, you will have to call the servlet using a http request (like a browser would). There are a number of existing frameworks for doing things like that. Apache HttpClient is a very popular one.
If you stop program it does not work. It is not a bug. It is a feature. BTW if you shutdown your computer nothing happens too :).
But If you questing is how to make my scheduled task more robust, e.g. how to make task to continue working when server stops and then starts again you have to persist somewhere the state of you scheduler, i.e. in your case the last time of task execution. You can implement this yourself: create special file and store the data there. You can use cross platform pure java Preferences API to do this: the data will be stored in file system in Unix and Registry in windows. you can save state in DB too.
But you can use other products that have already implemented this functionality. The most popular and well-known is Quartz.
But Quartz still need some java process to be up and running. If you want to be able to run your tasks even if no java process is running use platform dependent tools: cron tab for Unix and scheduler API for windows (it is accessible via VBScript, JScript, command line).
Unix has cron
