Java Fork Join Pool Eating All Thread Resources

Java Fork Join Pool Eating All Thread Resources - java

I have a string parser (parsing large text blobs) that needs to be run in a java fork join pool. The pool is faster than other threading and has reduced my parsing time by over 30 minutes when using both regular expressions and xpath. However, the number of threads being created climbs dramatically and I need to be able to terminate them since the thread pool is called multiple times. How can I reduce the increase in threads without limiting the pool to just 1 core on a 4 core system?
My thread count is exceeding 40000 and I need it to be closer to 5000 since the program is running 10 times with a stone cold execution limit of 50000 threads for my user.
This issue is happening on both Windows and Linux.
I am:
setting the max processors to the number of available processors*configurable number which is currently 1
cancelling tasks after get() is called
desperately setting the forkjoin pool to null before reinstantiating because I am desperate
Any Help would be appreciated. Thanks.
Here is the code I am using to stop, get and restart the pool. I should probably also note that I am submitting each task with fjp.submit(TASK) and then invoking them all at shutdown.
while(pages.size()>0)
{
log.info("Currently Active Threads: "+Thread.activeCount());
log.info("Pages Found in the Iteration "+j+": "+pages.size());
if(fjp.isShutdown())
{
fjp=new ForkJoinPool(Runtime.getRuntime().availableProcessors()*procnum);
}
i=0;
//if asked to generate a hash, due this first
if(getHash==true){
log.info("Generating Hash");
int s=pages.size();
while(i<s){
String withhash=null;
String str=pages.get(0);
if(str != null){
jmap=Json.read(str).asJsonMap();
jmap.put("offenderhash",Json.read(genHash(jmap.get("offenderhash").asString()+i)));
for(String k:jmap.keySet()){
withhash=(withhash==null)?"{\""+k+"\":\""+jmap.get(k).asString()+"\"":withhash+",\""+k+"\":\""+jmap.get(k).asString()+"\"";
}
if(withhash != null){
withhash+=",}";
}
pages.remove(0);
pages.add((pages.size()-1), withhash);
i++;
}
}
i=0;
}
if(singlepats != null)
{
log.info("Found Singlepats");
for(String row:pages)
{
String str=row;
str=str.replaceAll("\t|\r|\r\n|\n","");
jmap=Json.read(str).asJsonMap();
if(singlepats.containsKey("table"))
{
if(fjp.isShutdown())
{
fjp=new ForkJoinPool((Runtime.getRuntime().availableProcessors()*procnum));
}
fjp=new ForkJoinPool((Runtime.getRuntime().availableProcessors()*procnum));
if(jmap.get(column)!=null)
{
if(test){
System.out.println("//////////////////////HTML////////////////////////\n"+jmap.get(column).asString()+"\n///////////////////////////////END///////////////////////////\n\n");
}
if(mustcontain != null)
{
if(jmap.get(column).asString().contains(mustcontain))
{
if(cannotcontain != null)
{
if(jmap.get(column).asString().contains(cannotcontain)==false)
results.add(fjp.submit(new ParsePage(replacementPattern,singlepats.get("table"),jmap.get(column).asString().replaceAll("\\s\\s", " "),singlepats, Calendar.getInstance().getTime().toString(), jmap.get("offenderhash").asString())));
}
else
{
results.add(fjp.submit(new ParsePage(replacementPattern,singlepats.get("table"),jmap.get(column).asString().replaceAll("\\s\\s", " "),singlepats, Calendar.getInstance().getTime().toString(), jmap.get("offenderhash").asString())));
}
}
}
else if(cannotcontain != null)
{
if(jmap.get(column).asString().contains(cannotcontain)==false)
{
results.add(fjp.submit(new ParsePage(replacementPattern,singlepats.get("table"),jmap.get(column).asString().replaceAll("\\s\\s", " "),singlepats, Calendar.getInstance().getTime().toString(), jmap.get("offenderhash").asString())));
}
}
else
{
results.add(fjp.submit(new ParsePage(replacementPattern,singlepats.get("table"),jmap.get(column).asString().replaceAll("\\s\\s", " "),singlepats, Calendar.getInstance().getTime().toString(), jmap.get("offenderhash").asString())));
}
}
}
i++;
if(((i%commit_size)==0 & i != 0) | i==pages.size() |pages.size()==1 & singlepats != null)
{
log.info("Getting Regex Results");
log.info("Shutdown");
try {
fjp.awaitTermination(termtime, TimeUnit.MILLISECONDS);
} catch (InterruptedException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
fjp.shutdown();
while(fjp.isTerminated()==false)
{
try{
Thread.sleep(5);
}catch(InterruptedException e)
{
e.printStackTrace();
}
}
for(Future<String> r:results)
{
try {
add=r.get();
if(add.contains("No Data")==false)
{
parsedrows.add(add);
}
add=null;
if(r.isDone()==false)
{
r.cancel(true);
}
if(fjp.getActiveThreadCount()>0 && fjp.getRunningThreadCount()>0)
{
fjp.shutdownNow();
}
fjp=new ForkJoinPool(Runtime.getRuntime().availableProcessors()*procnum);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ExecutionException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
results=new ArrayList<ForkJoinTask<String>>();
if(parsedrows.size()>=commit_size)
{
if(parsedrows.size()>=SPLITSIZE)
{
sendToDb(parsedrows,true);
}
else
{
sendToDb(parsedrows,false);
}
parsedrows=new ArrayList<String>();
}
//hint to the gc in case it actually pays off (think if i were a gambling man)
System.gc();
Runtime.getRuntime().gc();
}
}
}
log.info("REMAINING ROWS TO COMMIT "+parsedrows.size());
log.info("Rows Left"+parsedrows.size());
if(parsedrows.size()>0)
{
if(parsedrows.size()>=SPLITSIZE)
{
sendToDb(parsedrows,true);
}
else
{
sendToDb(parsedrows,false);
}
parsedrows=new ArrayList<String>();
}
records+=i;
i=0;
//Query for more records to parse

It looks like you're making a new ForkJoinPool for every result. What you really want to do is make a single ForkJoinPool that all your tasks will share. Extra pools won't make extra parallelism available, so one should be fine. When you get a task that is ready to run take your fjp and call fjp.execute(ForkJoinTask) or ForkJoinTask.fork() if you're in a task already.
Making multiple pools seems like a bookkeeping nightmare. Try to get away with just one that's shared.

You are probably using join() in Java7. Join doesn't work. It requires a context switch and Java programs can't do a context switch so the framework creates "continuation threads" to keep moving. I detailed that problem several years ago in this article: ForkJoin Clamamity

Related

Will Exceptions in Project Loom someday purcolate up through ExecutorService contexts?

From loom-lab, given the code
var virtualThreadFactory = Thread.ofVirtual().factory();
try (var executorService = Executors.newThreadPerTaskExecutor(virtualThreadFactory)) {
IntStream.range(0, 15).forEach(item -> {
executorService.submit(() -> {
try {
var milliseconds = item * 1000;
System.out.println(Thread.currentThread() + " sleeping " + milliseconds + " milliseconds");
Thread.sleep(milliseconds);
System.out.println(Thread.currentThread() + " awake");
if (item == 8) throw new RuntimeException("task 8 is acting up");
} catch (InterruptedException e) {
System.out.println("Interrupted task = " + item + ", Thread ID = " + Thread.currentThread());
}
});
});
}
catch (RuntimeException e) {
System.err.println(e.getMessage());
}
My hope was that the code would catch the RuntimeException and print the message, but it does not.
Am I hoping for too much, or will this someday work as I hope?
In response to an amazing answer by Stephen C, which I can fully appreciate, upon further exploration I discovered via
static String spawn(
ExecutorService executorService,
Callable<String> callable,
Consumer<Future<String>> consumer
) throws Exception {
try {
var result = executorService.submit(callable);
consumer.accept(result);
return result.get(3, TimeUnit.SECONDS);
}
catch (TimeoutException e) {
// The timeout expired...
return callable.call() + " - TimeoutException";
}
catch (ExecutionException e) {
// Why doesn't malcontent get caught here?
return callable.call() + " - ExecutionException";
}
catch (CancellationException e) { // future.cancel(false);
// Exception was thrown
return callable.call() + " - CancellationException";
}
catch (InterruptedException e) { // future.cancel(true);
return callable.call() + "- InterruptedException ";
}
}
and
try (var executorService = Executors.newThreadPerTaskExecutor(threadFactory)) {
Callable<String> malcontent = () -> {
Thread.sleep(Duration.ofSeconds(2));
throw new IllegalStateException("malcontent acting up");
};
System.out.println("\n\nresult = " + spawn(executorService, malcontent, (future) -> {}));
} catch (Exception e) {
e.printStackTrace(); // malcontent gets caught here
}
I was expecting malcontent to get caught in spawn as an ExecutionException per the documentation, but it does not. Consequently, I have trouble reasoning about my expectations.
Much of my hope for Project Loom was that, unlike Functional Reactive Programming, I could once again rely on Exceptions to do the right thing, and reason about them such that I could predict what would happen without having to run experiments to validate what really happens.
As Steve Jobs (at NeXT) used to say: "It just works"
So far, my posting on loom-dev#openjdk.java.net has not been responded to... which is why I have used StackOverflow. I don't know the best way to engage the Project Loom developers.

This is speculation ... but I don't think so.
According to the provisional javadocs, ExecutorService now inherits AutoClosable, and it is specified that the default behavior of the close() method is to perform a clean shutdown and wait for it to complete. (Note that this is described as default behavior not required behavior!)
So why couldn't they change the behavior to catch an resignal the exceptions on this thread's stack?
One problem is that specifying patterns of behavior that are logically consistent for both this case, and the case where the ExecutorService is not used as a resource in a try with resources. In order to implement the behavior in this case, the close() method has to be informed by some other part of the executor service of the task's unhandled exception. But if nothing calls close() then the exceptions can't be re-raised. And if the close() is called in a finalizer or similar, there probably won't be anything to handle them. At the very least, it is complicated.
A second problem is that it would be difficult to handle the exception(s) in the general case. What if more than one task failed with an exception? What if different tasks failed with different exceptions? How does the code that handles the exception (e.g. your catch (RuntimeException e) ... figure out which task failed?
A third problem is that this would be a breaking change. In Java 17 and earlier, the above code would not propagate any exceptions from the tasks. In Java 18 and later it would. Java 17 code that assumed there were no "random" exceptions from failed tasks delivered to this thread would break.
A fourth point is that this would be an nuisance in use-cases where the Java 18+ programmer wants to treat the executor service as a resource, but does not want to deal with "stray" exceptions on this thread. (I suspect that would be the majority of use-cases for autoclosing an executor service.)
A fifth problem (if you want to call it that) is that it is a breaking change for early adopters of Loom. (I am reading your question as saying that you tried it with Loom and it currently doesn't behave as you proposed.)
The final problem is that there are already ways to capture a task's exception and deliver it; e.g. via the Future objects returned when you submit a task. This proposal is not filling a gap in ExecutorService functionality.
(Phew!)
Of course I don't know that the Java developers will actually do. And we won't collectively know until Loom is finally released as a non-preview feature of mainstream Java.
Anyhow, if you want to lobby for this, you should email the Loom mailing list about it.

LOOM has made many improvements such as making ExecutorService an AutoClosable so it simplifies coding, eliminating calls to shutdown / awaitTermination.
Your point on the expectation of neat exception handling applies to typical usage of ExecutorService in any JDK - not just the upcoming LOOM release - so IMO isn't obviously necessary to be tied in with LOOM work.
The error handling you wish for is quite easy to incorporate with any version of JDK by adding a few lines of code around code blocks that use ExecutorService:
var ex = new AtomicReference<RuntimeException>();
try {
// add any use of ExecutorService here
// eg OLD JDK style:
// var executorService = Executors.newFixedThreadPool(5);
try (var executorService = Executors.newThreadPerTaskExecutor(virtualThreadFactory)) {
...
if (item == 8) {
// Save exception before sending:
ex.set(new RuntimeException("task 8 is acting up"));
throw ex.get();
}
...
}
// OR: not-LOOM JDK call executorService.shutdown/awaitTermination here
// Pass on any handling problem
if (ex.get() != null)
throw ex.get();
}
catch (Exception e) {
System.err.println("Exception was: "+e.getMessage());
}
Not elegant as you hope for, but works in any JDK release.
EDIT On your edited question:
You've put callable.call() as the code inside catch (ExecutionException e) { so that you've lost the first exception and malcontent raises a second exception. Add System.out.println to see the original:
catch (ExecutionException e) {
System.out.println(Thread.currentThread()+" ExecutionException: "+e);
e.printStackTrace();
// Why doesn't malcontent get caught here?
return callable.call() + " - ExecutionException";
}

I think, the closest to what you are trying to achieve, is
try(var executor = StructuredExecutor.open()) {
var handler = new StructuredExecutor.ShutdownOnFailure();
IntStream.range(0, 15).forEach(item -> {
executor.fork(() -> {
var milliseconds = item * 100;
System.out.println(Thread.currentThread()
+ "sleeping " + milliseconds + " milliseconds");
Thread.sleep(milliseconds);
System.out.println(Thread.currentThread() + " awake");
if(item == 8) {
throw new RuntimeException("task 8 is acting up");
}
return null;
}, handler);
});
executor.join();
handler.throwIfFailed();
}
catch(InterruptedException|ExecutionException ex) {
System.err.println("Caught in initiator thread");
ex.printStackTrace();
}
which will run all jobs in virtual threads and generate an exception in the initiator thread when one of the jobs failed. StructuredExecutor is a new tool introduced by project Loom which allows to show the ownership of the created virtual threads to this specific job in diagnostic tools. But note that it’s close() won’t wait for the completion but rather requires the owner to do this before closing, throwing an exception if the developer failed to do so.
The behavior of classic ExecutorService implementations won’t change.
A solution for the ExecutorService would be
try(var executor = Executors.newVirtualThreadPerTaskExecutor()) {
var jobs = executor.invokeAll(IntStream.range(0, 15).<Callable<?>>mapToObj(item ->
() -> {
var milliseconds = item * 100;
System.out.println(Thread.currentThread()
+ " sleeping " + milliseconds + " milliseconds");
Thread.sleep(milliseconds);
System.out.println(Thread.currentThread() + " awake");
if(item == 8) {
throw new RuntimeException("task 8 is acting up");
}
return null;
}).toList());
for(var f: jobs) f.get();
}
catch(InterruptedException|ExecutionException ex) {
System.err.println("Caught in initiator thread");
ex.printStackTrace();
}
Note that while invokeAll waits for the completion of all jobs, we still need the loop calling get to enforce an ExecutionException to be thrown in the initiating thread.

How can I know who caused an interruptedExeption? (Java)

I'm using interrupt() in my code to signal from a thread to another to wake up from "endless" (Maximum time) sleep and verify a condition in a while.
I'm using also monitor (synchronized block, notify and wait) and synchronized method. I wrote my code in the way that some thread sleeps until they got an interrupt but some interrupt wake up thread when they should not be awaken (they must simulate they are doing other things sleeping). The problem is that I'm not able to find the thread that do interrupt() when it should not, how can I found it?
Is a good way to code using interrupt() in this way?
That's the code in which sleep get interrupted but should not
private void medicalVisit(int number) {
long sleepTime = (long) ((Math.random() * 2 + 0.5) * 1000); // 500 <= sleepTime (in msec) <= 2500
try {
Thread.sleep(sleepTime);
} catch (InterruptedException e) {
System.out.println(this.getName()+" ERROR, interrupt from sleep, id: 2 (medicalVisit)");
e.printStackTrace();
}
System.out.println(this.getName()+" - "+number+"° medical visit ended");
}
This is an example of code that launch an interrupt
private void handlerYellowPatient() {
Iterator<Patient> patientIt = yellows.iterator();
while(patientIt.hasNext()) {
Patient p = patientIt.next();
p.itsTurn = true;
p.interrupt();
yellows.remove(p);
}
}
And this an example of code "consuming" interrupt properly
private void waitUntilItsTurn(int number) {
// simulating the period of time before entering in guard
long sleepTime = (long) ((Math.random() * 2 + 0.5) * 1000); // 500 <= sleepTime (in msec) <= 2500
try {
Thread.sleep(sleepTime);
} catch (InterruptedException e) {
// must not be awaken while here
System.out.println(this.getName()+" ERROR MAYBE, interrupt from sleep, id: 1");
e.printStackTrace();
}
WMan.addPatient(this, WMan);
while (!itsTurn) {
try {
Thread.sleep(Long.MAX_VALUE);
} catch (InterruptedException e) {
// WMan handlerRedPatient interrupt#1
System.out.println(this.getName()+" - the wait is over, it's my turn for the "+number+"° times");
}
}
itsTurn = false;
}
Hoping these code can help

How to scanning and deleting millions of rows in HBase

What Happened
All the data from last month was corrupted due to a bug in the system. So we have to delete and re-input these records manually. Basically, I want to delete all the rows inserted during a certain period of time. However, I found it difficult to scan and delete millions of rows in HBase.
Possible Solutions
I found two way to bulk delete:
The first one is to set a TTL, so that all the outdated record would be deleted automatically by the system. But I want to keep the records inserted before last month, so this solution does not work for me.
The second option is to write a client using the Java API:
public static void deleteTimeRange(String tableName, Long minTime, Long maxTime) {
Table table = null;
Connection connection = null;
try {
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
connection = HBaseOperator.getHbaseConnection();
table = connection.getTable(TableName.valueOf(tableName));
ResultScanner rs = table.getScanner(scan);
List<Delete> list = getDeleteList(rs);
if (list.size() > 0) {
table.delete(list);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (null != table) {
try {
table.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (connection != null) {
try {
connection.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
private static List<Delete> getDeleteList(ResultScanner rs) {
List<Delete> list = new ArrayList<>();
try {
for (Result r : rs) {
Delete d = new Delete(r.getRow());
list.add(d);
}
} finally {
rs.close();
}
return list;
}
But in this approach, all the records are stored in ResultScanner rs, so the heap size would be huge. And if the program crushes, it has to start from the beginning.
So, is there a better way to achieve the goal?

Don't know how many 'millions' you are dealing with in your table, but the simples thing is to not try to put them all into a List at once but to do it in more manageable steps by using the .next(n) function. Something like this:
for (Result row : rs.next(numRows))
{
Delete del = new Delete(row.getRow());
...
}
This way, you can control how many rows get returned from the server via a single RPC through the numRows parameter. Make sure it's large enough so as not to make too many round-trips to the server, but at the same time not too large to kill your heap. You can also use the BufferedMutator to operate on multiple Deletes at once.
Hope this helps.

I would suggest two improvements:
Use BufferedMutator to batch your deletes,  it does exactly what you need – keeps internal buffer of mutations and flushes it to HBase when buffer fills up, so you do not have to worry about keeping your own list, sizing and flushing it.
Improve your scan:
Use KeyOnlyFilter – since you do not need the values, no need to retrieve them
use scan.setCacheBlocks(false) - since you do a full-table scan, caching all blocks on the region server does not make much sense
tune scan.setCaching(N) and scan.setBatch(N) – the N will depend on the size of your keys, you should keep a balance between caching more and memory it will require; but since you only transfer keys, the N could be quite large, I suppose.
Here's an updated version of your code:
public static void deleteTimeRange(String tableName, Long minTime, Long maxTime) {
try (Connection connection = HBaseOperator.getHbaseConnection();
final Table table = connection.getTable(TableName.valueOf(tableName));
final BufferedMutator mutator = connection.getBufferedMutator(TableName.valueOf(tableName))) {
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
scan.setFilter(new KeyOnlyFilter());
scan.setCaching(1000);
scan.setBatch(1000);
scan.setCacheBlocks(false);
try (ResultScanner rs = table.getScanner(scan)) {
for (Result result : rs) {
mutator.mutate(new Delete(result.getRow()));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
Note the use of "try with resource" – if you omit that, make sure to .close() mutator, rs, table, and connection.

Parallel processing using collection of CompletableFuture supplyAsync then collecting results

//Unit of logic I want to make it to run in parallel
public PagesDTO convertOCRStreamToDTO(String pageId, Integer pageSequence) throws Exception {
LOG.info("Get OCR begin for pageId [{}] thread name {}",pageId, Thread.currentThread().getName());
OcrContent ocrContent = getOcrContent(pageId);
OcrDTO ocrData = populateOCRData(ocrContent.getInputStream());
PagesDTO pageDTO = new PagesDTO(pageId, pageSequence.toString(), ocrData);
return pageDTO;
}
Logic to execute convertOCRStreamToDTO(..) in parallel then collect its results when individuals thread execution is done
List<PagesDTO> pageDTOList = new ArrayList<>();
//javadoc: Creates a work-stealing thread pool using all available processors as its target parallelism level.
ExecutorService newWorkStealingPool = Executors.newWorkStealingPool();
Instant start = Instant.now();
List<CompletableFuture<PagesDTO>> pendingTasks = new ArrayList<>();
List<CompletableFuture<PagesDTO>> completedTasks = new ArrayList<>();
CompletableFuture<<PagesDTO>> task = null;
for (InputPageDTO dcInputPageDTO : dcReqDTO.getPages()) {
String pageId = dcInputPageDTO.getPageId();
task = CompletableFuture
.supplyAsync(() -> {
try {
return convertOCRStreamToDTO(pageId, pageSequence.getAndIncrement());
} catch (HttpHostConnectException | ConnectTimeoutException e) {
LOG.error("Error connecting to Redis for pageId [{}]", pageId, e);
CaptureException e1 = new CaptureException(Error.getErrorCodes().get(ErrorCodeConstants.REDIS_CONNECTION_FAILURE),
" Connecting to the Redis failed while getting OCR for pageId ["+pageId +"] " + e.getMessage(), CaptureErrorComponent.REDIS_CACHE, e);
exceptionMap.put(pageId,e1);
} catch (CaptureException e) {
LOG.error("Error in Document Classification Engine Service while getting OCR for pageId [{}]",pageId,e);
exceptionMap.put(pageId,e);
} catch (Exception e) {
LOG.error("Error getting OCR content for the pageId [{}]", pageId,e);
CaptureException e1 = new CaptureException(Error.getErrorCodes().get(ErrorCodeConstants.TECHNICAL_FAILURE),
"Error while getting ocr content for pageId : ["+pageId +"] " + e.getMessage(), CaptureErrorComponent.REDIS_CACHE, e);
exceptionMap.put(pageId,e1);
}
return null;
}, newWorkStealingPool);
//collect all async tasks
pendingTasks.add(task);
}
//TODO: How to avoid unnecessary loops which is happening here just for the sake of waiting for the future tasks to complete???
//TODO: Looking for the best solutions
while(pendingTasks.size() > 0) {
for(CompletableFuture<PagesDTO> futureTask: pendingTasks) {
if(futureTask != null && futureTask.isDone()){
completedTasks.add(futureTask);
pageDTOList.add(futureTask.get());
}
}
pendingTasks.removeAll(completedTasks);
}
//Throw the exception cought while getting converting OCR stream to DTO - for any of the pageId
for(InputPageDTO dcInputPageDTO : dcReqDTO.getPages()) {
if(exceptionMap.containsKey(dcInputPageDTO.getPageId())) {
CaptureException e = exceptionMap.get(dcInputPageDTO.getPageId());
throw e;
}
}
LOG.info("Parallel processing time taken for {} pages = {}", dcReqDTO.getPages().size(),
org.springframework.util.StringUtils.deleteAny(Duration.between(Instant.now(), start).toString().toLowerCase(), "pt-"));
Please look at my above code base todo items, I have below two concerns for which I am looking for advice over stackoverflow:
1) I want to avoid unnecessary looping (happening in while loop above), what is the best way for optimistically I wait for all threads to complete its async execution then collect my results out of it??? Please anybody has an advice?
2) ExecutorService instance is created at my service bean class level, thinking that, it will be re-used for every requests, instead create it local to the method, and shutdown in finally. Am I doing right here?? or any correction in my thought process?

Simply remove the while and the if and you are good:
for(CompletableFuture<PagesDTO> futureTask: pendingTasks) {
completedTasks.add(futureTask);
pageDTOList.add(futureTask.get());
}
get() (as well as join()) will wait for the future to complete before returning a value. Also, there is no need to test for null since your list will never contain any.
You should however probably change the way you handle exceptions. CompletableFuture has a specific mechanism for handling them and rethrowing them when calling get()/join(). You might simply want to wrap your checked exceptions in CompletionException.

How to catch "NotesException: Notes error: Remote system no longer responding" and retry?

I have this java agent that processes a huge amount of documents that it could run overnight. The problem is that I need the agent to retry if the network got suddenly disconnected briefly. The retry could have a maximum number.
int numberOfRetries = 0;
try {
while(nextdoc != null) {
// process documents
numberOfRetries = 0;
}
} catch (NotesException e) {
numberOfRetries++;
if (numberOfRetries > 4) {
// go back and reprocess current document
} else {
// message reached max number of retries. did not successfully finished
}
}
Also, of course I do not want to actually retry the whole process. Basically I need to continue on the document it was processing and move on to the next loop

You should do a retry loop around each piece of code that gets a document. Since the Notes classes generally require a getFirst and getNext paradigm, that means you need two separate retry loops. E.g.,
numberOfRetries = 0;
maxRetries = 4;
// get first document, with retries
needToRetry = false;
while (needToRetry)
{
try
{
while (needToRetry)
{
nextDoc = myView.getFirstDocument();
needToRetry=false;
}
}
catch (NotesException e)
{
numberOfRetries++;
if (numberOfRetries < maxRetries) {
// you might want to sleep here to wait for the network to recover
// you could use numberOfRetries as a factor to sleep longer on
// each failure
needToRetry = true;
} else {
// write "Max retries have been exceeded getting first document" to log
nextDoc = null; // we won't go into the processing loop
}
}
}
// process all documents
while(nextdoc != null)
{
// process nextDoc
// insert your code here
// now get next document, with retries
while (needToRetry)
{
try
{
nextDoc = myView.getNextDocument();
needToRetry=false;
}
catch (NotesException e)
{
numberOfRetries++;
if (numberOfRetries < maxRetries) {
// you might want to sleep here to wait for the network to recover
// you could use numberOfRetries as a factor to sleep longer on
// each failure
needToRetry = true;
} else {
// write "Max retries have been exceeded getting first document" to log
nextDoc = false; // we'lll be exiting the processing loop without finishing all docs
}
}
}
}
Note that I'm treating maxRetries as the max total retries across all documents in the data set, not the max for each document.
Also note that it's probably cleaner to break this up a little. E.g.
numberOfRetries = 0;
maxRetries = 4;
nextDoc = getFirstDocWithRetries(view); // this contains while loop and try-catch
while (nextDoc != null)
{
processOneDoc(nextDoc);
nextDoc = getNextDocWithRetries(view,nextDoc); // and so does this
}

I would not recommend what you are doing at all.
The NotesException can fire for a number of reasons, and there is no guarantee you will be returning to a safe state.
Also the fact the agent needs to run for such a long time means you need to change the server "Maximum execution timeout" to allow it to run correctly. Setting that to a very high value makes the server more prone to performance/deadlock issues.
A better solution would be to batch workload and have the agent run for a set time on a batch. Update as you go so that when the agent comes back it knows to work on the next batch.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.