ParallelStream queue task in CommonPool rather than the custom pool - java

I wanted to use custom ThreadPool for parallelStream. Reason being I wanted to use MDCContext in the task. This is the code I wrote to use the custom ThreadPool:
final ExecutorService mdcPool = MDCExecutors.newCachedThreadPool();
mdcPool.submit(() -> ruleset.getOperationList().parallelStream().forEach(operation -> {
log.info("Sample log line");
});
When the MDC context was not getting copied to the task, I looked at the logs. These are the logs I found. The first log is executed in "(pool-16-thread-1)" but other tasks are getting executed on "ForkJoinPool.commonPool-worker". The first log also has MdcContextID. But as I am using custom ThreadPool for submitting the task, all tasks should be executing in custom ThreadPool.
16 Oct 2018 12:46:58,298 [INFO] 8fcfa6ee-d141-11e8-b84a-7da6cd73aa0b (pool-16-thread-1) com.amazon.rss.activity.business.VariablesEvaluator: Sample log line
16 Oct 2018 12:46:58,298 [INFO] (ForkJoinPool.commonPool-worker-11) com.amazon.rss.activity.business.VariablesEvaluator: Sample log line
16 Oct 2018 12:46:58,298 [INFO] (ForkJoinPool.commonPool-worker-4) com.amazon.rss.activity.business.VariablesEvaluator: Sample log line
16 Oct 2018 12:46:58,298 [INFO] (ForkJoinPool.commonPool-worker-13) com.amazon.rss.activity.business.VariablesEvaluator: Sample log line
16 Oct 2018 12:46:58,298 [INFO] (ForkJoinPool.commonPool-worker-9) com.amazon.rss.activity.business.VariablesEvaluator: Sample log line
16 Oct 2018 12:46:58,299 [INFO] (ForkJoinPool.commonPool-worker-2) com.amazon.rss.activity.business.VariablesEvaluator: Sample log line
16 Oct 2018 12:46:58,299 [INFO] (ForkJoinPool.commonPool-worker-15) com.amazon.rss.activity.business.VariablesEvaluator: Sample log line
Is this supposed to happen or am I missing something?

There is no support for running a parallel stream in a custom thread pool. It happens to be executed in a different Fork/Join pool when the operation is initiated in a worker thread of a different Fork/Join pool, but that does not seem to be a planned feature, as the Stream implementation code will still use artifacts of the common pool internally in some operations then.
In your case, it seems that the ExecutorService returned by MDCExecutors.newCachedThreadPool() is not a Fork/Join pool, so it does not exhibit this undocumented behavior at all.
There is a feature request, JDK-8032512, regarding more thread control. It’s open and, as far as I can see, without much activity.

Related

How to get scalable i/o bound async multi-threading from Java SDK or similar SDK (ie: rxJava, project reactor)?

Want
Many threads that will make a datbase call and block in order to improve and scale performance.
Problems:
The standard Java completable future API does not work well with
blocking/IO tasks, even when using ManagedBlocker.
If using a library that does not have this problem, too many async requests at the same time has at least 1 scaling problems:
Too many threads created at the same time could lead to out of memory error due to how much memory each thread needs. And there are no good default ThreadPoolExecutors that allow setting threadpool parameters such as max number of threads followed by providing a queue system for incoming tasks to wait before a thread is available.
Example
I want to scale a program that will need to make 3000 async db requests. Instead of making 3000 requests all at once, I want to limit it to 50 at any given time and queue the remaining 2950, then process each 2950 of the remaining one at a time whenever a task completes. Ideally I would like to do this using existing libraries as to re-inventing it with new custom code, as I am assuming there is a way to do this but I am unsure of how to use the APIs of various async Java SDKs that keep coming out.
I think there are a couple of ways of addressing the unbounded threadpool. One is, as others point out, to create an RxJava Scheduler from an Executor backed by bounded threadpool. That is pretty straightforward and may very well be the best approach.
But, I do want to point out that RxJava's "parallelizing" operators (flatMap, concatMapEager) also have an optional maxConcurrency operator that allows us to decouple the the number of swimlanes in a given Rx pipeline from the Scheduler being used to execute it.
Here's a hypothetical example, let's say we have a Data Access Object that performs blocking queries. In this case it just sleeps for 1 second and returns the query itself with a timestamp appended:
public class MyDao
{
public Object blockingGetData( String query ) throws InterruptedException
{
Thread.sleep( 1000 );
return query.toUpperCase() + " - " + new Date().toString();
}
}
Next, let's wrap the DAO in an async Service that maintains an Rx pipeline where each element represents a query and its async result:
public class MyService
{
private class QueryHolder
{
final String query;
final Subject<Object> result;
public QueryHolder( String query, Subject<Object> result )
{
this.query = query;
this.result = result;
}
}
private static final int MAX_CONCURRENCY = 2;
private final Subject<QueryHolder> querySubject;
private final MyDao dao;
public MyService()
{
dao = new MyDao();
querySubject = PublishSubject.<QueryHolder>create().toSerialized();
querySubject
.flatMap(
// For each element in the pipeline, perform blocking
// get on IO Scheduler, populating the result Subject:
queryHolder -> Observable.just( queryHolder )
.subscribeOn( Schedulers.io() )
.doOnNext( __ -> {
Object data = dao.blockingGetData( queryHolder.query );
queryHolder.result.onNext( data );
queryHolder.result.onComplete();
} ),
// With max concurrency limited:
MAX_CONCURRENCY )
.subscribe();
}
public Single<Object> getData( String query )
{
Subject<Object> result = AsyncSubject.create();
// Emit pipeline element:
querySubject.onNext( new QueryHolder( query, result ));
return result.firstOrError();
}
}
I recommend you google the different subject types and operators, etc. - there's tons of documentation available.
A simple manual test:
#Test
public void testService() throws InterruptedException
{
MyService service = new MyService();
// Issue 20 queries immediately, printing the results when they complete:
for ( int i = 0; i < 20; i++ )
{
service.getData( "query #" + i )
.subscribe( System.out::println );
}
// Sleep:
Thread.sleep( 11000 );
}
Output:
QUERY #0 - Wed Mar 11 11:08:21 EDT 2020
QUERY #1 - Wed Mar 11 11:08:21 EDT 2020
QUERY #2 - Wed Mar 11 11:08:22 EDT 2020
QUERY #3 - Wed Mar 11 11:08:22 EDT 2020
QUERY #4 - Wed Mar 11 11:08:23 EDT 2020
QUERY #5 - Wed Mar 11 11:08:23 EDT 2020
QUERY #6 - Wed Mar 11 11:08:24 EDT 2020
QUERY #7 - Wed Mar 11 11:08:24 EDT 2020
QUERY #8 - Wed Mar 11 11:08:25 EDT 2020
QUERY #9 - Wed Mar 11 11:08:25 EDT 2020
QUERY #10 - Wed Mar 11 11:08:26 EDT 2020
QUERY #11 - Wed Mar 11 11:08:26 EDT 2020
QUERY #12 - Wed Mar 11 11:08:27 EDT 2020
QUERY #13 - Wed Mar 11 11:08:27 EDT 2020
QUERY #14 - Wed Mar 11 11:08:28 EDT 2020
QUERY #15 - Wed Mar 11 11:08:28 EDT 2020
QUERY #16 - Wed Mar 11 11:08:29 EDT 2020
QUERY #17 - Wed Mar 11 11:08:29 EDT 2020
QUERY #18 - Wed Mar 11 11:08:30 EDT 2020
QUERY #19 - Wed Mar 11 11:08:30 EDT 2020

Returning NOBODY because of SkipAdminCheck

[INFO] Oct 06, 2016 11:24:54 AM com.google.apphosting.utils.jetty.AppEngineAuthentication$AppEngineAuthenticator authenticate
[INFO] INFO: Returning NOBODY because of SkipAdminCheck.
Seems this error is produced by TaskQueue.
Queue qu = QueueFactory.getQueue(qname);
qu.add(TaskOptions.Builder.withUrl("/task/"+qname)
.payload("{\"token\":\"asdf1234\"}","UTF-8")
.method(TaskOptions.Method.POST)
.header("Host", ModulesServiceFactory.getModulesService().getVersionHostname(null,null))
Any suggestions how to fix it? Sure I googled, Google found just 3 pages about, and two first is about adding. As you see above I have added following but the code still produce messages into the log:
.header("Host", ModulesServiceFactory.getModulesService().getVersionHostname(null,null))

RetryHandler Exceptions while using MapOnlyMapper in google appengine

I have a very large dataset, and want to update certain entity kinds. I am exploring MapReduce library in GoogleAppEngine. I have followed the examples listed here.
https://github.com/GoogleCloudPlatform/appengine-mapreduce/tree/master/java/example/src/com/google/appengine/demos/mapreduce/entitycount
What I am basically doing is this, in my MapSpecification
MapSpecification<Entity, Entity, Void> spec = new MapSpecification.Builder<>(
new DatastoreKeyInput(query,2),
new UrlFlattenMapper(),
new DatastoreOutput())
.setJobName("Flatten URLs entities")
.build();
and My Mapper basically performs the operations on the Entity and then Emits it, for the DatastoreOutput writer to write it back into the database.
My problem is, the Entities are getting updated fine. The endSlice is also being called in my MapperTask. But the Jobs is not completing. I keep getting these errors
[INFO] INFO: RetryHelper(28.07 ms, 1 attempts, java.util.concurrent.Executors$RunnableAdapter#7f0264e0): Attempt #1 failed [java.lang.RuntimeException: Can't serialize object: MapOnlyShardTask[context=IncrementalTaskContext[jobId=3c041e68-5041-458c-994b-290cd941f8bb, shardNumber=1, shardCount=2, lastWorkItem=Topics("jzdh"), workerCallCount=297, workerTimeMillis=42513], inputExhausted=true, isFirstSlice=false]], sleeping for 1028 ms
[INFO] Apr 26, 2016 4:39:37 PM com.google.appengine.tools.cloudstorage.RetryHelper doRetry
[INFO] INFO: RetryHelper(1.085 s, 2 attempts, java.util.concurrent.Executors$RunnableAdapter#7f0264e0): Attempt #2 failed [java.lang.RuntimeException: Can't serialize object: MapOnlyShardTask[context=IncrementalTaskContext[jobId=3c041e68-5041-458c-994b-290cd941f8bb, shardNumber=1, shardCount=2, lastWorkItem=Topics("jzdh"), workerCallCount=297, workerTimeMillis=42513], inputExhausted=true, isFirstSlice=false]], sleeping for 2435 ms
[INFO] Apr 26, 2016 4:39:37 PM com.google.appengine.tools.cloudstorage.RetryHelper doRetry
[INFO] INFO: RetryHelper(3.562 s, 3 attempts, java.util.concurrent.Executors$RunnableAdapter#6d7fcd47): Attempt #3 failed [java.lang.RuntimeException: Can't serialize object: MapOnlyShardTask[context=IncrementalTaskContext[jobId=3c041e68-5041-458c-994b-290cd941f8bb, shardNumber=0, shardCount=2, lastWorkItem=Topics("jz63"), workerCallCount=289, workerTimeMillis=41536], inputExhausted=true, isFirstSlice=false]], sleeping for 3421 ms
[INFO] Apr 26, 2016 4:39:39 PM com.google.appengine.tools.cloudstorage.RetryHelper doRetry
[INFO] INFO: RetryHelper(3.567 s, 3 attempts, java.util.concurrent.Executors$RunnableAdapter#7f0264e0): Attempt #3 failed [java.lang.RuntimeException: Can't serialize object: MapOnlyShardTask[context=IncrementalTaskContext[jobId=3c041e68-5041-458c-994b-290cd941f8bb, shardNumber=1, shardCount=2, lastWorkItem=Topics("jzdh"), workerCallCount=297, workerTimeMillis=42513], inputExhausted=true, isFirstSlice=false]], sleeping for 3340 ms
[INFO] Apr 26, 2016 4:39:41 PM com.google.appengine.tools.cloudstorage.RetryHelper doRetry
[INFO] INFO: RetryHelper(7.015 s, 4 attempts, java.util.concurrent.Executors$RunnableAdapter#6d7fcd47): Attempt #4 failed [java.lang.RuntimeException: Can't serialize object: MapOnlyShardTask[context=IncrementalTaskContext[jobId=3c041e68-5041-458c-994b-290cd941f8bb, shardNumber=0, shardCount=2, lastWorkItem=Topics("jz63"), workerCallCount=289, workerTimeMillis=41536], inputExhausted=true, isFirstSlice=false]], sleeping for 6941 ms
[INFO] Apr 26, 2016 4:39:42 PM com.google.appengine.tools.cloudstorage.RetryHelper doRetry
I havent been able to get around this issue, any help or pointers on what I could be doing would be greatly appreciated.
The Culprit in My case is a small Datastore field I have used in the Map Job. I put a transient in front of the field, and the issue was solved,

Stopping threads in a multi threaded application

I have created a service using procrun which launches certain jars through reflection. When the service is started it starts a thread and rest of the execution happens in that thread. Then each of the plugin loads its own threads and does the execution in there.
During service stop, I have called the stop method of the plugins. Those methods have returned and whatever thread I have created has been terminated for the plugins. But even after that the following threads are still running.
INFO: Thread No:0 = Timer-0
Jan 13, 2016 10:49:58 AM com.test.desktop.SdkMain stop
INFO: Thread No:1 = WebSocketWorker-14
Jan 13, 2016 10:49:58 AM com.test.desktop.SdkMain stop
INFO: Thread No:2 = WebSocketWorker-15
Jan 13, 2016 10:49:58 AM com.test.desktop.SdkMain stop
INFO: Thread No:3 = WebSocketWorker-16
Jan 13, 2016 10:49:58 AM com.test.desktop.SdkMain stop
INFO: Thread No:4 = WebSocketWorker-17
Jan 13, 2016 10:49:58 AM com.kube.desktop.KubeSdkMain stop
INFO: Thread No:5 = WebsocketSelector18
Jan 13, 2016 10:49:58 AM com.test.desktop.SdkMain stop
INFO: Thread No:6 = AWT-EventQueue-0
Jan 13, 2016 10:49:58 AM com.test.desktop.SdkMain stop
INFO: Thread No:7 = DestroyJavaVM
Jan 13, 2016 10:49:58 AM com.test.desktop.SdkMain stop
INFO: Thread No:8 = Thread-11
Jan 13, 2016 10:49:58 AM com.test.desktop.SdkMain stop
The following is how I printed those threads.
ThreadGroup currentGroup = Thread.currentThread().getThreadGroup();
int noThreads = currentGroup.activeCount();
Thread[] lstThreads = new Thread[noThreads];
currentGroup.enumerate(lstThreads);
for (int i = 0; i < noThreads; i++)
LOGGER.log(Level.INFO, "Thread No:" + i + " = " + lstThreads[i].getName());
Because of these threads, when I stop the service, it takes forever and then times out. But when I call System.exit(0) the service stops quickly. What should I do to get rid of these threads? When I launch the jars through reflection, are there separate threads created for each plugin? If so could these be them? Please advice.
It looks like the plugins are themself launching threads ("INFO: Thread No:1 = WebSocketWorker-14" -> sockets usually should be put in seperate threads) which will not be shut down if you kill the initiating thread. You'll have to enforce your plugins to kill all threads they started when they get shut down to make sure that they will not leave stuff behind. "some plugins don't do a good job cleaning up whatever they created. it's just sloppy programming." - bayou.io is describing it really good there.
Calling System.exit() will just kill the process meaning it will kill all threads created by the process as well.
The other way would be to manually iterate over all running threads, check if it's the main thread, and if not proceed to kill it. You can get all running threads in an iterable set using
Set<Thread> threadSet = Thread.getAllStackTraces().keySet();
And you can get your currently running Thread using
Thread currentThread = Thread.currentThread();
Still this is the way you would not want to do it, it's more of a way to clean up if plugins decide to leave stuff behind rather than just doing it that way. The plugins themselfes should take care of shutting down the threads when they get disabled but if they don't do that you can use above way to manually clean it up.

Apache Commons IO Tailer delivers old log messages

My code is given below.
public static void main(String[] args) {
// TODO code application logic here
File pcounter_log = new File("c:\development\temp\test.log");
try {
Tailer tailer = new Tailer(pcounter_log, new FileListener("c:\development\temp\test.log",getLogPattern()), 5000,true);
Thread thread = new Thread(tailer);
thread.start();
} catch (Exception e) {
System.out.println(e);
}
}
public class FileListener extends TailerListenerAdapter {
public void handle(String line) {
for (String logPattern : pattern) {
if (line.contains(logPattern)) {
logger.info(line);
}
}
}
}
Here getLogPattern() returns an ArrayList containing values like [info,error,abc.catch,warning]. When running this code, I get old log message followed by new log message. I.e. the output is like this:
20 May 2011 07:06:02,305 INFO FileListener:? - 20 May 2011 07:06:01,230 DEBUG - exiting readScriptErrorStream()
20 May 2011 07:06:55,052 INFO FileListener:? - 20 May 2011 07:06:55,016 DEBUG - readScriptErrorStream()
20 May 2011 07:06:56,056 INFO FileListener:? - 20 May 2011 07:06:55,040 DEBUG - exiting readScriptErrorStream()
20 May 2011 07:07:01,241 INFO FileListener:? - 20 May 2011 07:07:01,219 DEBUG - readScriptErrorStream()
20 May 2011 07:07:02,245 INFO FileListener:? - 20 May 2011 07:07:01,230 DEBUG - exiting readScriptErrorStream()
20 May 2011 07:07:55,020 INFO FileListener:? - 20 May 2011 07:07:55,016 DEBUG - readScriptErrorStream()
20 May 2011 07:07:56,024 INFO FileListener:? - 20 2011 07:07:55,030 DEBUG - exiting readScriptErrorStream()
20 May 2011 07:08:01,269 INFO FileListener:? - 20 May 2011 07:08:01,227 DEBUG - readScriptErrorStream()
20 May 2011 07:08:02,273 INFO FileListener:? - 20 May 2011 07:08:01,230 DEBUG - exiting readScriptErrorStream()
20 May 2011 07:08:21,234 INFO FileListener:? - 20 May 2011 06:40:02,461 DEBUG - readScriptErrorStream()
20 May 2011 07:08:22,237 INFO FileListener:? - 20 May 2011 06:40:02,468 DEBUG - exiting readScriptErrorStream()
20 May 2011 07:08:23,242 INFO FileListener:? - 20 May 2011 06:41:01,224 DEBUG - readScriptErrorStream()
20 May 2011 07:08:24,250 INFO FileListener:? - 20 May 2011 06:41:01,232 DEBUG - exiting readScriptErrorStream()
20 May 2011 07:08:25,261 INFO FileListener:? - 20 May 2011 06:42:01,218 DEBUG - readScriptErrorStream()
20 May 2011 07:08:26,265 INFO FileListener:? - 20 May 2011 06:42:01,230 DEBUG - exiting readScriptErrorStream()
20 May 2011 07:08:27,272 INFO FileListener:? - 20 May 2011 06:43:01,223 DEBUG - readScriptErrorStream()
20 May 2011 07:08:28,275 INFO FileListener:? - 20 May 2011 06:43:01,231 DEBUG - exiting readScriptErrorStream()
How to avoid to get old log messages from log file like this?
Oh boy, I have wasted an entire day thinking it was my dodgy threading, but I now see others have shared my pain. Oh well, at least I won't waste another day looking at it.
But I did look at the source code. I am sure the error is occuring here in the Tailer.java file:
boolean newer = FileUtils.isFileNewer(file, last); // IO-279, must be done first
...
...
else if (newer) {
/*
* This can happen if the file is truncated or overwritten with the
* exact same length of information. In cases like this, the file
* position needs to be reset
*/
position = 0;
reader.seek(position);
...
It seems it's possible that the file modification data can change before the data is written. I'm no expert on why this would be. I am getting my log files from the network, so perhaps all sorts of caching is going on that means you are not garunteed that a newer file will contain any more data.
I have updated the source and removed this section. For me, the chances of a file getting truncated/recreated with exactly the same number of bytes is minimal. I'm referencing 10MB rolling log files.
I see that this is a known issue ( IO-279 LINK HERE ). However, it's marked as resolved and that's clearly not the case. I'll contact the developers to see if there's something in the pipeline. It would seem they're of the same opinion as me about the fix.
https://issues.apache.org/jira/browse/IO-279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
What version of commons.io are ( were ) you using?
I experienced this error with 2.0.1. I updated to 2.3, and it seems to be working properly ( so far )
I know that this is a very old thread, but I just ran into a similar issue with Tailer. It turned out that Tailer had two threads reading the file concurrently.
I traced it back to how I had created the Tailer instance. Rather than using one of their 3 recommendations (static helper method, Executor, or Thread) I had created the instance with the static helper method and then fed the instance created into a Thread which seemed to result in two threads reading the file.
Once I corrected this (by removing the call to the static helper method and just using one of the overloaded Tailer constructors and a Thread) the issue went away.
Hope this helps someone.

Categories