Java: guide-line for when to use thread-pooling?

Java: guide-line for when to use thread-pooling? - java

This is a high-volume production system, however, this particular code path is seldom used. Its an import feature that can potential result in a lot data coming in, but it's only occasionally used, a few times a month, perhaps.
Having a (polite) debate with a colleague. The issue is whether a simple thread created the old fashioned way:
Runnable thread = new Runnable() {
public void run() {
//... do the import work ...
};
}
new Thread(thread).start();
Is sufficient, or if this requires using a thread pool.
This is happening in a service-layer class that is called from a servlet (providing a RESTful interface). The purpose being to allow the response to return and free the UI while the import happens.
As a follow on - in this situation, is using a thread pool actually just going to add more unnecessary (coding and resource use) overhead?
After EJP's comment - is there a good guideline for when it becomes 'worth having a discussion' about using pooling instead of straight thread creation?

A threadpool would only be useful if you were planning on starting a lot of these threads, and then avoid thread creation overhead by re-using them instead of kill + re-creating them for subsequent work.
Since this code path is used so rarely, you will not need a threadpool.
However, it sounds like you are doing this heavy work in the same process that serves your REST API? You may want to consider passing this work to a worker that runs in a separate process.

Related

How do I make a block aware execution context?

For some reason I can't wrap my head around implementing this. I've got an application running with Play that calls out to Elastic Search. As part of my design, my service uses the Java API wrapped with scala future's as shown in this blog post. I've updated the code from that post to hint to the ExecutionContext that it will be doing some blocking I/O like so:
import scala.concurent.{blocking, Future, Promise}
import org.elasticsearch.action.{ActionRequestBuilder, ActionListener, ActionResponse }
def execute[RB <: ActionRequestBuilder[_, T, _, _]](request: RB): Future[T] = {
blocking {
request.execute(this)
promise.future
}
}
My actual service that constructs the queries to send to ES takes an executionContext as a constructor parameter that it then uses for calls to elastic search. I did this so that the global execution context that play uses won't have it's threads tied down by the blocking calls to ES. This S.O. comment mentions that only the global context is blocking aware, so that leaves me to have to create my own. In that same post/answer there's a lot of information about using a ForkJoin pool, but I'm not sure how to take what's written in those docs and combine it with the hints in the blocking documentation to create an execution context that responds to blocking hints.
I think one of the issues I have is that I'm not sure exactly how to respond to the blocking context in the first place? I was reading the best practices and the example it uses is an unbounded cache of threads:
Note that here I prefer to use an unbounded "cached thread-pool", so it doesn't have a limit. When doing blocking I/O the idea is that you've got to have enough threads that you can block. But if unbounded is too much, depending on use-case, you can later fine-tune it, the idea with this sample being that you get the ball rolling.
So does this mean that with my ForkJoin backed thread pool, that I should try to use a cached thread when dealing with non-blocking I/O and create a new thread for blocking IO? Or something else? Pretty much every resource I find online about using seperate thread pools tends to do what the Neophytes guide does, which is to say:
How to tune your various thread pools is highly dependent on your individual application and beyond the scope of this article.
I know it depends on your application, but in this case if I just want to create some type of blocking aware ExecutionContext and understand a decent strategy for managing the threads. If the Context is specifically for a single part of the application, should I just make a fixed thread pool size and not use/ignore the blocking keyword in the first place?
I tend to ramble, so I'll try to break down what I'm looking for in an answer:
Code! Reading all these docs still leave me like I'm feeling just out of reach of being able to code a blocking-aware context, and I'd really appreciate an example.
Any links or tips on how to handle blocking threads, i.e. make a new thread for them endlessly, check the number of threads available and reject if too many, some other strategy
I'm not looking for performance tips here, I know I'll only get that with testing, but I can't test if I can't figure out how to code the context's in the first place! I did find an example of ForkJoins vs threadpools but I'm missing the crucial part about blocking there.
Sorry for the long question here, I'm just trying to give you a sense of what I'm looking at and that I have been trying to wrap my head around this for over a day and need some outside help.
Edit: Just to make this clear, the ElasticSearch Service's constructor signature is:
//Note that these are not implicit parameters!
class ElasticSearchService(otherParams ..., val executionContext: ExecutionContext)
And in my application start up code I have something like this:
object Global extends GlobalSettings {
val elasticSearchContext = //Custom Context goes here
...
val elasticSearchService = new ElasticSearchService(params, elasticSearchContext);
...
}
I am also reading through Play's recommendations for contexts, but have yet to see anything about blocking hints yet and I suspect I might have to go look into the source to see if they extend the BlockContext trait.

So I dug into the documentation and Play's best practices for the situation I'm dealing with is to
In certain circumstances, you may wish to dispatch work to other thread pools. This may include CPU heavy work, or IO work, such as database access. To do this, you should first create a thread pool, this can be done easily in Scala:
And provides some code:
object Contexts {
implicit val myExecutionContext: ExecutionContext = Akka.system.dispatchers.lookup("my-context")
}
The context is from Akka, so I ran down there searching for the defaults and types of Contexts they offer, which eventually led me to the documentation on dispatchers. The default is a ForkJoinPool whose default method for managing a block is to call the managedBlock(blocker). This led me to reading the documentation that stated:
Blocks in accord with the given blocker. If the current thread is a ForkJoinWorkerThread, this method possibly arranges for a spare thread to be activated if necessary to ensure sufficient parallelism while the current thread is blocked.
So it seems like if I have a ForkJoinWorkerThread then the behavior I think I want will take place. Looking at the source of ForkJoinPool some more I noted that the default thread factory is:
val defaultForkJoinWorkerThreadFactory: ForkJoinWorkerThreadFactory = juc.ForkJoinPool.defaultForkJoinWorkerThreadFactory
Which implies to me that if I use the defaults in Akka, that I'll get a context which handles blocking in the way I expect.
So reading the Akka documentation again it would seem that specifying my context something like this:
my-context {
type = Dispatcher
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 8
parallelism-factor = 3.0
parallelism-max = 64
task-peeking-mode = "FIFO"
}
throughput = 100
}
would be what I want.
While I was searching in the source code I did some looking for uses of blocking or of calling managedBlock and found an example of overriding the ForkJoin behavior in ThreadPoolBuilder
private[akka] class AkkaForkJoinWorkerThread(_pool: ForkJoinPool) extends ForkJoinWorkerThread(_pool) with BlockContext {
override def blockOn[T](thunk: ⇒ T)(implicit permission: CanAwait): T = {
val result = new AtomicReference[Option[T]](None)
ForkJoinPool.managedBlock(new ForkJoinPool.ManagedBlocker {
def block(): Boolean = {
result.set(Some(thunk))
true
}
def isReleasable = result.get.isDefined
})
result.get.get // Exception intended if None
}
}
Which seems like what I originally asked for as an example of how to make something that implements the BlockContext. That file also has code showing how to make an ExecutorServiceFactory, which is what I believe
is reference by the executor part of the configuration. So I think what I would do if I wanted to have
a totally custom context would be extend some type of WorkerThread and write my own ExecutorServiceFactory that uses the custom workerthread and then specify the fully qualified class name in the property like this post advises.
I'm probably going to go with using Akka's forkjoin :)

Best practices with Akka in Scala and third-party Java libraries

I need to use memcached Java API in my Scala/Akka code. This API gives you both synchronous and asynchronous methods. The asynchronous ones return java.util.concurrent.Future. There was a question here about dealing with Java Futures in Scala here How do I wrap a java.util.concurrent.Future in an Akka Future?. However in my case I have two options:
Using synchronous API and wrapping blocking code in future and mark blocking:
Future {
blocking {
cache.get(key) //synchronous blocking call
}
}
Using asynchronous Java API and do polling every n ms on Java Future to check if the future completed (like described in one of the answers above in the linked question above).
Which one is better? I am leaning towards the first option because polling can dramatically impact response times. Shouldn't blocking { } block prevent from blocking the whole pool?

I always go with the first option. But i am doing it in a slightly different way. I don't use the blocking feature. (Actually i have not thought about it yet.) Instead i am providing a custom execution context to the Future that wraps the synchronous blocking call. So it looks basically like this:
val ecForBlockingMemcachedStuff = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(100)) // whatever number you think is appropriate
// i create a separate ec for each blocking client/resource/api i use
Future {
cache.get(key) //synchronous blocking call
}(ecForBlockingMemcachedStuff) // or mark the execution context implicit. I like to mention it explicitly.
So all the blocking calls will use a dedicated execution context (= Threadpool). So it is separated from your main execution context responsible for non blocking stuff.
This approach is also explained in a online training video for Play/Akka provided by Typesafe. There is a video in lesson 4 about how to handle blocking calls. It is explained by Nilanjan Raychaudhuri (hope i spelled it correctly), who is a well known author for Scala books.
Update: I had a discussion with Nilanjan on twitter. He explained what the difference between the approach with blocking and a custom ExecutionContext is. The blocking feature just creates a special ExecutionContext. It provides a naive approach to the question how many threads you will need. It spawns a new thread every time, when all the other existing threads in the pool are busy. So it is actually an uncontrolled ExecutionContext. It could create lots of threads and lead to problems like an out of memory error. So the solution with the custom execution context is actually better, because it makes this problem obvious. Nilanjan also added that you need to consider circuit breaking for the case this pool gets overloaded with requests.
TLDR: Yeah, blocking calls suck. Use a custom/dedicated ExecutionContext for blocking calls. Also consider circuit breaking.

The Akka documentation provides a few suggestions on how to deal with blocking calls:
In some cases it is unavoidable to do blocking operations, i.e. to put
a thread to sleep for an indeterminate time, waiting for an external
event to occur. Examples are legacy RDBMS drivers or messaging APIs,
and the underlying reason is typically that (network) I/O occurs under
the covers. When facing this, you may be tempted to just wrap the
blocking call inside a Future and work with that instead, but this
strategy is too simple: you are quite likely to find bottlenecks or
run out of memory or threads when the application runs under increased
load.
The non-exhaustive list of adequate solutions to the “blocking
problem” includes the following suggestions:
Do the blocking call within an actor (or a set of actors managed by a router), making sure to configure a thread pool which is either
dedicated for this purpose or sufficiently sized.
Do the blocking call within a Future, ensuring an upper bound on the number of such calls at any point in time (submitting an unbounded
number of tasks of this nature will exhaust your memory or thread
limits).
Do the blocking call within a Future, providing a thread pool with an upper limit on the number of threads which is appropriate for the
hardware on which the application runs.
Dedicate a single thread to manage a set of blocking resources (e.g. a NIO selector driving multiple channels) and dispatch events as they
occur as actor messages.
The first possibility is especially well-suited for resources which
are single-threaded in nature, like database handles which
traditionally can only execute one outstanding query at a time and use
internal synchronization to ensure this. A common pattern is to create
a router for N actors, each of which wraps a single DB connection and
handles queries as sent to the router. The number N must then be tuned
for maximum throughput, which will vary depending on which DBMS is
deployed on what hardware.

Java synchronization performance

I would like opinion on this to settle a small dispute. Any help would be greatly appreciated.
I have written my own file handler that is attached to the logger. This being a file handler and being accessed by multiple threads, I am using synchronization in order to ensure that there is no collision during the writing process. Additionally it is a rolling log, so I also close and open files, and do not want any problems there either.
His response to it was (as pasted from email)
I strongly believe that Synchronization is very bad in the Handler. It
is too complex for such easy task. So, I would say why do not use one
instance per Thread?
What would you say is better from performance's and memory management perspective.
Thank you very much for any response. Whenever writing and reading is involved in multithreaded applications I have used synchronization on java applications all my life, and have not heard of any severe performance issues.
So please I would like to know if there are any issues and I really should switch to one instance per thread.
And in general, what would be the downfall of using synchronization?
EDIT: the reason why I wrote a custom file handler (yes I do love slf4j), is because my custom handler is dealing with two files at once, and additionally I have few other functions I perform on top of writing to files.

another solution would be to use a separate thread to do the (costly on its own) writing and use concurrent queues to pass the log messages from the domain threads
the key part here is that pushing to a queue is much less costly that writing to a file and means that there is less interference from concurrent log calls
the call to log would then log like
private static BlockingQueue logQueue = //...
public static void log(String message){
//construct&filter message
logQueue.add(message);
}
then in the logger thread it will look like
while(true){
String message = logQueue.poll();
logFile.println(message);//or whatever you are doing
}

As with all I/O, you have little choice but mutual exclusion. You may theoretically build up a complex scheme with a lock-free queue which accumulates logging entries, but its utility, and especially its reliability, would be very questionable: without careful design you could get a logging-caused OOME, have the application hang on due to threads which you didn't clean up, etc.
Keep in mind that, assuming you are using buffered I/O, you already have an equivalent of a queue, minimizing the time spent occupying the lock.

The downfall to synchronisation is the fact that only one thread can access that part of the code at any one time, meaning your code will see little benefit from multithreading I.e. the synchronised part of your application will only be as fast as a single thread. (Small overhead for handling the synchronised status too, so a little slower perhaps)
However, in subjects where you don't want the threads to interfere with one another, such as writing to files, the security gained from the synchronisation is paramount, and the performance loss should just be accepted.

Java: TaskExecutor for Asynchronous Database Writes?

I'm thinking of using Java's TaskExecutor to fire off asynchronous database writes. Understandably threads don't come for free, but assuming I'm using a fixed threadpool size of say 5-10, how is this a bad idea?
Our application reads from a very large file using a buffer and flushes this information to a database after performing some data manipulation. Using asynchronous writes seems ideal here so that we can continue working on the file. What am I missing? Why doesn't every application use asynchronous writes?

Why doesn't every application use asynchronous writes?
It's often necessary/usefull/easier to deal with a write failure in a synchronous manner.

I'm not sure a threadpool is even necessary. I would consider using a dedicated databaseWriter thread which does all writing and error handling for you. Something like:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
I personally fancy the style of using a Proxy for threading out operations which might take a long time. I'm not saying this approach is better than using executors in any way, just adding it as an alternative.

Idea is not bad at all. Actually I just tried it yesterday because I needed to create a copy of online database which has 5 different categories with like 60000 items each.
By moving parse/save operation of each category into the parallel tasks and partitioning each category import into smaller batches run in parallel I reduced the total import time from several hours (estimated) to 26 minutes. Along the way I found good piece of code for splitting the collection: http://www.vogella.de/articles/JavaAlgorithmsPartitionCollection/article.html
I used ThreadPoolTaskExecutor to run tasks. Your tasks are just simple implementation of Callable interface.

why doesn't every application use asynchronous writes? - erm because every application does a different thing.
can you believe some applications don't even use a database OMG!!!!!!!!!
seriously though, given as you don't say what your failure strategies are - sounds like it could be reasonable. What happens if the write fails? or the db does away somehow
some databases - like sybase - have (or at least had) a thing where they really don't like multiple writers to a single table - all the writers ended up blocking each other - so maybe it wont actually make much difference...

Manually Increasing the Amount of CPU a Java Application Uses

I've just made a program with Eclipse that takes a really long time to execute. It's taking even longer because it's loading my CPU to 25% only (I'm assuming that is because I'm using a quad-core and the program is only using one core). Is there any way to make the program use all 4 cores to max it out? Java is supposed to be natively multi-threaded, so I don't understand why it would only use 25%.

You still have to create and manage threads manually in your application. Java can't determine that two tasks can run asynchronously and automatically split the work into several threads.

This is a pretty vague question because we don't know much about what your program does. If your program is single-threaded, then no number of cores on your machine is going to make it run any faster. Java does have threading support, but it won't automatically parallelize your code for you. To speed it up, you'll need to identify parts of the computation that can be run in parallel with one another and add code as appropriate to split up and reconstitute the work. Without more info on what your program does, I can't help you out.
Another important detail to note is that Java threads are not the same as system threads. The JVM often has its own thread scheduler that tries to put Java threads onto actual system threads in a way that's fair, but there's no actual guarantee that it will do so.

Yes, Java is multi-threaded, but the multi-threading doesn't happen "by magic".
Have a look at either at the Thread class or at the Executor framework. Essentially you need to split your job into "subtasks" each of which can run on a single processor, then do something like this:
Executor ex = Executors.newFixedThreadPool(4);
while (thereAreMoreSubtasksToDo) {
ex.execute(new Runnable() {
public void run() {
... do subtask ...
}
});
}
Turning a serial routine/algorithm into a parallel one isn't necessarily trivial: you need to know in particular about a range of issues broadly termed "thread-safety". You may be interested in some material I've written about thread-safety in Java, and threading in general if you follow the links: the key thing to bear in mind is that if any data/objects are being shared among the different threads running, then you need to take special precautions. That said, for independent things that you just want to "run at the same time", then the above pattern will get you started.

Java is multi-threaded but if your application runs in only one thread, only one thread will be used. (Apart from the internal threads Java uses for finalization, garbage collection and so on.)
If you want your code to use multiple threads, you have to split it up manually, either by starting threads by yourself or using a third party thread pool. I'd suggest the latter option as it's safer but both can work equally well.

You've got a bit of learning ahead of you (actually, quite a bit of learning) - but it's learning you should do if you are going to be doing any serious programming.
Here's a starting point: http://download.oracle.com/javase/tutorial/essential/concurrency/
But you might want to look into a good book on Java multi-threading (I did this so long ago that any book I could recommend would be out of print). This sort of hard topic is well suited for learning from a text instead of online tutorials.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.