I have a Java servlet that operates with a heavy-weight and thread-unsafe resource to handle user requests. The resource is an object that needs a long time to be instantiated (up to 10 seconds) and takes a large amount of memory.
But when the object is allocated, it takes a short time to run its method I need to process a request.
There can be several such resources, different from each other.
Each request comes with an ID, which points out on the certain resource.
I wish to implement a pool of such resources, so that requests with the same IDs will not instantiate a new object, but will pick one from the pool.
The scheme is following:
after the request has been received, servlet checks whether a resource with the requested ID is in the pool
if not, servlet creates one and provides it
if the resource is already instantiated, the request goes into a queue to be executed, doPost waits for it.
The operation over different resources must be concurrent, but synchronized within the same resource.
I am new to multithreading in Java, and the ThreadPoolExecutor does not seem to be usable as is, so I would be appreciated for an advice how to implement the above described scheme. Thanks.
You are correct - ThreadPoolExecutor is not what you want. It is simply a pool of threads to run tasks with, not a shared resource collection.
What you want is a cache. It needs to create a resource and return it to requesting threads to use, and reuse the things it returned previously. Also, the resource returned must be thread-safe (So if your underlying resources are not, you may need to write synchronized wrappers for them).
There are a number of thread-safe caches around, quite a few of them - opensource. Try those out, it shouldn't be too difficult to configure them for your use case (it seems fairly typical).
It is possible and not too difficult to implement a make-shift cache of your own, but you're far better off using a third-party solution if you are new to multithreading.
Related
As we know Tomcat has approx 200 threads and Jetty has some default count threads in their respective thread pools. So if we set something in a ThreadLocal per request, will it be there in the thread for life time or will Tomcat clear the ThreadLocal after each request.
If we set something in userContext in a filter do we need to clear it every time the filter exits?
Or will the web server create a new thread every time, if we don't have a thread pool configuration?
public static final ThreadLocal<UserContextDto> userContext = new ThreadLocal<>();
Yes, you need to clear ThreadLocal. Tomcat won't clear ThreadLocals.
No, new thread is not created every time. A thread from the pool is used to serve a request, and returned back to pool once request is complete.
This not only applies to Tomcat, it applies to Jetty and Undertow as well. Thread creation for every request is expensive in terms of both resources and time.
No, Tomcat will not clear ThreadLocals that your code creates, which means they will remain and could pollute subsequent requests.
So whenever you create one, make sure you clear it out before that same request or whatever exits.
It should also be noted that subsequent requests - even using the identical URL - could well be executed in a totally different thread, so ThreadLocals are not a mechanism for saving state between requests. For this, something like SessionBeans could be used.
If you put something in a ThreadLocal in a Thread that is not 100% under your control (i.e. one in which you are invoked from other code, like for a HTTP request), you need to clear whatever you set before you leave your code.
A try/finally structure is a good way to do that.
A threadpool can't do it for you, because the Java API does not provide a way to clear a thread's ThreadLocal variables. (Which is arguably a shortcoming in the Java API)
Not doing so risks a memory leak, although it's bounded by the size of the thread pool if you have one.
Once the same thread gets assigned again to the code that knows about the ThreadLocal, you'll see the old value from the previous request if you didn't remove it. It's not good to depend on that. It could lead to hard to trace bugs, security holes, etc.
Scenario
We are developing an API that will handle around 2-3 million hits per hour in a multi-threaded environment. The server is Apache Tomcat 7.0.64.
We have a custom object with lot of data let's call it XYZDataContext. When a new request comes in we associate XYZDataContext object to the request context. One XYZDataContext object per request. We will be spawning various threads in parallel to serve that request to collect/process data from/into XYZDataContext object. Our threads that will process things in parallel need access to this XYZDataContext object and
to avoid passing around of this object everywhere in the application, to various objects/methods/threads,
we are thinking to make it a threadlocal. Threads will use data from XYZDataContext object and will also update data in this object.
When the thread finishes we are planning to merge the data from the updated XYZDataContext object in the spawned child thread into the main thread's XYZDataContext object.
My questions:
Is this a good approach?
Threadpool risks - Tomcat server will maintain a threadpool and I read that using threadlocal with thread pools is a disaster because thread is not GCed per say and is reused so the references to the threadlocal objects will not get GCed and will result in storing huge objects in memory that we don't need anymore eventually resulting into OutOfMemory issues...
UNLESS they are referenced as weak references so that get GCed immediately.
We're using Java 1.7 open JDK. I saw the source code for ThreadLocal and the although the ThreadLocalMap.Entry is a weakreference it's not associated with a ReferenceQueue, and the comment for Entry constructor says "since reference queues are not used, stale entries are guaranteed to be removed only when the table starts running out of space."
I guess this works great in case of caches but is not the best thing in our case. I would like that the threadlocal XYZDataContext object be GCed immediately. Will the ThreadLocal.remove() method be effective here?
Is there any way to enforce emptying the space in the next GC run?
This is a right scenario to use ThreadLocal objects? Or are we abusing the threadlocal concept and using it where it shouldn't be used?
My gut feeling tells me you're on the wrong path. Since you already have a central context object (one for all threads) and you want to access it from multiple threads at the same time I would go with a Singleton hosting the context object and providing threadsafe methods to access it.
Instead of manipulating multiple properties of your context object, I would strongly suggest to do all manipulations at the same time. Best would be if you pass only one object containing all the properties you want to change in your context object.
e.g
Singleton.getInstance().adjustContext(ContextAdjuster contextAdjuster)
You might also want to consider using a threadsafe queue, filling it up with ContextAdjuster objects from your threads and finally processing it in the Context's thread.
Google for things like Concurrent, Blocking and Nonblocking Queue in Java. I am sure you'll find tons of example code.
My Java (Swing) application creates a new SwingWorker object when it needs to (e.g) download data from the Internet and do something at the same time (think display a loader). However, monitoring the threads created, this can quickly reach ~100 threads.
Is this bad practice? If yes; what's the proper way to do it? Doesn't the GC automatically clean up unused threads?
Yes it is a bad practice when you put no upper bound on the number of threads (or generally resources).
In this case you better use a thread pool which contains at most a specific number of threads (say for example 25). You can either create them all at startup, or create them lazily on demand.
Implement a simple request manager system for the pool, which gives to the requesters the resources (or in case of running out of resources, queues them or simply denies them).
In this way, cleaning them in the end will also be easy and obvious.
There should be a frontier object - Holding a set of visited and waiting to crawl URL's.
There should be some thread responsible for crawling web pages.
There would be also some kind of controller object to create crawling threads.
I don't know what architecture would be faster, easier to extend. How to divide responsibilities to make as as few synchronization as possible and also minimize number of checking if current URL has been already visited.
Should controller object be responsible of providing new URL's to working threads - this mean working threads will need to crawl all given URL's and then sleep for undefined time. Controller will be interpreting this threads so crawling thread should handle InterruptedException (How expensive it is in Java - it seems that exception handling is not very fast ).
Or maybe controller should only starts the threads and let crawling threads to fetch frontier themselves?
create a shared, thread-safe list with the URL's to be crawled. create an Executor with the number of threads corresponding to the number of crawlers you desire to run concurrently. start your crawlers as Runnables with a reference to the shared list and submit each of them to the Executor. each crawler removes the next URL from the list and does whatever you need it to do, looping until the list is empty.
Its been a few years since this question was asked, but in Nov 2015 we are currently using frontera and scrapyd
Scrapy uses twisted which makes it a good multithreaded crawler, and on multi-core machines that means we are only limited by the inbound bandwidth. Frontera-distributed uses hbase and kafka to score links and keep all the data accessible to clients.
Create a central resource with a hash map that can store URL as key with last time scanned. Make this thread safe. Then just spawn threads with links in a queue which can be picked up by the crawlers as starting point. Each thread would then carry on crawling and updating the resource. A thread in the resource clears up outdated crawls. The in memory resource can be serialised at start or it could be in a db depending on your app needs.
You could make this resource accessible via remote services to allow multiple machines. You could make the resource itself spread over several machines by segregating urls. Etc...
You should use a blocking queue, that contains urls that need to be fetched. In this case you could create multiple consumers that will fetch urls in multiple threads. If queue is empty, than all fetchers will be locked. In this case you should run all threads at the beginning and should not controll them later.
Also you need to maintain a list of already downloaded pages in some persistent storage and check before adding to the queue.
If you don't want to re-invent the wheel, why not look at Apache Nutch.
I need to use the Stanford Parser in a web service. As SentenceParser loads a big object, I will make sure it is a singleton, but in this case, is it thread safe (no according to http://nlp.stanford.edu/software/parser-faq.shtml). How else would it be done efficiently? One option is locking the object while being used.
Any idea how the people at Stanford are doing this for http://nlp.stanford.edu:8080/parser/ ?
If the contention is not a factor, locking (synchronization) would be one option as you mentioned, and it might be good enough.
If there are contentions, however, I see three general options.
(1) instantiating it every time
Just instantiate it as a local variable every time you perform parsing. Local variables are trivially safe. The instantiation is not free of course, but it may be acceptable depending on the specific situation.
(2) using threadlocals
If instantiation turns out to be costly, consider using threadlocals. Each thread would retain its own copy of the parser, and the parser instance would be reused on a given thread. Threadlocals are not without problems, however. Threadlocals may not be garbage collected without being set to null or until the holding thread goes away. So there is a memory concern if there are too many of them. Second, beware of the reuse. If these parsers are stateful, you need to ensure to clean up and restore the initial state so subsequent use of the threadlocal instance does not suffer from the side effect of previous use.
(3) pooling
Pooling is in general no longer recommended, but if the object sizes are truly large so that you need to have a hard limit on the number of instances you can allow, then using an object pool might be the best option.
I don't know how the people at Stanford have implemented their service but I would build such a service based on a message framework, such as http://www.rabbitmq.com/. So your front end service will receive documents and use a message queue to communicate (store documents and retrieve results) with several workers that execute NLP parsing. The workers -- after finishing processing -- will store results into a queue that is consumed by the front end service. This architecture will let you to dynamically add new workers in case of high load. Especially that NLP tagging takes some time - up several seconds per document.