How to deal with thread hanging in a Java ThreadPoolExecutor?

How to deal with thread hanging in a Java ThreadPoolExecutor? - java

I'm using Executors.newFixedThreadPool(THREADS) where THREADS=20 for sending events asynchronously through a rest service.
A couple weeks ago I experienced some unexpected problems when the rest service I'm consuming went down or began to response slower.
1. First, the LinkedBlockingQueue managed by the thread pool, suffered a bottleneck since it could not finish processes faster than new processes were created. So eventually it reached the max capacity allowed (Integer.MAX_VALUE) and great part of the heap memory was in use by the queue.
2. Second, although the rest service recovered and began to response in time again, the thread pool seemed to be in a "hanging state" where no more events were sent. I had to restart the application to fix the problem and send events back.
Here's the output of the Memory Analyzer Tool with the heap dump at issue:
One instance of "java.util.concurrent.ThreadPoolExecutor" loaded by
"" occupies 2,518,687,016 (94.69%) bytes. The
instance is referenced by
com.despegar.flights.keeper.concurrent.trace.TraceAwareExecutorServiceImpl
# 0x695cf3550 , loaded by "sun.misc.Launcher$AppClassLoader #
0x694cfd810".
Keywords java.util.concurrent.ThreadPoolExecutor
sun.misc.Launcher$AppClassLoader # 0x694cfd810
And the static state of the ThreadPoolExecutor:
Type |Name |Value
----------------------------------
int |TERMINATED |1610612736
int |TIDYING |1073741824
int |STOP |536870912
int |SHUTDOWN |0
int |RUNNING |-536870912
int |CAPACITY |536870911
int |COUNT_BITS |29
----------------------------------
Here's the code:
public AsyncDispatchManager() {
this.executorService = new TraceAwareExecutorServiceImpl(Executors.newFixedThreadPool(THREADS));
}
public void execute(String name, Runnable runnable) {
this.executorService.submit(this.adaptRunnable(name, runnable));
}
Regarding to the queue, I don't want to spend more resources in it, so I would implement a limited queue and a rejection policy for this situations since I don't care to loss some events.
Do you think there's a more appropriate solution?
With the hanging threads I'm stuck. I don't understand the background problem and how to resolve it.
Did someone have to deal with such problems?
I'd really appreciate any proposed solution or information.

Related

Using a disk backed queue

Under certain conditions, one of our servers running legacy code in a Wildfly application server, is suffering thread starvation and needs to be restarted.
After an arduous investigation, I stumbled upon this silly code:
private void addToQueue (Item e) {
if (!_queue.offer(e, 200L, TimeUnit.MILLISECONDS)) {
ThreadService.getInstance().schedule("retry process", () -> {
addToQueue(e);
return null;
}, 5L, TimeUnit.SECONDS);
}
}
ThreadService is the Wildfly implementation of the Java SE Executor Service, that provides a limited number of threads (16).
Sometimes, during reconnections between services, we receive in short time a huge amount of items to be processed (~100k), and as you can see it spams the ThreadService with scheduled tasks.
An obvious solution would be to increase the queue, which has currently a capacity of 20k. However I am afraid this would end up in other problems. Obviously this thread creation spamming needs to be eliminated.
Since processing these items is a non critical task I was thinking on using a disk backed queue, so it can be done in a separated process at slow pace.
Searching a bit, I have seen this project: Tape by Square
I would like to know your opinion about this solution, which kind of reminds me a bit to the pipes in Linux I used years ago. What do you think?

Concurrency logging to sql DB - threads not running parallel

I tested this with 2.14.0 and 2.13.3
I used the JDBC Appender in combination with the DynamicThresholdFilter and tried out a normal Logger
and also the AsyncLogger.
In the JDBC Appender i also tried out the PoolingDriver and the ConnectionFactory approach.
It turns out that the Threads are not started parallel because of Log4j2.
Using the AsyncLogger even made it worse since the Output said that the Appender is not started and of 15.000 expected logs only 13.517 are in the DB.
To reproduce the issue i made a github repo see here: https://github.com/stefanwendelmann/Log4j_JDBC_Test
EDIT
I replaced the mssql-jdbc with a h2db and the threads dont block.
JMC auto analysis say that there are locking instances of JdbcDatabaseManager.
Is there any configuration problem in my PoolableConnectionFactory for mssql-jdbc or is there a general problem with dbcp / jdbc driver pooling?
Edit 2
Created Ticket on Apaches LOGJ2 Jira: https://issues.apache.org/jira/browse/LOG4J2-3022
Edit 3
Added a longer flight recording for mssql and h2:file
https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_docker_mssql_asynclogger_10000_runs.jfr
https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_h2_file_asynclogger_10000_runs.jfr

Thanks for getting the flight recordings up. This is a pretty interesting scenario, but I'm afraid I can't give conclusive answers, mostly because for some reason
The information in your flight recordings is weirdly incomplete. I'll explain a little more shortly
There seems to be other things going on in your system that may be muddying the diagnosis. You might benefit from killing any other running process in your machine.
So, what now/(TL;DR)
You need to be sure that your connection source to the database is pooled
Make sure you start your load test on a calm, clear-headed CPU
Configure your next flight recording to take sufficient, intermittent thread dumps. This is probably the most important next step, if you're interested in figuring out what exactly all these threads are waiting for. Don't post up another flight recording until you're positive it contains multiple thread dumps that feature all the live threads in your JVM.
Maybe 10k threads isn't reasonable for your local machine
I also noticed from the flight recording that you have a heap size maxed at 7GB. If you're not on a 64bit OS, that could actually be harmful. A 32-bit OS can address a max of 4GB.
Make sure there aren't any actual database failures causing the whole thing to thrash. Are you running out of connections? Are there any SQLExceptions blowing up somewhere? Any exceptions at all?
Here's what I could tell from your recordings:
CPU
Both flight recordings show that your CPU was struggling for good chunks of both of your recordings:
The MSSQL recording (46 mins total)
JFR even warns in the MSSQL recording that:
An average CPU load of 42 % was caused by other processes during 1 min 17 s starting at 2/18/21 7:28:58 AM.
The H2 recording (20.3s total)
I noticed that your flight recordings are titled XXXX_10000. If this means "10k concurrent requests", it may simply mean that your machine simply can't deal with the load you're putting on it. You may also benefit from first ensuring that your cores don't have a bunch of other things hogging their time before you kick off another test. At any rate, hitting 100% CPU utilization is bound to cause lock contention as a matter of course, due to context switching. Your flight recording shows that you're running on an 8-core machine; but you noted that you're running a dockerized MSSQL. How many cores did you allocate to Docker?
Blocked Threads
There's a tonne of blocking in your setup, and there are smoking guns everywhere. The thread identified by Log4j2-TF-1-AsyncLoggerConfig-1 was blocked a lot by the garbage collector, just as the CPU was thrashing:
The H2 flight recording:
All but the last 3 ticks across that graph were blockings of the log4j2 thread. There was still significant blocking of the other pooled threads by GC (more on that further down)
The MSSQL flight recording had smoother GC, but both flight recordings featured blocking by GC and the consequent super high CPU utilization. One thing was clear from the MSSQL and H2 recording: every other pooled thread was blocked, waiting for a lock on the same object ID
For MSSQL, lock ID: 0x1304EA4F#40; for H2, lock ID: 0x21A1100D7D0
Every thread except the main thread and pool-1-thread-1 (which was blocked by garbage collection) exhibits this behavior.
These 7 threads are all waiting for the same object. There is definitely some blocking or even a deadlock somewhere in your setup.
The small specks of green also corroborate the intermittent transfer of monitor locks between the various threads, confirming that they're sort of deadlocked. The pane that shows the threads at the bottom gives a timeline of each thread's blockage. Red indicates blocked; green indicates running. If you hover over each thread's red portion, it shows you that
The thread is blocked, waiting to acquire a lock (RED)
The ID of the lock that the thread is trying to acquire and is currently unable
The ID of the thread that last held the lock
Green indicates a running, unblocked thread.
When you hover over the red slices in your flight recording, you'll see that they're all waiting to acquire the same lock. That lock is intermittently held between the various pooled threads.
MSSQL (threads blocked waiting for 0x1304EA4F40):
H2 (threads blocked waiting for 0x21A1100D7D0):
In both flight recordings, pool-1-thread-1 is the sole thread that isn't blocked while trying to acquire a lock. That blank row for pool-1-thread-1 is solely due to garbage collection, which I covered earlier.
Dumps
Ideally, your flight recordings should contain a bunch of thread dumps, especially the one that you ran for over 40 mins; never mind the 20s one. Unfortunately, both recordings contain just 2 recordings each; only one of each of them even contains the stacktrace for pool-1-thread-1. Singular thread dumps are worthless. You'll need multiple snapshots over a length of time to make use of them. With a thread dump (or a heap dump), one could identify what objects the IDs 0x1304EA4F40 and 0x21A1100D7D0 refer to. The most I could figure out from the dumps is that they're all waiting for an instance of "Object":
It literally could be anything. Your very first flight recording at least showed that the threads were locked on org.apache.logging.log4j.core.appender.db.jdbc.JdbcDatabaseManager:
That very first recording shows the same pattern in the locks pane, that all the threads were waiting for that single object:
That first recording also shows us what pool-1-thread-1 was up to at that one instant:
From there, I would hazard a guess that that thread was in the middle of closing a database connection? Nothing conclusive can be said until multiple successive thread dumps show the thread activity over a span of time.

I tested on MySQL db and I found lock on following method:
org.apache.logging.log4j.core.appender.db.AbstractDatabaseManager.write(org.apache.logging.log4j.core.LogEvent, java.io.Serializable) (line: 261)
because in the source code you can see synchronization on write method:
/**
* This method manages buffering and writing of events.
*
* #param event The event to write to the database.
* #param serializable Serializable event
*/
public final synchronized void write(final LogEvent event, final Serializable serializable) {
if (isBuffered()) {
buffer(event);
} else {
writeThrough(event, serializable);
}
}
I think if you specify buffer size it will increase throughput, because logs will be collected into batches and synchronization will be pretty low.
After updating log4j2 config file on using AsyncLogger you will see lock on:
org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(org.apache.logging.log4j.core.LogEvent, org.apache.logging.log4j.core.async.AsyncLoggerConfig) (line: 375)
and implementation of that method:
private void enqueue(final LogEvent logEvent, final AsyncLoggerConfig asyncLoggerConfig) {
if (synchronizeEnqueueWhenQueueFull()) {
synchronized (queueFullEnqueueLock) {
disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
}
} else {
disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
}
}
synchronizeEnqueueWhenQueueFull is true by default, and it produces locks on threads, you can manage these parameters:
/**
* LOG4J2-2606: Users encountered excessive CPU utilization with Disruptor v3.4.2 when the application
* was logging more than the underlying appender could keep up with and the ringbuffer became full,
* especially when the number of application threads vastly outnumbered the number of cores.
* CPU utilization is significantly reduced by restricting access to the enqueue operation.
*/
static final boolean ASYNC_LOGGER_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
.getBooleanProperty("AsyncLogger.SynchronizeEnqueueWhenQueueFull", true);
static final boolean ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
.getBooleanProperty("AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull", true);
But you should know about side effect of using these parameters, as mentioned in code snippet.
Ideas why can a DB become a bottleneck?:
remoteness db(vpn and etc.)
check what strategy is used for id column (SEQUENCE, TABLE, IDENTITY) to avoid additional db call
are there indexes on columns? (it can produce reindex operation on each transaction commit)

Appengine Backend Task Dispatcher: A More Economical Version

I'm working on a job dispatcher for appengine, and the default scheduler always winds up firing up 3-4 instances that do all the work, some overflow instances that might take thousands of tasks, or only a couple and then sits there burning cpus doing nothing.
My task involves processing jobs for many different sized domains, sometimes there's huge throughput, and other times it's one user with 10,000 models to update; if I turn the normal appengine task scheduler loose, it fails in two ways: 1) backends never shut down, and when memory hits the cap, java gc makes an instance thrash and act like it's almost a zombie yet never shut down {and still take/hold jobs}, and 2) many domains have a single user that takes far longer than all the others to process, and this keeps a backend alive long after the rest of the domain has finished.
These tasks must run throughout the day, and it takes multiple backends to handle fanout, so I can't just dump them all on a B8 and call it a day., so we need a dispatcher to manage how tasks get allocated to backends.
Now, I don't want to pay datastore ops on every task just to save a few minutes of cpu time, so my plan of attack {please critique} is to use a static ConcurrentHashMap in RAM, start each run() in a try, have every deferred task put it's [hashcode, startTime] in at startup and remove(hashcode) in a finally. There will be one such map per backend instance that's running jobs, wrapped in a method, BackendCounter.addToLiveMap(this); it's .size() serves as a running total of how many jobs are alive on that backend {with timestamp to detect zombie jobs that run >10 minutes}. The job dispatcher can fire off a worker thread per instance to monitor how many jobs, excluding itself, are running in that instance, and keep a ranked list in memcache of which instances have how many tasks alive. If one instance drops below a threshold of X live tasks, pick an overflow instance to defer to, then have the method BackendCounter.addToLiveMap(this) throw an exception I can catch to tell jobs to just schedule themselves to a new instance {ChangeInstanceException#getNewTarget()}. This way I can prevent barely-used instances from getting new jobs so they have a chance to shut down, paying only for some memcache ops and fanout only pays a write and delete to static map.
That takes care of problem two, which is the instance-hour killer. As for problem one, which is how to prevent one instance {usually instance 0 and 1} from hitting peak memory and start turning toward the dark side, I am torn between two options.
On the one hand, I can use the expected call to BackendCounter.addToLiveMap(this) throws ChangeInstanceException and simply check memory:
if (((float)Runtime.getRuntime().freeMemory() / Runtime.getRuntime().totalMemory())<0.9) throw new ChangeInstanceException(getOverflowInstance());
This naive approach will simply tell any instance approaching it's memory limit to send all new work elsewhere.
On the other hand, I could keep instance 0 and 1 for handling overflow {and toggle between which of the two gets new jobs to give them chances to shut down}, then send the fanout to instances 2+, which will only run until they drop to say, 10 or 15 jobs in parallel. The fanout is pretty consistent, and only takes a couple minutes, so instances 2, 3 and, at most, 4, will need to turn on, and be given time to turn off while a different instance gets hit with more load.
The only thing I'm afraid of is if jobs starting bouncing from one instance to another, which can probably be overcome with a redirect header limit to skip throwing ChangeInstanceException.
Any thoughts or advice are greatly appreciated.

Sporadic problems in running a multi-threaded Java project in Win7

I am working on a project that is both memory and computationally intensive. A significant portion of the execution utilizes multi-threading by a FixedThreadPool. In short; I have 1 thread for fetching data from several remote locations (using URL connections) and populating a BlockingQueue with objects to be analyzed and n threads that pick these objects and run the analysis. edit: see code below
Now this setup works like a charm on my Linux machine running OpenSUSE 11.3, but a colleague is testing it on a very similar machine running Win7 is getting custom notifications of timeouts on the queue polling (see code below), lots of them actually. I have been trying to monitor the processor use on her machine, and it appears that the software does not get any more than 15% of the CPUs while on my machine the processor usage hits the roof, just as I intended.
My question is, then, can this be a sign of "starvation" of the queue? Could it be so that the producer thread is not getting enough cpu time? If so how do I go about giving one particular thread in the pool higher priority?
UPDATE:
I have been trying to pinpoint the problem, with no joy... I did however gain some new insights.
Profiling the execution of the code with JVisualVM demonstrates a very peculiar behavior. The methods are called in short bursts of CPU-time with several seconds of no progress in between. This to me means that somehow the OS is hitting the brakes on the process.
Disabling the anti-virus and back-up daemons do not have any significant affect on the matter
Changing the priority of java.exe (the only instance) through task manager (adviced here) does not change anything either. (That being said, I could not give "realtime" priority to java, and had to be content with "high" prio)
Profiling the network usage shows good flow of data in and out, so I am guessing that is not the bottleneck (while it is a considerable part of the execution time of the process, but that I know already and is pretty much the same percentage as what I get on my Linux machine).
Any ideas as to how the Win7 OS might be limiting the cpu time to my project? if it's not the OS, what could be the limiting factor? I would like to stress yet again that the machine is NOT running any other computation intensive at the same time and there is almost no load on the cpus other than my software. This is driving me crazy...
EDIT: relevant code
public ConcurrencyService(Dataset d, QueryService qserv, Set<MyObject> s){
timeout = 3;
this.qs = qserv;
this.bq = qs.getQueue();
this.ds = d;
this.analyzedObjects = s;
this.drc = DebugRoutineContainer.getInstance();
this.started = false;
int nbrOfProcs = Runtime.getRuntime().availableProcessors();
poolSize = nbrOfProcs;
pool = (ThreadPoolExecutor) Executors.newFixedThreadPool(poolSize);
drc.setScoreLogStream(new PrintStream(qs.getScoreLogFile()));
}
public void serve() throws InterruptedException {
try {
this.ds.initDataset();
this.started = true;
pool.execute(new QueryingAction(qs));
for(;;){
MyObject p = bq.poll(timeout, TimeUnit.MINUTES);
if(p != null){
if (p.getId().equals("0"))
break;
pool.submit(new AnalysisAction(ds, p, analyzedObjects, qs.getKnownAssocs()));
}else
drc.log("Timed out while waiting for an object...");
}
} catch (Exception ex) {
ex.printStackTrace();
String exit_msg = "Unexpected error in core analysis, terminating execution!";
}finally{
drc.log("--DEBUG: Termination criteria found, shutdown initiated..");
drc.getMemoryInfo(true); // dump meminfo to log
pool.shutdown();
int mins = 2;
int nCores = poolSize;
long totalTasks = pool.getTaskCount(),
compTasks = pool.getCompletedTaskCount(),
tasksRemaining = totalTasks - compTasks,
timeout = mins * tasksRemaining / nCores;
drc.log("--DEBUG: Shutdown commenced, thread pool will terminate once all objects are processed, " +
"or will timeout in : " + timeout + " minutes... \n" + compTasks + " of " + (totalTasks -1) +
" objects have been analyzed so far, " + "mean process time is: " +
drc.getMeanProcTimeAsString() + " milliseconds.");
pool.awaitTermination(timeout, TimeUnit.MINUTES);
}
}
The class QueryingAction is a simple Runnable that calls the data acquisition method in the designated QueryService object which then populates a BlockingQueue. The AnalysisAction class does all the number-crunching for a single instance of MyObject.

I suspect the producer thread is not getting/loading the source data fast enough. This might not be a lack of CPU but an IO related issue. (not sure why you have time outs on your BlockingQueue)
It might be worth having a thread which periodically logs things like the number of tasks added and the length of the queue (e.g. every 5-15 seconds)

So, if I correctly understand your problem, you have one thread to fetch data, and several threads to analyse the fetched data. Your problem is that the threads are not correctly synchronized to run together and take full advantage of the processor.
You have a tipical producer-consumer problem with a single producer and several consumers.
I advise you to remake your code a bit to have, instead, several independent consumer threads that are always waiting for resources to be available and only then running. This way you guarantee the maximum processor use.
Consumer thread:
while (!terminate)
{
synchronized (Producer.getLockObject())
{
try
{
//sleep (no processing at all)
Producer.getLockObject().wait();
}
catch (Exceptions..)
}
MyObject p = Producer.getObjectFromQueue(); //this function should be synchronized
//Analyse fetched data, and submit it to somewhere...
}
Producer thread:
while (!terminate)
{
MyObject newData = fetchData(); //fetch data from remote location
addDataToQueueu(newData); //this should also be synchronized
synchronized (getLockObject())
{
//wake up one thread to deal with the data
getLockObject().notify();
}
}
You see that this way, your threads are always performing useful work or sleeping.
This is just draft code to exemplify.
See more explanation here: http://www.javamex.com/tutorials/wait_notify_how_to.shtml
and here: http://www.java-samples.com/showtutorial.php?tutorialid=306

Priority won't help, since the problem is not an issue of deciding who gets precious resources -- resource usage isn't maxed. The only way the producer thread would not be getting enough CPU time is if it wasn't ready-to-run. Priority won't help, since the problem is not an issue.
How many cores does the machine have? It's possible that the producer thread is running full speed and there still just isn't enough CPU to go around. It's also possible the producer is I/O bound.

You can try to separate the producer thread from the pool (i.e. create a distinct Thread and set the pool to have -1 the current capacity) and then set its priority to maximum via setPriority. See what happens, although priority rarely accounts for such a difference in performance.

When you say URL connection, do you mean local or remote? It could be that network speed is slowing your producer down

So after weeks of fiddling, wrestling in code and other types of suffering I think I had a breakthrough, "a moment of clarity" if you will...
I managed to show that the program can exhibits the same slow behavior on my Linux machine and can indeed run full throttle on the problematic Win-7 machine. The crux of the problem appears to be some sort of corruption of the system/cache files that are used to store the results of previous queries, and overall, speed up the analysis. You got to love the irony, in this case they appeared to be the reason for EXTREME slow analysis. In retrospect, I should have known (a la Occam's razor)...
I am still not sure what how the corruption occurs, but at least it's probably not related to different OS. Using the system files from my machine increases the output on the Win7 host up to about 40% only however. Profiling the process more has also revealed that, oddly enough, there is significantly more GC activity on Win7, which apparently took lots of CPU time from number crunching. Giving -Xmx2g takes care of excessive garbage collection and the CPU usage for the process shoots up to 95-96%, and threads run smoothly.
Now that my original question is answered, I have to say that overall java responsiveness is definitely better on Linux environment, even without allocating more heap memory, I can easily multi-task while I am running an extensive analysis in the background. Things are not as smooth in Win-7, e.x. resizing the GUI is significantly slow once the analysis takes off at full speed.
Thanks for all the replies, I am sorry for the partially misleading problem description. I merely shared what I found out while debugging to the best of my abilities. Anyways, I believe the bounty goes to Peter Lawrey, since he early on pointed to an I/O issue and it was his suggestion about a logger thread which eventually led me to the answer.

I would think it was some OS specific issue because that is the core difference between the two units. More specifically, something is slowing down the data arriving through the remote connection.
Find some traffic analysis tool such as Wireshark and/or Networx and try to discover if there is anything throttling the Win PC. Perhaps it is going through a proxy that has some kind of rate cap configured.

Sorry not really an answer but did not fit inside comment and still it is worth the read I think:
well i am not JAVA friendly
but i have recently the same problem with C++ projects for machine control through USB.
On XP or W2K all goes perfectly for months of 24/7 operation on any 2 or more core machine
On W7 and strong enough machine all goes OK but sometimes (cca 1x per few hours) freezes for few seconds without obvious reason.
On W7 and relatively weak machine (2 core 1.66GHz T2300E notebook) the threads are freezing for some time and run again which under/overflows USB/WIN/App FIFOs and collapse communication ...
it appears that nothing is blocked but the W7 sheduler just do not give CPU to the right threads occasionally.
i thought that USB driver (JUNGO) communication freezes bud that is not true I measured it and it is OK even in freeze
the freeze was about 6-15 seconds cca once per minute.
after adding some safety sleeps to threads loops the freeze has shorten to about 0.5 sec
but still there
even if App do not Under/Overflows FIFOs the windows USB driver side do (few times per minute for few ms)
Change of exe/threads priority and class do not affect performance on W7 (on XP,W2K work as it should)
As you can see it seems we have most likely the same problem. In my case:
is not I/O related (when i replace USB thread with simulation of device it behaves similar)
adding Sleep to time critical code helps a lot
error is present also in low count of threads [2 fast (17ms) + 1 slow (250ms) + App code = 4]
my CPU consumption on W7 slow machine is also not 100% but about 95% which is OK because I have sleeps everywhere
my Apps use about 40-100MB of memory but are CPU computation demanding ...
but not that much it could run safely on much slower machines
but because of USB driver connection and multiple device support it need at least 2 cores
my next step is to add some kind of execution time logging/analyze to see what is happening in more detail
and also little rewrite of send/receive threads to see if it helps
When i learn something new/useful will add it.

Greedy threads are grabbing too many JMS messages under WebLogic

We encountered a problem under WebLogic 8.1 that we lived with but could never fix. We often queue up a hundred or more JMS messages, each of which represents a unit of work. Despite the fact that each message is of the same size and looks the same, one may take only seconds to complete while the next one represents 20 minutes of solid crunching.
Our problem is that each of the message driven beans we have doing the work of these messages ends up on a thread that seems to grab ten messages at a time (we think it is being done as a WebLogic optimization to keep from having to hit the queue over and over again for small messages). Then, as one thread after another finishes all of its small jobs and no new ones come in, we end up with a single thread log jammed on a long running piece of work with up to nine other items sitting waiting on it to finish, despite the fact that other threads are free and could start on those units of work.
Now we are at a point where we are converting to WebLogic 10 so it is a natural point to return to this problem and find out if there is any solution that we could implement so that either: a) each thread only grabs one JMS message at a time to process and leaves all the others waiting in the incoming queue, or b) it would automatically redistribute waiting messages (even ones already assigned to a particular thread) out to free threads. Any ideas?

Enable the Forward Delay and provide an appropriate value. This will cause the JMS Queue to redistribute messages to it's peers if they have not been processed in the configured time.
Taking a single message off the queue every time might be overkill - It's all a balance on the number of messages you are processing and what you gauge as an issue.
There are also multiple issues with JMS on WebLogic 10 depending on your setup. You can save yourself a lot of time and trouble by using the latest MP right from the start.

when a Thread is in 'starvation' after getting the resources they can able to execute.The threads which are in starvation called as "greedy thread"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.