Jboss Netty - How to serve 2 connections using 3 worker threads

Jboss Netty - How to serve 2 connections using 3 worker threads - java

Just as a simple example, lets say I want to handle 3 simultaneous TCP client connections using only 2 worker threads in netty, how would I do it?
Questions
A)
With the code below, my third connection doesn't get any data from the server - the connection just sits there. Notice - how my worker executor and worker count is 2.
So if I have 2 worker threads and 3 connections, shouldnt all three connections be served by the 2 threads?
B)
Another question is - Does netty use CompletionService of java.util.concurrent? It doesnt seem to use it. Also, I didnt see any source code that does executor.submit or future.get
So all this has added to the confusion of how it handles and serves data to connections that are MORE than its worker threads?
C)
I'm lost on how netty handles 10000+ simultaneous TCP connections....will it create 10000 threads? Thread per connection is not a scalable solution, so I'm confused, because how my test code doesnt work as expected.
import java.net.InetSocketAddress;
import java.nio.channels.ClosedChannelException;
import java.util.Date;
import java.util.concurrent.Executors;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jboss.netty.bootstrap.ServerBootstrap;
import org.jboss.netty.channel.Channel;
import org.jboss.netty.channel.ChannelFuture;
import org.jboss.netty.channel.ChannelFutureListener;
import org.jboss.netty.channel.ChannelHandlerContext;
import org.jboss.netty.channel.ChannelPipeline;
import org.jboss.netty.channel.ChannelPipelineFactory;
import org.jboss.netty.channel.ChannelStateEvent;
import org.jboss.netty.channel.Channels;
import org.jboss.netty.channel.ExceptionEvent;
import org.jboss.netty.channel.SimpleChannelUpstreamHandler;
import org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory;
import org.jboss.netty.handler.codec.string.StringEncoder;
public class SRNGServer {
public static void main(String[] args) throws Exception {
// Configure the server.
ServerBootstrap bootstrap = new ServerBootstrap(
new NioServerSocketChannelFactory(
Executors.newCachedThreadPool(),
//Executors.newCachedThreadPool()
Executors.newFixedThreadPool(2),2
));
// Configure the pipeline factory.
bootstrap.setPipelineFactory(new SRNGServerPipelineFactoryP());
// Bind and start to accept incoming connections.
bootstrap.bind(new InetSocketAddress(8080));
}
private static class SRNGServerHandlerP extends SimpleChannelUpstreamHandler {
private static final Logger logger = Logger.getLogger(SRNGServerHandlerP.class.getName());
#Override
public void channelConnected(
ChannelHandlerContext ctx, ChannelStateEvent e) throws Exception {
// Send greeting for a new connection.
Channel ch=e.getChannel();
System.out.printf("channelConnected with channel=[%s]%n", ch);
ChannelFuture writeFuture=e.getChannel().write("It is " + new Date() + " now.\r\n");
SRNGChannelFutureListener srngcfl=new SRNGChannelFutureListener();
System.out.printf("Registered listener=[%s] for future=[%s]%n", srngcfl, writeFuture);
writeFuture.addListener(srngcfl);
}
#Override
public void exceptionCaught(
ChannelHandlerContext ctx, ExceptionEvent e) {
logger.log(
Level.WARNING,
"Unexpected exception from downstream.",
e.getCause());
if(e.getCause() instanceof ClosedChannelException){
logger.log(Level.INFO, "****** Connection closed by client - Closing Channel");
}
e.getChannel().close();
}
}
private static class SRNGServerPipelineFactoryP implements ChannelPipelineFactory {
public ChannelPipeline getPipeline() throws Exception {
// Create a default pipeline implementation.
ChannelPipeline pipeline = Channels.pipeline();
pipeline.addLast("encoder", new StringEncoder());
pipeline.addLast("handler", new SRNGServerHandlerP());
return pipeline;
}
}
private static class SRNGChannelFutureListener implements ChannelFutureListener{
public void operationComplete(ChannelFuture future) throws InterruptedException{
Thread.sleep(1000*5);
Channel ch=future.getChannel();
if(ch!=null && ch.isConnected()){
ChannelFuture writeFuture=ch.write("It is " + new Date() + " now.\r\n");
//-- Add this instance as listener itself.
writeFuture.addListener(this);
}
}
}
}

I haven't analyzed your source code in detail, so I don't know exactly why it doesn't work properly. But this line in SRNGChannelFutureListener looks suspicious:
Thread.sleep(1000*5);
This will make the thread that executes it be locked for 5 seconds; the thread will not be available to do any other processing during that time.
Question C: No, it will not create 10,000 threads; the whole point of Netty is that it doesn't do that, because that would indeed not scale very well. Instead, it uses a limited number of threads from a thread pool, generates events whenever something happens, and runs event handlers on the threads in the pool. So, threads and connections are decoupled from each other (there is not a thread for each connection).
To make this mechanism work properly, your event handlers should return as quickly as possible, to make the threads that they run on available for running the next event handler as quickly as possible. If you make a thread sleep for 5 seconds, then you're keeping the thread allocated, so it won't be available for handling other events.
Question B: If you really want to know you could get the source code to Netty and find out. It uses selectors and other java.nio classes for doing asynchronous I/O.

Related

Apache Camel: async operation and backpressure

In Apache Camel 2.19.0, I want to produce messages and consume the result asynchronously on a concurrent seda queue while at the same time blocking if the executors on the seda queue are full.
The use case behind it: I need to process large files with many lines and need to create batches for it because a single message for each individual line is too much overhead, whereas I cannot fit the entire file into heap. But in the end, I need to know whether all batches I triggered have completed successfully.
So effectively, I need a back pressure mechanism to spam the queue while at the same time want to leverage multi-threaded processing.
Here is a quick example in Camel and Spring. The route I configured:
package com.test;
import org.apache.camel.builder.RouteBuilder;
import org.springframework.stereotype.Component;
#Component
public class AsyncCamelRoute extends RouteBuilder {
public static final String ENDPOINT = "seda:async-queue?concurrentConsumers=2&size=2&blockWhenFull=true";
#Override
public void configure() throws Exception {
from(ENDPOINT)
.process(exchange -> {
System.out.println("Processing message " + (String)exchange.getIn().getBody());
Thread.sleep(10_000);
});
}
}
The producer looks like this:
package com.test;
import org.apache.camel.ProducerTemplate;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.event.ContextRefreshedEvent;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Component;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CompletableFuture;
#Component
public class AsyncProducer {
public static final int MAX_MESSAGES = 100;
#Autowired
private ProducerTemplate producerTemplate;
#EventListener
public void handleContextRefresh(ContextRefreshedEvent event) throws Exception {
new Thread(() -> {
// Just wait a bit so everything is initialized
try {
Thread.sleep(5_000);
} catch (InterruptedException e) {
e.printStackTrace();
}
List<CompletableFuture> futures = new ArrayList<>();
System.out.println("Producing messages");
for (int i = 0; i < MAX_MESSAGES; i++) {
CompletableFuture future = producerTemplate.asyncRequestBody(AsyncCamelRoute.ENDPOINT, String.valueOf(i));
futures.add(future);
}
System.out.println("All messages produced");
System.out.println("Waiting for subtasks to finish");
futures.forEach(CompletableFuture::join);
System.out.println("Subtasks finished");
}).start();
}
}
The output of this code looks like:
Producing messages
All messages produced
Waiting for subtasks to finish
Processing message 6
Processing message 1
Processing message 2
Processing message 5
Processing message 8
Processing message 7
Processing message 9
...
Subtasks finished
So it seems that blockIfFull is ignored and all messages are created and put onto the queue prior to processing.
Is there any way to create messages so that I can use async processing in camel while at the same time making sure that putting elements onto the queue will block if there are too many unprocessed elements?

I solved the problem by using streaming and a custom splitter. By doing this, I can split the source lines into chunks using an iterator that returns a list of lines instead of a single line only. With this, it seems to me that I can use Camel as required.
So the route contains the following portion:
.split().method(new SplitterBean(), "splitBody").streaming().parallelProcessing().executorService(customExecutorService)
With a custom-made splitter with the behavior as above.

Why do my threads always seem to be idle?

I have the following code:
import redis.clients.jedis.JedisPubSub;
import javax.sql.DataSource;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class MsgSubscriber extends JedisPubSub {
private final PersistenceService service;
private final ExecutorService pool;
public MsgSubscriber(DataSource dataSource) {
pool = Executors.newFixedThreadPool(4);
service = new PersistenceServiceImpl(dataSource);
}
public void onMessage(String channel, String message) {
pool.execute(new Handler(message, service));
}
}
It is subscribed to a Redis channel, which is receiving hundreds of messages a second.
I am processing each of these messages as they come along and saving them to a data store, the handler looks like this:
public class Handler implements Runnable {
private String msg;
private PersistenceService service;
public MessageHandler(String msg, PersistenceService service) {
this.msg = msg;
this.service = service;
}
#Override
public void run() {
service.save(msg);
}
}
Things seem to be working ok, messages are being written to the database, but I have been running Java VisualVM and am seeing graphs like the following:
I'm concerned because the threads seem to be sitting in this "Parked" state and not running - although with some logging statements I am seeing that the code is being run. I guess my question is firstly, is there a problem with my code, and secondly, why is Visual VM showing me the threads don't seem to be doing anything?

hundreds of messages a second
Redis can easily handle 10K messages per second in 1 thread. With 4 threads it should be well under 1% busy, however this might be too low for VisualVM to detect with sampling and instead it says it is Parked all the time.

Singleton class to manage running tasks in multithreaded environment in Java

I have a similar situation to that described in this question:
Java email sending queue - fixed number of threads sending as many messages as are available
In that I have a blocking queue that gets fed commands(ICommandTask extends Callable{Object}) from which a thread pool takes off and runs. The blocking queue provides thread synchronization and isolation between calling thread and executing thread. Different objects throughout the program can submit ICommandTasks to the command queue which is why I've made AddTask() static.
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;
import com.mypackage.tasks.ICommandTask;
public enum CommandQueue
{
INSTANCE;
private final BlockingQueue<ICommandTask> commandQueue;
private final ExecutorService executor;
private CommandQueue()
{
commandQueue = new LinkedBlockingQueue<ICommandTask>();
executor = Executors.newCachedThreadPool();
}
public static void start()
{
new Thread(INSTANCE.new WaitForProducers()).start();
}
public static void addTask(ICommandTask command)
{
INSTANCE.commandQueue.add(command);
}
private class WaitForProducers implements Runnable
{
#Override
public void run()
{
ICommandTask command;
while(true)
{
try
{
command = INSTANCE.commandQueue.take();
executor.submit(task);
}
catch (InterruptedException e)
{
// logging etc.
}
}
}
}
}
In the main program during start up the Command Queue is started using the following which creates a New CommandQueue object and starts the WaitForProducers in a separate thread.
CommandQueue.Start();
I wanted to ask whether this method of setting up a multiple producers to single executor using the singleton enum (so that different parts of the program can access), and that uses a separate thread to take off tasks from the queue and submit to a ThreadPool is a recommended way of doing what I want to achieve. Particularly in a very multithreaded environment.
So far it seems to be working ok but I plan on creating similar objects to CommandQueue to handle different types of Tasks. They will be stored in their own queues. E.g. OrderQueue, EventQueue, NegotiationQueue etc. So it needs to be somewhat scaleable and threadsafe.
Thanks in advance.

Camel ActiveMQ Performance Tuning

Situation
At present, we use some custom code on top of ActiveMQ libraries for JMS messaging. I have been looking at switching to Camel, for ease of use, ease of maintenance, and reliability.
Problem
With my present configuration, Camel's ActiveMQ implementation is substantially slower than our old implementation, both in terms of delay per message sent and received, and time taken to send and receive a large flood of messages. I've tried tweaking some configuration (e.g. maximum connections), to no avail.
Test Approach
I have two applications, one using our old implementation, one using a Camel implementation. Each application sends JMS messages to a topic on local ActiveMQ server, and also listens for messages on that topic. This is used to test two Scenarios:
- Sending 100,000 messages to the topic in a loop, and seen how long it takes from start of sending to end of handling all of them.
- Sending a message every 100 ms and measuring the delay (in ns) from sending to handling each message.
Question
Can I improve upon the implementation below, in terms of time sent to time processed for both floods of messages, and individual messages? Ideally, improvements would involve tweaking some config that I have missed, or suggesting a better way to do it, and not be too hacky. Explanations of improvements would be appreciated.
Edit: Now that I am sending messages asyncronously, I appear to have a concurrency issue. receivedCount does not reach 100,000. Looking at the ActiveMQ web interface, 100,000 messages are enqueued, and 100,000 dequeued, so it's probably a problem on the message processing side. I've altered receivedCount to be an AtomicInteger and added some logging to aid debugging. Could this be a problem with Camel itself (or the ActiveMQ components), or is there something wrong with the message processing code? As far as I can tell, only ~99,876 messages are making it through to floodProcessor.process.
Test Implementation
Edit: Updated with async sending and logging for concurrency issue.
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import org.apache.activemq.ActiveMQConnectionFactory;
import org.apache.activemq.camel.component.ActiveMQComponent;
import org.apache.activemq.pool.PooledConnectionFactory;
import org.apache.camel.CamelContext;
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.ProducerTemplate;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.component.jms.JmsConfiguration;
import org.apache.camel.impl.DefaultCamelContext;
import org.apache.log4j.Logger;
public class CamelJmsTest{
private static final Logger logger = Logger.getLogger(CamelJmsTest.class);
private static final boolean flood = true;
private static final int NUM_MESSAGES = 100000;
private final CamelContext context;
private final ProducerTemplate producerTemplate;
private long timeSent = 0;
private final AtomicInteger sendCount = new AtomicInteger(0);
private final AtomicInteger receivedCount = new AtomicInteger(0);
public CamelJmsTest() throws Exception {
context = new DefaultCamelContext();
ActiveMQConnectionFactory connectionFactory = new ActiveMQConnectionFactory("tcp://localhost:61616");
PooledConnectionFactory pooledConnectionFactory = new PooledConnectionFactory(connectionFactory);
JmsConfiguration jmsConfiguration = new JmsConfiguration(pooledConnectionFactory);
logger.info(jmsConfiguration.isTransacted());
ActiveMQComponent activeMQComponent = ActiveMQComponent.activeMQComponent();
activeMQComponent.setConfiguration(jmsConfiguration);
context.addComponent("activemq", activeMQComponent);
RouteBuilder builder = new RouteBuilder() {
#Override
public void configure() {
Processor floodProcessor = new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
int newCount = receivedCount.incrementAndGet();
//TODO: Why doesn't newCount hit 100,000? Remove this logging once fixed
logger.info(newCount + ":" + exchange.getIn().getBody());
if(newCount == NUM_MESSAGES){
logger.info("all messages received at " + System.currentTimeMillis());
}
}
};
Processor spamProcessor = new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
long delay = System.nanoTime() - timeSent;
logger.info("Message received: " + exchange.getIn().getBody(List.class) + " delay: " + delay);
}
};
from("activemq:topic:test?exchangePattern=InOnly")//.threads(8) // Having 8 threads processing appears to make things marginally worse
.choice()
.when(body().isInstanceOf(List.class)).process(flood ? floodProcessor : spamProcessor)
.otherwise().process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
logger.info("Unknown message type received: " + exchange.getIn().getBody());
}
});
}
};
context.addRoutes(builder);
producerTemplate = context.createProducerTemplate();
// For some reason, producerTemplate.asyncSendBody requires an Endpoint to be passed in, so the below is redundant:
// producerTemplate.setDefaultEndpointUri("activemq:topic:test?exchangePattern=InOnly");
}
public void send(){
int newCount = sendCount.incrementAndGet();
producerTemplate.asyncSendBody("activemq:topic:test?exchangePattern=InOnly", Arrays.asList(newCount));
}
public void spam(){
Executors.newSingleThreadScheduledExecutor().scheduleWithFixedDelay(new Runnable() {
#Override
public void run() {
timeSent = System.nanoTime();
send();
}
}, 1000, 100, TimeUnit.MILLISECONDS);
}
public void flood(){
logger.info("starting flood at " + System.currentTimeMillis());
for (int i = 0; i < NUM_MESSAGES; i++) {
send();
}
logger.info("flooded at " + System.currentTimeMillis());
}
public static void main(String... args) throws Exception {
CamelJmsTest camelJmsTest = new CamelJmsTest();
camelJmsTest.context.start();
if(flood){
camelJmsTest.flood();
}else{
camelJmsTest.spam();
}
}
}

It appears from your current JmsConfiguration that you are only consuming messages with a single thread. Was this intended?
If not, you need to set the concurrentConsumers property to something higher. This will create a threadpool of JMS listeners to service your destination.
Example:
JmsConfiguration config = new JmsConfiguration(pooledConnectionFactory);
config.setConcurrentConsumers(10);
This will create 10 JMS listener threads that will process messages concurrently from your queue.
EDIT:
For topics you can do something like this:
JmsConfiguration config = new JmsConfiguration(pooledConnectionFactory);
config.setConcurrentConsumers(1);
config.setMaxConcurrentConsumers(1);
And then in your route:
from("activemq:topic:test?exchangePattern=InOnly").threads(10)
Also, in ActiveMQ you can use a virtual destination. The virtual topic will act like a queue and then you can use the same concurrentConsumers method you would use for a normal queue.
Further Edit (For Sending):
You are currently doing a blocking send. You need to do producerTemplate.asyncSendBody().
Edit
I just built a project with your code and ran it. I set a breakpoint in your floodProcessor method and newCount is reaching 100,000. I think you may be getting thrown off by your logging and the fact that you are sending and receiving asynchronously. On my machine newCount hit 100,000 and the "all messages recieved" message was logged in well under 1 second after execution, but the program continued to log for another 45 seconds afterwards since it was buffered. You can see the effect of logging on how close your newCount number is to your body number by reducing the logging. I turned the logging to info, shutting off camel logging, and the two numbers matched at the end of the logging:
INFO CamelJmsTest - 99996:[99996]
INFO CamelJmsTest - 99997:[99997]
INFO CamelJmsTest - 99998:[99998]
INFO CamelJmsTest - 99999:[99999]
INFO CamelJmsTest - 100000:[100000]
INFO CamelJmsTest - all messages received at 1358778578422

I took over from the original poster in looking at this as part of another task, and found the problem with losing messages was actually in the ActiveMQ config.
We had a setting sendFailIfNoSpace=true, which was resulting in messages being dropped if we were sending fast enough to fill the publishers cache. Playing around with the policyEntry topic cache size I could vary the number of messages that disappeared with as much reliability as can be expected of such a race condition. Setting sendFailIfNoSpace=false (default), I could have any cache size I liked and never fail to receive all messages.
In theory sendFailIfNoSpace should throw a ResourceAllocationException when it drops a message, but that is either not happening(!) or is being ignored somehow. Also interesting is that our custom JMS wrapper code doesn't hit this problem despite running the throughput test faster than Camel. Maybe that code is faster in such a way that it means the publishing cache is being emptied faster, or else we are overriding sendFailIfNoSpace in the connection code somewhere that I haven't found yet.
On the question of speed, we have implemented all the suggestions mentioned here so far except for virtual destinations, but the Camel version test with 100K messages still runs in 16 seconds on my machine compared to 10 seconds for our own wrapper. As mentioned above, I have a sneaking suspicion that we are (implicitly or otherwise) overriding config somewhere in our wrapper, but I doubt it is anything that would cause that big a performance boost within ActiveMQ.
Virtual destinations as mentioned by gwithake might speed up this particular test, but most of the time with our real workloads it is not an appropriate solution.

Deadlock in ThreadPoolExecutor

Encountered a situation when ThreadPoolExecutor is parked in execute(Runnable) function while all the ThreadPool threads are waiting in getTask func, workQueue is empty.
Does anybody have any ideas?
The ThreadPoolExecutor is created with ArrayBlockingQueue, and corePoolSize == maximumPoolSize = 4
[Edit] To be more precise, the thread is blocked in ThreadPoolExecutor.exec(Runnable command) func. It has the task to execute, but doesn't do it.
[Edit2] The executor is blocked somewhere inside the working queue (ArrayBlockingQueue).
[Edit3] The callstack:
thread = front_end(224)
at sun.misc.Unsafe.park(Native methord)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
at
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
at java.util.concurrent.ArrayBlockingQueue.offer(ArrayBlockingQueue.java:224)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:653)
at net.listenThread.WorkersPool.execute(WorkersPool.java:45)
at the same time the workQueue is empty (checked using remote debug)
[Edit4] Code working with ThreadPoolExecutor:
public WorkersPool(int size) {
pool = new ThreadPoolExecutor(size, size, IDLE_WORKER_THREAD_TIMEOUT, TimeUnit.SECONDS, new ArrayBlockingQueue<Runnable>(WORK_QUEUE_CAPACITY),
new ThreadFactory() {
#NotNull
private final AtomicInteger threadsCount = new AtomicInteger(0);
#NotNull
public Thread newThread(#NotNull Runnable r) {
final Thread thread = new Thread(r);
thread.setName("net_worker_" + threadsCount.incrementAndGet());
return thread;
}
},
new RejectedExecutionHandler() {
public void rejectedExecution(#Nullable Runnable r, #Nullable ThreadPoolExecutor executor) {
Verify.warning("new task " + r + " is discarded");
}
});
}
public void execute(#NotNull Runnable task) {
pool.execute(task);
}
public void stopWorkers() throws WorkersTerminationFailedException {
pool.shutdownNow();
try {
pool.awaitTermination(THREAD_TERMINATION_WAIT_TIME, TimeUnit.SECONDS);
} catch (InterruptedException e) {
throw new WorkersTerminationFailedException("Workers-pool termination failed", e);
}
}
}

It sounds like it is a bug with an JVM's older than 6u21. There was an issue in the compiled native code for some (maybe all) OS's.
From the link:
The bug is caused by missing memory barriers in various Parker::park()
paths that can result in lost wakeups and hangs. (Note that
PlatformEvent::park used by built-in synchronization is not vulnerable
to the issue). -XX:+UseMembar constitues a work-around because the
membar barrier in the state transition logic hides the problem in
Parker::. (that is, there's nothing wrong with the use -UseMembar
mechanism, but +UseMembar hides the bug Parker::). This is a day-one
bug introduced with the addition of java.util.concurrent in JDK 5.0.
I developed a simple C mode of the failure and it seems more likely to
manifest on modern AMD and Nehalem platforms, likely because of deeper
store buffers that take longer to drain. I provided a tentative fix
to Doug Lea for Parker::park which appears to eliminate the bug. I'll
be delivering this fix to runtime. (I'll also augment the CR with
additional test cases and and a longer explanation). This is likely a
good candidate for back-ports.
Link: JVM Bug
Workarounds are available, but you would probably be best off just getting the most recent copy of Java.

I don't see any locking in the code of ThreadPoolExecutor's execute(Runnable). The only variable there is the workQueue. What sort of BlockingQueue did you provide to your ThreadPoolExecutor?
On the topic of deadlocks:
You can confirm this is a deadlock by examining the Full Thread Dump, as provided by <ctrl><break> on Windows or kill -QUIT on UNIX systems.
Once you have that data, you can examine the threads. Here is a pertinent excerpt from Sun's article on examining thread dumps (suggested reading):
For hanging, deadlocked or frozen programs: If you think your program is hanging, generate a stack trace and examine the threads in states MW or CW. If the program is deadlocked then some of the system threads will probably show up as the current threads, because there is nothing else for the JVM to do.
On a lighter note: if you are running in an IDE, can you ensure that there are no breakpoints enabled in these methods.

This deadlock probably because you run task from executor itself. For example, you submit one task, and this one fires another 4 tasks. If you have pool size equals to 4, then you just totally overflow it and last task will wait until someone of task return value. But the first task wait for all forked tasks to be completed.

As someone already mentioned, this sounds like normal behaviour, the ThreadPoolExecutor is just waiting to do some work. If you want to stop it, you need to call:
executor.shutdown()
to get it to terminate, usually followed by a executor.awaitTermination

The library code source is below (that's in fact a class from http://spymemcached.googlecode.com/files/memcached-2.4.2-sources.zip),
- a bit complicated - added protection against repeated calls of FutureTask if I'm not mistaken - but doesn't seem like deadlock prone - very simple ThreadPool usage:
package net.spy.memcached.transcoders;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;
import java.util.concurrent.FutureTask;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
import java.util.concurrent.atomic.AtomicBoolean;
import net.spy.memcached.CachedData;
import net.spy.memcached.compat.SpyObject;
/**
* Asynchronous transcoder.
*/
public class TranscodeService extends SpyObject {
private final ThreadPoolExecutor pool = new ThreadPoolExecutor(1, 10, 60L,
TimeUnit.MILLISECONDS, new ArrayBlockingQueue<Runnable>(100),
new ThreadPoolExecutor.DiscardPolicy());
/**
* Perform a decode.
*/
public <T> Future<T> decode(final Transcoder<T> tc,
final CachedData cachedData) {
assert !pool.isShutdown() : "Pool has already shut down.";
TranscodeService.Task<T> task = new TranscodeService.Task<T>(
new Callable<T>() {
public T call() {
return tc.decode(cachedData);
}
});
if (tc.asyncDecode(cachedData)) {
this.pool.execute(task);
}
return task;
}
/**
* Shut down the pool.
*/
public void shutdown() {
pool.shutdown();
}
/**
* Ask whether this service has been shut down.
*/
public boolean isShutdown() {
return pool.isShutdown();
}
private static class Task<T> extends FutureTask<T> {
private final AtomicBoolean isRunning = new AtomicBoolean(false);
public Task(Callable<T> callable) {
super(callable);
}
#Override
public T get() throws InterruptedException, ExecutionException {
this.run();
return super.get();
}
#Override
public T get(long timeout, TimeUnit unit) throws InterruptedException,
ExecutionException, TimeoutException {
this.run();
return super.get(timeout, unit);
}
#Override
public void run() {
if (this.isRunning.compareAndSet(false, true)) {
super.run();
}
}
}
}

Definitely strange.
But before writing your own TPE try:
another BlockingQueue impl., e.g. LinkedBlockingQueue
specify fairness=true in ArrayBlockingQueue, i.e. use new ArrayBlockingQueue(n, true)
From those two opts I would chose second one 'cause it's very strange that offer() being blocked; one reason that comes into mind - thread scheduling policy on your Linux. Just as an assumption.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.