Continuous parsing and processing of text

Continuous parsing and processing of text - java

I have a class that's a listener to a log server. The listener gets notified whenever a log/text is spewed out. I store this text in an arraylist.
I need to process this text (remove duplicate words, store it in a trie, compare it against some patterns etc).
My question is should i be doing this as an when the listener is notified? Or should i be creating a separate thread that handles the processing.
What is the best way to handle this situation?

Sounds like you're trying to solve the Producer Consumer Problem, in which case - Yes, you should be looking at threads.
If, however, you only need to do very basic operations that take less than milliseconds per entry - don't overly complicate things. If you use a TreeSet in conjunction with an ArrayList - it will automatically take care of keeping duplicates out. Simple atomic operations such as validating the log entry aren't such a big deal that they need a seperate thread, unless new text is coming in at such a rapid rate that you need to need a thread to busy itself full time with processing new notifications.

The process that are not related to UI i always run that type of process in separate thread so it will not hang your app screen. So as my point of view you need to go with separate thread.

Such a situation can be solved using Queues. The simplest solution would be to have an unbounded blocking queue (a LinkedTransferQueue is tailored for such a case) and a limited size pool of worker threads.
You would add()/offer() the log entry from the listener's thread and take() for processing with worker threads. take() will block a thread if no log entries are available for processing.
P. S. A LinkedTransferQueue is designed for concurrent usage, no external synchronization is necessary: it's based on weak iterators, just like the Concurrent DS family.

Related

We can only use a blockingqueue or any other data structures for Threadpool task queue?

Hi I am a newbie in Concurrent programming with java. of all the examples I saw in concurrent programming whenever we use to define a task queue people used different implementations of blockingqueue.
why only blockingqueue? what are the advantages and disadvantages?
why not any other data structures?

Ok, i can't address exactly why unspecified code you looked at uses certain data structures and not other ones. But Blocking queues have nice properties. Holding only a fixed number of elements and forcing producers who would insert items over that limit to wait is actually a feature.
Limiting the queue size helps keep the application safe from a badly-behaved producer, which otherwise could fill the queue with entries until the application ran out of memory. Obviously it's faster to insert a task into the task wueue thsn it is to execute it, an executor is going to be at risk for getting bombarded with work.
Also making the producer wait applies back pressure to the system. That way the queue lets the producer know it's falling behind and not accepting more work. It's better for the producer to wait than it is for it to keep hammering the queue; back pressure lets the system degrade gracefully.
So you have a data structure that is easy to understand, has practical benefits for building applications and seems like a natural fit for a task queue. Of course people are going to use it.

How to integrate LMAX within a real financial application

I am also thinking of integrating the disruptor pattern in our application. I am a bit unsure about a few things before I start using the disruptor
I have 3 producers, mainly a FIX thread which de-serialises the requests. Another thread which continously modifies order price as the market moves. Also we have one more thread which is responsible for de-serialising the requests sent from a GUI application. All three threads currently write to a Blocking Queue (hence we see a lot of contention on the queue)
The disruptor talks about a Single writer principle and from what I have read that approach scales the best. Is there any way we could make the above three threads obey the single writer principle?
Also in a typical request/response application, specially in our case we have contention on an in memory cache, as we need to lock the cache when we update the cache with the response, whilst a request might be happening for the same order. How do we handle this through the disruptor, i.e. how do I tie up a response to a particular request? Can I eliminate the lock on the cache if yes how?
Any suggestions/pointers would be highly appreciated. We are currently using Java 1.6

I'm new to distruptor and am trying to understand as much usecases as possible. I have tried to answer your questions.
Yes, Disruptor can be used to sequence calls from multiple
producers. I understand that all 3 threads try to update the state
of a shared object. And a single consumer which takes necessary action on the shared object. Internally you can have the single consumer delegate calls to the appropriate single threaded handler based on responsibility. The
The Disruptor exactly does this. It sequences the calls such that
the state is accessed only by a thread at a time. If there's a specific order in which the event handlers are to be invoked, set up the memory barrier. The latest version of Disruptor has a DSL that lets you setup the order easily.
The Cache can be abstracted and accessed through the Disruptor. At a time, only a
Reader or a Writer would get access to the cache, since all calls to
the cache are sequential.

Java concurrency - Should block or yield?

I have multiple threads each one with its own private concurrent queue and all they do is run an infinite loop retrieving messages from it. It could happen that one of the queues doesn't receive messages for a period of time (maybe a couple seconds), and also they could come in big bursts and fast processing is necessary.
I would like to know what would be the most appropriate to do in the first case: use a blocking queue and block the thread until I have more input or do a Thread.yield()?
I want to have as much CPU resources available as possible at a given time, as the number of concurrent threads may increase with time, but also I don't want the message processing to fall behind, as there is no guarantee of when the thread will be reescheduled for execution when doing a yield(). I know that hardware, operating system and other factors play an important role here, but setting that aside and looking at it from a Java (JVM?) point of view, what would be the most optimal?

Always just block on the queues. Java yields in the queues internally.
In other words: You cannot get any performance benefit in the other threads if you yield in one of them rather than just block.

You certainly want to use a blocking queue - they are designed for exactly this purpose (you want your threads to not use CPU time when there is no work to do).
Thread.yield() is an extremely temperamental beast - the scheduler plays a large role in exactly what it does; and one simple but valid implementation is to simply do nothing.

Alternatively, consider converting your implementation to use one of the managed ExecutorService implementations - probably ThreadPoolExecutor.
This may not be appropriate for your use case, but if it is, it removes the whole burden of worrying about thread management from your own code - and these questions about yielding or not simply vanish.
In addition, if better thread management algorithms emerge in future - for example, something akin to Apple's Grand Central Dispatch - you may be able to convert your application to use it with almost no effort.

Another thing that you could do is use the concurrent hash map for your queue. When you do a read it gives you a reference of the object you were looking for, so it is possible you my miss a message that was just put into the queue. But if all this is doing is listening for a message you will catch it the next iteration. It would be different if the messages could be updated by other threads. But there doesn't really seem to be a reason to block that I can see.

Multiple SingleThreadExecutors for a given application...a good idea?

This question is about the fallouts of using SingleThreadExecutor (JDK 1.6). Related questions have been asked and answered in this forum before, but I believe the situation I am facing, is a bit different.
Various components of the application (let's call the components C1, C2, C3 etc.) generate (outbound) messages, mostly in response to messages (inbound) that they receive from other components. These outbound messages are kept in queues which are usually ArrayBlockingQueue instances - fairly standard practice perhaps. However, the outbound messages must be processed in the order they are added. I guess use of a SingleThreadExector is the obvious answer here. We end up having a 1:1 situation - one SingleThreadExecutor for one queue (which is dedicated to messages emanating from one component).
Now, the number of components (C1,C2,C3...) is unknown at a given moment. They will come into existence depending on the need of the users (and will be eventually disposed of too). We are talking about 200-300 such components at the peak load. Following the 1:1 design principle stated above, we are going to arrange for 200 SingleThreadExecutors. This is the source of my query here.
I am uncomfortable with the thought of having to create so many SingleThreadExecutors. I would rather try and use a pool of SingleThreadExecutors, if that makes sense and is plausible (any ready-made, seen-before classes/patterns?). I have read many posts on recommended use of SingleThreadExecutor here, but what about a pool of the same?
What do learned women and men here think? I would like to be directed, corrected or simply, admonished :-).

If your requirement is that the messages be processed in the order that they're posted, then you want one and only one SingleThreadExecutor. If you have multiple executors, then messages will be processed out-of-order across the set of executors.
If messages need only be processed in the order that they're received for a single producer, then it makes sense to have one executor per producer. If you try pooling executors, then you're going to have to put a lot of work into ensuring affinity between producer and executor.
Since you indicate that your producers will have defined lifetimes, one thing that you have to ensure is that you properly shut down your executors when they're done.

Messaging and batch jobs is something that has been solved time and time again. I suggest not attempting to solve it again. Instead, look into Quartz, which maintains thread pools, persisting tasks in a database etc. Or, maybe even better look into JMS/ActiveMQ. But, at the very least look into Quartz, if you have not already. Oh, and Spring makes working with Quartz so much easier...

I don't see any problem there. Essentially you have independent queues and each has to be drained sequentially, one thread for each is a natural design. Anything else you can come up with are essentially the same. As an example, when Java NIO first came out, frameworks were written trying to take advantage of it and get away from the thread-per-request model. In the end some authors admitted that to provide a good programming model they are just reimplementing threading all over again.

It's impossible to say whether 300 or even 3000 threads will cause any issues without knowing more about your application. I strongly recommend that you should profile your application before adding more complexity
The first thing that you should check is that number of concurrently running threads should not be much higher than number of cores available to run those threads. The more active threads you have, the more time is wasted managing those threads (context switch is expensive) and the less work gets done.
The easiest way to limit number of running threads is to use semaphore. Acquire semaphore before starting work and release it after the work is done.
Unfortunately limiting number of running threads may not be enough. While it may help, overhead may still be to great, if time spent per context switch is major part of total cost of one unit of work. In this scenario, often the most efficient way is to have fixed number of queues. You get queue from global pool of queues when component initializes using algorithm such as round-robin for queue selection.
If you are in one of those unfortunate cases where most obvious solutions do not work, I would start with something relatively simple: one thread pool, one concurrent queue, lock, list of queues and temporary queue for each thread in pool.
Posting work to queue is simple: add payload and identity of producer.
Processing is relatively straightforward as well. First you get get next item from queue. Then you acquire the lock. While you have lock in place, you check if any of other threads is running task for same producer. If not, you register thread by adding a temporary queue to list of queues. Otherwise you add task to existing temporary queue. Finally you release the lock. Now you either run the task or poll for next and start over depending on whether current thread was registered to run tasks. After running the task, you get lock again and see, if there is more work to be done in temporary queue. If not, remove queue from list. Otherwise get next task. Finally you release the lock. Again, you choose whether to run the task or to start over.

What are the advantages of Blocking Queue in Java?

I am working on a project that uses a queue that keeps information about the messages that need to be sent to remote hosts. In that case one thread is responsible for putting information into the queue and another thread is responsible for getting information from the queue and sending it. The 2nd thread needs to check the queue for the information periodically.
But later I found this is reinvention of the wheel :) I could use a blocking queue for this purpose.
What are the other advantages of using a blocking queue for the above application? (Ex : Performance, Modifiable of the code, Any special tricks etc )

The main advantage is that a BlockingQueue provides a correct, thread-safe implementation. Developers have implemented this feature themselves for years, but it is tricky to get right. Now the runtime has an implementation developed, reviewed, and maintained by concurrency experts.
The "blocking" nature of the queue has a couple of advantages. First, on adding elements, if the queue capacity is limited, memory consumption is limited as well. Also, if the queue consumers get too far behind producers, the producers are naturally throttled since they have to wait to add elements. When taking elements from the queue, the main advantage is simplicity; waiting forever is trivial, and correctly waiting for a specified time-out is only a little more complicated.

They key thing you eliminate with the blocking queue is 'polling'. This is where you say
In that case the 2nd thread needs to check the queue for the information periodically.
This can be very inefficient - using much unnecessary CPU time. It can also introduce unneeded latencies.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.