I'm working on an ultra low latency and high performance application.
The core is single threaded, so don't need to worry about concurrency.
I'm developing a schedule log function which log messages periodically to prevent same messages flush in the log.
So the log class contains a ConcurrentHashMap, one thread update it (put new key or update existing value), another thread periodically loop through the map to log all the messages.
My concern is since need to log when loop through the Map which may take time, would it block the thread trying to update the Map? Any blocking is not acceptable since our application core is single threaded.
And is there any other data structure other than ConcurrentHashMap I can use to reduce the memory footprint?
Is there a thread-safe way to iterate the Map without blocking in the read-only case? Even the data iterated may be stale is still acceptable.
According to the java API docs, it says that :
[...] even though all operations are thread-safe, retrieval operations do not entail locking [...]
Moreover, the entrySet() method documentation tells us that:
The [returned] set is backed by the map, so changes to the map are reflected in the set, and vice-versa.
That would mean that modification of the map is possible while an iteration is done over it, meaning that it indeed doesn't block the whole map.
There may be other structures that would allow you to reduce the memory footprint, reduce latency, and perhaps offer a more uniform & consistent performance profile.
One of these is a worker pattern where your main worker publishes to a nonblocking queue and the logger pulls from that queue. This should decouple the two processes, allow for multiple concurrent publishers and allow you to scale out loggers.
One possible data structure for this is ConcurrentLinked Queue. I have very little experience with Java so I'm not sure how the performance profile of this differs from concurrent hashmap; but this is a very common pattern in distributed systems and in golang.
Related
Problems in Details
Will it cause any issue when using data structure e.g ArrayBlockingQueue<ArrayList<MyClass>>
with multiple threads?
Background
In a high level, i am trying to achieve that I have one producer which will produce a giant list. In order to speed up the processing. I decide to use multiple consumers(threads) consuming the giant list produced by the producer.
My Proposal Solution
I will be converting the giant list to multiple relatively smaller list and in order to ensure its thread safe, I will enqueue these smaller lists to a concurrent data structure. So in multi-threads scenario, each thread just poll the concurrent queue to get one list and work on it.
Problem Statement
In multi-threads scenario, I understand we have to use the concurrent data structure to avoid thread interference and build happen-before relation.
But will it be safe that using non-thread-safe data structure as element of thread-safe data structure?
Will it cause any issue when using data structure e.g ArrayBlockingQueue<ArrayList<MyClass>>
with multiple threads?
Will it be any impact to the performance?
There shouldn't be an obvious problem with this approach.
will it be safe that using non-thread-safe data structure as element of thread-safe data structure?
This is safe as long as you properly coordinate (or avoid) concurrent access to the non-thread-safe inner data structure. The ArrayBlockingQueue ensures the happens-before relation is established when you access its elements via on of the peek, poll or related methods.
Will it cause any issue when using data structure e.g ArrayBlockingQueue<ArrayList<MyClass>> with multiple threads?
No, this is what BlockingQueue is intended for as long as you coordinate access to the inner lists (see above).
Will it be any impact to the performance?
In general the approach where the single producer partitions the list into sub-lists might not be optimal. The producer does not / should not know about the number of consumers and their bandwidth and thus in general does not know what partition sizes work well. A better approach might be to use an ArrayBlockingQueue<MyClass> and from the consumer side always consume multiple elements in one go by calling drainTo for a suitable number maxElements of elements.
Thanks the Answer from michid# and Thilo#
Final Resolution
I end up using LinkedBlockingQueue<List<MyObjClass>> and have multiple child threads polling from the queue. Each child thread will take list of MyObjClass to work on.
This resolution does not have any impact to slow down the performance.
For why i am choosing LinkedBlockingQueue over ArrayBlockingQueue see Link
In multiThreading I want to use a map which will be updated, which Map will be better considering the performance 1. HashMap 2. ConcurrentHashMap? also, will it perform slow if i make it volatile?
It is going to be used in a Java batch for approx. 20Million records.
Currently i am not sharing this map among threads.
will sharing the map among threads reduce performance?
HashMap will be better performance-wise, as it is not synchronized in any way. ConcurrentHashMap adds overhead to manage concurrent read and - especially - concurrent write access.
That being said, in a multithreaded environment, you are responsible for synchronizing access to HashMap as needed, which will cost performance, too.
Therefore, I would go for HashMap only if the use case allows for very specific optimization of the synchronization logic. Otherwise, ConcurrentHashMap will save you a lot of time working out the synchronization.
However, please note that even with ConcurrentHashMap you will need to carefully consider what level of synchronization you need. ConcurrentHashMap is thread-safe, but not fully synchronized. For instance, if you absolutely need to synchronize each read access with each write access, you will still need custom logic, since for a read operation ConcurrentHashMap will provide the state after the last successfully finished write operation. That is, there might still be an ongoing write operation which will not be seen by the read.
As for volatile, this only ensures that changes to that particular field will be synchronized between threads. Since you will likely not change the reference to the HashMap / ConcurrentHashMap, but work on the instance, the performance overhead will be negligible.
I have a class that's a listener to a log server. The listener gets notified whenever a log/text is spewed out. I store this text in an arraylist.
I need to process this text (remove duplicate words, store it in a trie, compare it against some patterns etc).
My question is should i be doing this as an when the listener is notified? Or should i be creating a separate thread that handles the processing.
What is the best way to handle this situation?
Sounds like you're trying to solve the Producer Consumer Problem, in which case - Yes, you should be looking at threads.
If, however, you only need to do very basic operations that take less than milliseconds per entry - don't overly complicate things. If you use a TreeSet in conjunction with an ArrayList - it will automatically take care of keeping duplicates out. Simple atomic operations such as validating the log entry aren't such a big deal that they need a seperate thread, unless new text is coming in at such a rapid rate that you need to need a thread to busy itself full time with processing new notifications.
The process that are not related to UI i always run that type of process in separate thread so it will not hang your app screen. So as my point of view you need to go with separate thread.
Such a situation can be solved using Queues. The simplest solution would be to have an unbounded blocking queue (a LinkedTransferQueue is tailored for such a case) and a limited size pool of worker threads.
You would add()/offer() the log entry from the listener's thread and take() for processing with worker threads. take() will block a thread if no log entries are available for processing.
P. S. A LinkedTransferQueue is designed for concurrent usage, no external synchronization is necessary: it's based on weak iterators, just like the Concurrent DS family.
I am also thinking of integrating the disruptor pattern in our application. I am a bit unsure about a few things before I start using the disruptor
I have 3 producers, mainly a FIX thread which de-serialises the requests. Another thread which continously modifies order price as the market moves. Also we have one more thread which is responsible for de-serialising the requests sent from a GUI application. All three threads currently write to a Blocking Queue (hence we see a lot of contention on the queue)
The disruptor talks about a Single writer principle and from what I have read that approach scales the best. Is there any way we could make the above three threads obey the single writer principle?
Also in a typical request/response application, specially in our case we have contention on an in memory cache, as we need to lock the cache when we update the cache with the response, whilst a request might be happening for the same order. How do we handle this through the disruptor, i.e. how do I tie up a response to a particular request? Can I eliminate the lock on the cache if yes how?
Any suggestions/pointers would be highly appreciated. We are currently using Java 1.6
I'm new to distruptor and am trying to understand as much usecases as possible. I have tried to answer your questions.
Yes, Disruptor can be used to sequence calls from multiple
producers. I understand that all 3 threads try to update the state
of a shared object. And a single consumer which takes necessary action on the shared object. Internally you can have the single consumer delegate calls to the appropriate single threaded handler based on responsibility. The
The Disruptor exactly does this. It sequences the calls such that
the state is accessed only by a thread at a time. If there's a specific order in which the event handlers are to be invoked, set up the memory barrier. The latest version of Disruptor has a DSL that lets you setup the order easily.
The Cache can be abstracted and accessed through the Disruptor. At a time, only a
Reader or a Writer would get access to the cache, since all calls to
the cache are sequential.
Since thread execution happens in a pool, and is not guaranteed to queue in any particular order, then why would you ever create threads without the protection of synchronization and locks? In order to protect data attached to an object's state (what I understand to be the primary purpose of using threads), locking appears to be the only choice. Eventually you'll end up with race conditions and "corrupted" data if you don't synchronize. So if you're not interested in protecting that data, then why use threads at all?
If there's no shared mutable data, there's no need for synchronization or locks.
Delegation, just as one example. Consider a webserver that gets connect requests. It can delegate to a worker thread a particular request. The main thread can pass all the data it wants to the worker thread, as long as that data is immutable, and not have to worry at all about concurrent data access.
(For that matter, both main thread and worker thread can send all the immutable data to each other they want, it just requires a messaging queue of some sort, so the queue may need synchronization but not the data itself. But you don't need a message queue to get data to a worker thread, just construct the data before the thread starts, and as long as the data is immutable at that point, you don't need any synchronization or locks or concurrency management of any sort, other than the ability to run a thread.)
Synchronization and locks protect shared state from conflicting concurrent updates. If there is no shared state to protect, you can run multiple threads without locking and synchronization. This might be the case in a web server with multiple independent worker threads serving incoming requests. Another way to avoid synchronization and locking is to have your threads only operate on immutable shared state: if a thread can't alter any data that another thread is operating on, concurrent unsynchronized access is fine.
Or you might be using an Actor-based system to handle concurrency. Actors communicate by message passing only, there is no shared state for them to worry about. So here you can have many threads running many Actors without locks. Erlang uses this approach, and there is a Scala Actors library that allows you to program this way on the JVM. In addition there are Actors-based libraries for Java.
In order to protect data attached to
an object's state (what I understand
to be the primary purpose of using
threads), locking appears to be the
only choice. ... So if
you're not interested in protecting
that data, then why use threads at
all?
The highlighted bit of your question is incorrect, and since it is the root cause of your "doubts" about threads, it needs to be addressed explicitly.
In fact, the primary purpose for using threads is to allow tasks to proceed in parallel, where possible. On a multiprocessor the parallelism will (all things being equal) speedup your computations. But there are other benefits that apply on a uniprocessor as well. The most obvious one is that threads allow an application to do work while waiting for some IO operation to complete.
Threads don't actually protect object state in any meaningful way. The protection you are attributing to threads comes from:
declaring members with the right access,
hiding state behind getters / setters,
correct use of synchronization,
use of the Java security framework, and/or
sending requests to other servers / services.
You can do all of these independently of threading.
java.util.concurrent.atomic provides for some minimal operations that can be performed in a lock-free and yet thread-safe way. If you can arrange your concurrency entirely around such classes and operations, your performance can be vastly enhanced (as you avoid all the overhead connected with locking). Granted, it's unusual to be working on such a simplifiable problem (more often some locking will be needed), but, if and when you do find yourself in such a situation, well, then, that's exactly the use case you're asking about!-)
There are other kinds of protection for shared data. Maybe you have atomic sections, monitors, software transactional memory, or lock-free data structures. All these ideas support parallel execution without explicit locking. You can Google any of these terms and learn something interesting. If your primary interest is Java, look up Tim Harris's work.
Threads allow multiple parallel units of work to progress concurrently. The synchronisation is simply to protect shard resources from unsafe access if not needed you don't use it.
Processing on threads becomes delayed when accessing certain resources such as IO and it may be desirable to keep the CPU processing other units of work while others are delayed.
As in the example in the other answer listening to services requests may well be a unit of work that is kept independent of responding to a request as the latter my block due to resource contention - say access disk or IO.