How to get an iterator from an akka streams Source? - java

I'm trying to create a flow that I can consume via something like an Iterator.
I'm implementing a library that exposes an iterator-like interface, so that would be the simplest thing for me to consume.
My graph designed so far is essentially a Source<Iterator<DataRow>>. One thing I see so far is to flatten it to Source<DataRow> and then use http://doc.akka.io/japi/akka/current/akka/stream/javadsl/StreamConverters.html#asJavaStream-- followed by https://docs.oracle.com/javase/8/docs/api/java/util/stream/BaseStream.html#iterator--
But given that there will be lots potentially many rows, I'm wondering whether it would make sense to avoid the flattening step (at least within the akka streams context, I'm assuming there's some minor per-element overhead when passed via stages), or if there's a more direct way.
Also, I'm curious how backpressure works in the created stream, especially the child Iterator; does it only buffer one element?

Flattening Step
Flattening a Source<Iterator<DataRow>> to a Source<DataRow> does add some amount of overhead since you'll have to use flatMapConcat which does eventually create a new GraphStage.
However, if you have "many" rows then this separate stage may come in handy since it will provide concurrency for the flattening step.
Backpressure
If you look at the code of StreamConverters.asJavaStream you'll see that there is a QueueSink that is spawning a Future to pull the next element from the akka stream and then doing an Await.result(nextElementFuture, Inf) to wait on the Future to complete so the next element can be forwarded to the java Stream.
Answering your question: yes the child Iterator only buffers one element, but the QueueSink has a Future which may also have the next DataRow. Therefore the javaStream & Iterator may have 2 elements buffered, on top of however much buffering is going on in your original akka Source.

Alternatively, you may implement an Iterator using prefixAndTail(1) under the hood for implementing hasNext and next.

Related

Interface for Lazy Loaded List of Objects

We have a whole bunch of data sources where we consult some REST API or other and get back a list of objects. I'm trying to design an abstraction layer that doesn't need to know how to contact any specific API instance or how to semantically interpret the objects, but that guarantees that we get back a list of objects from whichever class implements the interface we need at the time.
I expect at times the numbers of results to be quite large (but always finite!) and often slow to retrieve, so I require something that does not load everything into memory all at once but allows the results of the list to be worked with as they become available. I'm fine if the list blocks on next or hasNext or whatever the appropriate analogue is.
What's the most appropriate abstraction / approach for achieving these goals and how is it implemented?
My gut tells me it ought to be some flavor of Java 8 Streams, possibly created via the Java 9 Stream.iterate method, but I'm not too familiar with functional programming paradigms and can't for the life of me figure out how one would populate the elements of the Stream as they became available from the REST calls and close it out when it's finished.
It turns out I was confusing myself by conflating two issues: how to provide an Iterator in an Interface (which is trivial), and how to populate that Iterator in the background. I ended up with roughly the following:
Create a custom abstract class which implements Iterator. That class has an internal BlockingQueue and an internal List. It also defines an abstract method which is intended to perform all the activities of population in a single invocation.
The first time hasNext() is called, kick off a daemon thread which invokes that abstract method. Then, while the thread is alive (meaning it's still populating the BlockingQueue) or the List isn't empty (meaning not all elements have been consumed via next()), poll against the BlockingQueue until it has at least one element in it. Once it does, remove that element and add it to the List. next() merely returns elements from the List.
This results in lazy loading (nothing occurs until hasNext() is called for the first time) that also happens asynchronously in the background -- the caller will be able to process things as soon as they're available (hasNext() will block if things aren't available), and it doesn't use up an unreasonable amount of memory (the BlockingQueue will block if it has too many elements).

How to find max element in array using LMAX disruptor

Could you please provide an link on code example that implement parallel sort or parallel max finding using LMAX Disruptor pattern.
It's not really applicable. The disruptor is essentially behaving like a pipe with a handler visiting every item in isolation, but it's implemented very differently for avoiding locks and improving locality of references.
To find the max, this handler would have to "leak" information in a central place, thus colliding with other threads trying to produce their own value. To sort, I wouldn't even know where to begin... you want each handler to do some insertion sort into separate array somewhere else and merge later? That's just so not a good fit.
Besides, some thread has to put the data in the ring, which is pretty much the linear search you could have done in the first place. If the ring could be built directly over an existing array (to skip publishing), then what's the point of the disruptor? You would be better off with a bunch of threads given a sub range of the array.

Using Java Concurrent Collections in Scala [duplicate]

I have an Actor that - in its very essence - maintains a list of objects. It has three basic operations, an add, update and a remove (where sometimes the remove is called from the add method, but that aside), and works with a single collection. Obviously, that backing list is accessed concurrently, with add and remove calls interleaving each other constantly.
My first version used a ListBuffer, but I read somewhere it's not meant for concurrent access. I haven't gotten concurrent access exceptions, but I did note that finding & removing objects from it does not always work, possibly due to concurrency.
I was halfway rewriting it to use a var List, but removing items from Scala's default immutable List is a bit of a pain - and I doubt it's suitable for concurrent access.
So, basic question: What collection type should I use in a concurrent access situation, and how is it used?
(Perhaps secondary: Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread?)
(Tertiary: In Scala, what collection type is best for inserts and random access (delete / update)?)
Edit: To the kind responders: Excuse my late reply, I'm making a nasty habit out of dumping a question on SO or mailing lists, then moving on to the next problem, forgetting the original one for the moment.
Take a look at the scala.collection.mutable.Synchronized* traits/classes.
The idea is that you mixin the Synchronized traits into regular mutable collections to get synchronized versions of them.
For example:
import scala.collection.mutable._
val syncSet = new HashSet[Int] with SynchronizedSet[Int]
val syncArray = new ArrayBuffer[Int] with SynchronizedBuffer[Int]
You don't need to synchronize the state of the actors. The aim of the actors is to avoid tricky, error prone and hard to debug concurrent programming.
Actor model will ensure that the actor will consume messages one by one and that you will never have two thread consuming message for the same Actor.
Scala's immutable collections are suitable for concurrent usage.
As for actors, a couple of things are guaranteed as explained here the Akka documentation.
the actor send rule: where the send of the message to an actor happens before the receive of the same actor.
the actor subsequent processing rule: where processing of one message happens before processing of the next message by the same actor.
You are not guaranteed that the same thread processes the next message, but you are guaranteed that the current message will finish processing before the next one starts, and also that at any given time, only one thread is executing the receive method.
So that takes care of a given Actor's persistent state. With regard to shared data, the best approach as I understand it is to use immutable data structures and lean on the Actor model as much as possible. That is, "do not communicate by sharing memory; share memory by communicating."
What collection type should I use in a concurrent access situation, and how is it used?
See #hbatista's answer.
Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread
The second (though the thread on which messages are processed may change, so don't store anything in thread-local data). That's how the actor can maintain invariants on its state.

Queue implementation with blocked 'take()' but with eviction policy

Is there an implementation with a blocking queue for take but bounded by a maximum size. When the size of the queue reaches a given max-size, instead of blocking 'put', it will remove the head element and insert it. So put is not blocked() but take() is.
One usage is that if I have a very slow consumer, the system will not crash ( runs out of memory ) rather these message will be removed but I do not want to block the producer.
An example of this would stock trading system. When you get a spike in stock trade/quote data, if you haven't consumed data, you want to automatically throw away old stock trade/quote.
There currently isnt in Java a thread-safe queue that will do what you are looking for. However, there is a BlockingDequeue (Double Ended Queue) that you can write a wrapper in which you can take from the head and and tail as you see freely.
This class, similar to a BlockingQueue, is thread safe.
Several strategies are provided in ThreadPoolExecutor. Search for "AbortPolicy" in this javadoc . You can also implement your own policy if you want. Perhaps Discard is similar to what you want. Personally I think CallerRuns is what you want in most cases.
I think using these is a better solution, but if you absolutely want to implement it at the queue, I'd probably do it by composition. Perhaps use a LinkedList or something and wrap it with synchronize keyword.
EDIT:(some clarifications..)
"Executor" is basically a thread pool combined with a blocking queue. It is the recommended way to implement a producer/consumer pattern in java. The authors of these libraries provides several strategies to cope with issues like you mentioned. If you are interested, here is another approach to specifically address the OOME issue (the source is framework specific and can't be used as is).

Java: Large collection and concurrent threads

I am facing this issue:
I have lots of threads (1024) who access one large collection - Vector.
Question:
is it possible to do something about it which would allow me to do concurrent actions on it without having to synchronize everything (since that takes time)? What I mean, is something like Mysql database works, you don't have to worry about synchronizing and thread-safe issues. Is there some collection alike that in Java? Thanks
Vector is a very old Java class - predates the Collections API. It synchronizes on every operation, so you're not going to have any luck trying to speed it up.
You should consider reworking your code to use something like ConcurrentHashMap or a LinkedBlockingQueue, which are highly optimized for concurrent access.
Failing that, you mention that you'd like performance and access semantics similar to a database - why not use a dedicated database or a message queue? They are likely to implement it better than you ever will, and it's less code for you to write!
[edit] Given your comment:
all what thread does is adding elements to vector
(only if num of elements in vector = 0) &
removing elements from vector. (if vector size > 0)
it sounds very much like you should be using something much more like a queue than a list! A bounded queue with size 1 will give you these semantics - although I'd question why you can't add elements if there is already something there. When you've got thousands of threads this seems like a very inefficient design.
Well first off, this design doesn't sound right. It sounds like you need to think about using a proper database rather than an simple data structure, even if this means just using something like an in-memory instance of HypersonicDB.
However, if you insist on doing things this way, then the java.util.concurrent package has a number of highly concurrent, non-locking data structures. One of them might suit your purpose (e.g. ConcurrentHashMap, if you can use a Map rather than a List)
Looks like you are implementing the producer consumer pattern, you should google "producer consumer java" or have a look at the BlockingQueue interface
I agree with skaffman about looking at java.util.concurrent.
ConcurrentHashMap is very scalable. However, the size() call on it returns only an approximation. So e.g. your app will occasionally be adding elements to it even if !(num of elements in vector = 0).
If you want to strictly enforce the condition you gave, there is no other way than to synchronize.
Instead of having tons of context switches, I guess you could let your users thread post a callable on a queue and have only one thread dealing with the mutation. This will eliminate the need for synchronization on the collection. The user threads can wait on Future.get().
Just an idea.
If you do not want to change your data structure and have only infrequent writes, you might also use one or many ReentrantReadWriteLock to synchronize access. Then many threads can read at the same time, but when a thread wants to write all reads are blocked until the write is done.
But you should check whether the used data structure is appropriate for the task, or whether another of the many java.util or java.util.concurrent classes is more appropriate. java.util.Vector is synchronized, by the way.

Categories