ParallelStreams in java

ParallelStreams in java - java

I'm trying to use parallel streams to call an API endpoint to get some data back. I am using an ArrayList<String> and sending each String to a method that uses it in making a call to my API. I have setup parallel streams to call a method that will call the endpoint and marshall the data that comes back. The problem for me is that when viewing this in htop I see ALL the cores on the db server light up the second I hit this method ... then as the first group finish I see 1 or 2 cores light up. My issue here is that I think I am truly getting the result I want ... for the first set of calls only and then from monitoring it looks like the rest of the calls get made one at a time.
I think it may have something to do with the recursion but I'm not 100% sure.
private void generateObjectMap(Integer count){
ArrayList<String> myList = getMyList();
myList.parallelStream().forEach(f -> performApiRequest(f,count));
}
private void performApiRequest(String myString,Integer count){
if(count < 10) {
TreeMap<Integer,TreeMap<Date,MyObj>> tempMap = new TreeMap();
try {
tempMap = myJson.getTempMap(myRestClient.executeGet(myString);
} catch(SocketTimeoutException e) {
count += 1;
performApiRequest(myString,count);
}
...
else {
System.exit(1);
}
}

This seems an unusual use for parallel streams. In general the idea is that your are informing the JVM that the operations on the stream are truly independent and can run in any order in one thread or multiple. The results will subsequently be reduced or collected as part of the stream. The important point to remember here is that side effects are undefined (which is why variables changed in streams need to be final or effectively final) and you shouldn't be relying on how the JVM organises execution of the operations.
I can imagine the following being a reasonable usage:
list.parallelStream().map(item -> getDataUsingApi(item))
.collect(Collectors.toList());
Where the api returns data which is then handed to downstream operations with no side effects.
So in conclusion if you want tight control over how the api calls are executed I would recommend you not use parallel streams for this. Traditional Thread instances, possibly with a ThreadPoolExecutor will serve you much better for this.

Related

Java Clear CompletionService Working Queue

I am writing a program which uses a CompletionService to run threaded analyses on a bunch of different objects, where each "analysis" consists of taking in a string and doing some computation to give either true or false as an answer. My code looks essentially like this:
// tasks come from a different method and contain the strings + some other needed info
List<Future<Pair<Pieces,Boolean>>> futures = new ArrayList<>(tasks.size());
for (Task task : tasks) {
futures.add(executorCompletionService.submit(task));
}
ArrayList<Pair<Pieces, Boolean>> pairs = new ArrayList<>();
int toComplete = tasks.size();
int received = 0;
int failed = 0;
while (received < toComplete) {
Future<Pair<Pieces, Boolean>> resFuture = executorCompletionService.take();
received++;
Pair<Pieces, Boolean> res = resFuture.get();
if (!res.getValue()) failed++;
if (failed > 300) {
// My problem is here
}
pairs.add(res);
}
// return pairs and go on to do something else
In the marked section, my goal is to have it abandon the computation if over 300 strings have failed, such that I can move on to a new analysis, calling this method again with some different data. The problem is that since the same CompletionService is used again, if I do not somehow clear the queue, then the worker queue will keep growing as I keep adding more to it every time I use it (since after 300 failures there are likely still many unprocessed strings left).
I have tried to loop through the futures list and delete all unfinished tasks using something like futures.foreach(future -> future.cancel(true), however when I next call the method I get a java.util.concurrent.CancellationException error when I try to call resFuture.get().
(Edit: It seems that even though I call foreach(future->future.cancel(true)), this does not guarantee that the workerQueue is actually clear afterwards. I do not understand why this is. It almost seems as if it takes a while to clear the queue, and the code does not wait for this to happen before moving to the next analysis, so occasionally get will be called on a future which has been cancelled.)
I have also tried to do
while (received < toComplete) {
executorCompletionService.take();
received++;
}
To empty the queue, and while this works it is barely faster than just running all of the analyses anyway, and so it does not do very well for the efficiency.
My question is if there is a better way to empty the worker queue such that when I next call this code it is as if the CompletionService is new again.
Edit: Another method I have tried is just setting executorCompletionService = new CompletionService, which is slightly faster than my other solution but is still rather slow and definitely not good practice.
P.S.: Also happy to accept any other ways in which this is possible, I am not attached to using a CompletionService it has just been the easiest thing for what I've done so far

This has since been resolved, but I have seen other similar questions with no good answer so here is my solution:
Previously, I was using an ExecutorService to create my ExecutorCompletionService(ExecutorService). I switched the ExecutorService to be a ThreadPoolExecutor, and since in the backed the ExecutorService already is a ThreadPoolExecutor all method signatures can be fixed with just a cast. Using the ThreadPoolExecutor gives you much more freedom in the backend, and specifically you can called threadPoolExecutor.getQueue().clear() which clears all tasks awaiting completion. Finally, I needed to make sure to "drain" the remaining working tasks, so my final cancelling code looked like this:
if (failed > maxFailures) {
executorService.getQueue().clear();
while (executorService.getActiveCount() > 0) {
executorCompletionService.poll();
}
At the end of this code block, the executor will be ready to run again.

Why do some Reactor operators request far more elements than they are interested in?

I have the following code:
Flux<String> flux = Flux.<String>never()
.doOnRequest(n -> System.out.println("Requested " + n));
It is a Flux that never emits any signal, but reports demand to the console.
Each of the following 3 lines
flux.take(3).next().block();
flux.next().block();
flux.blockFirst();
produces this output:
Requested 9223372036854775807
Looking at the code, I see the following.
BlockingSingleSubscriber (works both in the cases of Flux#blockFirst() and Mono#block():
public final void onSubscribe(Subscription s) {
this.s = s;
if (!cancelled) {
s.request(Long.MAX_VALUE);
}
}
MonoNext.NextSubscriber:
public void request(long n) {
if (WIP.compareAndSet(this, 0, 1)) {
s.request(Long.MAX_VALUE);
}
}
FluxTake.TakeSubscriber:
public void request(long n) {
if (wip == 0 && WIP.compareAndSet(this, 0, 1)) {
if (n >= this.n) {
s.request(Long.MAX_VALUE);
} else {
s.request(n);
}
return;
}
s.request(n);
}
So Flux#blockFirst(), Flux#next() and Mono#block() always signal an unbounded demand to their upstream, and Flux#take() can do the same under some circumstances.
But Flux#blockFirst(), Flux#next() and Mono#block() each need at max one element from their upstream, and Flux#take() needs maximally this.n.
Also, Flux#take() javadoc says the following:
Note that this operator doesn't manipulate the backpressure requested amount.
Rather, it merely lets requests from downstream propagate as is and cancels once
N elements have been emitted. As a result, the source could produce a lot of
extraneous elements in the meantime. If that behavior is undesirable and you do
not own the request from downstream (e.g. prefetching operators), consider
using {#link #limitRequest(long)} instead.
The question is: why do they signal an unbounded demand when they know the limit upfront? I had an impression that reactive backpressure was about only asking for what you are ready to consume. But in reality, it often works like this: shout 'produce all you can' to the upstream, and then cancel the subscription once satisfied. In cases when it is costly to produce gazillion records upstream this seems simply wasteful.

tl;dr - Requesting only what you need is usually ideal in a pull based system, but is very rarely ideal in a push based system.
I had an impression that reactive backpressure was about only asking for what you are ready to consume.
Not quite, it's what you are able to consume. The difference is subtle, but important.
In a pull based system, you'd be entirely correct - requesting more values than you know you'll ever need would almost never be a good thing, as the more values you request, the more work needs to happen to produce those values.
But note that reactive streams are inherently push based, not pull based. Most reactive frameworks, reactor included, are built with this in mind - and while hybrid, or pull based semantics are possible (using Flux.generate() to produce elements one at a time on demand for example) this is very much a secondary use case. The norm is to have a publisher that has a bunch of data it needs to offload, and it "wants" to push that to you as quickly as possible to be rid of it.
This is important as it flips the view as to what's ideal from a requesting perspective. It no longer becomes a question of "What's the most I'll ever need", but instead "What's the most I can ever deal with" - the bigger the number, the better.
As an example, let's say I have a database query returning 2000 records connected to a flux - but I only want 1. If I have a publisher that's pushing these 2000 records, and I call request(1), then I'm not "helping" things at all - I haven't caused less processing on the database side, those records are already there and waiting. Since I've only requested 1 however, the publisher must then decide whether it can buffer the remaining records, or it's best to skip some or all of them, or it should throw an exception if it can't keep up, or something else entirely. Whatever it does, I'm actually causing more work, and possibly even an exception in some cases, by requesting fewer records.
Granted, this is not always desirable - perhaps those extra elements in the Flux really do cause extra processing that's wasteful, perhaps network bandwidth is a primary concern, etc. In that case, you'd want to explicitly call limitRequest(). In most cases though, that's probably not the behaviour you're after.
(For completeness sake, the best scenario is of course to limit the data at source - put a LIMIT 1 on your database query if you only want a single value for instance. Then you don't have to worry about any of this stuff. But, of course, in real-world usage that's not always possible.)

How to make writing method thread safe?

I have multiple threads to call one method in writing contents from an object to file, as below:
When I use 1 thread to test this method, the output into my file is expected. However, for multiple threads, the output into the file is messy. How to make this thread safe?
void (Document doc, BufferedWriter writer){
Map<Sentence, Set<Matrix>> matrix = doc.getMatrix();
for(Sentence sentence : matrix.keySet()){
Set<Matrix> set = doc.getMatrix(sentence);
for(Matrix matrix : set){
List<Result> results = ResultGenerator.getResult();
writer.write(matrix, matrix.frequency());
writer.write(results.toString());
writer.write("\n");
}
}
}
Edit:
I added this line List<Result> results = ResultGenerator.getResult(). What I really want is to use multiple threads to process this method call, since this part is expensive and takes a lot of time. The writing part is very quick, I don't really need multiple threads.
Given this change, is there a way to make this method call safe in concurrent environment?

Essentially, you are limited by single file at the end. There are no global variables and it publishes nothing, so the method is thread safe.
But, if processing does take a lot of time, you can use parallelstreams and publish the results to concurrenthashmap or a blocking queue. You would however still have a single consumer to write to the file.

I am not well versed in Java so I am going to provide a language-agnostic answer.
What you want to do is to transform matrices into results, then format them as string and finally write them all into the stream.
Currently you are writing into the stream as soon as you process each result, so when you add multi threads to your logic you end up with racing conditions in your stream.
You already figured out that only the calls for ResultGenerator.getResult() should be done in parallel whilst the stream still need to be accessed sequentially.
Now you only need to put this in practice. Do it in order:
Build a list where each item is what you need to generate a result
Process this list in parallel thus generating all results (this is a map operation). Your list of items will become a list of results.
Now you already have your results so you can iterate over them sequentially to format and write them into the stream.
I suspect the Java 8 provides some tools to make everything in a functional-way, but as said I am not a Java guy so I cannot provide code samples. I hope this explanation will suffice.
#edit
This sample code in F# explains what I meant.
open System
// This is a pretty long and nasty operation!
let getResult doc =
Threading.Thread.Sleep(1000)
doc * 10
// This is writing into stdout, but it could be a stream...
let formatAndPrint =
printfn "Got result: %O"
[<EntryPoint>]
let main argv =
printfn "Starting..."
[| 1 .. 10 |] // A list with some docs to be processed
|> Array.Parallel.map getResult // Now that's doing the trick
|> Array.iter formatAndPrint
0

If you need the final file in a predetermined sequential order, do not multithread, or you will not get what you expect.
If you think that with multithreading your program will execute faster in regards to I/O output, you are likely mistaken; because of locking or overhead due to synchronisation, you will actually get degraded performance than a single thread.
If you trying to write a very big file, the ordering of Document instances is not relevant, and you think your writer method will hit a CPU bottleneck instead (but the only possible cause I can figure out from our code is the frequency() method call), what you can do is having each thread hold its own BufferedWriter that writes to a temporary file, and then add an additional thread that waits for all, then generates the final file using concatenation.

If your code is using distinct doc and writer objects, then your method is already thread-safe as it does not access and use instance variables.
If you are writing passing the same writer object to the method, you could use one of these approaches, depending on your needs:
void (Document doc, BufferedWriter writer){
Map<Sentence, Set<Matrix>> matrix = doc.getMatrix();
for(Sentence sentence : matrix.keySet()){
Set<Matrix> set = doc.getMatrix(sentence);
for(Matrix matrix : set){
List<Result> results = ResultGenerator.getResult();
// ensure that no other thread interferes while the following
// three .write() statements are executed.
synchronized(writer) {
writer.write(matrix, matrix.frequency()); // from your example, but I doubt it compiles
writer.write(results.toString());
writer.write("\n");
}
}
}
}
or lock-free with using a temporary StringBuilder object:
void (Document doc, BufferedWriter writer){
Map<Sentence, Set<Matrix>> matrix = doc.getMatrix();
StringBuilder sb = new StringBuilder();
for(Sentence sentence : matrix.keySet()){
Set<Matrix> set = doc.getMatrix(sentence);
for(Matrix matrix : set){
List<Result> results = ResultGenerator.getResult();
sb.append(matrix).append(matrix.frequency());
sb.append(results.toString());
sb.append("n");
}
}
// write everything at once
writer.write(sb.toString();
}

I'd make it synchronized. In that case, only one thread in your application is allowed to call this method at the same time => No messy output. If you have multiple applications running, you should consider something like file locking.
Example for a synchronized method:
public synchronized void myMethod() {
// ...
}
This method is exclusive for each thread.

You could lock down a method and then unlock it when you are finished with it. By putting synchronized before a method, you make sure only one thread at a time can execute it. Synchronizing slows down Java, so it should only be used when necessary.
ReentrantLock lock = new ReentrantLock();
/* synchronized */
public void run(){
lock.lock();
System.out.print("Hello!");
lock.unlock();
}
This locks down the method just like synchronized. You can use it instead of synchronized, that's why synchronized is commented out above.

Way to prioritize specific API calls with multithreads or priority queue?

In my application, my servlet(running on tomcat) takes in a doPost request, and it returns an initial value of an api call to the user for presentation and then does a ton of data analysis in the back with a lot more other api calls. The data analysis then goes into my mongodb. The problem arises when I want to start the process before the bulk api calls are finished. There are so many calls that I would need at least 20 seconds. I don't want the user to wait for 20 seconds for their initial data display, so I want the data analysis to pause to let the new request to call for that initial api for display.
Here's the general structure of my function after the doPost(async'd so this is in a Runnable). It's a bit long so I abbreviated it for easier read:
private void postMatches(ServletRequest req, ServletResponse res) {
... getting the necessary fields from the req ...
/* read in values for arrays */
String rankQueue = generateQueryStringArray("RankedQueues=", "rankedQueues", info);
String season = generateQueryStringArray("seasons=", "seasons", info);
String champion = generateQueryStringArray("championIds=", "championIds", info);
/* first api call, "return" this and then start analysis */
JSONObject recentMatches = caller.callRiotMatchHistory(region, "" + playerId);
try {
PrintWriter out = res.getWriter();
out.write((new Gson()).toJson(recentMatches));
out.close();
} catch (IOException e) {
e.printStackTrace();
}
/* use array values to send more api calls */
JSONObject matchList = caller.callRiotMatchList(playerId, region, rankQueue, season, champion);
... do a ton more api calls with the matchList's id's...
}
So one of my ideas is to
Have two threads per client.
That way, there would be one thread calling that single api call, and the other thread would be calling the rest 999 api calls. This way, the single api calling thread would just wait until another doPost from the same client would come and call the api immediately, and the bulk api calls that come with it will just be appended to the other thread. By doing this, the two threads will compute in parallel.
Have a priority queue, put the initial call on high priority
This way, every URL will be passed through the queue and I can chose the compareTo of specific URL's to be greater(maybe wrap it in a bean). However, I'm not sure how the api caller will be able to distinguish which call is which, because once the url's added into the queue it loses identity. Is there any way to fix that? I know callbacks aren't available in java, so it's kind of hard to do that.
Are either of these two ideas possible? No need for code, but it would be greatly appreciated!
PS: I'm using Jersey for API calls.

The best bet for you seems to be using the "two threads per client" solution. Or rather a variation of it.
I figure the API you're calling will have some rate-limiting in place, so that significant amounts of calls will get automatically blocked. That's problematic for you since that limit can probably be trivially reached with just a few requests you process simultaneously.
Additionally you may hit I/O-Limits rather sooner than later, depending on how timeintensive your calculations are. This means you should have an intrinsic limit for your background API calls, the initial request should be fine without any inherent limiting. As such a fixed-size ThreadPool seems to be the perfect solution. Exposing it as a static ExecutorService in your service should be the simplest solution.
As such I'd propose that you expose a static service, that takes the matchList as parameter and then "does it's thing".
It could looks something like this:
public class MatchProcessor {
private static final ExecutorService service = Executors.newFixedThreadPool(THREADS);
public static void processMatchList(final JSONObject matchList) {
service.submit(() -> runAnalysis(matchList));
}
private static void runAnalysis(final JSONObject matchList) {
//processing goes here
}
}
Sidenote: this code uses java 8, it should be simple to convert the submitted lambda into a Runnable, if you're on Java 7

Terminating a Future and getting the intermediate result

I have a long-running Scala Future that operates as follows:
Calculate initial result
Improve result
If no improvement then terminate, else goto 2
After receiving an external signal (meaning the Future won't have any a priori knowledge about how long it's supposed to run for), I would like to be able to tell the Future to terminate and give me its intermediate result. I can do this using some sort of side channel (note: this is a Java program using Akka, hence the reason I'm creating a Scala future in Java along with all of the attendant boilerplate):
public void doCalculation(AtomicBoolean interrupt, AtomicReference output) {
Futures.future(new Callable<Boolean>() {
public Boolean call() {
Object previous = // calculate initial value
output.set(previous);
while(!interrupt.get()) {
Object next = // calculate next value
if(/* next is better than previous */) {
previous = next;
output.set(previous);
} else return true;
}
return false;
}
}, TypedActor.dispatcher());
}
This way whoever is calling doCalculation can get intermediate values via output and can terminate the Future via interrupt. However, I'm wondering if there is a way to do this without resorting to side channels, as this is going to make it somewhat difficult for somebody else to maintain my code. We're using Java 7.

This doesn't sound much like a Future to me. Consider instead a Runnable that simply updates an AtomicReference field. Your runnable can update the reference as often as needed, and callers can poll the field whenever they want.
You could have your class implement Supplier if you want it to expose a standard interface for getting a value.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.