How to force parallel execution of BufferedStream.lines() with flatMap()? - java

I have some code that looks like this (simplified pseudo-code):
[...]
// stream constructed of series of web service calls
Stream<InputStream> slowExternalSources = StreamSupport.stream(spliterator, false);
[...]
then this
public Stream<String> getLines(Stream<InputStream> slowExternalSources) {
return slowExternalSources.flatMap(is -> new BufferedReader(new InputStreamReader(is)).lines())
.onClose(() -> is.close());
}
and later this
Stream<String> lineStream = getLines();
lineStream.parallel().forEach( ... do some fast CPU-intensive stuff here ... }
I've been strugging to try to make this code execute with some level of parallelisation.
Inspection in jps/jstack/jmc shows that all the InputStream reading is occurring in the main thread, and not paralleling at all.
Possible culprints:
BufferedReader.lines() uses a Spliterator with parallel=false to construct the stream (source: see Java sources)
I think I read some articles that said flatMap does not interact well with parallel(). I am not able to locate that article right now.
How can I fix this code so that it runs in parallel?
I would like to retain the Java8 Streams if possible, to avoid rewriting existing code that expects a Stream.
NOTE I added java.util.concurrent to the tags because I suspect it might be part of the answer, even though it's not part of the question.

Related

Asynchronous File.copy in Java

Is there a way in Java to copy one file into another in an asynchrnous way? Something similar to Stream.CopyToAsync in C# is what I'm trying to find.
What I'm trying to achieve is to download a series of ~40 files from the Internet, and this is the best I've come up with for each file:
CompletableFuture.allOf(myFiles.stream()
.map(file -> CompletableFuture.supplyAsync(() -> syncDownloadFile(file)))
.toArray(CompletableFuture[]::class))
.then(ignored -> doSomethingAfterAllDownloadsAreComplete());
Where syncDownloadFile is:
private void syncDownloadFile(MyFile file) {
try (InputStream is = file.mySourceUrl.openStream()) {
long actualSize = Files.copy(is, file.myDestinationNIOPath);
// size validation here
} catch (IOException e) {
throw new RuntimeException(e);
}
}
But that means I have some blocking calls inside of the task executors, and I'd like to avoid that so I don't block too many executors at once.
I'm not sure if the C# method internally does the same (I mean, something has to be downloading that file right?).
Is there a better way to accomplish this?
AsynchronousFileChannel (AFC for short) is the right way to manage Files in Java with non-blocking IO. Unfortunately it does not provide a promises based (aka as Task in .net) API such as the CopyToAsync(Stream) of .Net.
The alternative RxIo library is built on top of the AFC and provides the AsyncFiles asynchronous API with different calling idioms: callbacks based, CompletableFuture (equivalent to .net Task) and also reactive streams.
For instance, copying from one file to another asynchronously can be done though:
Path in = Paths.get("input.txt");
Path out = Paths.get("output.txt");
AsyncFiles
.readAllBytes(in)
.thenCompose(bytes -> AsyncFiles.writeBytes(out, bytes))
.thenAccept(index -> /* invoked on completion */)
Note that continuations are invoked by a thread from the background AsynchronousChannelGroup.
Thus you may solve your problem using a non-blocking HTTP client, with ComplableFuture based API chained with the AsyncFiles use. For instance, AHC is valid choice. See usage here: https://github.com/AsyncHttpClient/async-http-client#using-continuations

Can Spark Streaming do Anything Other Than Word Count?

I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)

Reading stdout/stderr simultaneously from java.lang.Process with Java 8 CompletableFuture?

Suppose I have a java.lang.Process process object representing a sub-process I want to start from Java. I need to get both stdout and stderr output from the sub-process combined as a single String, and for the purpose of this question, I have chosen to store stdout first, followed by stderr. Based on my current understanding, I should be reading from them simultaneously. Sounds like a good task for CompletableFuture, I presume?
Hence, I have the following code snippets:
Getting the output
final CompletableFuture<String> output = fromStream(process.getInputStream()).thenCombine(
fromStream(process.getErrorStream()),
(stdout, stderr) -> Stream.concat(stdout, stderr).collect(Collectors.joining("\n")));
// to actually get the result, for example
System.out.println(output.get());
fromStream() helper method
public static CompletableFuture<Stream<String>> fromStream(final InputStream stream) {
return CompletableFuture.supplyAsync(() -> {
return new BufferedReader(new InputStreamReader(stream)).lines();
});
}
Is there a better/nicer Java-8-way of doing this task? I understand there are the redirectOutput() and redirectError() methods from ProcessBuilder, but I don't suppose I can use them to redirect to just a String?
As pointed out in the comments, I missed out on the redirectErrorStream(boolean) method that allows me to pipe stderr to stdout internally, so there's only one stream to deal with. In this case, using a CompletableFuture is completely overkill (pun unintended...?) and I'll probably be better off without it.

How to chain Guava futures?

I'm trying to create a small service to accept file upload, unzip it and then delete the uploaded file. Those three steps should be chained as futures. I'm using Google Guava library.
Workflow is:
A future to download the file, if the operation completed, then a future to unzip the file. If unzipping is done, a future to delete the original uploaded file.
But honestly, it isn't clear to me how I would chain the futures, and even how to create them in Guava's way. Documentation is simply terse and unclear. Ok, there is transform method but no concrete example at all. chain method is deprecated.
I miss RxJava library.
Futures.transform is not fluently chainable like RxJava, but you can still use it to set up Futures that depend on one another. Here is a concrete example:
final ListeningExecutorService service = MoreExecutors.listeningDecorator(Executors.newCachedThreadPool());
final ListenableFuture<FileClass> fileFuture = service.submit(() -> fileDownloader.download())
final ListenableFuture<UnzippedFileClass> unzippedFileFuture = Futures.transform(fileFuture,
//need to cast this lambda
(Function<FileClass, UnzippedFileClass>) file -> fileUnzipper.unzip(file));
final ListenableFuture<Void> deletedFileFuture = Futures.transform(unzippedFileFuture,
(Function<UnzippedFileClass, Void>) unzippedFile -> fileDeleter.delete(unzippedFile));
deletedFileFuture.get(); //or however you want to wait for the result
This example assumes fileDownloader.download() returns an instance of FileClass, fileUpzipper.unzip() returns an UnzippedFileClass etc. If FileDownloader.download() instead returns a ListenableFuture<FileClass>, use AsyncFunction instead of Function.
This example also uses Java 8 lambdas for brevity. If you are not using Java 8, pass in anonymous implementations of Function or AsyncFunction instead:
Futures.transform(fileFuture, new AsyncFunction<FileClass, UpzippedFileClass>() {
#Override
public ListenableFuture<UnzippedFileClass> apply(final FileClass input) throws Exception {
return fileUnzipper.unzip();
}
});
More info on transform here: http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/util/concurrent/Futures.html#transform (scroll or search for "transform" -- deep linking appears to be broken currently)
Guava extends the Future interface with ListenableFuture for this purpose.
Something like this should work:
Runnable downloader, unzipper;
ListeningExecutorService service = MoreExecutors.listeningDecorator(Executors.newCachedThreadPool());
service.submit(downloader).addListener(unzipper, service);
I would include deleting the file in the unzipper, since it is a near instantaneous action, and it would complicate the code to separate it.

Why does starting StreamingContext fail with “IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute”?

I'm trying to execute a Spark Streaming example with Twitter as the source as follows:
public static void main (String.. args) {
SparkConf conf = new SparkConf().setAppName("Spark_Streaming_Twitter").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(2));
JavaSQLContext sqlCtx = new JavaSQLContext(sc);
String[] filters = new String[] {"soccer"};
JavaReceiverInputDStream<Status> receiverStream = TwitterUtils.createStream(jssc,filters);
jssc.start();
jssc.awaitTermination();
}
But I'm getting the following exception
Exception in thread "main" java.lang.AssertionError: assertion failed: No output streams registered, so nothing to execute
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:158)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:416)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:437)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:501)
at org.learning.spark.TwitterStreamSpark.main(TwitterStreamSpark.java:53)
Any suggestion how to fix this issue?
When an output operator is called, it triggers the computation of a
stream.
Without output operator on DStream no computation is invoked. basically you will need to invoke any of below method on stream
print()
foreachRDD(func)
saveAsObjectFiles(prefix, [suffix])
saveAsTextFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations
you can also first apply any transformations and then output functions too if required.
Exception in thread "main" java.lang.AssertionError: assertion failed: No output streams registered, so nothing to execute
TL;DR Use one of the available output operators like print, saveAsTextFiles or foreachRDD (or less often used saveAsObjectFiles or saveAsHadoopFiles).
In other words, you have to use an output operator between the following lines in your code:
JavaReceiverInputDStream<Status> receiverStream = TwitterUtils.createStream(jssc,filters);
// --> The output operator here <--
jssc.start();
Quoting the Spark official documentation's Output Operations on DStreams (highlighting mine):
Output operations allow DStream's data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
The point is that without an output operator you have "no output streams registered, so nothing to execute".
As one commenter has noticed, you have to use an output transformation, e.g. print or foreachRDD, before starting the StreamingContext.
Internally, whenever you use one of the available output operators, e.g. print or foreach, DStreamGraph is requested to add an output stream.
You can find the registration when a new ForEachDStream is created and registered afterwards (which is exactly to add it as an output stream).
It also -wrongly- fails accusing this problem, but the real cause is the non multiple numbers between the slide window durations from streaming input and the RDD time windows.
It only logs a warning: you fix it, and the context stops failing :D

Categories