Asynchronous File.copy in Java - java

Is there a way in Java to copy one file into another in an asynchrnous way? Something similar to Stream.CopyToAsync in C# is what I'm trying to find.
What I'm trying to achieve is to download a series of ~40 files from the Internet, and this is the best I've come up with for each file:
CompletableFuture.allOf(myFiles.stream()
.map(file -> CompletableFuture.supplyAsync(() -> syncDownloadFile(file)))
.toArray(CompletableFuture[]::class))
.then(ignored -> doSomethingAfterAllDownloadsAreComplete());
Where syncDownloadFile is:
private void syncDownloadFile(MyFile file) {
try (InputStream is = file.mySourceUrl.openStream()) {
long actualSize = Files.copy(is, file.myDestinationNIOPath);
// size validation here
} catch (IOException e) {
throw new RuntimeException(e);
}
}
But that means I have some blocking calls inside of the task executors, and I'd like to avoid that so I don't block too many executors at once.
I'm not sure if the C# method internally does the same (I mean, something has to be downloading that file right?).
Is there a better way to accomplish this?

AsynchronousFileChannel (AFC for short) is the right way to manage Files in Java with non-blocking IO. Unfortunately it does not provide a promises based (aka as Task in .net) API such as the CopyToAsync(Stream) of .Net.
The alternative RxIo library is built on top of the AFC and provides the AsyncFiles asynchronous API with different calling idioms: callbacks based, CompletableFuture (equivalent to .net Task) and also reactive streams.
For instance, copying from one file to another asynchronously can be done though:
Path in = Paths.get("input.txt");
Path out = Paths.get("output.txt");
AsyncFiles
.readAllBytes(in)
.thenCompose(bytes -> AsyncFiles.writeBytes(out, bytes))
.thenAccept(index -> /* invoked on completion */)
Note that continuations are invoked by a thread from the background AsynchronousChannelGroup.
Thus you may solve your problem using a non-blocking HTTP client, with ComplableFuture based API chained with the AsyncFiles use. For instance, AHC is valid choice. See usage here: https://github.com/AsyncHttpClient/async-http-client#using-continuations

Related

Java mono repeat call until collected results compete

I'm picking up Java/Reactor after moving over from C#. I'm well versed in the C# async-await approach to non-blocking calls and am struggling to adapt to Flux/Mono.
I'm implementing a solution where I need to make a call to ElasticSearch via the Java SDK, get results, apply additional filters to strip out ES results, and keep paging through ES until my final collection of results is complete.
The ES SDK doesn't support Reactor but there are examples of Java adapter code that takes the ES callback and converts to a mono (I see a direct correlation to the C# async-await here as this is a non-blocking call to ES). What I then struggle with is the next bit - I need to take the results from the ES mono, filter them.
I do this by calling out to other external services to get additional data based on the results from the ES call, so I need to know the ids of each page of content the ES mono result before I can apply the filtering (effectively a kind of block), then apply the in-memory filters and if I don't have enough content, then go back to ES to get the next page... repeat until I have enough data or there are no more results from ES.
This appears to be very difficult to achieve compared to C# but I probably just don't understand the Java paradigm correctly.
My problem is that I can't use "block()" as this throws an error in Reactor 3.2 so I don't really know how to "wait" until the mono calls to ES and external services are complete until continuing. In C#, this would be as simple as call to an Async method with an await to handle the implicit callbacks
My blocking version (works in IntellJ, fails when published via maven and then run in a webserver) is effectively:
do {
var sr = GetSearchRequest(xxxx);
this.elasticsearch.results(sr)
.map(r -> chunk.addAdd(r))
.block();
if (chunk.size() == 0 {
isComplete = true;
}
else {
var filtered = postFilterResults(chunk);
finalResults.add(filtered);
if (finalResults.size() = MAXIMUM_RESULTS) {
isComplete = true;
}
esPage = esPage + 1;
while (isComplete == false);
If I try to subscribe() or other non-blocking reaktor calls, then (obviously) the code skips over the "get ES" bit and hits the do-while, looping repeatedly until the callback from ES finally happens and the subscribed map is invoked.
I think I need to perform an "async block" for each ES call but I don't know how.
To answer my own question... The underlying issue IMO is that Flux/Mono simply is not like any existing programming style in that it absolutely forces you to work within the fluent style that reactor mandates. This is very similar to C# Linq but it's almost a "false friend" as even things like loops need to be in Reactor.
In this case, the key issue to solve is one of paging and to keep doing this within a loop. it is very unclear how to achieve this as a subscription to a flux "locks in" the original parameters so repeating the subscription call simply gets the same page again. The solution is to use the Flux.defer method which forces lazy building of the subscription on each repeated invoke. You then need Atomic integers to keep track of the page counter across different calls. Again, this is something that C# handles for you, so it can catch a .net developer out.
Something like:
//The response from the elasticsearch adapter is a Flux<T> but we do not want to filter
//results on a row by row basis as this incurs one call for each row to the DB/Network
//(as appropriate). We choose to batch these up
var result = new SearchResult();
var page = new AtomicInteger();
var chunkSize = new AtomicInteger();
//Use a defer so we recalculate the subscription to the search with the new page count
var results = Flux.defer(() -> elasticsearch.results(GetSearchRequest(request, lc, pf, page.get()))
.doOnComplete(() -> {
chunkSize.set(0);
page.getAndAdd(1);
})
.collectList()
.map(chunk -> {
chunkSize.set(chunk.size());
return chunk;
})
.map(chunk -> postFilterResults(request, chunk, pf))
.map(filtered -> result.getDocuments().addAll(filtered)));
//Repeat the deferred flux (recalculating each time) until we have enough content or we don't get anything from the search engine
return results
.repeat()
.takeUntil(r -> chunkSize.get() == 0 || result.getDocuments().size() >= this.elasticsearch.getMaximumSearchResults())
.take(this.elasticsearch.getMaximumSearchResults())
.collectList()
.flatMap(r -> {
result.setTotalHits(result.getDocuments().size());
return Mono.just(result);
});

Getting every file from directory in Java

I'm trying to write a method that, from a given directory, extract every file (also in every subdirectories). I'm using Files.find for this. The problem is that whenever it finds a file that I can't access it stops but I want to continue the research and add to the list the other files.
This is my code
public static List<String> search(String dir){
List<String> listFiles = new ArrayList<>();
try{
Files.find(Paths.get(dir), Integer.MAX_VALUE, (filePath, fileAttr) -> fileAttr.isRegularFile())
.forEach((file) -> {
listFiles.add(file.toAbsolutePath().toString());
});
} catch (UncheckedIOException ue){
System.out.println("Can't access that directory");
} catch (IOException e) {
e.printStackTrace();
}
return listFiles;
}
How can I change it?
You're looking for the FileVisitor interface from Java 8's NIO package. This class offers multiple functions to test directories for accessibility etc before entering them, as well as built-in error handling and an API to control the behaviour of your application.
Your specific problem would require to create some kind of list (E.g. outside of the FileVisitor) which you can then fill from inside the method using Collection::add
Sadly, Java's Stream API is completely unable to handle exceptions on its own, so any try to solve your problem with Streams would require a lot of unneccessary work, considering that NIO offers the more verbose, but far superior FileVisitor solution.

How to force parallel execution of BufferedStream.lines() with flatMap()?

I have some code that looks like this (simplified pseudo-code):
[...]
// stream constructed of series of web service calls
Stream<InputStream> slowExternalSources = StreamSupport.stream(spliterator, false);
[...]
then this
public Stream<String> getLines(Stream<InputStream> slowExternalSources) {
return slowExternalSources.flatMap(is -> new BufferedReader(new InputStreamReader(is)).lines())
.onClose(() -> is.close());
}
and later this
Stream<String> lineStream = getLines();
lineStream.parallel().forEach( ... do some fast CPU-intensive stuff here ... }
I've been strugging to try to make this code execute with some level of parallelisation.
Inspection in jps/jstack/jmc shows that all the InputStream reading is occurring in the main thread, and not paralleling at all.
Possible culprints:
BufferedReader.lines() uses a Spliterator with parallel=false to construct the stream (source: see Java sources)
I think I read some articles that said flatMap does not interact well with parallel(). I am not able to locate that article right now.
How can I fix this code so that it runs in parallel?
I would like to retain the Java8 Streams if possible, to avoid rewriting existing code that expects a Stream.
NOTE I added java.util.concurrent to the tags because I suspect it might be part of the answer, even though it's not part of the question.

How to chain Guava futures?

I'm trying to create a small service to accept file upload, unzip it and then delete the uploaded file. Those three steps should be chained as futures. I'm using Google Guava library.
Workflow is:
A future to download the file, if the operation completed, then a future to unzip the file. If unzipping is done, a future to delete the original uploaded file.
But honestly, it isn't clear to me how I would chain the futures, and even how to create them in Guava's way. Documentation is simply terse and unclear. Ok, there is transform method but no concrete example at all. chain method is deprecated.
I miss RxJava library.
Futures.transform is not fluently chainable like RxJava, but you can still use it to set up Futures that depend on one another. Here is a concrete example:
final ListeningExecutorService service = MoreExecutors.listeningDecorator(Executors.newCachedThreadPool());
final ListenableFuture<FileClass> fileFuture = service.submit(() -> fileDownloader.download())
final ListenableFuture<UnzippedFileClass> unzippedFileFuture = Futures.transform(fileFuture,
//need to cast this lambda
(Function<FileClass, UnzippedFileClass>) file -> fileUnzipper.unzip(file));
final ListenableFuture<Void> deletedFileFuture = Futures.transform(unzippedFileFuture,
(Function<UnzippedFileClass, Void>) unzippedFile -> fileDeleter.delete(unzippedFile));
deletedFileFuture.get(); //or however you want to wait for the result
This example assumes fileDownloader.download() returns an instance of FileClass, fileUpzipper.unzip() returns an UnzippedFileClass etc. If FileDownloader.download() instead returns a ListenableFuture<FileClass>, use AsyncFunction instead of Function.
This example also uses Java 8 lambdas for brevity. If you are not using Java 8, pass in anonymous implementations of Function or AsyncFunction instead:
Futures.transform(fileFuture, new AsyncFunction<FileClass, UpzippedFileClass>() {
#Override
public ListenableFuture<UnzippedFileClass> apply(final FileClass input) throws Exception {
return fileUnzipper.unzip();
}
});
More info on transform here: http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/util/concurrent/Futures.html#transform (scroll or search for "transform" -- deep linking appears to be broken currently)
Guava extends the Future interface with ListenableFuture for this purpose.
Something like this should work:
Runnable downloader, unzipper;
ListeningExecutorService service = MoreExecutors.listeningDecorator(Executors.newCachedThreadPool());
service.submit(downloader).addListener(unzipper, service);
I would include deleting the file in the unzipper, since it is a near instantaneous action, and it would complicate the code to separate it.

Executing dependent tasks in java

I need to find a way to execute mutually dependent tasks.
First task has to download a zip file from remote server.
Second tasks goal is to unzip the file downloaded by the first task.
Third task has to process files extracted from zip.
so, third is dependent on second and second on first task.
Naturally if one of the tasks fails, others shouldn't be executed. Since the first task downloads files from remote server, there should be a mechanism for restarting the task is server is not available.
Tasks have to be executed daily.
Any suggestions, patterns or java API?
Regards!
It seems that you do not want to devide them into tasks, just do like this:
process(unzip(download(uri)));
It depends a bit on external requirements. Is there any user involvement? Monitoring? Alerting?...
The simplest would obviously be just methods that check if the previous has done what it should.
download() downloads file to specified place.
unzip() extracts the file to a specified place if the downloaded file is in place.
process() processes the data if it has been extracted.
A more "formal" way of doing it would be to use a workflow engine. Depending on requirements, you can get some that do everything from fancy UIs, to some that follow formal standardised .XML-definitions of the workflow - and any in between.
http://java-source.net/open-source/workflow-engines
Create one public method to execute the full chain and private methods for the tasks:
public void doIt() {
if (download() == false) {
// download failed
} else if (unzip() == false) {
// unzip failed;
} else (process() == false)
// processing failed
}
}
private boolean download() {/* ... */}
private boolean unzip() {/* ... */}
private boolean process() {/* ... */}
So you have an API that gurantees that all steps are executed in the correct sequence and that a step is only executed if certain conditions are met (the above example just illustrates this pattern)
For daily execution you can use the Quartz Framework.
As the tasks are depending on each other I would recommend to evaluate the error codes or exceptions the tasks are returning. Then just continue if the previous task was successful.
The normal way to perform these tasks is to; call each task in order, and throw an exception when you have a failure which prevents the following tasks being performed. Something like
try {
download();
unzip();
process();
} catch(Exception failed) {
failed.printStackTrace();
}
I think what you are interested in is some kind of transaction definition.
I.e.
- Define TaskA (e.g. download)
- Define TaskB (e.g. unzip)
- Define TaskC (e.g. process)
Assuming that you intention is to have tasks working independent as well, e.g. only download a file (not execute also TaskB, TaskC) you should define Transaction1 composed of TaskA,TaskB,TaskC or Transaction2 composed of only TaskA.
The semantics e.g. concerning Transaction1 that TaskA,TaskB and TaskC should be executed sequentially and all or none could be captured in your transaction definitions.
The definitions can be in xml configuration files and you can use a framework e.g. Quartz for scheduling.
A higher construct shall check for the transactions and execute them as defined.
Dependent tasks execution made easy with Dexecutor
Disclaimer : I am the owner of the library
Basically you need the following pattern
Use Dexecutor.addDependency method
DefaultDexecutor<Integer, Integer> executor = newTaskExecutor();
//Building
executor.addDependency(1, 2);
executor.addDependency(2, 3);
executor.addDependency(3, 4);
executor.addDependency(4, 5);
//Execution
executor.execute(ExecutionConfig.TERMINATING);

Categories