Java - processing documents in parallel

Java - processing documents in parallel - java

I have 5 documents(say) and I have some processing on each of them. Processing here includes open the document/file, read the data, do some document manipulation(edit text etc). For document manipulation I will probably be using docx4j or apache-poi. But my use case is this - I want to somehow process these 4-5 documents in parallel utilizing multiple cores available to me on my CPU. The processing on each document is independent of each other.
What would be the best way to achieve this parallel processing in Java. I have used ExecutorService in java before and Thread class too. But I dont have much idea about the newer concepts like Streams or RxJava. Can this task be achieved by using Parallel Stream in Java as introduced in Java 8? What would be better to use Executors/Streams/Thread Class etc. If Streams can be used please provide a link where I can find some tutorial on how to do that. Thanks for your help!

You can process in parallel using Java Streams using the following pattern.
List<File> files = ...
files.parallelStream().forEach(f -> process(f));
or
File[] files = dir.listFiles();
Stream.of(files).parallel().forEach(f -> process(f));
Note: process cannot throw a CheckedException in this example. I suggest you either log it or return a result object.

If you want to learn about ReactiveX, I would recomend use rxJava Observable.zip http://reactivex.io/documentation/operators/zip.html
Where you can run multiple process on parallel here an example:
public class ObservableZip {
private Scheduler scheduler;
private Scheduler scheduler1;
private Scheduler scheduler2;
#Test
public void testAsyncZip() {
scheduler = Schedulers.newThread();//Thread to open and read 1 file
scheduler1 = Schedulers.newThread();//Thread to open and read 1 file
scheduler2 = Schedulers.newThread();//Thread to open and read 1 file
Observable.zip(obAsyncString(file1), obAsyncString1(file2), obAsyncString2(file3), (s, s2, s3) -> s.concat(s2)
.concat(s3))
.subscribe(result -> showResult("All files in one:", result));
}
public void showResult(String transactionType, String result) {
System.out.println(result + " " +
transactionType);
}
public Observable<String> obAsyncString(File file) {
return Observable.just(file)
.observeOn(scheduler)
.doOnNext(val -> {
//Here you read your file
});
}
public Observable<String> obAsyncString1(File file) {
return Observable.just(file)
.observeOn(scheduler1)
.doOnNext(val -> {
//Here you read your file 2
});
}
public Observable<String> obAsyncString2(File file) {
return Observable.just(file)
.observeOn(scheduler2)
.doOnNext(val -> {
//Here you read your file 3
});
}
}
Like I said, just in case that you want to learn about ReactiveX, because if it not, add this framework in your stack to solve the issue would be a little overkill, and I would much rather the previous stream parallel solution

Related

How to copy large amount of files from S3 folder to another

I'm trying to move large amount of files(around 300Kb max size each file) from S3 folder to another.
I'm using AWS sdk for java, and tried to move around 1500 files.
it took too much time, and the number of files may be increase to 10,000.
for each copy of file, need to delete from the source folder as there is no method to move file.
this what i tried:
public void moveFiles(String fromKey, String toKey) {
Stream<S3ObjectSummary> objectSummeriesStream = this.getObjectSummeries(fromKey);
objectSummeriesStream.forEach(file ->
{
this.s3Bean.copyObject(bucketName, file.getKey(), bucketName, toKey);
this.s3Bean.deleteObject(bucketName, file.getKey());
});
}
private Stream<S3ObjectSummary> getObjectSummeries(String key) {
// get the files that their prefix is "key" (can be consider as Folders).
ListObjectsRequest listObjectsRequest = new ListObjectsRequest().withBucketName(this.bucketName)
.withPrefix(key);
ObjectListing outFilesList = this.s3Bean.listObjects(listObjectsRequest);
return outFilesList.getObjectSummaries()
.stream()
.filter(x -> !x.getKey()
.equals(key));
}

If you are using Java application you can try to use several threads to copy files:
private ExecutorService executorService = Executors.fixed(20);
public void moveFiles(String fromKey, String toKey) {
Stream<S3ObjectSummary> objectSummeriesStream =
this.getObjectSummeries(fromKey);
objectSummeriesStream.forEach(file ->
{
executorService.submit(() ->
this.s3Bean.copyObject(bucketName, file.getKey(), bucketName, toKey);
this.s3Bean.deleteObject(bucketName, file.getKey());
)};
});
}
This should speed up the process.
An alternative might be using AWS-lambda. Once the file appear in source bucket you can, for example, put event in the SQS FIFO queue. The lambda will start single file copy by this event. If I am not mistaken in parallel you can start up to 500 instances of lambdas. Should be fast.

Using reactor's Flux.buffer to batch work only works for single item

I'm trying to use Flux.buffer() to batch up loads from a database.
The use case is that loading records from a DB may be 'bursty', and I'd like to introduce a small buffer to group together loads where possible.
My conceptual approach has been to use some form of processor, publish to it's sink, let that buffer, and then subscribe & filter for the result I want.
I've tried multiple different approaches (different types of processors, creating the filtered Mono in different ways).
Below is where I've gotten so far - largely by stumbling.
Currently, this returns a single result, but subsequent calls are dropped (though I'm unsure of where).
class BatchLoadingRepository {
// I've tried all manner of different processors here. I'm unsure if
// TopicProcessor is the correct one to use.
private val bufferPublisher = TopicProcessor.create<String>()
private val resultsStream = bufferPublisher
.bufferTimeout(50, Duration.ofMillis(50))
// I'm unsure if concatMapIterable is the correct operator here,
// but it seems to work.
// I'm really trying to turn the List<MyEntity>
// into a stream of MyEntity, published on the Flux<>
.concatMapIterable { requestedIds ->
// this is a Spring Data repository. It returns List<MyEntity>
repository.findAllById(requestedIds)
}
// Multiple callers will invoke this method, and then subscribe to receive
// their entity back.
fun findByIdAsync(id: String): Mono<MyEntity> {
// Is there a potential race condition here, caused by a result
// on the resultsStream, before I've subscribed?
return Mono.create<MyEntity> { sink ->
bufferPublisher.sink().next(id)
resultsStream.filter { it.id == id }
.subscribe { next ->
sink.success(next)
}
}
}
}

Hi i was testing your code and i think the best way is to use EmitterProcessor shared. I did a test with emitterProcessor and it seems to work.
Flux<String> fluxi;
EmitterProcessor emitterProcessor;
#Override
public void run(String... args) throws Exception {
emitterProcessor = EmitterProcessor.create();
fluxi = emitterProcessor.share().bufferTimeout(500, Duration.ofMillis(500))
.concatMapIterable(o -> o);
Flux.range(0,1000)
.flatMap(integer -> findByIdAsync(integer.toString()))
.map(s -> {
System.out.println(s);
return s;
}).subscribe();
}
private Mono<String> findByIdAsync(String id) {
return Mono.create(monoSink -> {
fluxi.filter(s -> s == id).subscribe(value -> monoSink.success(value));
emitterProcessor.onNext(id);
});
}

Dataflow Splittable ReadFn not using multiple workers

I have a particularly simple Dataflow pipeline where I want to read a file and output its parsed records to Avro. This works in most cases, except where the source file is particularly large (20+ GB) which causes me to OOM even with particularly large memory machines. I am pretty sure this happens because the non-splittable source is read in its entirety by Beam, so I implemented a splittable DoFn<FileIO.ReadableFile, GenericRecord>
This functionally works in that the pipeline now succeeds, which seems to validate my assumption that the single large batch from a non-splittable file is the cause. However, this does not seem to spread the work across multiple workers. I tried the following:
Disabled autoscaling (autoscalingAlgorithm=NONE) and set numWorkers to 10. This had the same throughput as numWorkers 1
Left autoscaling on with a high maxWorkers. This went briefly up to 2, and then came back down to 1
Added a shuffle (Reshuffle.viaRandomKey) after the DoFn, but before the Avro write
Any ideas? The exact code is difficult to share because of company policy, but overall is pretty simple. I implemented the following:
public class SplittableReadFn extends DoFn<FileIO.ReadableFile, GenericRecord> {
// ...
#ProcessElement
public void process(final ProcessContext c, final OffsetRangeTracker tracker) {
final FileIO.ReadableFile file = c.element();
// Followed by something like
ReadableByteStream in = file.open()
in.seek(tracker.from())
Parser parser = new Parser(in)
while (parser.next()) {
if (parser.getOffset() > tracker.to()) {
break
}
tracker.tryClaim(parser.getOffset())
c.output(parser.item())
}
tracker.markDone()
}
#GetInitialRestriction
public OffsetRange getInitialRestriction(final FileIO.ReadableFile file) {
return new Offset(0, getSize(file) - 1);
}
#SplitRestriction
public void splitRestriction(final FileIO.ReadableFile file, final OffsetRange restriction, final DoFn.OutputReceiver<OffsetRange> receiver) {
// chunkRange for test purposes just breaks into at most 500MB chunks
for (final OffsetRange chunk: chunkRange(restriction)) {
receiver.output(chunk);
}
}

How to handle split Streams functionally

Given the following code, how can I simplify it to a single, functional line?
// DELETE CSV TEMP FILES
final Map<Boolean, List<File>> deleteResults = Stream.of(tmpDir.listFiles())
.filter(tempFile -> tempFile.getName().endsWith(".csv"))
.collect(Collectors.partitioningBy(File::delete));
// LOG SUCCESSES AND FAILURES
deleteResults.entrySet().forEach(entry -> {
if (entry.getKey() && !entry.getValue().isEmpty()) {
LOGGER.debug("deleted temporary files, {}",
entry.getValue().stream().map(File::getAbsolutePath).collect(Collectors.joining(",")));
} else if (!entry.getValue().isEmpty()) {
LOGGER.debug("failed to delete temporary files, {}",
entry.getValue().stream().map(File::getAbsolutePath).collect(Collectors.joining(",")));
}
});
This is a common pattern I run into, where I have a stream of things, and I want to filter this stream, creating two streams based off that filter, where I can then do one thing to Stream A and another thing to Stream B. Is this an anti-pattern, or is it supported somehow?

If you particularly don't want the explicit variable referencing the interim map then you can just chain the operations:
.collect(Collectors.partitioningBy(File::delete))
.forEach((del, files) -> {
if (del) {
LOGGER.debug(... files.stream()...);
} else {
LOGGER.debug(... files.stream()...);
});

If you want to log all files of the either category together, there is no way around collecting them into a data structure holding them, until all elements are known. Still, you can simplify your code:
Stream.of(tmpDir.listFiles())
.filter(tempFile -> tempFile.getName().endsWith(".csv"))
.collect(Collectors.partitioningBy(File::delete,
Collectors.mapping(File::getAbsolutePath, Collectors.joining(","))))
.forEach((success, files) -> {
if (!files.isEmpty()) {
LOGGER.debug(success? "deleted temporary files, {}":
"failed to delete temporary files, {}",
files);
}
});
This doesn’t collect the files into a List but into the intended String for the subsequent logging action in the first place. The logging action also is identical for both cases, but only differs in the message.
Still, the most interesting thing is why deleting a file failed, which a boolean doesn’t tell. Since Java 7, the nio package provides a better alternative:
Create helper method
public static String deleteWithReason(Path p) {
String problem;
IOException ioEx;
try {
Files.delete(p);
return "";
}
catch(FileSystemException ex) {
problem = ex.getReason();
ioEx = ex;
}
catch(IOException ex) {
ioEx = ex;
problem = null;
}
return problem!=null? problem.replaceAll("\\.?\\R", ""): ioEx.getClass().getName();
}
and use it like
Files.list(tmpDir.toPath())
.filter(tempFile -> tempFile.getFileName().toString().endsWith(".csv"))
.collect(Collectors.groupingBy(YourClass::deleteWithReason,
Collectors.mapping(p -> p.toAbsolutePath().toString(), Collectors.joining(","))))
.forEach((failure, files) ->
LOGGER.debug(failure.isEmpty()? "deleted temporary files, {}":
"failed to delete temporary files, "+failure+ ", {}",
files)
);
The disadvantage, if you want to call it that way, is does not produce a single entry for all failed files, if they have different failure reasons. But that’s obviously unavoidable if you want to log them with the reason why they couldn’t be deleted.
Note that if you want to exclude “being deleted by someone else concurrently” from the failures, you can simply use Files.deleteIfExists(p) instead of Files.delete(p) and being already deleted will be treated as success.

RxJava Combine Sequence Of Requests

The Problem
I have two Apis. Api 1 gives me a List of Items and Api 2 gives me more detailed Information for each of the items I got from Api 1. The way I solved it so far results in bad Performance.
The Question
Efficent and fast solution to this Problem with the help of Retrofit and RxJava.
My Approach
At the Moment my Solution Looks like this:
Step 1: Retrofit executes Single<ArrayList<Information>> from Api 1.
Step 2: I iterate through this Items and make a request for each to Api 2.
Step 3: Retrofit Returns Sequentially executes Single<ExtendedInformation> for
each item
Step 4: After all calls form Api 2 completely executed I create a new Object for all Items combining the Information and Extended Information.
My Code
public void addExtendedInformations(final Information[] informations) {
final ArrayList<InformationDetail> informationDetailArrayList = new ArrayList<>();
final JSONRequestRatingHelper.RatingRequestListener ratingRequestListener = new JSONRequestRatingHelper.RatingRequestListener() {
#Override
public void onDownloadFinished(Information baseInformation, ExtendedInformation extendedInformation) {
informationDetailArrayList.add(new InformationDetail(baseInformation, extendedInformation));
if (informationDetailArrayList.size() >= informations.length){
listener.onAllExtendedInformationLoadedAndCombined(informationDetailArrayList);
}
}
};
for (Information information : informations) {
getExtendedInformation(ratingRequestListener, information);
}
}
public void getRatingsByTitle(final JSONRequestRatingHelper.RatingRequestListener ratingRequestListener, final Information information) {
Single<ExtendedInformation> repos = service.findForTitle(information.title);
disposable.add(repos.subscribeOn(Schedulers.io()).observeOn(AndroidSchedulers.mainThread()).subscribeWith(new DisposableSingleObserver<ExtendedInformation>() {
#Override
public void onSuccess(ExtendedInformation extendedInformation) {
ratingRequestListener.onDownloadFinished(information, extendedInformation);
}
#Override
public void onError(Throwable e) {
ExtendedInformation extendedInformation = new ExtendedInformation();
ratingRequestListener.onDownloadFinished(extendedInformation, information);
}
}));
}
public interface RatingRequestListener {
void onDownloadFinished(Information information, ExtendedInformation extendedInformation);
}

tl;dr use concatMapEager or flatMap and execute sub-calls asynchronously or on a schedulers.
long story
I'm not an android developer, so my question will be limited to pure RxJava (version 1 and version 2).
If I get the picture right the needed flow is :
some query param
\--> Execute query on API_1 -> list of items
|-> Execute query for item 1 on API_2 -> extended info of item1
|-> Execute query for item 2 on API_2 -> extended info of item1
|-> Execute query for item 3 on API_2 -> extended info of item1
...
\-> Execute query for item n on API_2 -> extended info of item1
\----------------------------------------------------------------------/
|
\--> stream (or list) of extended item info for the query param
Assuming Retrofit generated the clients for
interface Api1 {
#GET("/api1") Observable<List<Item>> items(#Query("param") String param);
}
interface Api2 {
#GET("/api2/{item_id}") Observable<ItemExtended> extendedInfo(#Path("item_id") String item_id);
}
If the order of the item is not important, then it is possible to use flatMap only:
api1.items(queryParam)
.flatMap(itemList -> Observable.fromIterable(itemList)))
.flatMap(item -> api2.extendedInfo(item.id()))
.subscribe(...)
But only if the retrofit builder is configured with
Either with the async adapter (calls will be queued in the okhttp internal executor). I personally think this is not a good idea, because you don't have control over this executor.
.addCallAdapterFactory(RxJava2CallAdapterFactory.createAsync()
Or with the scheduler based adapter (calls will be scheduled on the RxJava scheduler). It would my preferred option, because you explicitly choose which scheduler is used, it will be most likely the IO scheduler, but you are free to try a different one.
.addCallAdapterFactory(RxJava2CallAdapterFactory.createWithScheduler(Schedulers.io()))
The reason is that flatMap will subscribe to each observable created by api2.extendedInfo(...) and merge them in the resulting observable. So results will appear in the order they are received.
If the retrofit client is not set to be async or set to run on a scheduler, it is possible to set one :
api1.items(queryParam)
.flatMap(itemList -> Observable.fromIterable(itemList)))
.flatMap(item -> api2.extendedInfo(item.id()).subscribeOn(Schedulers.io()))
.subscribe(...)
This structure is almost identical to the previous one execpts it indicates locally on which scheduler each api2.extendedInfo is supposed to run.
It is possible to tune the maxConcurrency parameter of flatMap to control how many request you want to perform at the same time. Although I'd be cautious on this one, you don't want run all queries at the same time. Usually the default maxConcurrency is good enough (128).
Now if order of the original query matter. concatMap is usually the operator that does the same thing as flatMap in order but sequentially, which turns out to be slow if the code need to wait for all sub-queries to be performed. The solution though is one step further with concatMapEager, this one will subscribe to observable in order, and buffer the results as needed.
Assuming retrofit clients are async or ran on a specific scheduler :
api1.items(queryParam)
.flatMap(itemList -> Observable.fromIterable(itemList)))
.concatMapEager(item -> api2.extendedInfo(item.id()))
.subscribe(...)
Or if the scheduler has to be set locally :
api1.items(queryParam)
.flatMap(itemList -> Observable.fromIterable(itemList)))
.concatMapEager(item -> api2.extendedInfo(item.id()).subscribeOn(Schedulers.io()))
.subscribe(...)
It is also possible to tune the concurrency in this operator.
Additionally if the Api is returning Flowable, it is possible to use .parallel that is still in beta at this time in RxJava 2.1.7. But then results are not in order and I don't know a way (yet?) to order them without sorting after.
api.items(queryParam) // Flowable<Item>
.parallel(10)
.runOn(Schedulers.io())
.map(item -> api2.extendedInfo(item.id()))
.sequential(); // Flowable<ItemExtended>

the flatMap operator is designed to cater to these types of workflows.
i'll outline the broad strokes with a simple five step example. hopefully you can easily reconstruct the same principles in your code:
#Test fun flatMapExample() {
// (1) constructing a fake stream that emits a list of values
Observable.just(listOf(1, 2, 3, 4, 5))
// (2) convert our List emission into a stream of its constituent values
.flatMap { numbers -> Observable.fromIterable(numbers) }
// (3) subsequently convert each individual value emission into an Observable of some
// newly calculated type
.flatMap { number ->
when(number) {
1 -> Observable.just("A1")
2 -> Observable.just("B2")
3 -> Observable.just("C3")
4 -> Observable.just("D4")
5 -> Observable.just("E5")
else -> throw RuntimeException("Unexpected value for number [$number]")
}
}
// (4) collect all the final emissions into a list
.toList()
.subscribeBy(
onSuccess = {
// (5) handle all the combined results (in list form) here
println("## onNext($it)")
},
onError = { error ->
println("## onError(${error.message})")
}
)
}
(incidentally, if the order of the emissions matter, look at using concatMap instead).
i hope that helps.

Check below it's working.
Say you have multiple network calls you need to make–cals to get Github user information and Github user events for example.
And you want to wait for each to return before updating the UI. RxJava can help you here.
Let’s first define our Retrofit object to access Github’s API, then setup two observables for the two network requests call.
Retrofit repo = new Retrofit.Builder()
.baseUrl("https://api.github.com")
.addConverterFactory(GsonConverterFactory.create())
.addCallAdapterFactory(RxJavaCallAdapterFactory.create())
.build();
Observable<JsonObject> userObservable = repo
.create(GitHubUser.class)
.getUser(loginName)
.subscribeOn(Schedulers.newThread())
.observeOn(AndroidSchedulers.mainThread());
Observable<JsonArray> eventsObservable = repo
.create(GitHubEvents.class)
.listEvents(loginName)
.subscribeOn(Schedulers.newThread())
.observeOn(AndroidSchedulers.mainThread());
Used Interface for it like below:
public interface GitHubUser {
#GET("users/{user}")
Observable<JsonObject> getUser(#Path("user") String user);
}
public interface GitHubEvents {
#GET("users/{user}/events")
Observable<JsonArray> listEvents(#Path("user") String user);
}
After we use RxJava’s zip method to combine our two Observables and wait for them to complete before creating a new Observable.
Observable<UserAndEvents> combined = Observable.zip(userObservable, eventsObservable, new Func2<JsonObject, JsonArray, UserAndEvents>() {
#Override
public UserAndEvents call(JsonObject jsonObject, JsonArray jsonElements) {
return new UserAndEvents(jsonObject, jsonElements);
}
});
Finally let’s call the subscribe method on our new combined Observable:
combined.subscribe(new Subscriber<UserAndEvents>() {
...
#Override
public void onNext(UserAndEvents o) {
// You can access the results of the
// two observabes via the POJO now
}
});
No more waiting in threads etc for network calls to finish. RxJava has done all that for you in zip().
hope my answer helps you.

I solved a similar problem with RxJava2. Execution of requests for Api 2 in parallel slightly speeds up the work.
private InformationRepository informationRepository;
//init....
public Single<List<FullInformation>> getFullInformation() {
return informationRepository.getInformationList()
.subscribeOn(Schedulers.io())//I usually write subscribeOn() in the repository, here - for clarity
.flatMapObservable(Observable::fromIterable)
.flatMapSingle(this::getFullInformation)
.collect(ArrayList::new, List::add);
}
private Single<FullInformation> getFullInformation(Information information) {
return informationRepository.getExtendedInformation(information)
.map(extendedInformation -> new FullInformation(information, extendedInformation))
.subscribeOn(Schedulers.io());//execute requests in parallel
}
InformationRepository - just interface. Its implementation is not interesting for us.
public interface InformationRepository {
Single<List<Information>> getInformationList();
Single<ExtendedInformation> getExtendedInformation(Information information);
}
FullInformation - container for result.
public class FullInformation {
private Information information;
private ExtendedInformation extendedInformation;
public FullInformation(Information information, ExtendedInformation extendedInformation) {
this.information = information;
this.extendedInformation = extendedInformation;
}
}

Try using Observable.zip() operator. It will wait until both Api calls are finished before continuing the stream. Then you can insert some logic by calling flatMap() afterwards.
http://reactivex.io/documentation/operators/zip.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - processing documents in parallel - java

Related

How to copy large amount of files from S3 folder to another

Using reactor's Flux.buffer to batch work only works for single item

Dataflow Splittable ReadFn not using multiple workers

How to handle split Streams functionally

RxJava Combine Sequence Of Requests

Categories

Resources