While writing a data synchronization job with RxJava I discovered a strange behavior that I cannot explain. I'm quite novice with RxJava and would appreciate help.
Briefely my job is quite simple I have a list of element IDs, I call a webservice to get each element by ID, do some processing and do multiple call to push data to DB.
Data loading is faster than data storing so I encounted OutOfMemory errors.
My code pretty much look like "failing" test but then doning some test I realized that removing the line :
flatMap(dt -> Observable.just(dt))
Make it work.
Failing test output shows clearly that unconsumed items stack up and this lead to OutOfMemory. Working test output shows that producer will always wait consumer so this never lead to OutOfMemory.
public static class DataStore {
public Integer myVal;
public byte[] myBigData;
public DataStore(Integer myVal) {
this.myVal = myVal;
this.myBigData = new byte[1000000];
}
}
#Test
public void working() {
int MAX_CONCURRENT_LOAD = 1;
int MAX_CONCURRENT_STORE = 2;
AtomicInteger nbUnconsumed = new AtomicInteger(0);
List<Integer> ids = IntStream.range(0, 1000).boxed().collect(Collectors.toList());
Observable.from(ids)
.flatMap(this::produce, MAX_CONCURRENT_LOAD)
.doOnNext(s -> logger.info("+1 Total unconsumed values: " + nbUnconsumed.incrementAndGet()))
.flatMap(this::consume, MAX_CONCURRENT_STORE)
.doOnNext(s -> logger.info("-1 Total unconsumed values: " + nbUnconsumed.decrementAndGet()))
.toBlocking().forEach(s -> {});
logger.info("Finished");
}
#Test
public void failing() {
int MAX_CONCURRENT_LOAD = 1;
int MAX_CONCURRENT_STORE = 2;
AtomicInteger nbUnconsumed = new AtomicInteger(0);
List<Integer> ids = IntStream.range(0, 1000).boxed().collect(Collectors.toList());
Observable.from(ids)
.flatMap(this::produce, MAX_CONCURRENT_LOAD)
.doOnNext(s -> logger.info("+1 Total unconsumed values: " + nbUnconsumed.incrementAndGet()))
.flatMap(dt -> Observable.just(dt))
.flatMap(this::consume, MAX_CONCURRENT_STORE)
.doOnNext(s -> logger.info("-1 Total unconsumed values: " + nbUnconsumed.decrementAndGet()))
.toBlocking().forEach(s -> {});
logger.info("Finished");
}
private Observable<DataStore> produce(final int value) {
return Observable.<DataStore>create(s -> {
try {
if (!s.isUnsubscribed()) {
Thread.sleep(200); //Here I synchronous call WS to retrieve data
s.onNext(new DataStore(value));
s.onCompleted();
}
} catch (Exception e) {
s.onError(e);
}
}).subscribeOn(Schedulers.io());
}
private Observable<Boolean> consume(DataStore value) {
return Observable.<Boolean>create(s -> {
try {
if (!s.isUnsubscribed()) {
Thread.sleep(1000); //Here I synchronous call DB to store data
s.onNext(true);
s.onCompleted();
}
} catch (Exception e) {
s.onNext(false);
s.onCompleted();
}
}).subscribeOn(Schedulers.io());
}
What is explaination behind this behavior? How could I solve my failing test without removing the Observable.just(dt)) which in my real case is a Observable.from(someListOfItme)
flatMap by default merges an unlimited amount of sources and by applying that specific lambda without maxConcurrent parameter, you essentially unbounded the upstream which now can run at full speed, overwhelming the internal buffers of the other operators.
Related
Problem statement:
Do I/O in chunks. Start processing chunks as soon as one becomes available, while further chunks are being read in background (but not more than X chunks are read ahead). Process chunks in parallel as they are being received. Consume each processed chunk in-order-of-reading, i.e. in original order of the chunk being read.
What I've done:
I've set up an MWE class to imitate the situation and it works to an extent:
The "prefetch" part doesn't seem to be working as I expect it to, the "generator", which simulates the IO, produces arbitrarily many items before "processing" needs more elements, depending on time delays I set.
Final consumption is not in order (expected, but I don't yet know what to do about it).
Pseudo-Rx code explanation of what I'd like to achieve:
Flux.fromFile(path, some-function-to-define-chunk)
// done with Flux.generate in MWE below
.prefetchOnIoThread(x-count: int)
// at this point we try to maintain a buffer filled with x-count pre-read chunks
.parallelMapOrdered(n-threads: int, limit-process-ahead: int)
// n-threads: are constantly trying to drain the x-count buffer, doing some transformation
// limit-process-ahead: as the operation results are needed in order, if we encounter an
// input element that takes a while to process, we don't want the pipeline to run too far
// ahead of this problematic element (to not overflow the buffers and use too much memory)
.consume(TMapped v)
Current attempt with Reactor (MWE):
Dependency: implementation 'io.projectreactor:reactor-core:3.3.5.RELEASE'
import reactor.core.Disposable;
import reactor.core.publisher.Flux;
import reactor.core.publisher.ParallelFlux;
import reactor.core.scheduler.Schedulers;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.concurrent.atomic.AtomicInteger;
public class Tmp {
static final SimpleDateFormat fmt = new SimpleDateFormat("HH:mm:ss.SSS");
static long millisRead = 1; // time taken to "read" a chunk
static long millisProcess = 100; // time take to "process" a chunk
public static void main(String[] args) {
log("Before flux construct");
// Step 1: Generate / IO
Flux<Integer> f = Flux.generate( // imitate IO
AtomicInteger::new,
(atomicInteger, synchronousSink) -> {
sleepQuietly(millisRead);
Integer next = atomicInteger.getAndIncrement();
if (next > 50) {
synchronousSink.complete();
log("Emitting complete");
} else {
log("Emitting next : %d", next);
synchronousSink.next(next);
}
return atomicInteger;
},
atomicInteger -> log("State consumer called: pos=%s", atomicInteger.get()));
f = f.publishOn(Schedulers.elastic());
f = f.subscribeOn(Schedulers.elastic());
ParallelFlux<Integer> pf = f.parallel(2, 2);
pf = pf.runOn(Schedulers.elastic(), 2);
// Step 2: transform in parallel
pf = pf.map(i -> { // imitate processing steps
log("Processing begin: %d", i);
sleepQuietly(millisProcess); // 10x the time it takes to create an input for this operation
log("Processing done : %d", i);
return 1000 + i;
});
// Step 3: use transformed data, preferably in order of generation
Disposable sub = pf.sequential(3).subscribe(
next -> log(String.format("Finally got: %d", next)),
err -> err.printStackTrace(),
() -> log("Complete!"));
while (!sub.isDisposed()) {
log("Waiting pipeline completion...");
sleepQuietly(500);
}
log("Main done");
}
public static void log(String message) {
Thread t = Thread.currentThread();
Date d = new Date();
System.out.printf("[%s] # [%s]: %s\n", t.getName(), fmt.format(d), message);
}
public static void log(String format, Object... args) {
log(String.format(format, args));
}
public static void sleepQuietly(long millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
throw new IllegalStateException();
}
}
}
Considering lack of answers, I'll post what I came up with.
final int threads = 2;
final int prefetch = 3;
Flux<Integer> gen = Flux.generate(AtomicInteger::new, (ai, sink) -> {
int i = ai.incrementAndGet();
if (i > 10) {
sink.complete();
} else {
sink.next(i);
}
return ai;
});
gen.parallel(threads, prefetch) // switch to parallel processing after genrator
.runOn(Schedulers.parallel(), prefetch) // if you don't do this, it won't run in parallel
.map(i -> i + 1000) // this is done in parallel
.ordered(Integer::compareTo) // you can do just .sequential(), which is faster
.subscribeOn(Schedulers.elastic()) // generator will run on this scheduler (once subscribed)
.subscribe(i -> {
System.out.printf("Transformed integer: " + i); // do something with generated and processed item
});
Researching this has been a little difficult due to I'm not precisely sure how the question should be worded. Here is some pseudo code summarizing my goal.
public class TestService {
Object someBigMehtod(String A, Integer I) {
{ //block A
//do some long database read
}
{ //block B
//do another long database read at the same time as block B
}
{ //block C
//get in this block when both A & B are complete
//and access result returned or pushed from A & B
//to build up some data object to push out to a class that called
//this service or has subscribed to it
return null;
}
}
}
I am thinking I can use RxJava or Spring Integration to accomplish this or maybe just instantiating multiple threads and running them. Just the layout of it though makes me think Rx has the solution because I am thinking data is pushed to block C. Thanks in advance for any advice you might have.
You can do this with CompletableFuture. In particular, its thenCombine method, which waits for two tasks to complete.
CompletableFuture<A> fa = CompletableFuture.supplyAsync(() -> {
// do some long database read
return a;
});
CompletableFuture<B> fb = CompletableFuture.supplyAsync(() -> {
// do another long database read
return b;
});
CompletableFuture<C> fc = fa.thenCombine(fb, (a, b) -> {
// use a and b to build object c
return c;
});
return fc.join();
These methods will all execute on the ForkJoinPool.commonPool(). You can control where they run if you pass in optional Executors.
You can use Zip operator from Rxjava. This operator can run in parallel multiple process and then zip the results.
Some docu http://reactivex.io/documentation/operators/zip.html
And here an example of how works https://github.com/politrons/reactive/blob/master/src/test/java/rx/observables/combining/ObservableZip.java
For now I just went with John's suggestion. This is getting the desired effect. I mix in RxJava1 and RxJava2 syntax a bit which is probably poor practice. Looks like I have some reading cut out for me on java.util.concurrent package . Time permitting I would like to do the zip solution.
#Test
public void myBigFunction(){
System.out.println("starting ");
CompletableFuture<List<String>> fa = CompletableFuture.supplyAsync( () ->
{ //block A
//do some long database read
try {
Thread.sleep(3000);
System.out.println("part A");
return asList(new String[] {"abc","def"});
} catch (InterruptedException e) {
e.printStackTrace();
}
return null;
}
);
CompletableFuture<List<Integer>> fb = CompletableFuture.supplyAsync( () ->
{ //block B
//do some long database read
try {
Thread.sleep(6000);
System.out.println("Part B");
return asList(new Integer[] {123,456});
} catch (InterruptedException e) {
e.printStackTrace();
}
return null;
}
);
CompletableFuture<List<String>> fc = fa.thenCombine(fb,(a,b) ->{
//block C
//get in this block when both A & B are complete
int sum = b.stream().mapToInt(i -> i.intValue()).sum();
return a.stream().map(new Function<String, String>() {
#Override
public String apply(String s) {
return s+sum;
}
}).collect(Collectors.toList());
});
System.out.println(fc.join());
}
It does only take 6 seconds to run.
The crawler has a urlQueue to record urls to crawl, a mock asynchronous url fetcher.
I try to write it in rx-java style.
At first, I try Flowable.generate like this
Flowable.generate((Consumer<Emitter<Integer>>) e -> {
final Integer poll = demo.urlQueue.poll();
if (poll != null) {
e.onNext(poll);
} else if (runningCount.get() == 0) {
e.onComplete();
}
}).flatMap(i -> {
runningCount.incrementAndGet();
return demo.urlFetcher.asyncFetchUrl(i);
}, 10)
.doOnNext(page -> demo.onSuccess(page))
.subscribe(page -> runningCount.decrementAndGet());
but it won't work, because at beginning, there may be only one seed in urlQueue, so generate is called 10 times, but only one e.onNext is emitted. Only when it is finished, then next request(1)-> generate is called.
Although in the code, we specify flatMap maxConcurrency is 10, it will crawl one by one.
After that , I modify code like following, It can work like expected.
But In the code, I should care how many tasks are running currently, then calculate how many should be fetched from the queue, that I think rx-java should do this job.
I am not sure if the code can be rewritten in a simpler way.
public class CrawlerDemo {
private static Logger logger = LoggerFactory.getLogger(CrawlerDemo.class);
// it can be redis queue or other queue
private BlockingQueue<Integer> urlQueue = new LinkedBlockingQueue<>();
private static AtomicInteger runningCount = new AtomicInteger(0);
private static final int MAX_CONCURRENCY = 5;
private UrlFetcher urlFetcher = new UrlFetcher();
private void addSeed(int i) {
urlQueue.offer(i);
}
private void onSuccess(Page page) {
page.links.forEach(i -> {
logger.info("offer more url " + i);
urlQueue.offer(i);
});
}
private void start(BehaviorProcessor processor) {
final Integer poll = urlQueue.poll();
if (poll != null) {
processor.onNext(poll);
} else {
processor.onComplete();
}
}
private int dispatchMoreLink(BehaviorProcessor processor) {
int links = 0;
while (runningCount.get() <= MAX_CONCURRENCY) {
final Integer poll = urlQueue.poll();
if (poll != null) {
processor.onNext(poll);
links++;
} else {
if (runningCount.get() == 0) {
processor.onComplete();
}
break;
}
}
return links;
}
private Flowable<Page> asyncFetchUrl(int i) {
return urlFetcher.asyncFetchUrl(i);
}
public static void main(String[] args) throws InterruptedException {
CrawlerDemo demo = new CrawlerDemo();
demo.addSeed(1);
BehaviorProcessor<Integer> processor = BehaviorProcessor.create();
processor
.flatMap(i -> {
runningCount.incrementAndGet();
return demo.asyncFetchUrl(i)
.doFinally(() -> runningCount.decrementAndGet())
.doFinally(() -> demo.dispatchMoreLink(processor));
}, MAX_CONCURRENCY)
.doOnNext(page -> demo.onSuccess(page))
.subscribe();
demo.start(processor);
}
}
class Page {
public List<Integer> links = new ArrayList<>();
}
class UrlFetcher {
static Logger logger = LoggerFactory.getLogger(UrlFetcher.class);
final ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
public Flowable<Page> asyncFetchUrl(Integer url) {
logger.info("start async get " + url);
return Flowable.defer(() -> emitter ->
scheduledExecutorService.schedule(() -> {
Page page = new Page();
// the website urls no more than 1000
if (url < 1000) {
page.links = IntStream.range(1, 5).boxed().map(j -> 10 * url + j).collect(Collectors.toList());
}
logger.info("finish async get " + url);
emitter.onNext(page);
emitter.onComplete();
}, 5, TimeUnit.SECONDS)); // cost 5 seconds to access url
}
}
You are trying to use regular (non-Rx) code with RxJava and not getting the results you want.
The first thing to do is to convert the urlQueue.poll() into a Flowable<Integer>:
Flowable.generate((Consumer<Emitter<Integer>>) e -> {
final Integer take = demo.urlQueue.take(); // Note 1
e.onNext(take); // Note 2
})
.observeOn(Schedulers.io(), 1) // Note 3
.flatMap(i -> demo.urlFetcher.asyncFetchUrl(i), 10)
.subscribe(page -> demo.onSuccess(page));
Reading the queue in a reactive way means a blocking wait. Trying to poll() the queue adds a layer of complexity that RxJava allows you to skip over.
Pass the received value on to any subscribers. If you need to indicate completion, you will need to add an external boolean, or use an in-band indicator (such as a negative integer).
observeOn() operator will subscribe to the generator. The value 1 will cause only one subscription since there is no point in having more than one.
The rest of the code is similar to what you have. The issues that you have arose because the flatMap(...,10) operation will subscribe to the generator 10 times, which is not what you wanted. You want to limit the number of simultaneous fetches. Adding the runningCount was a kludge to prevent exiting the generator early, but it is not a substitute for a proper way to signal end-of-data on the urlQueue.
I've recently given a coding interview on a Java concurrency task and unfortunately didn't get the job. The worst part is I've given my best but now I'm not even sure where went wrong. Can anyone help give me some ideas about things I can improve on below code? Thanks
The question is pretty vague. Given 4 generic interface which on a high level divides a task into small pieces, work on each piece and combine the partial result into final result, I'm asked to implement the central controller piece of the interface. The only requirement is to use concurrency in the partial result processing and "code must be production quality"
My code is as below (the interfaces was given). I did put in a lot of comment to explain my assumptions which are removed here
// adding V,W in order to use in private fields types
public class ControllerImpl<T, U, V, W> implements Controller<T, U> {
private static Logger logger = LoggerFactory.getLogger(ControllerImpl.class);
private static int BATCH_SIZE = 100;
private Preprocessor<T, V> preprocessor;
private Processor<V, W> processor;
private Postprocessor<U, W> postprocessor;
public ControllerImpl() {
this.preprocessor = new PreprocessorImpl<>();
this.processor = new ProcessorImpl<>();
this.postprocessor = new PostprocessorImpl<>();
}
public ControllerImpl(Preprocessor preprocessor, Processor processor, Postprocessor postprocessor) {
this.preprocessor = preprocessor;
this.processor = processor;
this.postprocessor = postprocessor;
}
#Override
public U process(T arg) {
if (arg == null) return null;
final V[] parts = preprocessor.split(arg);
final W[] partResult = (W[]) new Object[parts.length];
final int poolSize = Runtime.getRuntime().availableProcessors();
final ExecutorService executor = getExecutor(poolSize);
int i = 0;
while (i < parts.length) {
final List<Callable<W>> tasks = IntStream.range(i, i + BATCH_SIZE)
.filter(e -> e < parts.length)
.mapToObj(e -> (Callable<W>) () -> partResult[e] = processor.processPart(parts[e]))
.collect(Collectors.toList());
i += tasks.size();
try {
logger.info("invoking batch of {} tasks to workers", tasks.size());
long start = System.currentTimeMillis();
final List<Future<W>> futures = executor.invokeAll(tasks);
long end = System.currentTimeMillis();
logger.info("done batch processing took {} ms", end - start);
for (Future future : futures) {
future.get();
}
} catch (InterruptedException e) {
logger.error("{}", e);// have comments to explain better handling according to real business requirement
} catch (ExecutionException e) {
logger.error("error: ", e);
}
}
MoreExecutors.shutdownAndAwaitTermination(executor, 60, TimeUnit.SECONDS);
return postprocessor.aggregate(partResult);
}
private ExecutorService getExecutor(int poolSize) {
final ThreadFactory threadFactory = new ThreadFactoryBuilder()
.setNameFormat("Processor-%d")
.setDaemon(true)
.build();
return new ThreadPoolExecutor(poolSize, poolSize, 60, TimeUnit.SECONDS, new LinkedBlockingDeque<>(), threadFactory);
}
}
So, if I understand correctly, you have a Preprocessor that takes a T and splits it into an array of V[]. Then you have a processor which transforms a V into a W. And then a postprocessor which transforms a W[] into a U, right? And you must assemble those things.
First of all, arrays and generics really don't match together, so it's really bizarre for those methods to return arrays rather than lists. For production-quality code, generic arrays shouldn't be used.
So, to recap:
T --> V1 --> W1 --> U
V2 --> W2
. .
. .
Vn --> Wn
So you could do this:
V[] parts = preprocessor.split(t);
W[] transformedParts =
(W[]) Arrays.stream(parts) // unchecked cast due to the use of generic arrays
.parallel() // this is where concurrency happens
.map(processor::processPart)
.toArray();
U result = postProcessor.aggregate(transformedParts);
If you use lists instead of arrays, and write it as a single line:
U result =
postProcessor.aggregate(
preprocessor.split(t)
.parallelStream()
.map(processor::processPart)
.collect(Collectors.toList()));
I have a java method that returns a string template. I want to make 2 async call to a remote api, each call will return a number, then I want to compute the sum of these 2 numbers and put it into the template before returning it.
So I have this java code to achieve this task :
private Observable<Integer> createObservable() {
Observable<Integer> obs = Observable.create(new OnSubscribe<Integer>() {
public void call(Subscriber<? super Integer> t) {
System.out.println("Call with thread : " + Thread.currentThread().getName());
//FAKE CALL TO REMOTE API => THE THREAD IS SLEEPING DURING 4 SECCONDS
try {
Thread.sleep(4000);
} catch (InterruptedException e) {
e.printStackTrace();
}
t.onNext(new Random().nextInt(10));
t.onCompleted();
}
}).subscribeOn(Schedulers.newThread());
return Observable
.merge(obs, obs)
.reduce(new Func2<Integer, Integer, Integer>() {
public Integer call(Integer t1, Integer t2) {
return t1 + t2;
}
});
}
public String retrieveTemplate() {
//I WANT TO START THE WORK OF THE OBSERVABLE HERE BUT I DON'T KNOW HOW TO DO IT
//DO THINGS IN THE MAIN THREAD
//HERE I JUST INITIALIZE A STRING BUT WE COULD IMAGINE I WOULD DO MORE THINGS
String s = "The final Number is {0}";
System.out.println(Thread.currentThread().getName() + " : the string is initialized");
//I WAIT FOR THE OBSERVABLE RESULT HERE
int result = createObservable().toBlocking().first();
return MessageFormat.format(s, result);
}
The output of this code is correct (Two threads are created to call the remote api)
main : the string is initialized
Call with thread : RxNewThreadScheduler-1
Call with thread : RxNewThreadScheduler-2
The final Number is 2
I want to call the RxJava Observable at the begining of the method retrieveTemplate (in order to call the remote api as soon as possible) and wait for the result just before the call of MessageFormat.format but I don't know how to do it
Assuming the whole creation process works, you may want to bind the whole computation together to subscription moment by transforming the source observable:
public Observable<String> retrieveTemplate() {
return createObservable().map(result -> {
String s = "The final Number is {0}";
System.out.println(Thread.currentThread().getName() + " : the string is initialized");
return MessageFormat.format(s, result);
});
}
When you subscribe to the result observable of retrieveTemplate - you actually start the whole computation:
// some other place in the code
retrieveTemplate().subscribe(template -> doStuffWithTemplate(template))