How to rewrite following rx-java crawler - java

The crawler has a urlQueue to record urls to crawl, a mock asynchronous url fetcher.
I try to write it in rx-java style.
At first, I try Flowable.generate like this
Flowable.generate((Consumer<Emitter<Integer>>) e -> {
final Integer poll = demo.urlQueue.poll();
if (poll != null) {
e.onNext(poll);
} else if (runningCount.get() == 0) {
e.onComplete();
}
}).flatMap(i -> {
runningCount.incrementAndGet();
return demo.urlFetcher.asyncFetchUrl(i);
}, 10)
.doOnNext(page -> demo.onSuccess(page))
.subscribe(page -> runningCount.decrementAndGet());
but it won't work, because at beginning, there may be only one seed in urlQueue, so generate is called 10 times, but only one e.onNext is emitted. Only when it is finished, then next request(1)-> generate is called.
Although in the code, we specify flatMap maxConcurrency is 10, it will crawl one by one.
After that , I modify code like following, It can work like expected.
But In the code, I should care how many tasks are running currently, then calculate how many should be fetched from the queue, that I think rx-java should do this job.
I am not sure if the code can be rewritten in a simpler way.
public class CrawlerDemo {
private static Logger logger = LoggerFactory.getLogger(CrawlerDemo.class);
// it can be redis queue or other queue
private BlockingQueue<Integer> urlQueue = new LinkedBlockingQueue<>();
private static AtomicInteger runningCount = new AtomicInteger(0);
private static final int MAX_CONCURRENCY = 5;
private UrlFetcher urlFetcher = new UrlFetcher();
private void addSeed(int i) {
urlQueue.offer(i);
}
private void onSuccess(Page page) {
page.links.forEach(i -> {
logger.info("offer more url " + i);
urlQueue.offer(i);
});
}
private void start(BehaviorProcessor processor) {
final Integer poll = urlQueue.poll();
if (poll != null) {
processor.onNext(poll);
} else {
processor.onComplete();
}
}
private int dispatchMoreLink(BehaviorProcessor processor) {
int links = 0;
while (runningCount.get() <= MAX_CONCURRENCY) {
final Integer poll = urlQueue.poll();
if (poll != null) {
processor.onNext(poll);
links++;
} else {
if (runningCount.get() == 0) {
processor.onComplete();
}
break;
}
}
return links;
}
private Flowable<Page> asyncFetchUrl(int i) {
return urlFetcher.asyncFetchUrl(i);
}
public static void main(String[] args) throws InterruptedException {
CrawlerDemo demo = new CrawlerDemo();
demo.addSeed(1);
BehaviorProcessor<Integer> processor = BehaviorProcessor.create();
processor
.flatMap(i -> {
runningCount.incrementAndGet();
return demo.asyncFetchUrl(i)
.doFinally(() -> runningCount.decrementAndGet())
.doFinally(() -> demo.dispatchMoreLink(processor));
}, MAX_CONCURRENCY)
.doOnNext(page -> demo.onSuccess(page))
.subscribe();
demo.start(processor);
}
}
class Page {
public List<Integer> links = new ArrayList<>();
}
class UrlFetcher {
static Logger logger = LoggerFactory.getLogger(UrlFetcher.class);
final ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
public Flowable<Page> asyncFetchUrl(Integer url) {
logger.info("start async get " + url);
return Flowable.defer(() -> emitter ->
scheduledExecutorService.schedule(() -> {
Page page = new Page();
// the website urls no more than 1000
if (url < 1000) {
page.links = IntStream.range(1, 5).boxed().map(j -> 10 * url + j).collect(Collectors.toList());
}
logger.info("finish async get " + url);
emitter.onNext(page);
emitter.onComplete();
}, 5, TimeUnit.SECONDS)); // cost 5 seconds to access url
}
}

You are trying to use regular (non-Rx) code with RxJava and not getting the results you want.
The first thing to do is to convert the urlQueue.poll() into a Flowable<Integer>:
Flowable.generate((Consumer<Emitter<Integer>>) e -> {
final Integer take = demo.urlQueue.take(); // Note 1
e.onNext(take); // Note 2
})
.observeOn(Schedulers.io(), 1) // Note 3
.flatMap(i -> demo.urlFetcher.asyncFetchUrl(i), 10)
.subscribe(page -> demo.onSuccess(page));
Reading the queue in a reactive way means a blocking wait. Trying to poll() the queue adds a layer of complexity that RxJava allows you to skip over.
Pass the received value on to any subscribers. If you need to indicate completion, you will need to add an external boolean, or use an in-band indicator (such as a negative integer).
observeOn() operator will subscribe to the generator. The value 1 will cause only one subscription since there is no point in having more than one.
The rest of the code is similar to what you have. The issues that you have arose because the flatMap(...,10) operation will subscribe to the generator 10 times, which is not what you wanted. You want to limit the number of simultaneous fetches. Adding the runningCount was a kludge to prevent exiting the generator early, but it is not a substitute for a proper way to signal end-of-data on the urlQueue.

Related

Reactive processing: async IO producer with prefetch and in-order consumers (MWE provided) (java Project Reactor 3.x)

Problem statement:
Do I/O in chunks. Start processing chunks as soon as one becomes available, while further chunks are being read in background (but not more than X chunks are read ahead). Process chunks in parallel as they are being received. Consume each processed chunk in-order-of-reading, i.e. in original order of the chunk being read.
What I've done:
I've set up an MWE class to imitate the situation and it works to an extent:
The "prefetch" part doesn't seem to be working as I expect it to, the "generator", which simulates the IO, produces arbitrarily many items before "processing" needs more elements, depending on time delays I set.
Final consumption is not in order (expected, but I don't yet know what to do about it).
Pseudo-Rx code explanation of what I'd like to achieve:
Flux.fromFile(path, some-function-to-define-chunk)
// done with Flux.generate in MWE below
.prefetchOnIoThread(x-count: int)
// at this point we try to maintain a buffer filled with x-count pre-read chunks
.parallelMapOrdered(n-threads: int, limit-process-ahead: int)
// n-threads: are constantly trying to drain the x-count buffer, doing some transformation
// limit-process-ahead: as the operation results are needed in order, if we encounter an
// input element that takes a while to process, we don't want the pipeline to run too far
// ahead of this problematic element (to not overflow the buffers and use too much memory)
.consume(TMapped v)
Current attempt with Reactor (MWE):
Dependency: implementation 'io.projectreactor:reactor-core:3.3.5.RELEASE'
import reactor.core.Disposable;
import reactor.core.publisher.Flux;
import reactor.core.publisher.ParallelFlux;
import reactor.core.scheduler.Schedulers;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.concurrent.atomic.AtomicInteger;
public class Tmp {
static final SimpleDateFormat fmt = new SimpleDateFormat("HH:mm:ss.SSS");
static long millisRead = 1; // time taken to "read" a chunk
static long millisProcess = 100; // time take to "process" a chunk
public static void main(String[] args) {
log("Before flux construct");
// Step 1: Generate / IO
Flux<Integer> f = Flux.generate( // imitate IO
AtomicInteger::new,
(atomicInteger, synchronousSink) -> {
sleepQuietly(millisRead);
Integer next = atomicInteger.getAndIncrement();
if (next > 50) {
synchronousSink.complete();
log("Emitting complete");
} else {
log("Emitting next : %d", next);
synchronousSink.next(next);
}
return atomicInteger;
},
atomicInteger -> log("State consumer called: pos=%s", atomicInteger.get()));
f = f.publishOn(Schedulers.elastic());
f = f.subscribeOn(Schedulers.elastic());
ParallelFlux<Integer> pf = f.parallel(2, 2);
pf = pf.runOn(Schedulers.elastic(), 2);
// Step 2: transform in parallel
pf = pf.map(i -> { // imitate processing steps
log("Processing begin: %d", i);
sleepQuietly(millisProcess); // 10x the time it takes to create an input for this operation
log("Processing done : %d", i);
return 1000 + i;
});
// Step 3: use transformed data, preferably in order of generation
Disposable sub = pf.sequential(3).subscribe(
next -> log(String.format("Finally got: %d", next)),
err -> err.printStackTrace(),
() -> log("Complete!"));
while (!sub.isDisposed()) {
log("Waiting pipeline completion...");
sleepQuietly(500);
}
log("Main done");
}
public static void log(String message) {
Thread t = Thread.currentThread();
Date d = new Date();
System.out.printf("[%s] # [%s]: %s\n", t.getName(), fmt.format(d), message);
}
public static void log(String format, Object... args) {
log(String.format(format, args));
}
public static void sleepQuietly(long millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
throw new IllegalStateException();
}
}
}
Considering lack of answers, I'll post what I came up with.
final int threads = 2;
final int prefetch = 3;
Flux<Integer> gen = Flux.generate(AtomicInteger::new, (ai, sink) -> {
int i = ai.incrementAndGet();
if (i > 10) {
sink.complete();
} else {
sink.next(i);
}
return ai;
});
gen.parallel(threads, prefetch) // switch to parallel processing after genrator
.runOn(Schedulers.parallel(), prefetch) // if you don't do this, it won't run in parallel
.map(i -> i + 1000) // this is done in parallel
.ordered(Integer::compareTo) // you can do just .sequential(), which is faster
.subscribeOn(Schedulers.elastic()) // generator will run on this scheduler (once subscribed)
.subscribe(i -> {
System.out.printf("Transformed integer: " + i); // do something with generated and processed item
});

What is the way to get the first finish future from a list of future in java?

The following way of iterating a list of future always wait for the first job to be done:
for (Future<MyFutureResult> future : list) {
List<MyFutureResult> result = future.get();
}
Is there a way to iterate all the finish job first?
Getting the first completed Future from the list of futures is not possible directly since those are processed in parallel and you would have to block anwyay to find the result.
However you could have control over task completion by using ExecutorsCompletionService for your parallel processing. This class has take and poll methods that return Future of next completed task :
A CompletionService that uses a supplied Executor to execute tasks. This class arranges that submitted tasks are, upon completion, placed on a queue accessible using take. The class is lightweight enough to be suitable for transient use when processing groups of tasks.
ExecutorService threadPool = Executors.newCachedThreadPool();
CompletionService<Integer> ecs = new ExecutorCompletionService<>(threadPool);
int tasks = 10;
IntStream.range(0, tasks)
.forEach(i -> ecs.submit(() -> i)); // submit tasks
for(int i = 0; i < tasks; i++) {
Future<Integer> take = ecs.take(); // this is blocking operation but futures are returned in completion order. Also you will have to handle InterruptedException
}
// remember to close the ExecutorService after you are done
Have a look into ExecutorService.invokeAny(..) that returns the first result, or ExecutorService.invokeAll(..) that returns all completed tasks (within a timeout).
class InvokeAnyAllTest {
ExecutorService es = Executors.newCachedThreadPool();
// create some Callable tasks
List<Callable<MyFutureResult>> tasks = IntStream.range(0, 10) //
.mapToObj(this::createTask)
.collect(toList());
private Callable<MyFutureResult> createTask(int i) {
return () -> new MyFutureResult(i);
}
#Test
void testFirstCallable() throws Exception {
MyFutureResult result = es.invokeAny(tasks);
assertTrue(result.i >= 0 && result.i < 10);
}
#Test
void testAllCompleted() throws Exception {
List<Future<MyFutureResult>> results = es.invokeAll(tasks, 5, TimeUnit.SECONDS);
// all futures that are done within 5s, either normally or by throwing an exception
Set<Integer> values = results.stream().map(f -> {
try {
return f.get();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} catch (ExecutionException e) {
// are we interested in failed ones too?
}
return null;
}).filter(Objects::nonNull).map(result -> result.i).collect(toSet());
IntStream.range(0, 10).forEach(i -> assertTrue(values.contains(i)));
}
// in case we have only futures, not callables
#Test
void testFirstFuture() throws Exception {
// create futures
List<Future<MyFutureResult>> futures = IntStream.range(0, 10) //
.mapToObj(i -> es.submit(createTask(i)))
.collect(toList());
// turn futures into callables
List<Callable<MyFutureResult>> callables = futures.stream()
.map(f -> (Callable<MyFutureResult>) () -> f.get())
.collect(toList());
MyFutureResult result = es.invokeAny(callables);
assertTrue(result.i >= 0 && result.i < 10);
}
private static class MyFutureResult {
int i;
public MyFutureResult(int i) {
this.i = i;
}
}
}
If by any reason you are not supposed to use ExecutorsCompletionService (as mentioned by Michał Krzywański). Then you can replace ecs.take() or future.get() with below method.
getCompletedFuture(futureSet, 1000).get();
.
.
.
private static Future<V> getCompletedFuture(Set<Future<V>> futureSet, long pollInterval)
throws ExecutionException, InterruptedException {
Iterator<Future<V>> iterator = futureSet.iterator();
while (!Thread.currentThread().isInterrupted()) {
if (!iterator.hasNext()) {
iterator = futureSet.iterator();
}
try {
V v = iterator.next().get(pollInterval, TimeUnit.MILLISECONDS);
if (v != null) {
iterator.remove();
return iterator.next();
}
} catch (TimeoutException e) {
continue;
}
}
throw new InterruptedException();
}
You can put the futures onto a BlockingQueue in order of their completion.
public static <T> BlockingQueue<CompletableFuture<T>> collect(Stream<CompletableFuture<T>> futures) {
var queue = new LinkedBlockingQueue<CompletableFuture<T>>();
futures.forEach(future ->
future.handle((success, failure) -> queue.add(future)));
return queue;
}
Each call to BlockingQueue.take on the queue returned by collect will block until the next future become available by completing and return that future.

Java concurrency code improvement ideas

I've recently given a coding interview on a Java concurrency task and unfortunately didn't get the job. The worst part is I've given my best but now I'm not even sure where went wrong. Can anyone help give me some ideas about things I can improve on below code? Thanks
The question is pretty vague. Given 4 generic interface which on a high level divides a task into small pieces, work on each piece and combine the partial result into final result, I'm asked to implement the central controller piece of the interface. The only requirement is to use concurrency in the partial result processing and "code must be production quality"
My code is as below (the interfaces was given). I did put in a lot of comment to explain my assumptions which are removed here
// adding V,W in order to use in private fields types
public class ControllerImpl<T, U, V, W> implements Controller<T, U> {
private static Logger logger = LoggerFactory.getLogger(ControllerImpl.class);
private static int BATCH_SIZE = 100;
private Preprocessor<T, V> preprocessor;
private Processor<V, W> processor;
private Postprocessor<U, W> postprocessor;
public ControllerImpl() {
this.preprocessor = new PreprocessorImpl<>();
this.processor = new ProcessorImpl<>();
this.postprocessor = new PostprocessorImpl<>();
}
public ControllerImpl(Preprocessor preprocessor, Processor processor, Postprocessor postprocessor) {
this.preprocessor = preprocessor;
this.processor = processor;
this.postprocessor = postprocessor;
}
#Override
public U process(T arg) {
if (arg == null) return null;
final V[] parts = preprocessor.split(arg);
final W[] partResult = (W[]) new Object[parts.length];
final int poolSize = Runtime.getRuntime().availableProcessors();
final ExecutorService executor = getExecutor(poolSize);
int i = 0;
while (i < parts.length) {
final List<Callable<W>> tasks = IntStream.range(i, i + BATCH_SIZE)
.filter(e -> e < parts.length)
.mapToObj(e -> (Callable<W>) () -> partResult[e] = processor.processPart(parts[e]))
.collect(Collectors.toList());
i += tasks.size();
try {
logger.info("invoking batch of {} tasks to workers", tasks.size());
long start = System.currentTimeMillis();
final List<Future<W>> futures = executor.invokeAll(tasks);
long end = System.currentTimeMillis();
logger.info("done batch processing took {} ms", end - start);
for (Future future : futures) {
future.get();
}
} catch (InterruptedException e) {
logger.error("{}", e);// have comments to explain better handling according to real business requirement
} catch (ExecutionException e) {
logger.error("error: ", e);
}
}
MoreExecutors.shutdownAndAwaitTermination(executor, 60, TimeUnit.SECONDS);
return postprocessor.aggregate(partResult);
}
private ExecutorService getExecutor(int poolSize) {
final ThreadFactory threadFactory = new ThreadFactoryBuilder()
.setNameFormat("Processor-%d")
.setDaemon(true)
.build();
return new ThreadPoolExecutor(poolSize, poolSize, 60, TimeUnit.SECONDS, new LinkedBlockingDeque<>(), threadFactory);
}
}
So, if I understand correctly, you have a Preprocessor that takes a T and splits it into an array of V[]. Then you have a processor which transforms a V into a W. And then a postprocessor which transforms a W[] into a U, right? And you must assemble those things.
First of all, arrays and generics really don't match together, so it's really bizarre for those methods to return arrays rather than lists. For production-quality code, generic arrays shouldn't be used.
So, to recap:
T --> V1 --> W1 --> U
V2 --> W2
. .
. .
Vn --> Wn
So you could do this:
V[] parts = preprocessor.split(t);
W[] transformedParts =
(W[]) Arrays.stream(parts) // unchecked cast due to the use of generic arrays
.parallel() // this is where concurrency happens
.map(processor::processPart)
.toArray();
U result = postProcessor.aggregate(transformedParts);
If you use lists instead of arrays, and write it as a single line:
U result =
postProcessor.aggregate(
preprocessor.split(t)
.parallelStream()
.map(processor::processPart)
.collect(Collectors.toList()));

do...while() using Java 8 stream?

I want to convert this java do...while() to a Java 8.
private static final Integer PAGE_SIZE = 200;
int offset = 0;
Page page = null;
do {
// Get all items.
page = apiService.get(selector);
// Display items.
if (page.getEntries() != null) {
for (Item item : page.getEntries()) {
System.out.printf("Item with name '%s' and ID %d was found.%n", item.getName(),
item.getId());
}
} else {
System.out.println("No items were found.");
}
offset += PAGE_SIZE;
selector = builder.increaseOffsetBy(PAGE_SIZE).build();
} while (offset < page.getTotalNumEntries());
This code makes api call to apiService and retrieves data. Then, I want to loop until offset is less than totalNumberEntries.
What is prohibiting me from using while() or foreach with step or any other kind of loop loop is I don't know the totalNumberEntries without making API call (which is done inside the loop).
One option I can think of is making the API call just to get the totalNumberEntries and proceed with the loop.
If you really want/need a stream api for retrieving pages, you could create your own streams by implementing a Spliterator to retrieve each page in its tryAdvance() method.
It would look something like this
public class PageSpliterator implements Spliterator<Page> {
private static final Integer PAGE_SIZE = 200;
int offset;
ApiService apiService;
int selector;
Builder builder;
Page page;
public PageSpliterator(ApiService apiService) {
// initialize Builder?
}
#Override
public boolean tryAdvance(Consumer<? super Page> action) {
if (page == null || offset < page.getTotalNumEntries()) {
Objects.requireNonNull(action);
page = apiService.get(selector);
action.accept(page);
offset += PAGE_SIZE;
selector = builder.increaseOffsetBy(PAGE_SIZE).build();
return true;
} else {
// Maybe close/cleanup apiService?
return false;
}
}
#Override
public Spliterator<Page> trySplit() {
return null; // can't split
}
#Override
public long estimateSize() {
return Long.MAX_VALUE; // don't know in advance
}
#Override
public int characteristics() {
return IMMUTABLE; // return appropriate
}
}
Then you could use the it like this:
StreamSupport.stream(new PageSpliterator(apiService), false)
.flatMap(page -> page.getEntries()
.stream())
.forEach(item -> System.out.printf("Item with name '%s' and ID %d was found.%n", item.getName(), item.getId()));
In my opinion there are not many scenarios where a do...while loop would be the best choice. This however is such a scenario.
Just because there is new stuff in Java8, does not mean you have to use it.
If you still want to implement it with a foreach loop, for whatever reason, then I would go for the option you mentioned. Do the API call at the beginning and then start the foreach.

RxJava flatMap and backpressure strange behavior

While writing a data synchronization job with RxJava I discovered a strange behavior that I cannot explain. I'm quite novice with RxJava and would appreciate help.
Briefely my job is quite simple I have a list of element IDs, I call a webservice to get each element by ID, do some processing and do multiple call to push data to DB.
Data loading is faster than data storing so I encounted OutOfMemory errors.
My code pretty much look like "failing" test but then doning some test I realized that removing the line :
flatMap(dt -> Observable.just(dt))
Make it work.
Failing test output shows clearly that unconsumed items stack up and this lead to OutOfMemory. Working test output shows that producer will always wait consumer so this never lead to OutOfMemory.
public static class DataStore {
public Integer myVal;
public byte[] myBigData;
public DataStore(Integer myVal) {
this.myVal = myVal;
this.myBigData = new byte[1000000];
}
}
#Test
public void working() {
int MAX_CONCURRENT_LOAD = 1;
int MAX_CONCURRENT_STORE = 2;
AtomicInteger nbUnconsumed = new AtomicInteger(0);
List<Integer> ids = IntStream.range(0, 1000).boxed().collect(Collectors.toList());
Observable.from(ids)
.flatMap(this::produce, MAX_CONCURRENT_LOAD)
.doOnNext(s -> logger.info("+1 Total unconsumed values: " + nbUnconsumed.incrementAndGet()))
.flatMap(this::consume, MAX_CONCURRENT_STORE)
.doOnNext(s -> logger.info("-1 Total unconsumed values: " + nbUnconsumed.decrementAndGet()))
.toBlocking().forEach(s -> {});
logger.info("Finished");
}
#Test
public void failing() {
int MAX_CONCURRENT_LOAD = 1;
int MAX_CONCURRENT_STORE = 2;
AtomicInteger nbUnconsumed = new AtomicInteger(0);
List<Integer> ids = IntStream.range(0, 1000).boxed().collect(Collectors.toList());
Observable.from(ids)
.flatMap(this::produce, MAX_CONCURRENT_LOAD)
.doOnNext(s -> logger.info("+1 Total unconsumed values: " + nbUnconsumed.incrementAndGet()))
.flatMap(dt -> Observable.just(dt))
.flatMap(this::consume, MAX_CONCURRENT_STORE)
.doOnNext(s -> logger.info("-1 Total unconsumed values: " + nbUnconsumed.decrementAndGet()))
.toBlocking().forEach(s -> {});
logger.info("Finished");
}
private Observable<DataStore> produce(final int value) {
return Observable.<DataStore>create(s -> {
try {
if (!s.isUnsubscribed()) {
Thread.sleep(200); //Here I synchronous call WS to retrieve data
s.onNext(new DataStore(value));
s.onCompleted();
}
} catch (Exception e) {
s.onError(e);
}
}).subscribeOn(Schedulers.io());
}
private Observable<Boolean> consume(DataStore value) {
return Observable.<Boolean>create(s -> {
try {
if (!s.isUnsubscribed()) {
Thread.sleep(1000); //Here I synchronous call DB to store data
s.onNext(true);
s.onCompleted();
}
} catch (Exception e) {
s.onNext(false);
s.onCompleted();
}
}).subscribeOn(Schedulers.io());
}
What is explaination behind this behavior? How could I solve my failing test without removing the Observable.just(dt)) which in my real case is a Observable.from(someListOfItme)
flatMap by default merges an unlimited amount of sources and by applying that specific lambda without maxConcurrent parameter, you essentially unbounded the upstream which now can run at full speed, overwhelming the internal buffers of the other operators.

Categories