Observable do IO and following flow on IO schedulre - java

I saw a weird behavior on RxJava with the following code:
package com.hotels.guestreview.infrastructure.repository;
import java.util.List;
import java.util.Random;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import rx.Observable;
import rx.functions.Action1;
import rx.schedulers.Schedulers;
import org.apache.commons.lang.RandomStringUtils;
import rx.Observable;
import rx.functions.Action1;
import rx.schedulers.Schedulers;
import java.util.List;
import java.util.Random;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Main {
public static void main(String[] args) {
final Main m = new Main();
m.run();
}
public void run() {
final List<String> result = Observable.from(new Integer[]{4, 5, 6, 6, 7, 3})
.doOnNext(debug("Init"))
.flatMap(i -> Observable.defer(() -> toRandomList(i)).subscribeOn(Schedulers.io()))
.doOnNext(debug("defer"))
.flatMap(this::chooseString)
.doOnNext(debug("chooseString"))
.toList()
.doOnNext(debug("list"))
.toBlocking()
.single();
System.out.println("\nresult = " + result);
}
public static Observable<List<String>> toRandomList(Integer n) {
debug("perform IO").call(n);
try {
Thread.sleep(new Random().nextInt(3000));
} catch (InterruptedException e) {
e.printStackTrace();
}
debug("IO done").call(n);
final List<String> result = Stream.iterate(0, t -> t + 1)
.map(i -> RandomStringUtils.randomAlphanumeric(n))
.limit(n)
.collect(Collectors.toList());
return Observable.just(result);
}
public Observable<String> chooseString(List<String> list) {
// guilty code
/*
try {
Thread.sleep(new Random().nextInt(3000));
} catch (InterruptedException e) {
e.printStackTrace();
}
*/
// end guilty code
if (Math.random() > .3) {
return Observable.just(list.get(new Random().nextInt(list.size())));
}
else {
return Observable.empty();
}
}
public static <T> Action1<T> debug(String s) {
return o -> System.out.println(o + " | " + s + " | " + Thread.currentThread().getName());
}
}
I'm trying to execute the method toRandomList on the io scheduler, and everything works fine with the guilty code commented, having each emission and following flow of toRandomList on a separate thread.
If I remove the comment (adding the sleep) of the guilty code in the chooseString method, each step after the toRandomList is executed on the same thread.
Why is this happening? What I'm doing wrong?
Thanks in advance

The problem is here in the flat map, should be refactored as:
Observable.from(new Integer[]{4, 5, 6, 6, 7, 3})
.doOnNext(debug("Init"))
.flatMap(i -> Observable.defer(() -> toRandomList(i))
.doOnNext(debug("defer"))
.flatMap(this::chooseString)
.subscribeOn(Schedulers.io())
)
In this way all the subflow defined internally at the flatMap in which is called the subscribeOn are executed on a thread of the Scheduler choosen.
Then as #Dmitry pointed out in his response a better approach is use the fromCallable instead of the combination of defer and just/empty

That's because you are using Observable.just to create your stream inside toRandomList. Observable.just creates a stream from already calculated value. But what you want is to do some calculations before returning a value, so you need to use a different operator. Observable.fromCallable for example:
public static Observable<List<String>> toRandomList(Integer n) {
return Observable.fromCallable(() -> {
debug("perform IO").call(n);
try {
Thread.sleep(new Random().nextInt(3000));
} catch (InterruptedException e) {
e.printStackTrace();
}
debug("IO done").call(n);
return Stream.iterate(0, t -> t + 1)
.map(i -> "1")
.limit(n)
.collect(Collectors.toList());
});
}

Related

printing progress for my application lag for Kafka stream application

is there any nice way to print the progresss in a kafka stream app? I feel that my app is falling behind and I want a nice way to show the progress of processing the events in my app
Out of the box, not within the Streams API.
You're more than welcome to import methods that ConsumerGroupCommand.scala uses to get the group lag and calculate / print from there.
Or you can externally install a tool like Burrow or Remora which have REST APIs for accessing lag information
I wrote the following class to help be print the lag/progress easily
package util;
import lombok.extern.slf4j.Slf4j;
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.ListConsumerGroupOffsetsResult;
import org.apache.kafka.clients.admin.ListOffsetsResult;
import org.apache.kafka.clients.admin.OffsetSpec;
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.TopicPartition;
import java.util.*;
import java.util.concurrent.*;
import java.util.function.Function;
import java.util.stream.Collectors;
#Slf4j
public class LagLogger implements AutoCloseable {
private ScheduledExecutorService scheduledExecutorService = Executors.newScheduledThreadPool(1);
private String topic;
private String consumerGroupName;
private int logDelayInMilliSeconds;
private Properties kafkaStreamsProperties;
private boolean closed;
private AdminClient adminClient;
public LagLogger(String topic, String consumerGroupName, Properties kafkaStreamProperties, int logDelayInMilliSeconds) {
this.topic = topic;
this.kafkaStreamsProperties = kafkaStreamProperties;
this.logDelayInMilliSeconds = logDelayInMilliSeconds;
this.consumerGroupName = consumerGroupName;
adminClient = AdminClient.create(LagLogger.this.kafkaStreamsProperties);
}
public class LagVisualizerTask implements AutoCloseable, Runnable {
public LagVisualizerTask() {
}
public void run() {
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult = adminClient.listConsumerGroupOffsets(LagLogger.this.consumerGroupName);
// Current offsets.
Map<TopicPartition, OffsetAndMetadata> topicPartitionOffsetAndMetadataMap = null;
try {
topicPartitionOffsetAndMetadataMap = listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata().get();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
// all topic partitions.
Set<TopicPartition> topicPartitions = topicPartitionOffsetAndMetadataMap.keySet();
// list of end offsets for each partitions.
ListOffsetsResult listOffsetsResult = adminClient.listOffsets(topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())));
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append(topic+": ");
for (var entry : topicPartitionOffsetAndMetadataMap.entrySet()) {
String finalString = stringBuilder.toString();
if (entry.getKey().topic().equals(LagLogger.this.topic)) {
long current_offset = entry.getValue().offset();
long end_offset = 0;
try {
end_offset = listOffsetsResult.partitionResult(entry.getKey()).get().offset();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
stringBuilder.append(current_offset);
stringBuilder.append(" --> ");
stringBuilder.append(end_offset);
stringBuilder.append(" ("+String.format("%.2f", ((double)current_offset/end_offset)*100) +"%)");
stringBuilder.append(" / ");
}
}
log.info(stringBuilder.toString());
}
public void close() {
closed = true;
}
}
public LagVisualizerTask startNewLagVisualizerTask() {
LagVisualizerTask lagVisualizerTask = new LagVisualizerTask();
scheduledExecutorService.scheduleWithFixedDelay(lagVisualizerTask,0, LagLogger.this.logDelayInMilliSeconds, TimeUnit.MILLISECONDS);
return lagVisualizerTask;
}
public void close() {
if (scheduledExecutorService != null) {
scheduledExecutorService.shutdownNow();
scheduledExecutorService = null;
}
}
}
Which can be used as follows:
LagLogger lagVisualizer = new LagLogger(INPUT_TOPIC_NAME,APPLICATION_ID,configuration.getKafkaStreamsProperties(),DELY_BETWEEN_LOGS);
lagVisualizer.startNewLagVisualizerTask();

How to get server status using multi-threads periodically

The below code works fine and it connects to a given server (host, port) and gets the connection status.
What it does is:
PollService implements the Callable interface and connects to a server(host, port) then it returns the status.
Since this should happen periodically, it iterates the Hashmap entries in a while(true) loop infinitely.
The problem: On the server-side, I see it takes 2 or 3 seconds to reach the thread and if I use Runnable with periodic implementation it connects within 1 sec. Looks like iterating the Hashmap infinitely is a slow approach.
However, I can not use Runnable as it doesn't return the status of the connection which I need later to use.
Below is the ServiceMonitor class (client) which connects to the server.
package org.example;
import java.time.LocalDateTime;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.stream.Collectors;
public class ServicesMonitor {
private ExecutorService scheduledExecutorService = null;
private static Logger logger = Logger.getLogger(ServicesMonitor.class.getName());
private final Map<ServiceType, List<ClientMonitorService>> clientMonitorServicesMap = new HashMap<>();
public void registerInterest(ClientMonitorService clientMonitorService) {
clientMonitorServicesMap.computeIfAbsent(clientMonitorService.getServiceToMonitor().getServiceType(), v -> new ArrayList<>()).add(clientMonitorService);
}
public Map<ServiceType, List<ClientMonitorService>> getClineMonitorService() {
return clientMonitorServicesMap;
}
public void poll(){
//Observable.interval(1, TimeUnit.SECONDS).st
}
public void pollServices() {
scheduledExecutorService = Executors.newFixedThreadPool(clientMonitorServicesMap.size());
try {
while (true) {
clientMonitorServicesMap.forEach((k, v) -> {
Future<Boolean> val = scheduledExecutorService.submit(new PollService(k));
try {
boolean result = val.get();
System.out.println("service " + k.getHost() + ":" + k.getPort() + "status is " + result);
if (result) {
List<ClientMonitorService> list = v.stream().filter(a -> LocalDateTime.now().getSecond() % a.getServiceToMonitor().getFreqSec() == 0)
.collect(Collectors.toList());
list.stream().forEach(a -> System.out.println(a.getClientId()));
}
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
});
}
} catch (Exception e) {
logger.log(Level.SEVERE, e.getMessage());
} finally {
scheduledExecutorService.shutdown();
}
}
}
How to improve the performance of this code by reducing the time it takes to connect to the server?
How to improve this code?
after using the get(1, TimeUnit.SECONDS); I started to see improvement on the server side as well (Reaching the threads less than 1 second) since we are not waiting more than 1 second on the client side.
while (true) {
clientMonitorServicesMap.forEach((k, v) -> {
Future<Boolean> val = scheduledExecutorService.submit(new PollService(k));
try {
boolean result = val.get(1, TimeUnit.SECONDS);
System.out.println("service " + k.getHost() + ":" + k.getPort() + "status is " + result);
if (result) {
List<ClientMonitorService> list = v.stream()
//.filter(a -> LocalDateTime.now().getSecond() % a.getServiceToMonitor().getFreqSec() == 0)
.collect(Collectors.toList());
list.stream().forEach(a -> System.out.println(a.getClientId()));
}
} catch (InterruptedException e) {
logger.log(Level.WARNING,"Interrupted -> " + k.getHost()+":"+k.getPort());
} catch (ExecutionException e) {
logger.log(Level.INFO,"ExecutionException exception -> "+ k.getHost()+":"+k.getPort());
} catch (TimeoutException e) {
logger.log(Level.INFO,"TimeoutException exception -> "+ k.getHost()+":"+k.getPort());
}
});
}

How to wait for full completion of a completable future with runAsync?

This test fails:
package com.stackoverflow.demo;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.ForkJoinPool;
import org.junit.Assert;
import org.junit.Test;
public class AsyncTest {
#Test
public void test1() {
Assert.assertTrue("please run this test in a machine with 2 or more cores", ForkJoinPool.getCommonPoolParallelism() > 1);
CompletableFuture<String> cf = CompletableFuture.completedFuture("ok");
ConcurrentLinkedQueue<String> out = new ConcurrentLinkedQueue<>();
cf.thenRunAsync(() -> {
out.add("one");
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
out.add("two");
}, ForkJoinPool.commonPool());
cf.join();
Assert.assertEquals(2, out.size());
}
}
I was surprised because I expected cf.join() to take all attached tasks into account. I am sure it says somewhere in the documentation that join only waits for the initial task, but somehow I missed it.
How can I get the behavior I want: Wait for a CompletableFuture and all its attached subtasks to complete?
Fixed it while proof-reading my post:
public class AsyncTest {
#Test
public void test1() {
Assert.assertTrue("please run this test in a machine with 2 or more cores", ForkJoinPool.getCommonPoolParallelism() > 1);
CompletableFuture<String> cf = CompletableFuture.completedFuture("ok");
ConcurrentLinkedQueue<String> out = new ConcurrentLinkedQueue<>();
CompletableFuture<Void> cf2 = cf.thenRunAsync(() -> {
out.add("one");
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
out.add("two");
}, ForkJoinPool.commonPool());
cf2.join();
Assert.assertEquals(2, out.size());
}
}

Second parallel streams starts before completing the first

import java.util.Arrays;
import java.util.List;
import java.util.Random;
public class Main {
public static void main(String[] args) {
List<Integer> fullList = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14);
List<Integer> toBeLast = Arrays.asList(9,10,11,12);
Random r = new Random();
fullList.parallelStream().filter(l->!toBeLast .contains(l)).forEach(l->{
System.out.println("L1 : " + e);
try {
Thread.sleep(Math.abs(r.nextLong() % 1000));
System.out.println(l);
}
catch(InterruptedException i) {
}
});
toBeLast .parallelStream().forEach(l->{
System.out.println("L2 : " + e);
try {
Thread.sleep(Math.abs(r.nextLong() % 1000));
System.out.println(l);
}
catch(InterruptedException i) {
}
});
}
}
Expectation - complete 1-8, 13-14 and start 9-12.
The rest call will trigger a sh script in server which will take 15-90 secs each.
Actual - in the server at one point I'm seeing scripts for 2 & 11 are running. I don't see the sysout for 2 yet, and no exception in server as well as the program.
I'm wondering how was that possible to trigger 11 before completing 2?
Something is not right in the question. Here is the code that I wrote and in my example, the first stream always completes before the second stream.
import java.util.Arrays;
import java.util.List;
import java.util.Random;
public class Main {
public static void main(String[] args) {
List<Integer> l1 = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,1,2,3,4,5,6,7,8,9,10,11,12,13,14,1,2,3,4,5,6,7,8,9,10,11,12,13,14,1,2,3,4,5,6,7,8,9,10,11,12,13,14);
List<Integer> l2 = Arrays.asList(21,22,23,24,25,26,27,28,29,30,31,32,33,34,21,22,23,24,25,26,27,28,29,30,31,32,33,34,21,22,23,24,25,26,27,28,29,30,31,32,33,34,21,22,23,24,25,26,27,28,29,30,31,32,33,34);
Random r = new Random();
l1.parallelStream().forEach(e -> {
System.out.println("L1 : " + e);
try {
Thread.sleep(Math.abs(r.nextLong() % 1000));
}
catch(InterruptedException i) {
}
});
l2.parallelStream().forEach(e -> {
System.out.println("L2 : " + e);
try {
Thread.sleep(Math.abs(r.nextLong() % 1000));
}
catch(InterruptedException i) {
}
});
}
}
My guess is that you are using an HTTP client library that is doing the activity in the background and because of that the second stream is getting started before first stream finishes.

spark application does not stop when multiple threads share the same spark context

I have tried to reproduce the problem i am facing. My problem statement - In a folder multiple files are present. I need to do word counts for each file and print the result. Each file should be processed parallely! of course, there is a limit to parallelism. I have written the following code to accomplish it. It is running fine.The cluster is having spark installation of mapR. The cluster has spark.scheduler.mode = FIFO.
Q1- is there a better way to accomplish the task mentioned above?
Q2- i have observed that the application does not stop even when it
has completed the word counting of avaialble files. i am unable to
figure out how to deal with it?
package groupId.artifactId;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
public class Executor {
/**
* #param args
*/
public static void main(String[] args) {
final int threadPoolSize = 5;
SparkConf sparkConf = new SparkConf().setMaster("yarn-client").setAppName("Tracker").set("spark.ui.port","0");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
ExecutorService executor = Executors.newFixedThreadPool(threadPoolSize);
List<Future> listOfFuture = new ArrayList<Future>();
for (int i = 0; i < 20; i++) {
if (listOfFuture.size() < threadPoolSize) {
FlexiWordCount flexiWordCount = new FlexiWordCount(jsc, i);
Future future = executor.submit(flexiWordCount);
listOfFuture.add(future);
} else {
boolean allFutureDone = false;
while (!allFutureDone) {
allFutureDone = checkForAllFuture(listOfFuture);
System.out.println("Threads not completed yet!");
try {
Thread.sleep(2000);//waiting for 2 sec, before next check
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
printFutureResult(listOfFuture);
System.out.println("printing of future done");
listOfFuture.clear();
System.out.println("future list got cleared");
}
}
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static void printFutureResult(List<Future> listOfFuture) {
Iterator<Future> iterateFuture = listOfFuture.iterator();
while (iterateFuture.hasNext()) {
Future tempFuture = iterateFuture.next();
try {
System.out.println("Future result " + tempFuture.get());
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ExecutionException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
private static boolean checkForAllFuture(List<Future> listOfFuture) {
boolean status = true;
Iterator<Future> iterateFuture = listOfFuture.iterator();
while (iterateFuture.hasNext()) {
Future tempFuture = iterateFuture.next();
if (!tempFuture.isDone()) {
status = false;
break;
}
}
return status;
}
package groupId.artifactId;
import java.io.Serializable;
import java.util.Arrays;
import java.util.concurrent.Callable;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class FlexiWordCount implements Callable<Object>,Serializable {
private static final long serialVersionUID = 1L;
private JavaSparkContext jsc;
private int fileId;
public FlexiWordCount(JavaSparkContext jsc, int fileId) {
super();
this.jsc = jsc;
this.fileId = fileId;
}
private static class Reduction implements Function2<Integer, Integer, Integer>{
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
private static class KVPair implements PairFunction<String, String, Integer>{
#Override
public Tuple2<String, Integer> call(String paramT)
throws Exception {
return new Tuple2<String, Integer>(paramT, 1);
}
}
private static class Flatter implements FlatMapFunction<String, String>{
#Override
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
}
#Override
public Object call() throws Exception {
JavaRDD<String> jrd = jsc.textFile("/root/folder/experiment979/" + fileId +".txt");
System.out.println("inside call() for fileId = " + fileId);
JavaRDD<String> words = jrd.flatMap(new Flatter());
JavaPairRDD<String, Integer> ones = words.mapToPair(new KVPair());
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Reduction());
return counts.collect();
}
}
}
Why is Program not closing automatically ?
Ans : you have not closed the Sparkcontex , try changing main method to this :
public static void main(String[] args) {
final int threadPoolSize = 5;
SparkConf sparkConf = new SparkConf().setMaster("yarn-client").setAppName("Tracker").set("spark.ui.port","0");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
ExecutorService executor = Executors.newFixedThreadPool(threadPoolSize);
List<Future> listOfFuture = new ArrayList<Future>();
for (int i = 0; i < 20; i++) {
if (listOfFuture.size() < threadPoolSize) {
FlexiWordCount flexiWordCount = new FlexiWordCount(jsc, i);
Future future = executor.submit(flexiWordCount);
listOfFuture.add(future);
} else {
boolean allFutureDone = false;
while (!allFutureDone) {
allFutureDone = checkForAllFuture(listOfFuture);
System.out.println("Threads not completed yet!");
try {
Thread.sleep(2000);//waiting for 2 sec, before next check
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
printFutureResult(listOfFuture);
System.out.println("printing of future done");
listOfFuture.clear();
System.out.println("future list got cleared");
}
}
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
jsc.stop()
}
Is there a better way ?
Ans : Yes you should pass the directory of the files to sparkcontext and use .textFile over directory , in this case spark would parallaize the reads from directories over the executors . If you try to create threads yourself and then use the same spark context to re-submit job for each file you are adding a extra overhead of submitting application to yarn queue .
I think the fastest approach would be to directly pass the entire directory and create RDD out of it and then then let spark launch parallel task to process all the files in different executors .You can experiment with using .repartition() method over the RDD , as it would launch that many tasks to run parallely .

Categories