NPE while doing context.forward() using low-level Kafka Stream API - java

I have built a plain Kafka streams API using Low-level Kafka API. The topology is linear.
p1 -> p2 -> p3
While doing context.forward(), I am getting NPE, snippet here:
NAjava.lang.NullPointerException: null
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:178)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
...
I am using Kafka Stream 2.3.0.
I see a similar SO question [here][1], and the question is based on the very old version. So, not sure if this is the same error?
Edit
I am putting some more info, keeping the Gist of what I am doing
public class SP1Processor implements StreamProcessor {
private StreamProcessingContext ctxt;
// In init(), create a single thread pool
// which does some processing and sends the
// data to next processor
#Override
void init(StreamProcessingContext ctxt) {
this.ctxt = ctxt;
// Create a thread pool, do some work
// and then do this.ctxt.forward(K,V)
// Not showing code of Thread pool
// Strangely, inside this thread pool,
// this.ctxt isn't same what I see in process()
// shouldn't it be same? ctxt is member variable
// and shouldn't it be same
// this.ctxt.forward(K,V) here in this thread pool is causing NPE
// why does it happen?
this.ctxt.forward(K,V);
}
#Override
void process(K,V) {
// Here do some processing and go to the next processor chain
// This works fine
this.ctxt.forward(K,V);
}
}
[1]: https://stackoverflow.com/questions/39067846/periodic-npe-in-kafka-streams-processor-context

It looks like it could be the same issue as the linked question, although we are talking about a much more contemporary version in your case.
Make sure that ProcessorSupplier.get() returns a new instance each time it is called.

You shouldn't create any thread pool inside Processor or DSL calls.
Parallelism is managed in KafkaStreams by num.stream.threads, number of partitions and number of instances.
ctxt is the same but its fields/members might be different (ex. currentNode) - might be change by different threads.

Related

How manually read data from Flink's checkpoint file and keep in Java memory

We need to read data from our checkpoints manually for different reasons (let's say we need to change our state object/class structure, so we want to read restore and copy data to a new type of object)
But, while we are reading everything is good, when we want to keep/store it in memory and deploying to flink cluster we get empty list/map. in log we see that we are reading and adding all our data properly to list/map but as soon as our method completes it's work we lost data, list/map is empty :(
val env = ExecutionEnvironment.getExecutionEnvironment();
val savepoint = Savepoint.load(env, checkpointSavepointLocation, new HashMapStateBackend());
private List<KeyedAssetTagWithConfig> keyedAssetsTagWithConfigs = new ArrayList<>();
val keyedStateReaderFunction = new KeyedStateReaderFunctionImpl();
savepoint.readKeyedState("my-uuid", keyedStateReaderFunction)
.setParallelism(1)
.output(new MyLocalCollectionOutputFormat<>(keyedAssetsTagWithConfigs));
env.execute("MyJobName");
private static class KeyedStateReaderFunctionImpl extends KeyedStateReaderFunction<String, KeyedAssetTagWithConfig> {
private MapState<String, KeyedAssetTagWithConfig> liveTagsValues;
private Map<String, KeyedAssetTagWithConfig> keyToValues = new ConcurrentHashMap<>();
#Override
public void open(final Configuration parameters) throws Exception {
liveTagsValues = getRuntimeContext().getMapState(ExpressionsProcessor.liveTagsValuesStateDescriptor);
}
#Override
public void readKey(final String key, final Context ctx, final Collector<KeyedAssetTagWithConfig> out) throws Exception {
liveTagsValues.iterator().forEachRemaining(entry -> {
keyToValues.put(entry.getKey(), entry.getValue());
log.info("key {} -> {} val", entry.getKey(), entry.getValue());
out.collect(entry.getValue());
});
}
public Map<String, KeyedAssetTagWithConfig> getKeyToValues() {
return keyToValues;
}
}
as soon as this code executes I expect having all values inside map which we get from keyedStateReaderFunction.getKeyToValues(). But it returns empty map. However, I see in logs we are reading all of them properly. Even data empty inside keyedAssetsTagWithConfigs list where we are reading output in it.
If anyone has any idea will be very helpful because I get lost, I never had such experience that I put data to map and then I lose it :) When I serialize and write my map or list to text file and then deserialize it from there (using jackson) I see my data exists, but this is not a solution, kind of "workaround"
Thanks in advance
The code you show creates and submits a Flink job to be executed in its own environment orchestrated by the Flink framework: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#flink-application-execution
The job runs independently than the code that builds and submits the Flink job so when you call keyedStateReaderFunction.getKeyToValues(), you are calling the method of the object that was used to build the job, not the actual object that was run in the Flink execution environment.
Your workaround seems like a valid option to me. You can then submit the file with your savepoint contents to your new job to recreate its state as you'd like.
You have an instance of KeyedStateReaderFunctionImpl in the Flink client which gets serialized and sent to each task manager. Each task manager then deserializes a copy of that KeyedStateReaderFunctionImpl and calls its open and readKey methods, and gradually builds up a private Map containing its share of the data extracted from the savepoint/checkpoint.
Meanwhile the original KeyedStateReaderFunctionImpl back in the Flink client has never had its open or readKey methods called, and doesn't hold any data.
In your case the parallelism is one, so there is only one task manager, but in general you will need collect the output from each task manager and assemble together the complete results from these pieces. These results are not available in the flink client process because the work hasn't been done there.
I found a solution, started job in attached mode and collecting results in main thread
val env = ExecutionEnvironment.getExecutionEnvironment();
val configuration = env.getConfiguration();
configuration
.setBoolean(DeploymentOptions.ATTACHED, true);
...
val myresults = dataSource.collect();
Hope will help somebody else because I wasted couple of days while trying to find a soltion.

Thread safety for method that returns Mono based on mutable attribute in Java

In my Spring Boot application I have a component that is supposed to monitor the health status of another, external system. This component also offers a public method that reactive chains can subscribe to in order to wait for the external system to be up.
#Component
public class ExternalHealthChecker {
private static final Logger LOG = LoggerFactory.getLogger(ExternalHealthChecker.class);
private final WebClient externalSystemWebClient = WebClient.builder().build(); // config omitted
private volatile boolean isUp = true;
private volatile CompletableFuture<String> completeWhenUp = new CompletableFuture<>();
#Scheduled(cron = "0/10 * * ? * *")
private void checkExternalSystemHealth() {
webClient.get() //
.uri("/health") //
.retrieve() //
.bodyToMono(Void.class) //
.doOnError(this::handleHealthCheckError) //
.doOnSuccess(nothing -> this.handleHealthCheckSuccess()) //
.subscribe(); //
}
private void handleHealthCheckError(final Throwable error) {
if (this.isUp) {
LOG.error("External System is now DOWN. Health check failed: {}.", error.getMessage());
}
this.isUp = false;
}
private void handleHealthCheckSuccess() {
// the status changed from down -> up, which has to complete the future that might be currently waited on
if (!this.isUp) {
LOG.warn("External System is now UP again.");
this.isUp = true;
this.completeWhenUp.complete("UP");
this.completeWhenUp = new CompletableFuture<>();
}
}
public Mono<String> waitForExternalSystemUPStatus() {
if (this.isUp) {
LOG.info("External System is already UP!");
return Mono.empty();
} else {
LOG.warn("External System is DOWN. Requesting process can now wait for UP status!");
return Mono.fromFuture(completeWhenUp);
}
}
}
The method waitForExternalSystemUPStatus is public and may be called from many, different threads. The idea behind this is to provide some of the reactive flux chains in the application a method of pausing their processing until the external system is up. These chains cannot process their elements when the external system is down.
someFlux
.doOnNext(record -> LOG.info("Next element")
.delayUntil(record -> externalHealthChecker.waitForExternalSystemUPStatus())
... // starting processing
The issue here is that I can't really wrap my head around which part of this code needs to be synchronised. I think there should not be an issue with multiple threads calling waitForExternalSystemUPStatusat the same time, as this method is not writing anything. So I feel like this method does not need to be synchronised. However, the method annotated with #Scheduled will also run on it's own thread and will in-fact write the value of isUp and also potentially change the reference of completeWhenUpto a new, uncompleted future instance. I have marked these two mutable attributes with volatilebecause from reading about this keyword in Java it feels to me like it would help with guaranteeing that the threads reading these two values see the latest value. However, I am unsure if I also need to add synchronized keywords to part of the code. I am also unsure if the synchronized keyword plays well with reactor code, I have a hard time finding information on this. Maybe there is also a way of providing the functionality of the ExternalHealthCheckerin a more complete, reactive way, but I cannot think of any.
I'd strongly advise against this approach. The problem with threaded code like this is it becomes immensely difficult to follow & reason about. I think you'd at least need to synchronise the parts of handleHealthCheckSuccess() and waitForExternalSystemUPStatus() that reference your completeWhenUp field otherwise you could have a race hazard on your hands (only one writes to it, but it might be read out-of-order after that write) - but there could well be something else I'm missing, and if so it may show as one of these annoying "one in a million" type bugs that's almost impossible to pin down.
There should be a much more reliable & simple way of achieving this though. Instead of using the Spring scheduler, I'd create a flux when your ExternalHealthChecker component is created as follows:
healthCheckStream = Flux.interval(Duration.ofMinutes(10))
.flatMap(i ->
webClient.get().uri("/health")
.retrieve()
.bodyToMono(String.class)
.map(s -> true)
.onErrorResume(e -> Mono.just(false)))
.cache(1);
...where healthCheckStream is a field of type Flux<Boolean>. (Note it doesn't need to be volatile, as you'll never replace it so cross-thread worries don't apply - it's the same stream that will be updated with different results every 10 minutes based on the healthcheck status, whatever thread you'll access it from.)
This essentially creates a stream of healthcheck response values every 10 minutes, always caches the latest response, and turns it into a hot source. This means that the "nothing happens until you subscribe" doesn't apply in this case - the flux will start executing immediately, and any new subscribers that come in on any thread will always get the latest result, be that a pass or a fail. handleHealthCheckSuccess() and handleHealthCheckError(), isUp, and completeWhenUp are then all redundant, they can go - and then your waitForExternalSystemUPStatus() can just become a single line:
return healthCheckStream.filter(x -> x).next();
...then job done, you can call that from anywhere and you'll have a Mono that will only complete when the system is up.

Creating a deep copy of cache in mulithreaded Java application

Setup
I have a multithreaded Java application which will receive 200-300 requests per second to perform a task 'A'(which take approximately 30 milliseconds) on an input received in a request.
The application has a cache(max size = 1MB) which is read by each thread to perform task 'A' on input received:
public class DataProvider() {
private HashMap<KeyObject, ValueObject> cache;
private Database database;
// Scheduled to run in interval of 15 seconds by a background thread
public synchronized void updateData() {
this.cache = database.getData();
}
public HashMap<KeyObject, ValueObject> getCache() {
return this.cache;
}
}
KeyObject and ValueObject are POJO. ValueObject contains List of another POJO.
For every request received task is done in following way:
public class TaskExecutor() {
private DataProvider dataProvider;
public boolean doTask(final InputObject input) {
final HashMap<KeyObject, ValueObject> data = dataProvider.getCache(); // shallow copy I think
// Do Task 'A' using data
}
}
Problem
One of the thread starts executing task 'A' at timestamp 't' using data 'd1' from cache. At time 't + t1' cache data gets updated to 'd2'. Thread now starts using data 'd2' to finish rest of the task. Task gets completed at 't+t1+t2'. Half of the task was completed with different data. This will lead to invalid outcome of task.
Current Approach
Each thread will create a deep copy of the cache and then use the deep copy to perform the task using one of the following approach(best in performance) to perform deep copy:
How do you make a deep copy of an object in Java?
Deep clone utility recommendation
Limitation
Cloning using deep copy will create thousand of objects which may crash JVM.
All the cloning approaches don't look good in terms of performance.
For Your use case, returning a new cache from database.getData(); is much better choice. Because If You choose this way, You would only have to create new cache object once in 15 second. If You choose to clone cache in each task, You would have to create 4501 cache object in 15 second. Obviously returning new cache object is the right choice.
If the code You provided is the same code as in Your project, I believe database.getData(); method changing the content of a single cache object instead of returning a new one. If You return a new cache object from this method Your problem will be solved.

Code running on main thread even with subscribeOn specified

I'm in the process of migrating an AsyncTaskLoader to RxJava, trying to understand all the details about the RxJava approach to concurrency. Simple things were running ok, however I'm struggling with the following code:
This is the top level method that gets executed:
mCompositeDisposable.add(mDataRepository
.getStuff()
.subscribeOn(mSchedulerProvider.io())
.subscribeWith(...)
mDataRepository.getStuff() looks like this:
public Observable<StuffResult> getStuff() {
return mDataManager
.listStuff()
.flatMap(stuff -> Observable.just(new StuffResult(stuff)))
.onErrorReturn(throwable -> new StuffResult(null));
And the final layer:
public Observable<Stuff> listStuff() {
Log.d(TAG, ".listStuff() - "+Thread.currentThread().getName());
String sql = <...>;
return mBriteDatabase.createQuery(Stuff.TABLE_NAME, sql).mapToList(mStuffMapper);
}
So with the code above, the log will print out .listStuff() - main, which is not exactly what I'm looking for. And I'm not really sure why. I was under impression that by setting subscribeOn, every event pulled from the chain will be processed on the thread specified in the subscribeOn method.
What I think is happening, is that the source-aka-final-layer code, before reaching mBriteDatabase, is not from the RxJava world and therefore is not an event until createQuery is called. So I probably need some sort of a wrapper? I've tried applying .fromCallable, however that's a wrapper for non Rx code, and my database layer returns an observable...
Your Log.d call happens
immediately when listStuff gets called
which is immediately after getStuff gets called
which is the first thing happening in the top level code fragment you show us.
If you need to do it when the subscription happens, you need to be explicit:
public Observable<Stuff> listStuff() {
String sql = <...>;
return mBriteDatabase.createQuery(Stuff.TABLE_NAME, sql)
.mapToList(mStuffMapper)
.doOnsubscribe(() -> Log.d(TAG, ".listStuff() - "+Thread.currentThread().getName()));
}

Spark on Java - What is the right way to have a static object on all workers

I need to use a non-serialisable 3rd party class in my functions on all executors in Spark, for example:
JavaRDD<String> resRdd = origRdd
.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String t) throws Exception {
//A DynamoDB mapper I don't want to initialise every time
DynamoDBMapper mapper = new DynamoDBMapper(new AmazonDynamoDBClient(credentials));
Set<String> userFav = mapper.load(userDataDocument.class, userId).getFav();
return userFav;
}
});
I would like to have a static DynamoDBMapper mapper which I initialise once for every executor and be able to use it over and over again.
Since it's not a serialisable, I can't initialise it once in the drive and broadcast it.
note: this is an answer here (What is the right way to have a static object on all workers) but it's only for Scala.
You can use mapPartition or foreachPartition. Here is a snippet taken from Learning Spark
By using partition- based operations, we can share a connection pool
to this database to avoid setting up many connections, and reuse our
JSON parser. As Examples 6-10 through 6-12 show, we use the
mapPartitions() function, which gives us an iterator of the elements
in each partition of the input RDD and expects us to return an
iterator of our results.
This allows us to initialize one connection per executor, then iterate over the elements in the partition however you would like. This is very useful for saving data into some external database or for expensive reusable object creation.
Here is a simple scala example taken from the linked book. This can be translated to java if needed. Just here to show a simple use case of mapPartition and foreachPartition.
ipAddressRequestCount.foreachRDD { rdd => rdd.foreachPartition { partition =>
// Open connection to storage system (e.g. a database connection)
partition.foreach { item =>
// Use connection to push item to system
}
// Close connection
}
}
Here is a link to a java example.

Categories