Kafka Stream with multiple consumers / processors not persisting data concurrently - java

I'm new with Kafka and want to persist data from kafka topics to database tables (each topic flow to a specific table). I know Kafka connect exists and can be used to achieve this but there are reasons why this approach is preferred.
Unfortunately only one topic is writing the database. Kafka seems to not process() all processors concurrently. Either MyFirstData is writing to database or MySecondData but never but at the same time.
According the my readings the is the option overriding init() from of kafka stream Processor interface which offers context.forward() not sure if this will help and how to use it in my used case.
I use Spring Cloud Stream (but got the same behaviour with Kafka DSL and Processor API implementations)
My code snippet:
Configuring the consumers:
#Configuration
#RequiredArgsConstructor
public class DatabaseProcessorConfiguration {
private final MyFirstDao myFirstDao;
private final MySecondDao mySecondDao;
#Bean
public Consumer<KStream<GenericData.Record, GenericData.Record>> myFirstDbProcessor() {
return stream -> stream.process(() -> {
return new MyFirstDbProcessor(myFirstDao);
});
}
#Bean
public Consumer<KStream<GenericRecord, GenericRecord>> mySecondDbProcessor() {
return stream -> stream.process(() -> new MySecondDbProcessor(mySecondDao));
}
}
This MyFirstDbProcessor and MySecondDbProcessor is analog to this.
#Slf4j
#RequiredArgsConstructor
public class MyFirstDbProcessor implements Processor<GenericData.Record, GenericData.Record, Void, Void> {
private final MyFirstDao myFirstDao;
#Override
public void process(Record<GenericData.Record, GenericData.Record> record) {
CdcRecordAdapter adapter = new CdcRecordAdapter(record.key(), record.value());
MyFirstTopicKey myFirstTopicKey = adapter.getKeyAs(MyFirstTopicKey.class);
MyFirstTopicValue myFirstTopicValue = adapter.getValueAs(MyFirstTopicValue.class);
MyFirstData data = PersistenceMapper.map(myFirstTopicKey, myFirstTopicValue);
switch (myFirstTopicValue.getCrudOperation()) {
case UPDATE, INSERT -> myFirstDao.persist(data);
case DELETE -> myFirstDao.delete(data);
default -> System.err.println("unimplemented CDC operation streamed by kafka");
}
}
}
My Dao implementations: I try an implementation of MyFirstRepository with JPARepository and ReactiveCrudRepository but same behaviour. MySecondRepository is implemented analog to MyFirstRepository.
#Component
#RequiredArgsConstructor
public class MyFirstDaoImpl implements MyFirstDao {
private final MyFirstRepository myFirstRepository;
#Override
public MyFirstData persist(MyFirstData myFirstData) {
Optional<MyFirstData> dataOptional = MyFirstRepository.findById(myFirstData.getId());
if (dataOptional.isPresent()){
var data = dataOptional.get();
myFirstData.setCreatedDate(data.getCreatedDate());
}
return myFirstRepository.save(myFirstData);
}
#Override
public void delete(MyFirstData myFirstData) {
System.out.println("delete() from transaction detail dao called");
MyFirstRepository.delete(myFirstData);
}
}

Related

Sequential processing of multi-threaded results

I am setting up a Spring Boot application (DAO pattern with #Repositories) where I am attempting to write a #Service to asynchronously pull data from a database in multiple threads and merge-process the incoming payloads sequentially, preferably on arrival.
The goal is to utilize parallel database access for requests where multiple non-overlapping sets of filter conditions need to be queried individually, but post-processed (transformed, e.g. aggregated) into a combined result.
Being rather new to Java, and coming from Golang and its comparably trivial syntax for multi-threading and task-communication, I struggle to identify a preferable API in Java and Spring Boot - or determine if this approach is even favorable to begin with.
Question:
Given
a Controller:
#RestController
#RequestMapping("/api")
public class MyController {
private final MyService myService;
#Autowired
public MyController(MyService myService) {
this.myService = myService;
}
#PostMapping("/processing")
public DeferredResult<MyResult> myHandler(#RequestBody MyRequest myRequest) {
DeferredResult<MyResult> myDeferredResult = new DeferredResult<>();
myService.myProcessing(myRequest, myDeferredResult);
return myDeferredResult;
}
a Service:
import com.acme.parallel.util.MyDataTransformer
#Service
public class MyServiceImpl implementing MyService {
private final MyRepository myRepository;
#Autowired
public MyService(MyRepository myRepository) {
this.myRepository = myRepository;
}
public void myProcessing(MyRequest myRequest, MyDeferredResult myDeferredResult) {
MyDataTransformer myDataTransformer = new MyDataTransformer();
/* PLACEHOLDER CODE
for (MyFilter myFilter : myRequest.getMyFilterList()) {
// MyPartialResult myPartialResult = myRepository.myAsyncQuery(myFilter);
// myDataTransformer.transformMyPartialResult(myPartialResult);
}
*/
myDeferredResult.setResult(myDataTransformer.getMyResult());
}
}
a Repository:
#Repository
public class MyRepository {
public MyPartialResult myAsyncQuery(MyFilter myFilter) {
// for the sake of an example
return new MyPartialResult(myFilter, TakesSomeAmountOfTimeToQUery.TRUE);
}
}
as well as a MyDataTransformer helper class:
public class MyDataTransformer {
private final MyResult myResult = new MyResult(); // e.g. a Map
public void transformMyPartialResult(MyPartialResult myPartialResult) {
/* PLACEHOLDER CODE
this.myResult.transformAndMergeIntoMe(myPartialResult);
*/
}
}
how can I implement
the MyService.myProcessing method asynchronously and multi-threaded, and
the MyDataTransformer.transformMyPartialResult method sequential/thread-safe
(or redesign the above)
most performantly, to merge incoming MyPartialResult into one single MyResult?
Attempts:
The easiest solution seems to be to skip the "on arrival" part, and a commonly preferred implementation might e.g. be:
public void myProcessing(MyRequest myRequest, MyDeferredResult myDeferredResult) {
MyDataTransformer myDataTransformer = new MyDataTransformer();
List<CompletableFuture<myPartialResult>> myPartialResultFutures = new ArrayList<>();
for (MyFilter myFilter : myRequest.getMyFilterList()) { // Stream is the way they say, but I like for
myPartialResultFutures.add(CompletableFuture.supplyAsync(() -> myRepository.myAsyncQuery(myFilter));
}
myPartialResultFutures.stream()
.map(CompletableFuture::join)
.map(myDataTransformer::transformMyPartialResult);
myDeferredResult.setResult(myDataTransformer.getMyResult());
}
However, if feasible I'd like to benefit from sequentially processing incoming payloads when they arrive, so I am currently experimenting with something like this:
public void myProcessing(MyRequest myRequest, MyDeferredResult myDeferredResult) {
MyDataTransformer myDataTransformer = new MyDataTransformer();
List<CompletableFuture<myPartialResult>> myPartialResultFutures = new ArrayList<>();
for (MyFilter myFilter : myRequest.getMyFilterList()) {
myPartialResultFutures.add(CompletableFuture.supplyAsync(() -> myRepository.myAsyncQuery(myFilter).thenAccept(myDataTransformer::transformMyPartialResult));
}
myPartialResultFutures.forEach(CompletableFuture::join);
myDeferredResult.setResult(myDataTransformer.getMyResult());
}
but I don't understand if I need to implement any thread-safety protocols when calling myDataTransformer.transformMyPartialResult, and how - or if this even makes sense, performance-wise.
Update:
Based on the assumption that
myRepository.myAsyncQuery takes slightly varying amounts of time, and
myDataTransformer.transformMyPartialResult taking an ever increasing amount of time each call
implementing a thread-safe/atomic type/Object, e.g. a ConcurrentHashMap:
public class MyDataTransformer {
private final ConcurrentMap<K, V> myResult = new ConcurrentHashMap<K, V>();
public void transformMyPartialResult(MyPartialResult myPartialResult) {
myPartialResult.myRows.stream()
.map((row) -> this.myResult.merge(row[0], row[1], Integer::sum)));
}
}
into the latter Attempt (processing "on arrival"):
public void myProcessing(MyRequest myRequest, MyDeferredResult myDeferredResult) {
MyDataTransformer myDataTransformer = new MyDataTransformer();
List<CompletableFuture<myPartialResult>> myPartialResultFutures = new ArrayList<>();
for (MyFilter myFilter : myRequest.getMyFilterList()) {
myPartialResultFutures.add(CompletableFuture.supplyAsync(() -> myRepository.myAsyncQuery(myFilter).thenAccept(myDataTransformer::transformMyPartialResult));
}
myPartialResultFutures.forEach(CompletableFuture::join);
myDeferredResult.setResult(myDataTransformer.getMyResult());
}
is up to one order of magnitude faster than waiting on all threads first, even with atomicity protocol overhead.
Now, this may have been obvious (not ultimately, though, as async/multi-threaded processing is by far not always the better choice), and I am glad this approach is a valid choice.
What remains is what looks to me like a hacky, flexibility lacking solution - or at least an ugly one. Is there a better approach?

Disable Kafka connection in SpringBoot tests

I'm working on a springboot project following a microservice architecture and I use Kafka as an event bus to exchange data between some of them. I also have Junit tests which test some part of my application which doesn't require the bus and others that require it by using an embedded Kafka broker.
The problem I have is when I launch all my tests, they take so much time and they fail because each of then is trying to connect to the embedded Kafka broker (connection not available) whereas they don't need Kafka bus in order to achieve their task.
Is it possible to disable the loading of Kafka components for these tests and only allow them for the ones that require it ?
This is how I usually write my JUnit tester classes, that usually wont connect to KAFKA Brokers for each test.
Mocking the REST API, if your KAFKA Client (Producer/Consumer) integrated with a REST API
public class MyMockedRESTAPI {
public MyMockedRESTAPI() {
}
public APIResponseWrapper apiResponseWrapper(parameters..) throws RestClientException {
if (throwException) {
throw new RestClientException(....);
}
return new APIResponseWrapper();
}
}
A factory class to generate an incoming KAFKA Event and REST API request and response wrappers
public class mockFactory {
private static final Gson gson = new Gson();
public static KAKFAEvent generateKAFKAEvent() {
KAKFAEvent kafkaEvent = new KAKFAEvent();
kafkaEvent.set...
kafkaEvent.set...
kafkaEvent.set...
return KAKFAEvent;
}
public static ResponseEntity<APIResponse> createAPIResponse() {
APIResponse response = new APIResponse();
return new ResponseEntity<>(response, HttpStatus.OK);
}
}
A Test Runner Class
#RunWith(SpringJUnit4ClassRunner.class)
public class KAFKAJUnitTest {
Your assertion should be declared here
}
You can also refer : https://www.baeldung.com/spring-boot-kafka-testing
a good practice would be to avoid sending messages to Kafka while testing code in your isolated microservice scope. but when you need to make an integration test ( many microservices in the same time ) sometimes you need to activate Kafka messages.
So my purpose is :
1- Activate/Deactivate loding Kafka configuration as required
#ConditionalOnProperty(prefix = "my.kafka.consumer", value = "enabled", havingValue = "true", matchIfMissing = false)
#Configuration
public class KafkaConsumerConfiguration {
...
}
#ConditionalOnProperty(prefix = "my.kafka.producer", value = "enabled", havingValue = "true", matchIfMissing = false)
#Configuration
public class KafkaProducerConfiguration {
...
}
and then u will be able to activate/deactivate loading consumer and producer as you need...
Examples :
#SpringBootApplication
#Import(KafkaConsumerConfiguration.class)
public class MyMicroservice_1 {
public static void main(String[] args) {
SpringApplication.run(MyMicroservice_1.class, args);
}
}
or
#SpringBootApplication
#Import(KafkaProducerConfiguration.class)
public class MyMicroservice_2 {
public static void main(String[] args) {
SpringApplication.run(MyMicroservice_2.class, args);
}
}
or maybe a microservice that need both of configurations
#SpringBootApplication
#Import(value = { KafkaProducerConfiguration.class, KafkaConsumerConfiguration.class })
public class MyMicroservice_3 {
public static void main(String[] args) {
SpringApplication.run(MyMicroservice_3.class, args);
}
}
2 - You need also to make sending messages depending on the current spring profile. To do that u can override the send method of the Kafka template object:
#ConditionalOnProperty(prefix = "my.kafka.producer", value = "enabled", havingValue = "true", matchIfMissing = false)
#Configuration
public class KafkaProducerConfiguration {
...
#Resource
Environment environment;
#Bean
public KafkaTemplate<String, String> kafkaTemplate() {
return new KafkaTemplate<>(producerFactory()) {
#Override
protected ListenableFuture<SendResult<String, String>> doSend(ProducerRecord<String, String> producerRecord) {
if (Arrays.asList(environment.getActiveProfiles()).contains("test")) {
return null;
}
return super.doSend(producerRecord);
}
};
}
#Bean
public ProducerFactory<String, String> producerFactory() {
Map<String, Object> props = new HashMap<>();
...
return new DefaultKafkaProducerFactory<>(props);
}
}

AWS SQS Listening vs Polling

I currently have implemented in a Spring Boot project running on Fargate an SQS listener.
It's possible that under the hood, the SqsAsyncClient which appears to be a listener, is actually polling though.
Separately, as a PoC, on I implemented a Lambda function trigger on a different queue. This would be invoked when there are items in the queue and would post to my service. This seems unnecessarily complex to me but removes a single point of failure if I were to only have one instance of the service.
I guess my major point of confusion is whether I am needlessly worrying about polling vs listening on a SQS queue and whether it matters.
Code for example purposes:
#Component
#Slf4j
#RequiredArgsConstructor
public class SqsListener {
private final SqsAsyncClient sqsAsyncClient;
private final Environment environment;
private final SmsMessagingServiceImpl smsMessagingService;
#PostConstruct
public void continuousListener() {
String queueUrl = environment.getProperty("aws.sqs.sms.queueUrl");
Mono<ReceiveMessageResponse> responseMono = receiveMessage(queueUrl);
Flux<Message> messages = getItems(responseMono);
messages.subscribe(message -> disposeOfFlux(message, queueUrl));
}
protected Flux<Message> getItems(Mono<ReceiveMessageResponse> responseMono) {
return responseMono.repeat().retry()
.map(ReceiveMessageResponse::messages)
.map(Flux::fromIterable)
.flatMap(messageFlux -> messageFlux);
}
protected void disposeOfFlux(Message message, String queueUrl) {
log.info("Inbound SMS Received from SQS with MessageId: {}", message.messageId());
if (someConditionIsMet())
deleteMessage(queueUrl, message);
}
protected Mono<ReceiveMessageResponse> receiveMessage(String queueUrl) {
return Mono.fromFuture(() -> sqsAsyncClient.receiveMessage(
ReceiveMessageRequest.builder()
.maxNumberOfMessages(5)
.messageAttributeNames("All")
.queueUrl(queueUrl)
.waitTimeSeconds(10)
.visibilityTimeout(30)
.build()));
}
protected void deleteMessage(String queueUrl, Message message) {
sqsAsyncClient.deleteMessage(DeleteMessageRequest.builder()
.queueUrl(queueUrl)
.receiptHandle(message.receiptHandle())
.build())
.thenAccept(deleteMessageResponse -> log.info("deleted message with handle {}", message.receiptHandle()));
}
}

Spring Kafka Event after all listeners are started

I'm using Spring-Kakfa to connect kakfa cluster and I have a requirement to execute a piece of code after all the topic partitions of all the topics are assigned.
take the following code for example
#Component
public class MyKafkaMessageListener1 {
#KafkaListener(topics = "topic1" , groupId = "cg1")
public void handleMessage(String message){
System.out.println("Msg:"+message);
}
}
#Component
public class MyKafkaMessageListener2 {
#KafkaListener(topics = "topic2, topic3" , groupId = "cg2")
public void handleMessage(String message){
System.out.println("Msg:"+message);
}
}
I need to execute after all topic-partitions of topic1, topic2 and topic3 are assigned to threads. Is it possible? Is there such a EventHandler in kafka or spring-kafka?
Implement ConsumerSeekAware (or extend AbstractConsumerSeekAware), or add a ConsumerRebalanceListener and you will receive a call to onPartitionsAssigned().

Is there a better way to implement multitenant using kafka?

I’m trying to implement a multi tenant micro service using Spring Boot. I already implemented the web layer and the persistence layer. On web layer, I’ve implement a filter which sets the tenant id in a prototype bean (using ThreadLocalTargetSource), on persistence layer I’ve used Hibernate multi tenancy configuration (schema per tenant), they work fine, data is persisted in the appropriate schema. Currently I am implementing the same behaviour on messaging layer, using spring-kaka library, so far ir works the way I expected, but I’d like to know if there is a better way to do it.
Here is my code:
This si the class that manage a KafkaMessageListenerContainer:
#Component
public class MessagingListenerContainer {
private final MessagingProperties messagingProperties;
private KafkaMessageListenerContainer<String, String> container;
#PostConstruct
public void init() {
ContainerProperties containerProps = new ContainerProperties(
messagingProperties.getConsumer().getTopicsAsList());
containerProps.setMessageListener(buildCustomMessageListener());
container = createContainer(containerProps);
container.start();
}
#Bean
public MessageListener<String, String> buildCustomMessageListener() {
return new CustomMessageListener();
}
private KafkaMessageListenerContainer<String, String> createContainer(
ContainerProperties containerProps) {
Map<String, Object> props = consumerProps();
…
return container;
}
private Map<String, Object> consumerProps() {
Map<String, Object> props = new HashMap<>();
…
return props;
}
#PreDestroy
public void finish() {
container.stop();
}
}
This is the CustomMessageListener:
#Slf4j
public class CustomMessageListener implements MessageListener<String, String> {
#Autowired
private TenantStore tenantStore; // Prototype Bean
#Autowired
private List<ServiceListener> services;
#Override
public void onMessage(ConsumerRecord<String, String> record) {
log.info(“Tenant {} | Payload: {} | Record: {}", record.key(),
record.value(), record.toString());
tenantStore.setTenantId(record.key()); // Currently tenant is been setting as key
services.stream().forEach(sl -> sl.onMessage(record.value()));
}
}
This is a test service which would use the message data and tenant:
#Slf4j
#Service
public class ConsumerService implements ServiceListener {
private final MessagesRepository messages;
private final TenantStore tenantStore;
#Override
public void onMessage(String message) {
log.info("ConsumerService {}, tenant {}", message, tenantStore.getTenantId());
messages.save(new Message(message));
}
}
Thanks for your time!
Just to be clear ( correct me if I'm wrong ): you are using the same topic(s) for all your tenants. The way that you distinguish the message according to each tenant is by using the message key which in your case is the tenant id.
A slight improvement can be done by using message headers to store the tenant id instead of the key. By doing this then you will not be limited to partitioning messages based on tenants.
Although the model described by you works it has a major security issue. If someone gets access to your topic then you will be leaking data of all your tenants.
A more secure approach is using topic naming conventions and ACL's ( access control lists ). You can find a short explanation here. In a nutshell, you can include the name of your tenant in the topic's name by either using a suffix or a prefix.
e.g: orders_tenantA, orders_tenantB or tenantA_orders, tenantB_orders
Then, using ACL's you can restrict which applications can connect to those specific topics. This scenario is also helpful if one of your tenants need to connect one of their applications directly to your Kafka cluster.

Categories