Commit Offsets to Kafka on Spark Executors

Commit Offsets to Kafka on Spark Executors - java

I am getting events from Kafka, enriching/filtering/transforming them on Spark and then storing them in ES. I am committing back the offsets to Kafka
I have two questions/problems:
(1) My current Spark job is VERY slow
I have 50 partitions for a topic and 20 executors. Each executor has 2 cores and 4g of memory each. My driver has 8g of memory. I am consuming 1000 events/partition/second and my batch interval is 10 seconds. This means, I am consuming 500000 events in 10 seconds
My ES cluster is as follows:
20 shards / index
3 master instances c5.xlarge.elasticsearch
12 instances m4.xlarge.elasticsearch
disk / node = 1024 GB so 12 TB in total
And I am getting huge scheduling and processing delays
(2) How can I commit offsets on executors?
Currently, I enrich/transform/filter my events on executors and then send everything to ES using BulkRequest. It's a synchronous process. If I get positive feedback, I send the offset list to driver. If not, I send back an empty list. On the driver, I commit offsets to Kafka. I believe, there should be a way, where I can commit offsets on executors but I don't know how to pass kafka Stream to executors:
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges, this::onComplete);
This is the code for committing offsets to Kafka which requires Kafka Stream
Here is my overall code:
kafkaStream.foreachRDD( // kafka topic
rdd -> { // runs on driver
rdd.cache();
String batchIdentifier =
Long.toHexString(Double.doubleToLongBits(Math.random()));
LOGGER.info("## [" + batchIdentifier + "] Starting batch ...");
Instant batchStart = Instant.now();
List<OffsetRange> offsetsToCommit =
rdd.mapPartitionsWithIndex( // kafka partition
(index, eventsIterator) -> { // runs on worker
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
LOGGER.info(
"## Consuming " + offsetRanges[index].count() + " events" + " partition: " + index
);
if (!eventsIterator.hasNext()) {
return Collections.emptyIterator();
}
// get single ES documents
List<SingleEventBaseDocument> eventList = getSingleEventBaseDocuments(eventsIterator);
// build request wrappers
List<InsertRequestWrapper> requestWrapperList = getRequestsToInsert(eventList, offsetRanges[index]);
LOGGER.info(
"## Processed " + offsetRanges[index].count() + " events" + " partition: " + index + " list size: " + eventList.size()
);
BulkResponse bulkItemResponses = elasticSearchRepository.addElasticSearchDocumentsSync(requestWrapperList);
if (!bulkItemResponses.hasFailures()) {
return Arrays.asList(offsetRanges).iterator();
}
elasticSearchRepository.close();
return Collections.emptyIterator();
},
true
).collect();
LOGGER.info(
"## [" + batchIdentifier + "] Collected all offsets in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
OffsetRange[] offsets = new OffsetRange[offsetsToCommit.size()];
for (int i = 0; i < offsets.length ; i++) {
offsets[i] = offsetsToCommit.get(i);
}
try {
offsetManagementMapper.commit(offsets);
} catch (Exception e) {
// ignore
}
LOGGER.info(
"## [" + batchIdentifier + "] Finished batch of " + offsetsToCommit.size() + " messages " +
"in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
rdd.unpersist();
});

You can move the offset logic above the rdd loop ... I am using below template for better offset handling and performance
JavaInputDStream<ConsumerRecord<String, String>> kafkaStream = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
kafkaStream.foreachRDD( kafkaStreamRDD -> {
//fetch kafka offsets for manually commiting it later
OffsetRange[] offsetRanges = ((HasOffsetRanges) kafkaStreamRDD.rdd()).offsetRanges();
//filter unwanted data
kafkaStreamRDD.filter(
new Function<ConsumerRecord<String, String>, Boolean>() {
#Override
public Boolean call(ConsumerRecord<String, String> kafkaRecord) throws Exception {
if(kafkaRecord!=null) {
if(!StringUtils.isAnyBlank(kafkaRecord.key() , kafkaRecord.value())) {
return Boolean.TRUE;
}
}
return Boolean.FALSE;
}
}).foreachPartition( kafkaRecords -> {
// init connections here
while(kafkaRecords.hasNext()) {
ConsumerRecord<String, String> kafkaConsumerRecord = kafkaRecords.next();
// work here
}
});
//commit offsets
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
});

Related

How to sync enqueued data before getting data from aws amplify

On my app log in process, I have a service that get the latest datas from aws amplify
DataStore Events
private String processName = "Checking network status...";
private SubscriptionToken subscriptionToken;
public void sync() {
AmplifyDataStoreManager.start();
subscriptionToken = Amplify.Hub.subscribe(
HubChannel.DATASTORE,
hubEvent -> DataStoreChannelEventName.NETWORK_STATUS.toString().equals(hubEvent.getName()) ||
DataStoreChannelEventName.SUBSCRIPTION_DATA_PROCESSED.toString().equals(hubEvent.getName()) ||
DataStoreChannelEventName.MODEL_SYNCED.toString().equals(hubEvent.getName()) ||
DataStoreChannelEventName.READY.toString().equals(hubEvent.getName()),
hubEvent -> {
Log.d("DataStore - Hub Event Name: " + hubEvent.getName());
if (hubEvent.getData() != null) {
Log.d("DataStore - Hub Event Data: " + hubEvent.getData());
}
if (DataStoreChannelEventName.NETWORK_STATUS.toString().equals(hubEvent.getName())) {
NetworkStatusEvent networkStatusEvent = (NetworkStatusEvent) hubEvent.getData();
if (networkStatusEvent != null && !networkStatusEvent.getActive()) {
onProcessError("Device not connected to internet");
}
} else if (DataStoreChannelEventName.SUBSCRIPTION_DATA_PROCESSED.toString().equals(hubEvent.getName())) {
ModelWithMetadata modelWithMetadata = ((ModelWithMetadata) hubEvent.getData());
ModelMetadata modelMetadata = modelWithMetadata.getSyncMetadata();
Log.d("DataStore - Model ID: " + modelWithMetadata.getModel().getId());
Log.d("DataStore - Model Name: " + modelWithMetadata.getModel().getModelName());
processName = "Syncing " + modelWithMetadata.getModel().getModelName() + "...";
EventBus.getDefault().post(new ProcessEvent(this));
if (TextUtils.equals(AmplifyDataModel.Transaction.name(), modelWithMetadata.getModel().getModelName()) && (modelMetadata.isDeleted() == null || !modelMetadata.isDeleted())) {
AppAmplifyDataAccessManager.saveAppTransaction(AppAmplifyConfiguration.getDataSyncManager().mapAppTransaction(modelWithMetadata.getModel()));
} else if (TextUtils.equals(AmplifyDataModel.Configuration.name(), modelWithMetadata.getModel().getModelName()) && (modelMetadata.isDeleted() == null || !modelMetadata.isDeleted())) {
AppAmplifyConfiguration.getDataSyncManager().mapConfigurations(modelWithMetadata.getModel());
}
} else if (DataStoreChannelEventName.MODEL_SYNCED.toString().equals(hubEvent.getName())) {
ModelSyncedEvent modelSyncedEvent = (ModelSyncedEvent) hubEvent.getData();
if (modelSyncedEvent != null) {
processName = "Syncing " + modelSyncedEvent.getModel() + "...";
EventBus.getDefault().post(new ProcessEvent(this));
}
} else {
Amplify.Hub.unsubscribe(subscriptionToken);
onProcessCompleted();
}
}
);
}
Configuration Model data sample:
double grandTotal;
aws configuration: grandTotal = 100;
local configuration: grandTotal = 100;
The scenario is, the device is offline, did transaction, local configuration: grandTotal is now 200. The sync configuration is called and enqueued because the device is offline.
Closed the app, the internet is back, opens the app, in the login process, sync() method is called. What happens is the local configuration: grandTotal is 100 again, because of sync(), then after that, the enqueued sync when the device is offline runs(but I was not able to debug this, its not going to the sync method of configuration). And the result is:
aws configuration: grandTotal = 200;
local configuration: grandTotal = 100;
What I want is if there is a pending sync, do that first before syncing aws data to local.

After reading DataStoreChannelEventName enum descriptions, I tried adding OUTBOX_STATUS and fixed my problem, I'm not sure how but I tested all the scenario and its working.
else if (DataStoreChannelEventName.OUTBOX_STATUS.toString().equals(hubEvent.getName())){
OutboxStatusEvent outBoxStatusEvent = (OutboxStatusEvent) hubEvent.getData();
if (outBoxStatusEvent != null) {
Log.d("DataStore - Hub Event Data: " + hubEvent.getData());
}

Akka references increasing constantly with Play Framework

I have changed all my multi-thread actions in my application to Akka a few weeks ago.
However, since it seems that I am starting to run out of Heap space (after a week or so).
By basically looking at all actors with
ActorSelection selection = getContext().actorSelection("/*");
the number of actors seems to increase all the time. After an hour of running I have more then 2200. They are called like:
akka://application/user/$Aic
akka://application/user/$Alb
akka://application/user/$Alc
akka://application/user/$Am
akka://application/user/$Amb
I also noticed that when opening websockets (and closing them) there are these:
akka://application/system/Materializers/StreamSupervisor-2/flow-21-0-unnamed
akka://application/system/Materializers/StreamSupervisor-2/flow-2-0-unnamed
akka://application/system/Materializers/StreamSupervisor-2/flow-27-0-unnamed
akka://application/system/Materializers/StreamSupervisor-2/flow-23-0-unnamed
Is there something specific that I need to do to close them and let them be cleaned?
I am not sure the memory issue is related, but the fact that there seem so many after an hour on the production server it could be.
[EDIT: added the code to analyse/count the actors]
public class RetrieveActors extends AbstractActor {
private String identifyId;
private List<String> list;
public RetrieveActors(String identifyId) {
Logger.debug("Actor retriever identity: " + identifyId);
this.identifyId = identifyId;
}
#Override
public Receive createReceive() {
Logger.info("RetrieveActors");
return receiveBuilder()
.match(String.class, request -> {
//Logger.info("Message: " + request + " " + new Date());
if(request.equalsIgnoreCase("run")) {
list = new ArrayList<>();
ActorSelection selection = getContext().actorSelection("/*");
selection.tell(new Identify(identifyId), getSelf());
//ask(selection, new Identify(identifyId), 1000).thenApply(response -> (Object) response).toCompletableFuture().get();
} else if(request.equalsIgnoreCase("result")) {
//Logger.debug("Run list: " + list + " " + new Date());
sender().tell(list, self());
} else {
sender().tell("Wrong command: " + request, self());
}
}).match(ActorIdentity.class, identity -> {
if (identity.correlationId().equals(identifyId)) {
ActorRef ref = identity.getActorRef().orElse(null);
if (ref != null) { // to avoid NullPointerExceptions
// Log or store the identity of the actor who replied
//Logger.info("The actor " + ref.path().toString() + " exists and has replied!");
list.add(ref.path().toString());
// We want to discover all children of the received actor (recursive traversal)
ActorSelection selection = getContext().actorSelection(ref.path().toString() + "/*");
selection.tell(new Identify(identifyId), getSelf());
}
}
sender().tell(list.toString(), self());
}).build();
}
}

Read messages from Kafka topic between a range of offsets

I am looking for a way to consume some set of messages from my Kafka topic with specific offset range (assume my partition has offset from 200 - 300, I want to consume the messages from offset 250-270).
I am using below code where I can specify the initial offset, but it would consume all the messages from 250 to till end. Is there any way/attributes available to set the end offset to consume the messages till that point.
#KafkaListener(id = "KafkaListener",
topics = "${kafka.topic.name}",
containerFactory = "kafkaManualAckListenerContainerFactory",
errorHandler = "${kafka.error.handler}",
topicPartitions = #TopicPartition(topic = "${kafka.topic.name}",
partitionOffsets = {
#PartitionOffset(partition = "0", initialOffset = "250"),
#PartitionOffset(partition = "1", initialOffset = "250")
}))

You can use seek() in order to force the consumer to start consuming from a specific offset and then poll() until you reach the target end offset.
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the
same partition more than once, the latest offset will be used on the
next poll(). Note that you may lose data if this API is arbitrarily
used in the middle of consumption, to reset the fetch offsets
For example, let's assume you want to start from offset 200:
TopicPartition tp = new TopicPartition("myTopic", 0);
Long startOffset = 200L
Long endOffset = 300L
List<TopicPartition> topics = Arrays.asList(tp);
consumer.assign(topics);
consumer.seek(topicPartition, startOffset);
now you just need to keep poll()ing until endOffset is reached:
boolean run = true;
while (run) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
// Do whatever you want to do with `record`
// Check if end offset has been reached
if (record.offset() == endOffset) {
run = false;
break;
}
}
}

KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<String, String>(properties);
boolean keepOnReading = true;
// offset to read the data from.
long offsetToReadFrom = 250L;
// seek is mostly used to replay data or fetch a specific message
// seek
kafkaConsumer.seek(partitionToReadFrom, offsetToReadFrom);
while(keepOnReading) {
ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
numberOfMessagesRead ++;
logger.info("Key: "+record.key() + ", Value: " + record.value());
logger.info("Partition: " + record.partition() + ", Offset: " + record.offset());
if(record.offset() == 270L) {
keepOnReading = false;
break;
}
}
}
I hope this helps you !!

Kafka is slow to produce messages in first seconds

i m working with kafka, and i made a producer like that:
synchronized (obj) {
while (true){
long start = Instant.now().toEpochMilli();
for (int i=0; i< NUM_MSG_SEC ; i++)
{
PriceStreamingData data = PriceStreamingData.newBuilder()
.setUser(getRequest().getUser())
.setSecurity(getRequest().getSecurity())
.setTimestamp(Instant.now().toEpochMilli())
.setPrice(new Random().nextDouble()*200)
.build();
record = new ProducerRecord<>(topic, keyBuilder.build(data),
data);
producer.send(record,new Callback(){
#Override
public void onCompletion(RecordMetadata arg0, Exception arg1) {
counter.incrementAndGet();
if(arg1 != null){
arg1.printStackTrace();
}
}
});
}
long diffCiclo = Instant.now().toEpochMilli() - start;
long diff = Instant.now().toEpochMilli() - startTime;
System.out.println("Number of sent: " + counter.get() +
" Millisecond:" + (diff) + " - NumberOfSent/Diff(K): " + counter.get()/diff );
try {
if(diffCiclo >= 1000){
System.out.println("over 1 second: " + diffCiclo);
}
else {
obj.wait( 1000 - diffCiclo );
}
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
as you can see it is extremely simple, it just make a new message and send it.
If i see the logs:
NumberOfSent/Diff(K)
in the first 10 seconds it perform very bad just
30k per second
after 60 seconds i have
180k per second
why ? and how can i already start the process going already to 180k ?
my kafka producer configuration is the Follwing
Async producer ( but also with sync producer the situation dose not change)
ACKS_CONFIG = 0
BATCH_SIZE_CONFIG = 20000
COMPRESSION_TYPE_CONFIG = none
LINGER_MS_CONFIG = 0
last detail:
NUM_MSG_SEC is set to 200000 or bigger number

I found the solution by myself and I hope this post can be useful for other people too.
The problem stand in the
ProducerConfig.BATCH_SIZE_CONFIG
and
ProducerConfig.LINGER_MS_CONFIG
My parameters were 20000 and 0, in order to fix the issue I did set them them to higher values 200000 and 1000. Finally I started the JVM with the parameters:
-XX:MinMetaspaceFreeRatio=100
-XX:MaxMetaspaceFreeRatio=100
because I saw it takes longer to set the metaspace to a decent value.
Now the producer start directly at 140k and in 1 second already is to 180k.

Only half of the BinaryDocument(s) are getting inserted during bulk insert

I am having a weird problem during insertion. I have two types of documents - JSON and BinaryDocument. I am performing bulk insert operation restricted to a batch size.
The operation works fine for JSON documents. But if I upload, say 100 documents, then only 50 are getting upload in the case of BinaryDocument. Every time only half the number of documents are getting loaded in the database.
Here is my code for JSON document insertion:
public void createMultipleCustomerDocuments(String docId, Customer myCust, long numDocs, int batchSize) {
Gson gson = new GsonBuilder().create();
JsonObject content = JsonObject.fromJson(gson.toJson(myCust));
JsonDocument document = JsonDocument.create(docId, content);
jsonDocuments.add(document);
documentCounter.incrementAndGet();
System.out.println("Batch size: " + batchSize + " Document Counter: " + documentCounter.get());
if(documentCounter.get() >= batchSize){
System.out.println("Document counter: " + documentCounter.get());
Observable
.from(jsonDocuments)
.flatMap(new Func1<JsonDocument, Observable<JsonDocument>>() {
public Observable<JsonDocument> call(final JsonDocument docToInsert) {
return theBucket.async().upsert(docToInsert);
}
})
.last()
.toList()
.toBlocking()
.single();
jsonDocuments.clear();
documentCounter.set(0);
}
}
This works completely fine. I have no problem in insertion.
Here is the code for my BinaryDocument insertion:
public void createMultipleCustomerDocuments(final String docId, ByteBuffer myCust, long numDocs, int batchSize) throws BackpressureException, InterruptedException {
ByteBuf buffer = Unpooled.wrappedBuffer(myCust);
binaryDocuments.add(buffer);
documentCounter.incrementAndGet();
System.out.println("Batch size: " + batchSize + " Document Counter: " + documentCounter.get());
if(documentCounter.get() >= batchSize){
System.out.println("Document counter: " + documentCounter.get() + " Binary Document list size: " + binaryDocuments.size());
Observable
.from(binaryDocuments)
.flatMap(new Func1<ByteBuf, Observable<BinaryDocument>>() {
public Observable<BinaryDocument> call(final ByteBuf docToInsert) {
//docToInsert.retain();
return theBucket.async().upsert(BinaryDocument.create(docId, docToInsert));
}
})
.last()
.toList()
.toBlocking()
.single();
binaryDocuments.clear();
documentCounter.set(0);
}
}
This fails. Exactly half the number of documents get inserted. Even the numbers are printed in exactly the same manner as of JSON document's function's numbers. The documentCounter shows the correct number. But the number of documents that get inserted in the DB is only the half of what it is shown.
Can someone please help me this?

You seem to be using the same document id (i.e the docId of the last member of the batch) to create all documents in the same batch
.BinaryDocument.create(docId, docToInsert)
You should build up your array of BinaryDocument outside the if statement (like you did with the JsonDocument version). Something like
public void createMultipleCustomerDocuments(final String docId, ByteBuffer myCust, int batchSize) throws BackpressureException, InterruptedException {
// numDocs is redundant
ByteBuf buffer = Unpooled.wrappedBuffer(myCust);
binaryDocuments.add(BinaryDocument.create(docId, buffer)); // ArrayList<BinaryDocument> type
documentCounter.incrementAndGet();
System.out.println("Batch size: " + batchSize + " Document Counter: " + documentCounter.get());
if(documentCounter.get() >= batchSize){
System.out.println("Document counter: " + documentCounter.get() + " Binary Document list size: " + binaryDocuments.size());
Observable
.from(binaryDocuments)
.flatMap(new Func1<BinaryDocument, Observable<BinaryDocument>>() {
public Observable<BinaryDocument> call(final BinaryDocument docToInsert) {
return theBucket.async().upsert(docToInsert);
}
})
.last()
.toBlocking()
.single();
binaryDocuments.clear();
documentCounter.set(0);
}
}
should work.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Commit Offsets to Kafka on Spark Executors - java

Related

How to sync enqueued data before getting data from aws amplify

Akka references increasing constantly with Play Framework

Read messages from Kafka topic between a range of offsets

Kafka is slow to produce messages in first seconds

Only half of the BinaryDocument(s) are getting inserted during bulk insert

Categories

Resources