SparkStreaming/Kafka Offset handling - java

I am trying to integrate Spark/Kafka to build a streaming app.
Kakfa version: 0.9
spark: 1.6.2
how do i handle offsets after processing data in RDD batch.
Can you give me more insight on handling offsets?
Does spark had inbuilt to store and read offsets automatically? or do i need to guide spark to read offsets from some store like mongo or oracle?
JavaInputDStream<String> directKafkaStream = KafkaUtils.createDirectStream(jsc, String.class, String.class,
StringDecoder.class, StringDecoder.class, String.class, kafkaParams, topicMap,
(Function<MessageAndMetadata<String, String>, String>) MessageAndMetadata::message);
directKafkaStream.foreachRDD(rdd -> {

The answer to your question depends on your desired message delivery semantics:
at most once: each message will be processed at most once
at least once: each message will be processed at most once
exactly once: at most once and at least once at the same time
First of all, I would recommend reading those slides as well as this blog post.
I am assuming that you are pursuing exactly-once, since the remaining ones are pretty easy to figure out. Anyway, a couple of approaches to consider:
Checkpointing
Spark Streaming allows you to checkpoint your DStreams. If you use direct Stream from KafkaUtils, the offsets will be checkpointed as well. The streaming job might fail anywhere between checkpoints, so some messages might get replayed. To achieve exactly once semantics with this approach, one would have to use idempotent output operation (in other words - the downstream system is able to distinguish/ignore replayed messages).
Pros: easy to achieve; comes out-of-the-box
Cons: at least once semantics; checkpoints become invalidated after code change; offsets are stored in Spark, not in Zookeeper
Transactional data storage
You might want to store the offsets yourself in a custom data store that supports transactions, i.e a relational database like MySQL. In this case you need to make sure that processing stream and saving offsets are contained in a single transaction.
Pros: exactly once semantics
Cons: harder to set up, requires a transactional data store
WAL-based Receiver
You can use the older Kafka connector based on WAL.
Pros: works with other data sources as well; stores offsets in Zookeeper
Cons: it depends on HDFS; you cannot access offsets directly; it makes parallelism harder to achieve
To sum up, it all depends on your requirements - perhaps you can lift some restrictions to simplify this problem.

When you want to consume data from Kafka topic using Spark Streaming, there are two ways you can do that.
1.Receiver based approach
In this approach , offsets are managed in Zookeeper and it is automatically update the offsets in zookeeper. For more information.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-1-receiver-based-approach
2. Direct Approach(No Receiver)
This approach is that it does not update offsets in Zookeeper, hence Zookeeper-based Kafka monitoring tools will not show progress. However, you can access the offsets processed by this approach in each batch and update Zookeeper yourself.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers

Related

How to get record creation time from Kafka topic

We have a Kafka streams application that reads a record from one source Kafka topic does some validation, integration with some other systems via some API calls, based on integration results it builds up some other record and publishes it in a destination Kafka topic.
Both source and destination Kafka topics are created with CreateTime timestamp type. I am not sure if this timestamp type was set this way because of some business needs or just because this is the default timestamp type if you don't set it yourself.
Now we have a performance measuring tool that is capable of injecting a big load of messages into Kafka source topic as well as consuming the processing results from Kafka destination topic. When doing this it records the timestamp of each record into an embedded database for each key then evaluates the various percentiles we measure the performance for.
Because of topics timestamp type are set to CreateTime this is not working because both start time and end time have exactly the same value. Changing timestamp type to LogAppendTime for destination Kafka topic will solve our problem. However even if I am not aware about any business requirements to have CreateTime as topics timestamp types it does not mean there could not be such a requirement. Modifying your infrastructure design to satisfy your testing needs sounds to me a bad approach.
Wondering if there is another more elegant way of achieving this.
Thank you in advance for your inputs/suggestions.

Best Practice for Kafka rollback scenario in microservices [duplicate]

We have a micro-services architecture, with Kafka used as the communication mechanism between the services. Some of the services have their own databases. Say the user makes a call to Service A, which should result in a record (or set of records) being created in that service’s database. Additionally, this event should be reported to other services, as an item on a Kafka topic. What is the best way of ensuring that the database record(s) are only written if the Kafka topic is successfully updated (essentially creating a distributed transaction around the database update and the Kafka update)?
We are thinking of using spring-kafka (in a Spring Boot WebFlux service), and I can see that it has a KafkaTransactionManager, but from what I understand this is more about Kafka transactions themselves (ensuring consistency across the Kafka producers and consumers), rather than synchronising transactions across two systems (see here: “Kafka doesn't support XA and you have to deal with the possibility that the DB tx might commit while the Kafka tx rolls back.”). Additionally, I think this class relies on Spring’s transaction framework which, at least as far as I currently understand, is thread-bound, and won’t work if using a reactive approach (e.g. WebFlux) where different parts of an operation may execute on different threads. (We are using reactive-pg-client, so are manually handling transactions, rather than using Spring’s framework.)
Some options I can think of:
Don’t write the data to the database: only write it to Kafka. Then use a consumer (in Service A) to update the database. This seems like it might not be the most efficient, and will have problems in that the service which the user called cannot immediately see the database changes it should have just created.
Don’t write directly to Kafka: write to the database only, and use something like Debezium to report the change to Kafka. The problem here is that the changes are based on individual database records, whereas the business significant event to store in Kafka might involve a combination of data from multiple tables.
Write to the database first (if that fails, do nothing and just throw the exception). Then, when writing to Kafka, assume that the write might fail. Use the built-in auto-retry functionality to get it to keep trying for a while. If that eventually completely fails, try to write to a dead letter queue and create some sort of manual mechanism for admins to sort it out. And if writing to the DLQ fails (i.e. Kafka is completely down), just log it some other way (e.g. to the database), and again create some sort of manual mechanism for admins to sort it out.
Anyone got any thoughts or advice on the above, or able to correct any mistakes in my assumptions above?
Thanks in advance!
I'd suggest to use a slightly altered variant of approach 2.
Write into your database only, but in addition to the actual table writes, also write "events" into a special table within that same database; these event records would contain the aggregations you need. In the easiest way, you'd simply insert another entity e.g. mapped by JPA, which contains a JSON property with the aggregate payload. Of course this could be automated by some means of transaction listener / framework component.
Then use Debezium to capture the changes just from that table and stream them into Kafka. That way you have both: eventually consistent state in Kafka (the events in Kafka may trail behind or you might see a few events a second time after a restart, but eventually they'll reflect the database state) without the need for distributed transactions, and the business level event semantics you're after.
(Disclaimer: I'm the lead of Debezium; funnily enough I'm just in the process of writing a blog post discussing this approach in more detail)
Here are the posts
https://debezium.io/blog/2018/09/20/materializing-aggregate-views-with-hibernate-and-debezium/
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
first of all, I have to say that I’m no Kafka, nor a Spring expert but I think that it’s more a conceptual challenge when writing to independent resources and the solution should be adaptable to your technology stack. Furthermore, I should say that this solution tries to solve the problem without an external component like Debezium, because in my opinion each additional component brings challenges in testing, maintaining and running an application which is often underestimated when choosing such an option. Also not every database can be used as a Debezium-source.
To make sure that we are talking about the same goals, let’s clarify the situation in an simplified airline example, where customers can buy tickets. After a successful order the customer will receive a message (mail, push-notification, …) that is sent by an external messaging system (the system we have to talk with).
In a traditional JMS world with an XA transaction between our database (where we store orders) and the JMS provider it would look like the following: The client sets the order to our app where we start a transaction. The app stores the order in its database. Then the message is sent to JMS and you can commit the transaction. Both operations participate at the transaction even when they’re talking to their own resources. As the XA transaction guarantees ACID we’re fine.
Let’s bring Kafka (or any other resource that is not able to participate at the XA transaction) in the game. As there is no coordinator that syncs both transactions anymore the main idea of the following is to split processing in two parts with a persistent state.
When you store the order in your database you can also store the message (with aggregated data) in the same database (e.g. as JSON in a CLOB-column) that you want to send to Kafka afterwards. Same resource – ACID guaranteed, everything fine so far. Now you need a mechanism that polls your “KafkaTasks”-Table for new tasks that should be send to a Kafka-Topic (e.g. with a timer service, maybe #Scheduled annotation can be used in Spring). After the message has been successfully sent to Kafka you can delete the task entry. This ensures that the message to Kafka is only sent when the order is also successfully stored in application database. Did we achieve the same guarantees as we have when using a XA transaction? Unfortunately, no, as there is still the chance that writing to Kafka works but the deletion of the task fails. In this case the retry-mechanism (you would need one as mentioned in your question) would reprocess the task an sends the message twice. If your business case is happy with this “at-least-once”-guarantee you’re done here with a imho semi-complex solution that could be easily implemented as framework functionality so not everyone has to bother with the details.
If you need “exactly-once” then you cannot store your state in the application database (in this case “deletion of a task” is the “state”) but instead you must store it in Kafka (assuming that you have ACID guarantees between two Kafka topics). An example: Let’s say you have 100 tasks in the table (IDs 1 to 100) and the task job processes the first 10. You write your Kafka messages to their topic and another message with the ID 10 to “your topic”. All in the same Kafka-transaction. In the next cycle you consume your topic (value is 10) and take this value to get the next 10 tasks (and delete the already processed tasks).
If there are easier (in-application) solutions with the same guarantees I’m looking forward to hear from you!
Sorry for the long answer but I hope it helps.
All the approach described above are the best way to approach the problem and are well defined pattern. You can explore these in the links provided below.
Pattern: Transactional outbox
Publish an event or message as part of a database transaction by saving it in an OUTBOX in the database.
http://microservices.io/patterns/data/transactional-outbox.html
Pattern: Polling publisher
Publish messages by polling the outbox in the database.
http://microservices.io/patterns/data/polling-publisher.html
Pattern: Transaction log tailing
Publish changes made to the database by tailing the transaction log.
http://microservices.io/patterns/data/transaction-log-tailing.html
Debezium is a valid answer but (as I've experienced) it can require some extra overhead of running an extra pod and making sure that pod doesn't fall over. This could just be me griping about a few back to back instances where pods OOM errored and didn't come back up, networking rule rollouts dropped some messages, WAL access to an aws aurora db started behaving oddly... It seems that everything that could have gone wrong, did. Not saying Debezium is bad, it's fantastically stable, but often for devs running it becomes a networking skill rather than a coding skill.
As a KISS solution using normal coding solutions that will work 99.99% of the time (and inform you of the .01%) would be:
Start Transaction
Sync save to DB
-> If fail, then bail out.
Async send message to kafka.
Block until the topic reports that it has received the
message.
-> if it times out or fails Abort Transaction.
-> if it succeeds Commit Transaction.
I'd suggest to use a new approach 2-phase message. In this new approach, much less codes are needed, and you don't need Debeziums any more.
https://betterprogramming.pub/an-alternative-to-outbox-pattern-7564562843ae
For this new approach, what you need to do is:
When writing your database, write an event record to an auxiliary table.
Submit a 2-phase message to DTM
Write a service to query whether an event is saved in the auxiliary table.
With the help of DTM SDK, you can accomplish the above 3 steps with 8 lines in Go, much less codes than other solutions.
msg := dtmcli.NewMsg(DtmServer, gid).
Add(busi.Busi+"/TransIn", &TransReq{Amount: 30})
err := msg.DoAndSubmitDB(busi.Busi+"/QueryPrepared", db, func(tx *sql.Tx) error {
return AdjustBalance(tx, busi.TransOutUID, -req.Amount)
})
app.GET(BusiAPI+"/QueryPrepared", dtmutil.WrapHandler2(func(c *gin.Context) interface{} {
return MustBarrierFromGin(c).QueryPrepared(db)
}))
Each of your origin options has its disadvantage:
The user cannot immediately see the database changes it have just created.
Debezium will capture the log of the database, which may be much larger than the events you wanted. Also deployment and maintenance of Debezium is not an easy job.
"built-in auto-retry functionality" is not cheap, it may require much codes or maintenance efforts.

How to use mongodb changestream processing in scalable and fault tolerant probably live-livefashion

I am using mongodb changestream which listen for change in collection as per particular match logic. https://spring.io/blog/2018/09/27/what-s-new-in-spring-data-lovelace-for-mongodb
While the above works great but I am not sure how we can make above processing resilient and scalable. I tried to search around above but I could not find any solution around that.
How can we start multiple thread/process listening for a change as per same match criteria so that we can process change in parallel and also ensure change for a particular key get picked up by same thread/process to prevent out of order processing like Kafka partition so that application can be resilient and scalable as well.
Thanks for answering my question

Write-through cache Redis

I have been poundering on how to reliably implement a write-through caching mechanism to store realtime data.
Basically what we need is this:
Save data to Redis -> Save to database (underlying)
Read data from Redis <- Read from database in case unavailable in cache
The resources online to help in the implementation of this caching strategy seem scarce.
The problem is:
1) No built-in transaction possibility between Redis and the database (Mongo in my case).
2) No transactions mean that writes to the underlying database are unreliable.
The most straightforward way I see how this can be implemented is by using a broker like Kafka and putting messages on a persistent queue to be processed later.
Therefore Kafka would be the responsible entity for reliable processing.
Another way would be by having a custom implementation in a scheduler that checks the Redis database for dirty records. On first thought there seem to be some tradeoffs to this approach and I would like not having to go this road if possible.
I am looking on some options on how this can be implemented otherwise.
Or whether this is in fact the most viable approach.
So better approach than is as u mentioned above is to use kafka and consumer which will store data to mongo. But read about it delivery guarantee, as i remember exactly once is guaranteed in kafka streams only (between two topics), in your case your database should be idempotent because u get at least once guarantee. And don't forget to turn AOF on with Redis, not to loose data. And don't forget that in this case u get eventual consistency in db with all the consequences.
On review I will use MongoDB as a single datastore without Redis at all.
Premature optimization is evil I guess.
Anyhow, I can add additional architecture afterwards after benchmarking.
Plans to refactor towards a cache shouldn't be too hard.
Scaling is additional concern so I shouldn't be bothered with that during development right now.
Accepted #Ipave answer, going with a single datastore for the moment.

Proper way of testing Kafka, Spark and ES integration

I have a very common problem, yet I am unable to find the "right" or "correct" way to test this
I have a simple Spark job that gets events from Kafka (events are in protobuf format), applies some transformations on them and then stores the result in ES. I am saving single events
I need to know how to test this properly. I am using BulkProcessor, therefore, I am manually committing the offsets when I think they should be. Therefore, it makes sense to test this workflow properly because I don't want to lose events
My understanding is that I need to have a mock Kafka instance, need to call the appropriate function that handles all the transformations and then store the result in ES. However, I don't how to do all this. Also, I don't know how to write test events in protobuf format into Kafka topics
P.S. I am NOT using Spring framework

Categories