I have been poundering on how to reliably implement a write-through caching mechanism to store realtime data.
Basically what we need is this:
Save data to Redis -> Save to database (underlying)
Read data from Redis <- Read from database in case unavailable in cache
The resources online to help in the implementation of this caching strategy seem scarce.
The problem is:
1) No built-in transaction possibility between Redis and the database (Mongo in my case).
2) No transactions mean that writes to the underlying database are unreliable.
The most straightforward way I see how this can be implemented is by using a broker like Kafka and putting messages on a persistent queue to be processed later.
Therefore Kafka would be the responsible entity for reliable processing.
Another way would be by having a custom implementation in a scheduler that checks the Redis database for dirty records. On first thought there seem to be some tradeoffs to this approach and I would like not having to go this road if possible.
I am looking on some options on how this can be implemented otherwise.
Or whether this is in fact the most viable approach.
So better approach than is as u mentioned above is to use kafka and consumer which will store data to mongo. But read about it delivery guarantee, as i remember exactly once is guaranteed in kafka streams only (between two topics), in your case your database should be idempotent because u get at least once guarantee. And don't forget to turn AOF on with Redis, not to loose data. And don't forget that in this case u get eventual consistency in db with all the consequences.
On review I will use MongoDB as a single datastore without Redis at all.
Premature optimization is evil I guess.
Anyhow, I can add additional architecture afterwards after benchmarking.
Plans to refactor towards a cache shouldn't be too hard.
Scaling is additional concern so I shouldn't be bothered with that during development right now.
Accepted #Ipave answer, going with a single datastore for the moment.
Related
We have a micro-services architecture, with Kafka used as the communication mechanism between the services. Some of the services have their own databases. Say the user makes a call to Service A, which should result in a record (or set of records) being created in that service’s database. Additionally, this event should be reported to other services, as an item on a Kafka topic. What is the best way of ensuring that the database record(s) are only written if the Kafka topic is successfully updated (essentially creating a distributed transaction around the database update and the Kafka update)?
We are thinking of using spring-kafka (in a Spring Boot WebFlux service), and I can see that it has a KafkaTransactionManager, but from what I understand this is more about Kafka transactions themselves (ensuring consistency across the Kafka producers and consumers), rather than synchronising transactions across two systems (see here: “Kafka doesn't support XA and you have to deal with the possibility that the DB tx might commit while the Kafka tx rolls back.”). Additionally, I think this class relies on Spring’s transaction framework which, at least as far as I currently understand, is thread-bound, and won’t work if using a reactive approach (e.g. WebFlux) where different parts of an operation may execute on different threads. (We are using reactive-pg-client, so are manually handling transactions, rather than using Spring’s framework.)
Some options I can think of:
Don’t write the data to the database: only write it to Kafka. Then use a consumer (in Service A) to update the database. This seems like it might not be the most efficient, and will have problems in that the service which the user called cannot immediately see the database changes it should have just created.
Don’t write directly to Kafka: write to the database only, and use something like Debezium to report the change to Kafka. The problem here is that the changes are based on individual database records, whereas the business significant event to store in Kafka might involve a combination of data from multiple tables.
Write to the database first (if that fails, do nothing and just throw the exception). Then, when writing to Kafka, assume that the write might fail. Use the built-in auto-retry functionality to get it to keep trying for a while. If that eventually completely fails, try to write to a dead letter queue and create some sort of manual mechanism for admins to sort it out. And if writing to the DLQ fails (i.e. Kafka is completely down), just log it some other way (e.g. to the database), and again create some sort of manual mechanism for admins to sort it out.
Anyone got any thoughts or advice on the above, or able to correct any mistakes in my assumptions above?
Thanks in advance!
I'd suggest to use a slightly altered variant of approach 2.
Write into your database only, but in addition to the actual table writes, also write "events" into a special table within that same database; these event records would contain the aggregations you need. In the easiest way, you'd simply insert another entity e.g. mapped by JPA, which contains a JSON property with the aggregate payload. Of course this could be automated by some means of transaction listener / framework component.
Then use Debezium to capture the changes just from that table and stream them into Kafka. That way you have both: eventually consistent state in Kafka (the events in Kafka may trail behind or you might see a few events a second time after a restart, but eventually they'll reflect the database state) without the need for distributed transactions, and the business level event semantics you're after.
(Disclaimer: I'm the lead of Debezium; funnily enough I'm just in the process of writing a blog post discussing this approach in more detail)
Here are the posts
https://debezium.io/blog/2018/09/20/materializing-aggregate-views-with-hibernate-and-debezium/
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
first of all, I have to say that I’m no Kafka, nor a Spring expert but I think that it’s more a conceptual challenge when writing to independent resources and the solution should be adaptable to your technology stack. Furthermore, I should say that this solution tries to solve the problem without an external component like Debezium, because in my opinion each additional component brings challenges in testing, maintaining and running an application which is often underestimated when choosing such an option. Also not every database can be used as a Debezium-source.
To make sure that we are talking about the same goals, let’s clarify the situation in an simplified airline example, where customers can buy tickets. After a successful order the customer will receive a message (mail, push-notification, …) that is sent by an external messaging system (the system we have to talk with).
In a traditional JMS world with an XA transaction between our database (where we store orders) and the JMS provider it would look like the following: The client sets the order to our app where we start a transaction. The app stores the order in its database. Then the message is sent to JMS and you can commit the transaction. Both operations participate at the transaction even when they’re talking to their own resources. As the XA transaction guarantees ACID we’re fine.
Let’s bring Kafka (or any other resource that is not able to participate at the XA transaction) in the game. As there is no coordinator that syncs both transactions anymore the main idea of the following is to split processing in two parts with a persistent state.
When you store the order in your database you can also store the message (with aggregated data) in the same database (e.g. as JSON in a CLOB-column) that you want to send to Kafka afterwards. Same resource – ACID guaranteed, everything fine so far. Now you need a mechanism that polls your “KafkaTasks”-Table for new tasks that should be send to a Kafka-Topic (e.g. with a timer service, maybe #Scheduled annotation can be used in Spring). After the message has been successfully sent to Kafka you can delete the task entry. This ensures that the message to Kafka is only sent when the order is also successfully stored in application database. Did we achieve the same guarantees as we have when using a XA transaction? Unfortunately, no, as there is still the chance that writing to Kafka works but the deletion of the task fails. In this case the retry-mechanism (you would need one as mentioned in your question) would reprocess the task an sends the message twice. If your business case is happy with this “at-least-once”-guarantee you’re done here with a imho semi-complex solution that could be easily implemented as framework functionality so not everyone has to bother with the details.
If you need “exactly-once” then you cannot store your state in the application database (in this case “deletion of a task” is the “state”) but instead you must store it in Kafka (assuming that you have ACID guarantees between two Kafka topics). An example: Let’s say you have 100 tasks in the table (IDs 1 to 100) and the task job processes the first 10. You write your Kafka messages to their topic and another message with the ID 10 to “your topic”. All in the same Kafka-transaction. In the next cycle you consume your topic (value is 10) and take this value to get the next 10 tasks (and delete the already processed tasks).
If there are easier (in-application) solutions with the same guarantees I’m looking forward to hear from you!
Sorry for the long answer but I hope it helps.
All the approach described above are the best way to approach the problem and are well defined pattern. You can explore these in the links provided below.
Pattern: Transactional outbox
Publish an event or message as part of a database transaction by saving it in an OUTBOX in the database.
http://microservices.io/patterns/data/transactional-outbox.html
Pattern: Polling publisher
Publish messages by polling the outbox in the database.
http://microservices.io/patterns/data/polling-publisher.html
Pattern: Transaction log tailing
Publish changes made to the database by tailing the transaction log.
http://microservices.io/patterns/data/transaction-log-tailing.html
Debezium is a valid answer but (as I've experienced) it can require some extra overhead of running an extra pod and making sure that pod doesn't fall over. This could just be me griping about a few back to back instances where pods OOM errored and didn't come back up, networking rule rollouts dropped some messages, WAL access to an aws aurora db started behaving oddly... It seems that everything that could have gone wrong, did. Not saying Debezium is bad, it's fantastically stable, but often for devs running it becomes a networking skill rather than a coding skill.
As a KISS solution using normal coding solutions that will work 99.99% of the time (and inform you of the .01%) would be:
Start Transaction
Sync save to DB
-> If fail, then bail out.
Async send message to kafka.
Block until the topic reports that it has received the
message.
-> if it times out or fails Abort Transaction.
-> if it succeeds Commit Transaction.
I'd suggest to use a new approach 2-phase message. In this new approach, much less codes are needed, and you don't need Debeziums any more.
https://betterprogramming.pub/an-alternative-to-outbox-pattern-7564562843ae
For this new approach, what you need to do is:
When writing your database, write an event record to an auxiliary table.
Submit a 2-phase message to DTM
Write a service to query whether an event is saved in the auxiliary table.
With the help of DTM SDK, you can accomplish the above 3 steps with 8 lines in Go, much less codes than other solutions.
msg := dtmcli.NewMsg(DtmServer, gid).
Add(busi.Busi+"/TransIn", &TransReq{Amount: 30})
err := msg.DoAndSubmitDB(busi.Busi+"/QueryPrepared", db, func(tx *sql.Tx) error {
return AdjustBalance(tx, busi.TransOutUID, -req.Amount)
})
app.GET(BusiAPI+"/QueryPrepared", dtmutil.WrapHandler2(func(c *gin.Context) interface{} {
return MustBarrierFromGin(c).QueryPrepared(db)
}))
Each of your origin options has its disadvantage:
The user cannot immediately see the database changes it have just created.
Debezium will capture the log of the database, which may be much larger than the events you wanted. Also deployment and maintenance of Debezium is not an easy job.
"built-in auto-retry functionality" is not cheap, it may require much codes or maintenance efforts.
I am currently developing an application using Spring MVC4 and hibernate 4. I have implemented hibernate second level cache for performance improvement. If I use Redis which is an in-memory data structure store, used as a database, cache etc, the performance will increase but will it be a drastic change?
Drastic differences you may expect if you cache what is good to be cached and avoid caching data that should not be cached at all. Like beauty is in the eye of the beholder the same is with the performance. Here are several aspects you should have in mind when using hibernate AS second level cache provider:
No Custom serialization - Memory intensive
If you use second level caching, you would not be able to use fast serialization frameworks such as Kryo and will have to stick to java serializable which sucks.
On top of this for each entity type you will have a separate region and within each region, you will have an entry for each key of each entity.
In terms of memory efficiency, this is inefficient.
Lacks the ability to store and distribute rich objects
Most of the modern caches also present computing grid functionality having your objects fragmented into many small pieces decrease your ability to execute distributed tasks with guaranteed data co-location. That depends a little bit on the Grid provider, but for many would be a limitation.
Sub optimal performance
Depending on how much performance you need and what type of application you are having using hibernate second level cache might be a good or a bad choice. Good in terms that it is plug and play...." kind of..." bad because you will never squeeze the performance you would have gained. Also designing rich models mean more upfront work and more OOP.
Limited querying capabilities ON the Cache itself
That depends on the cache provider, but some of the providers really are not good doing JOINs with Where clause different than the ID. If you try to build and in memory index for a query on Hazelcast, for example, you will see what I mean.
Yes, if you use Redis, it will improve your performance.
No, it will not be a drastic change. :)
https://memorynotfound.com/spring-redis-application-configuration-example/
http://www.baeldung.com/spring-data-redis-tutorial
the above links will help you to find out the way of integration redis with your project.
It depends on the movement.
If You have 1000 or more requests per second and You are low on RAM, then Yes, use redis nodes on other machine to take some usage. It will greatly improve your RAM and request speed.
But If it's otherwise then do not use it.
Remember that You can use this approach later when You will see what is the RAM and database Connection Pool usage.
Your question was already discussed here. Check this link: Application cache v.s. hibernate second level cache, which to use?
This was the most accepted answer, which I agree with:
It really depends on your application querying model and the traffic
demands.
Using Redis/Hazelcast may yield the best performance since there won't
be any round-trip to DB anymore, but you end up having a normalized
data in DB and denormalized copy in your cache which will put pressure
on your cache update policies. So you gain the best performance at the
cost of implementing the cache update whenever the persisted data
changes.
Using 2nd level cache is easier to set up but it only stores
entities by id. There is also a query cache, storing ids returned by a
given query. So the 2nd level cache is a two-step process that you
need to fine tune to get the best performance. When you execute
projection queries the 2nd level object cache won't help you, since it
only operates on entity load. The main advantage of 2nd level cache is
that it's easier to keep it in sync whenever data changes, especially
if all your data is persisted by hibernate.
So, if you need ultimate
performance and you don't mind implementing your cache update logic
that ensures a minimum eventual consistency window, then go with an
external cache.
If you only need to cache entities (that usually don't change that
frequently) and you mostly access those through Hibernate entity
loading, then 2nd level cache can help you.
Hope it helps!
I am trying to integrate Spark/Kafka to build a streaming app.
Kakfa version: 0.9
spark: 1.6.2
how do i handle offsets after processing data in RDD batch.
Can you give me more insight on handling offsets?
Does spark had inbuilt to store and read offsets automatically? or do i need to guide spark to read offsets from some store like mongo or oracle?
JavaInputDStream<String> directKafkaStream = KafkaUtils.createDirectStream(jsc, String.class, String.class,
StringDecoder.class, StringDecoder.class, String.class, kafkaParams, topicMap,
(Function<MessageAndMetadata<String, String>, String>) MessageAndMetadata::message);
directKafkaStream.foreachRDD(rdd -> {
The answer to your question depends on your desired message delivery semantics:
at most once: each message will be processed at most once
at least once: each message will be processed at most once
exactly once: at most once and at least once at the same time
First of all, I would recommend reading those slides as well as this blog post.
I am assuming that you are pursuing exactly-once, since the remaining ones are pretty easy to figure out. Anyway, a couple of approaches to consider:
Checkpointing
Spark Streaming allows you to checkpoint your DStreams. If you use direct Stream from KafkaUtils, the offsets will be checkpointed as well. The streaming job might fail anywhere between checkpoints, so some messages might get replayed. To achieve exactly once semantics with this approach, one would have to use idempotent output operation (in other words - the downstream system is able to distinguish/ignore replayed messages).
Pros: easy to achieve; comes out-of-the-box
Cons: at least once semantics; checkpoints become invalidated after code change; offsets are stored in Spark, not in Zookeeper
Transactional data storage
You might want to store the offsets yourself in a custom data store that supports transactions, i.e a relational database like MySQL. In this case you need to make sure that processing stream and saving offsets are contained in a single transaction.
Pros: exactly once semantics
Cons: harder to set up, requires a transactional data store
WAL-based Receiver
You can use the older Kafka connector based on WAL.
Pros: works with other data sources as well; stores offsets in Zookeeper
Cons: it depends on HDFS; you cannot access offsets directly; it makes parallelism harder to achieve
To sum up, it all depends on your requirements - perhaps you can lift some restrictions to simplify this problem.
When you want to consume data from Kafka topic using Spark Streaming, there are two ways you can do that.
1.Receiver based approach
In this approach , offsets are managed in Zookeeper and it is automatically update the offsets in zookeeper. For more information.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-1-receiver-based-approach
2. Direct Approach(No Receiver)
This approach is that it does not update offsets in Zookeeper, hence Zookeeper-based Kafka monitoring tools will not show progress. However, you can access the offsets processed by this approach in each batch and update Zookeeper yourself.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
I'm designing an application that has to consume live data from several sources and periodically report on it. Consumed data will be added to an Ehcache cache and reports will query it. Once the live data is consumed it needs to be persisted for recovery purposes only. If the application restarts it will prime the cache with historical data from the DB before connecting to the live data sources (which queue new data).
I'm leaning toward implementing it as a cache-as-sor with JDBC caching:
1. Receive data from source
2. Persist to DB
3. Add to cache
4. Confirm receipt with source
with 2-4 wrapped in a JTA transaction.
I also looked into Hibernate with Ehcache as a 2nd level cache, but that doesn't seem appropriate.
I'm relatively new to Ehcache so would like some advice on the right design.
For persistence, rather than do a "cache-aside", you probably would want to configure your caches to use read-through and some cache writer (either write-through, or write-behind). You can read about these here: http://ehcache.org/documentation/user-guide/concepts#cache-as-sor
Now I'd avoid JTA, as I fear the overhead might be overkill (except if you really need XA Transaction Recovery) and rather opt for a fault tolerant approach. If you opt for a asynchronous persistence (write-behind), clustering your cache with Terracotta (the WriteBehind Queue would automatically be persistent, recoverable and even HA if multiple nodes are available) is one approach of ensuring every element gets written out to the underlying SoR... All depending on your needs I guess.
Ehcache would let you start with a single node, unclustered approach, simply using read- & write-through caches, that you could grow and fine tune to meet your SLA. As data grows, you'd then be able to move to clustered caches and asynchronous writers (should writes become the issues) or grow your cache sizes (if reads remain the issue). Obviously, you should measure (or at least know what the bottlenecks are you foresee) and choose accordingly. But putting a Cache in front of your RDBMS is a common and well understood pattern to scale read (and write) access to these "slower" stores...
If you want to have data in a cache, the Hibernate looks like overkill. All you need is JDBC, both to implement a cache loader for cache initialization and for saving the data to a database periodically. Or just setup your cache to persist on disk.
Then Ehcache + Hibernate is not the solution. What you are describing here is an asynchronous event processing system in which one of the listeners awaits a "event processed successfully" to persist.
NoSQL databases are a far better option in this case, unless you need to strictly rely to a relational database.
I am going through apache cassandra and working on sample data insertion, retrieving etc.
The documentation is very limited.
I am interested in knowing
can we completely replace relation db like mysql/ oracle with cassandra?
does cassandra support rollback/ commit?
does cassandra clients (thrift/ hector) support fetching associated object (objects where we save one super columns' key in another super column family)?
This will help me a lot to proceed further.
thank you in advance.
Short answer: No.
By design, Cassandra values availability and partition tolerance over consistency1. Basically, it's not possible to get acceptable latency while maintaining all three of qualities: one has to be sacrificed. This is called CAP theorem.
The amount of consistency is configurable in Cassandra using consistency levels, but there doesn't exist any semantics for rollback. There's no guarantee that you'll be able to roll back your changes even if the first write succeeds.
If you want to build application with transactions or locks on top of Cassandra, you probably want to look at Zookeeper, which can be used to provide distributed synchronization.
You might've already guessed this, but Cassandra doesn't have foreign keys or anything like that. This has to be handled manually. I'm not that familiar with Hector, but a higher-level client could be able to do this semi-automatically.
Whether or not you can use Cassandra to easily replace a RDBMS depends on your specific use case. In your use case (based on your questions), it might be hard to do so.
In version 2.x you can combine CQL-statements in logged batch that is atomic. Either all or none of statements succeed. Also you can read about lightweight transactions.
More than that - there are several persistence managers for Cassandra. You can achive foreign keys behavior on client level with them. For example, Achilles and Kundera.
If Zookeeper is able to handle transactions that has Oracle-quality then its a done deal. Relations and relation integrity is no problem to implement on top of ANY database. A foreign key is just another data-field. ACID/Transactions is the key issue.
instead of commit and rollback, you must use batch.
batch worked atomic, this means all records in multiple tables submit or no submit atomic mode
for example :
var batch = new BatchStatement();
batchItem= session.Prepare(stringCommand);
batch.Add(batchItem);
var result = session.ExecuteAsync(batch);
Of course you can but it is completely depends on your use case. If you don't pick the right db for your use case then you need to worry about lots of things on your own. For ex, in rdbms geographically distribution doesn't provided you need to find a way to do it. In cassandra, you lack some acid properties under some conditions. You need to handle those properties on application side.
Yes but limited for certain use cases. You can use batch property. It supports rollback but you lack the isolation. I am not sure this property exist in OSS Cassandra. For more info look
Dont understand what you mean by super column. If you ask to find an id in another table columns, yeah you can do it, why not. But definitely not understand what you mean by super column.
Overall Cassandra is not ACID compliant but there are some features that helps you under some conditions to be ACID compliant like batch, lightweight transactions.