I have a Kafka topic and a Spark application. The Spark application gets data from Kafka topic, pre aggregates it and stores it in Elastic Search. Sounds simple, right?
Everything works fine as expected, but the minute I set "spark.cores" property something other than 1, I start getting
version conflict, current version [2] is different than the one provided [1]
After researching a bit, I think the error is because multiple cores can have same document at the same time and thus, when one core is done with aggregation on its part and tries to write back to the document, it gets this error
TBH, I am a bit surprised by this behaviour because I thought Spark and ES would handle this on their own. This leads me to believe that maybe, there is something wrong with my approach.
How can I fix this? Is there some sort of "synchronized" or "lock" sort of concept that I need to follow?
Cheers!
It sounds like you have several messages in the queue that all update the same ES document, and these messages are being a processed concurrently. There are two possible solutions:
First, you can use Kafka partitions to ensure that all the messages that update the same ES document are handled in sequence. This assumes that’s there’s some property in your message that Kafka can use to determine how messages map to ES documents.
The other way is the standard way of handling optimistic concurrency conflicts: retry the transaction. If you have some data from a Kafka message that you need to add to an ES document, and the current document in ES is version 1, then you can try to update it and save back version 2. But if someone else already wrote version 2, you can retry by using version 2 as a starting point, adding your new data, and saving version 3.
If either of these approaches destroys the concurrency you were expecting to get from Kafka and Spark, then you may need to rethink your approach. You may have to introduce a new processing stage that does some heavy lifting but doesn’t actually write to ES, then do the ES updates in a separate step.
I would like to answer my own question. In my use case, I was updating the document counter. So, all I had to do was retry whenever a conflict arise because I just needed to aggregate my counter.
My use case was somewhat this:
For many uses of partial update, it doesn’t matter that a document has been changed. For instance, if two processes are both incrementing the page-view counter, it doesn’t matter in which order it happens; if a conflict occurs, the only thing we need to do is reattempt the update.
This can be done automatically by setting the retry_on_conflict parameter to the number of times that update should retry before failing; it defaults to 0.
Thanks to Willis and this blog, I was able to configure Elastic Search settings and now I am not having any problems at all
Related
Below are my assumptions/queries. Please address if there is something wrong in my understanding
By Reading the documentation I understood that
Zookeeper writes go to the Leader, and they are replicated to follower. A read request can be served from the follower(slave) itself. And hence read can be stale.
Why can't we use zookeeper as a cache system?
As the write request is always made/redirected to Leader, it means node creation is consistent. When two clients sending a write request for same node name, one of them will ALWAYS get an error(NodeExistsException).
If above is true, then can we use zookeeper to keep track of duplicate requests by creating a znode with the requestId.
For generating a sequence number in a distributed system, we can use the sequential node creation.
Based on what information is available in the question and the comments, it appears that the basic question is:
In a stateless multi server architecture, how best to prevent data duplication, here the data is "has this refund been processed?"
This qualifies as "primarily opinion based". There are multiple ways to do this and no one way is the best. You can do it with MySQL and you can do it with Zookeeper.
Now comes pure opinion and speculation:
To process a refund, there must be some database somewhere? Why not just check against it? The duplicate-request scenario that you are preparing against seems like a rare occurrence - this wont be happening hundred times per sec. If so, then this scenario does not warrant high performance implementation. Just a database lookup should be fine.
Your workload seems to be 1:1 ratio of read:write. Every time a refund is processed, you check whether it is already processed or not and if not processed then process it and make an entry for it. Now Zookeeper itself says it works best for something like 10:1 ratio of read:write. While there is no such metric available for MySQL, it does not need to make certain* guarantees that zookeeper makes for write activities. Hence i hope, it should be better for pure write intensive loads. (* Guarantees like sequentiality, broadcast, consensus etc)
Just a nitpick, but your data is a linear list of hundreds (thousands? millions?) of transaction ids. This is exactly what MySQL (or any database) and its Primary Key is built for. Zookeeper is made for more complex/powerful hierarchical data. That you do not need.
I have no experience with either Flink or Spark, and I would like to use one of them for my use case. I'd like to present my use case and hopefully get some insight of whether this can be done with either, and if they can both do that, which one would work best.
I have a bunch of entities A stored in a data store (Mongo to be precise but it doesn't matter really). I have a Java application that can load these entities and run some logic on them to generate a Stream of some data type E (to be 100% clear I don't have the Es in any data set, I need to generate them in Java after I load the As from the DB)
So I have something like this
A1 -> Stream<E>
A2 -> Stream<E>
...
An -> Stream<E>
The data type E is a bit like a long row in Excel, it has a bunch of columns. I need to collect all the Es and run some sort of pivot aggregation like you would do in Excel. I can see how I could do that easily in either Spark or Flink.
Now is the part I cannot figure out.
Imagine that one of the entity A1 is changed (by a user or a process), that mean that all the Es for A1 need updating. Of course I could reload all my As, recompute all the Es, and then re-run the whole aggregation. By I'm wondering if it's possible to be a bit more clever here.
Would it be possible to only recompute the Es for A1 and do the minimum amount of processing.
For Spark would it be possible to persist the RDD and only update part of it when needed (here that would be the Es for A1)?
For Flink, in the case of streaming, is it possible to update data points that have already been processed? Can it handle that sort of case? Or could I perhaps generate negative events for A1's old Es (i.e that would remove them from the result) and then add the new ones?
Is that a common use case? Is that even something that Flink or Spark are designed to do? I would think so but again I haven't used either so my understanding is very limited.
I think your question is very broad and depends on many conditions. In flink you could have a MapState<A, E> and only update the values for the changed A's and then depending on your use-case either generate the updated E's downstream or generate the difference (retraction stream).
In Flink there exists the concept of Dynamics Tables and Retraction streams that may inspire you, or maybe event the Table API already covers your use case. You can check out the docs here
There doesn't seem to be any direct way to know affected rows in cassandra for update, and delete statements.
For example if I have a query like this:
DELETE FROM xyztable WHERE PKEY IN (1,2,3,4,5,6);
Now, of course, since I've passed 6 keys, it is obvious that 6 rows will be affected.
But, like in RDBMS world, is there any way to know affected rows in update/delete statements in datastax-driver?
I've read cassandra gives no feedback on write operations here.
Except that I could not see any other discussion on this topic through google.
If that's not possible, can I be sure that with the type of query given above, it will either delete all or fail to delete all?
In the eventually consistent world you can look at these operations as if it was saving a delete request, and depending on the requested consistency level, waiting for a confirmation from several nodes that this request has been accepted. Then the request is delivered to the other nodes asynchronously.
Since there is no dependency on anything like foreign keys, then nothing should stop data from being deleted if the request was successfully accepted by the cluster.
However, there are a lot of ifs. For example, deleting data with a consistency level one, successfully accepted by one node, followed by an immediate node hard failure may result in the loss of that delete if it was not replicated before the failure.
Another example - during the deletion, one node was down, and stayed down for a significant amount of time, more than the gc_grace_period, i.e., more than it is required for the tombstones to be removed with deleted data. Then if this node is recovered, then all suddenly all data that has been deleted from the rest of the cluster, but not from this node, will be brought back to the cluster.
So in order to avoid these situations, and consider operations successful and final, a cassandra admin needs to implement some measures, including regular repair jobs (to make sure all nodes are up to date). Also applications need to decide what is better - faster performance with consistency level one at the expense of possible data loss, vs lower performance with higher consistency levels but with less possibility of data loss.
There is no way to do this in Cassandra because the model for writes, deletes, and updates in Cassandra is basically the same. In all of those cases a cell is added to the table which has either the new information or information about the delete. This is done without any inspection of the current DB state.
Without checking the rest of the replicas and doing a full merge on the row there is no way to tell if any operation will actually effect the current read state of the database.
This leads to the oft cited anti-pattern of "Reading before a write." In Cassandra you are meant to write as fast as possible and if you need to have history, use a datastructure which preservations a log of modifications rather than just current state.
There is one option for doing queries like this, using the CAS syntax of IF value THEN do other thing but this is a very expensive operation compared normal write and should be used sparingly.
During localhost development the ID's generated by GAE, starts with 1.
However in a real GAE deployment in the cloud, the ID generated even for the firsts entities are quite long like, 5639412304721232, is there a work around to make the first entities to start with 1, 2, 3.. and so on?
One might suggest to use Sharded Counters, and yes I've used this, however some suggests that sharded counters are not to be used as app might get the same count as it is eventually consistent.
In this case what could be the best solution?
The official post explaining the switch from sequential to 'scattered' ids is here.
The instructions for reverting to sequential behaviour are here, but note the warning that this option will eventually be removed.
The 'best' solution depends on what you need and why. You'll get better datastore performance with scattered ids, but honestly, you might not notice much difference if your app makes gets a small number of requests and makes light use of the datastore. If that's the case, you can use roll your own sequential ids based on a simple entity with a property that holds the the current high watermark id, and rely on having a low transaction rate to keep you from running into limits on the number of transactions per entity.
Reliably handing out sequential ids without gaps in a distributed systems is challenging.
Be aware that you may run into problems if you create a lot of entities very quickly, with sequential Long IDs. This post gives you an explanation why.
In theory there's a choice of auto ID generation policies, with scattered IDs being the default since 1.8.1, but the old monotonically increasing legacy policy is to be deprecated for the reasons discussed in the linked post.
If you're using a sharded counter, you will avoid this but, as you say, you may encounter other issues.
You might try using allocate_ds. We use this to get smaller integer values for system generated ids. In Python using a db kind:
model_key = db.Key.from_path('your_kind_name', 1)
key_batch = db.allocate_ids(model_key, 1)
id_new = key_batch[0]
idkey = db.Key.from_path('your_kind_name', id_new)
I would assign the key's identifier as the strings "1", "2", "3"... and so on, generating them from a sequencer. You can check to see if the entity already exists with a get_or_insert() function.
Similarly, you can use the auto-increment solution by storing the sequence number in an entity.
spring-batch newbie: I have a series of batches that
read all new records (since the last execution) from some sql tables
upload all the new records to hadoop
run a series of map-reduce (pig) jobs on all the data (old and new)
download all the output to local and run some other local processing on all the output
point is, I don't have any obvious "item" - I don't want to relate to the specific lines of text in my data, I work with all of it as one big chunk and don't want any commit intervals and such...
however, I do want to keep all these steps loosely coupled - as in, step a+b+c might succeed for several days and accumulate processed stuff while step d keeps failing, and then when it finally succeeds it will read and process all of the output of it's previous steps.
SO: is my "item" a fictive "working-item" which will signify the entire new data? do I maintain a series of queues myself and pass this fictive working-items between them?
thanks!
people always assume that the only use of spring batch is really only for the chunk processing. that is a huge feature, but what's overlooked is the visibility of the processing and job control.
give 5 people the same task with no spring batch and they're going to implement flow control and visibility their own way. give 5 people the same task and spring batch and you may end up with custom tasklets all done differently, but getting access to the job metadata and starting and stopping jobs is going to be consistent. from my perspective it's a great tool for job management. if you already have your jobs written, you can implement them as custom tasklets if you don't want to rewrite them to conform the 'item' paradigm. you'll still see benefits.
I don't see the problem. Your scenario seems like a classic application of Spring Batch to me.
read all new records (since the last execution) from some sql tables
Here, an item is a record
upload all the new records to hadoop
Same here
run a series of map-reduce (pig) jobs on all the data (old and new)
Sounds like a StepListener or ChunkListener
download all the output to local and run some other local processing on all the output
That's the next step.
The only problem I see is if you don't have Domain Objects for your records. But even then, you can work with maps or arrays, while still using ItemReaders and ItemWriters.