how strong consistency and eventual consistency work in datastore

how strong consistency and eventual consistency work in datastore - java

i'm no expert in Databases so what i know about queries is that they are the way to read or write in databases
in eventual consistency read will return stale data
in write query first data node will be updated but other node will need some time to be updated
in strong consistency read will be locked until data get modified to it latest version (really i'm not sure about what i said here so help if u got it wrong)
in write query all read operations for will be lock until data node get modified to its latest version
so if i write data as eventual and tried ancestors query to get that data will i get the latest version ?
if i used ancestors query to update would all eventual read operation get the latest version ?
update
i think Transactions is there so if there is multi modification request to the same data 1 will succeeded and other will fail after that the data the have been modified will take some time to be replicated in all datacenter so if transaction succeeded does not mean all read query will return the latest version (correct me if i'm right)

If you use what you call an "ancestor query", you're working in a transaction: either the transaction terminates successfully, in which case all subsequent reads will get the values as updated by the transaction, or else the transaction fails, in which case none of the changes made by the transaction will be seen (this all-or-nothing property is often referred to as a transaction being "atomic"). In particular, you do get strong consistency this way, not just eventual consistency.
The cost can be large, in terms of performance and scalability. In particular, an application should not update an entity group (any and all entities descending from a common ancestor) more than once a second, which can be a very constraining limit for a highly scalable application.
The online docs include a large variety of tips, tricks and advice on how to deal with this -- you could start at https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/ and continue with the "additional resources" this article lists at the end.
One simple idea that often suffices is that (differently from queries) getting a specific entity from its key is strongly consistent without needing transactions, and memcache is also strongly consistent; writing a modified entity gives you its new key, so you can stash that key into memcache and have other parts of your code fetch the modified entity from that key, rather than relying on queries. This has limits, of course, because memcache doesn't give you unbounded space -- but it's a useful idea to keep in mind, nevertheless, in many practical cases.

With GAE the only way to be consistante is to use transaction, into a transaction you can update, query the last update but it's slower.
For me using ancestors just compose the primary key and that's all.

Related

How to optimize one big insert with hibernate

For my website, I'm creating a book database. I have a catalog, with a root node, each node have subnodes, each subnode has documents, each document has versions, and each version is made of several paragraphs.
In order to create this database the fastest possible, I'm first creating the entire tree model, in memory, and then I call session.save(rootNode)
This single save will populate my entire database (at the end when I'm doing a mysqldump on the database it weights 1Go)
The save coasts a lot (more than an hour), and since the database grows with new books and new versions of existing books, it coasts more and more. I would like to optimize this save.
I've tried to increase the batch_size. But it changes nothing since it's a unique save. When I mysqldump a script, and I insert it back into mysql, the operation coast 2 minutes or less.
And when I'm doing a "htop" on the ubuntu machine, I can see the mysql is only using 2 or 3 % CPU. Which means that it's hibernate who's slow.
If someone could give me possible techniques that I could try, or possible leads, it would be great... I already know some of the reasons, why it takes time. If someone wants to discuss it with me, thanks for his help.
Here are some of my problems (I think): For exemple, I have self assigned ids for most of my entities. Because of that, hibernate is checking each time if the line exists before it saves it. I don't need this because, the batch I'm executing, is executed only one, when I create the databse from scratch. The best would be to tell hibernate to ignore the primaryKey rules (like mysqldump does) and reenabeling the key checking once the database has been created. It's just a one shot batch, to initialize my database.
Second problem would be again about the foreign keys. Hibernate inserts lines with null values, then, makes an update in order to make foreign keys work.
About using another technology : I would like to make this batch work with hibernate because after, all my website is working very well with hibernate, and if it's hibernate who creates the databse, I'm sure the naming rules, and every foreign keys will be well created.
Finally, it's a readonly database. (I have a user database, which is using innodb, where I do updates, and insert while my website is running, but the document database is readonly and mYisam)
Here is a exemple of what I'm doing
TreeNode rootNode = new TreeNode();
recursiveLoadSubNodes(rootNode); // This method creates my big tree, in memory only.
hibernateSession.beginTrasaction();
hibernateSession.save(rootNode); // during more than an hour, it saves 1Go of datas : hundreads of sub treeNodes, thousands of documents, tens of thousands paragraphs.
hibernateSession.getTransaction().commit();

It's a little hard to guess what could be the problem here but I could think of 3 things:
Increasing batch_size only might not help because - depending on your model - inserts might be interleaved (i.e. A B A B ...). You can allow Hibernate to reorder inserts and updates so that they can be batched (i.e. A A ... B B ...).Depending on your model this might not work because the inserts might not be batchable. The necessary properties would be hibernate.order_inserts and hibernate.order_updates and a blog post that describes the situation can be found here: https://vladmihalcea.com/how-to-batch-insert-and-update-statements-with-hibernate/
If the entities don't already exist (which seems to be the case) then the problem might be the first level cache. This cache will cause Hibernate to get slower and slower because each time it wants to flush changes it will check all entries in the cache by iterating over them and calling equals() (or something similar). As you can see that will take longer with each new entity that's created.To Fix that you could either try to disable the first level cache (I'd have to look up whether that's possible for write operations and how this is done - or you do that :) ) or try to keep the cache small, e.g. by inserting the books yourself and evicting each book from the first level cache after the insert (you could also go deeper and do that on the document or paragraph level).
It might not actually be Hibernate (or at least not alone) but your DB as well. Note that restoring dumps often removes/disables constraint checks and indices along with other optimizations so comparing that with Hibernate isn't that useful. What you'd need to do is create a bunch of insert statements and then just execute those - ideally via a JDBC batch - on an empty database but with all constraints and indices enabled. That would provide a more accurate benchmark.
Assuming that comparison shows that the plain SQL insert isn't that much faster then you could decide to either keep what you have so far or refactor your batch insert to temporarily disable (or remove and re-create) constraints and indices.
Alternatively you could try not to use Hibernate at all or change your model - if that's possible given your requirements which I don't know. That means you could try to generate and execute the SQL queries yourself, use a NoSQL database or NoSQL storage in a SQL database that supports it - like Postgres.
We're doing something similar, i.e. we have Hibernate entities that contain some complex data which is stored in a JSONB column. Hibernate can read and write that column via a custom usertype but it can't filter (Postgres would support that but we didn't manage to enable the necessary syntax in Hibernate).

Filter records using cache vs DB index?

Some of the RDBMS tables have million of records and some have few thousands. I am already caching those records in ehcache. Say I have million of customers already cached in
ehcache from DB table. Now have to search/filter customers on multiple attributes which is decided at run time
One approach is apply filtering on cached data. Good thing is here i can save IO calls which are costly Bad thing is I need to do filtering in application(java)
Second approach is fetch the data from DB using DB index. Good thing is i can use DB index which will eliminate scanning through all records . Bad thing is i need to make
IO calls.
Which is better approach performance wise ?

One approach is apply filtering on cached data. Good thing is here i can save IO calls which are costly Bad thing is I need to do filtering in application(java)
You cannot be sure that your cache contains all data, and that it is consistent. Making your cache is in sync with the database, possibly honoring transactions, leads you to many other problems.
If we are talking about a read-only, analytical and data fits complete in memory, you can load everything into the appropriate data structures (HashMap, Tree, etc.). Then you don't need a cache.
Filtering on cached data, typically means sequential scan through the data. This might be not very fast. Some caches provide indexing, but then you are locked in to very vendor specific extensions.
Second approach is fetch the data from DB using DB index. Good thing is i can use DB index which will eliminate scanning through all records . Bad thing is i need to make IO calls.
If all your data is not in the cache, you need to do a DB request anyways and the DB needs to do the index access anyway, too. A database query can just return IDs so you can save the redundant transfer of the row data. Consistency may be an issue here.
Which is better approach performance wise?
Also keep in mind that there is also your personal performance as programmer. Making to complex solutions will not make you happy and look good in the long run.
What you need to do depends on the cost of database I/O and your problem domain.

Apply Batch keyword to select statements

Is it possible to execute a batch of select statements using dse cassandra or should i consider a design change?
The reason is i have a lot of select queries i wish to execute against my db cluster and not sure about going about it. I have deleted all my secondary indexes so im not using those anymore.

That won't work and even if it would, it isn't adviseable.
You won't recieve the results in a way that you can use, no result set
Even if that worked, the batch query would be much less performant than doing them serially due to the way Cassandra batching is implemented.
Batching only works well if the keys (write executions) are distributed in an equal way, and this is only worth it if you want to do all the updates as a transaction.
So in summary you should definitely consider a design change

How to know affected rows in Cassandra(CQL)?

There doesn't seem to be any direct way to know affected rows in cassandra for update, and delete statements.
For example if I have a query like this:
DELETE FROM xyztable WHERE PKEY IN (1,2,3,4,5,6);
Now, of course, since I've passed 6 keys, it is obvious that 6 rows will be affected.
But, like in RDBMS world, is there any way to know affected rows in update/delete statements in datastax-driver?
I've read cassandra gives no feedback on write operations here.
Except that I could not see any other discussion on this topic through google.
If that's not possible, can I be sure that with the type of query given above, it will either delete all or fail to delete all?

In the eventually consistent world you can look at these operations as if it was saving a delete request, and depending on the requested consistency level, waiting for a confirmation from several nodes that this request has been accepted. Then the request is delivered to the other nodes asynchronously.
Since there is no dependency on anything like foreign keys, then nothing should stop data from being deleted if the request was successfully accepted by the cluster.
However, there are a lot of ifs. For example, deleting data with a consistency level one, successfully accepted by one node, followed by an immediate node hard failure may result in the loss of that delete if it was not replicated before the failure.
Another example - during the deletion, one node was down, and stayed down for a significant amount of time, more than the gc_grace_period, i.e., more than it is required for the tombstones to be removed with deleted data. Then if this node is recovered, then all suddenly all data that has been deleted from the rest of the cluster, but not from this node, will be brought back to the cluster.
So in order to avoid these situations, and consider operations successful and final, a cassandra admin needs to implement some measures, including regular repair jobs (to make sure all nodes are up to date). Also applications need to decide what is better - faster performance with consistency level one at the expense of possible data loss, vs lower performance with higher consistency levels but with less possibility of data loss.

There is no way to do this in Cassandra because the model for writes, deletes, and updates in Cassandra is basically the same. In all of those cases a cell is added to the table which has either the new information or information about the delete. This is done without any inspection of the current DB state.
Without checking the rest of the replicas and doing a full merge on the row there is no way to tell if any operation will actually effect the current read state of the database.
This leads to the oft cited anti-pattern of "Reading before a write." In Cassandra you are meant to write as fast as possible and if you need to have history, use a datastructure which preservations a log of modifications rather than just current state.
There is one option for doing queries like this, using the CAS syntax of IF value THEN do other thing but this is a very expensive operation compared normal write and should be used sparingly.

Why use your application-level cache if database already provides caching?

Modern database provide caching support. Most of the ORM frameworks cache retrieved data too. Why this duplication is necessary?

Because to get the data from the database's cache, you still have to:
Generate the SQL from the ORM's "native" query format
Do a network round-trip to the database server
Parse the SQL
Fetch the data from the cache
Serialise the data to the database's over-the-wire format
Deserialize the data into the database client library's format
Convert the database client librarie's format into language-level objects (i.e. a collection of whatevers)
By caching at the application level, you don't have to do any of that. Typically, it's a simple lookup of an in-memory hashtable. Sometimes (if caching with memcache) there's still a network round-trip, but all of the other stuff no longer happens.

Here are a couple of reasons why you may want this:
An application caches just what it needs so you should get a better cache hit ratio
Accessing a local cache will probably be a couple of orders of magnitude faster than accessing the database due to network latency - even with a fast network

Scaling read-write transactions using a strongly consistent cache
Scaling read-only transactions can be done fairly easily by adding more Replica nodes.
However, that does not work for the Primary node since that can be only scaled vertically:
And that's where a cache comes into play. For read-write database transactions that need to be executed on the Primary node, the cache can help you reduce the query load by directing it to a strongly consistent cache, like the Hibernate second-level cache:
Using a distributed cache
Storing an application-level cache in the memory of the application is problematic for several reasons.
First, the application memory is limited, so the volume of data that can be cached is limited as well.
Second, when traffic increases and we want to start new application nodes to handle the extra traffic, the new nodes would start with a cold cache, making the problem even worse as they incur a spike in database load until the cache is populated with data:
To address this issue, it's better to have the cache running as a distributed system, like Redis. This way, the amount of cached data is not limited by the memory size on a single node since sharding can be used to split the data among multiple nodes.
And, when a new application node is added by the auto-scaler, the new node will load data from the same distributed cache. Hence, there's no cold cache issue anymore.

Even if a database engine caches data, indexes, or query result sets, it still takes a round-trip to the database for your application to benefit from that cache.
An ORM framework runs in the same space as your application. So there's no round-trip. It's just a memory access, which is generally a lot faster.
The framework can also decide to keep data in cache as long as it needs it. The database may decide to expire cached data at unpredictable times, when other concurrent clients make requests that utilize the cache.
Your application-side ORM framework may also cache data in a form that the database can't return. E.g. in the form of a collection of java objects instead of a stream of raw data. If you rely on database caching, your ORM has to repeat that transformation into objects, which adds to overhead and decreases the benefit of the cache.

Also, the database's cache might not be as practical as one think. I copied this from http://highscalability.com/bunch-great-strategies-using-memcached-and-mysql-better-together -- it's MySQL specific, tho.
Given that MySQL has a cache, why is memcached needed at all?
The MySQL cache is associated with just one instance. This limits the cache to the maximum address of one server. If your system is larger than the memory for one server then using the MySQL cache won't work. And if the same object is read from another instance its not cached.
The query cache invalidates on writes. You build up all that cache and it goes away when someone writes to it. Your cache may not be much of a cache at all depending on usage patterns.
The query cache is row based. Memcached can cache any type of data you want and it isn't limited to caching database rows. Memcached can cache complex complex objects that are directly usable without a join.

The performance considerations related to the network roundtrips have correctly been pointed out.
To that, it must be added that caching data anywhere else than in the dbms (NOT "database"), creates a problem of potentially obsoleted data that is still being presented as being "up to date".
Giving in to the temptations of performance improvement goes at the expense of losing the guarantee (watertight or at least close to that) of absolutely reliably and guaranteeably correct and consistent data.
Consider this every time accuracy and consistency is crucial.

A lot of good answers here. I'll add one other point: I know my access pattern, the database doesn't.
Depending on what I'm doing, I know that if the data ends up stale, that's not really a problem. The DB doesn't, and would have to reload the cache with the new data.
I know that I'll come back to a piece of data a few times over the next while, so it's important to keep around. The DB has to guess at what to keep in the cache, it's doesn't have the information I do. So if I fetch it from the DB over and over, it may not be in cache if the server is busy. I could get a cache miss. With my cache, I can be sure I get a hit. This is especially true on data that is non-trivial to get (i.e. a few joins, some group functions) as opposed to just a single row. Getting a row with the primary key of 7 is easy for the DB, but if it has to do some real work, the cost of the cache miss is much higher.

No doubt that modern databases are providing caching facility but when you are having more traffic on you site and that time you need to perform many database transaction then you will no get high performance.So to increase performance in this case hibernate cache will help you,
by optimizing the database applications. The cache actually stores the data already loaded from the database, so that the traffic between our application and the database will be reduced when the application want to access that data again.The access time and traffic will be reduced between the application and the database.

That said - caches can sometimes become a burden and actually slowdown the server. When you have high load the algorithm for what is cached and what is not might not fit right with the requests coming in...what you get is a cache that starts to operate like FIFO in overtime...this begins to make itself known when the table that sits behind the cache has significantly more records than are ever going to be cached in memory...
A good trade off would be to cluster the data for what you want to cache. Have a main server which pumps updates to the clusters, the time for when to send/pump the updates should be able to be tailored for each table depending on TTL (time to live) settings.
Your logic and data on the user node can then sit on the same server which opens up in memory databases or if it does have to fetch data then you could set it up to use a pipe instead of a network call...
This is something that takes some thought on how you want to use the data and when/if you cluster then you have to be aware of distributed transactions (transactions over more than one database)...but if the data being cached will be updated on its own without links into other db spaces then you can get away with this....
The problem with ORM caching is that if the database is updated independently through another application then the ORM cache can become out of date...Also it can get tricky if you do an update to a set...the update might update something that is in your cache and it needs to have some sort of algorithm to identify which records need to be removed/updated in memory (slowing down the update!?) - and then this algorithm becomes incredibly tricky and bug prone!
If using ORM caching then keep to a simple rule...cache simple objects that hardly ever change (user/role details for example) and that are small in size and are hit many times in a request...if its outside of this then I suggest clustering the data for performance.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.