JGroups ReplicatedHashMap with Partial State Transfer

JGroups ReplicatedHashMap with Partial State Transfer - java

I'm looking to extend the JGroups ReplicatedHashMap demo with additional functionality - the ability to support named submaps to be replicated across different instances within the same cluster.
The basic idea is that not all clients need to have a local copy of the entire hashmap, but might need to request additional chunks of the hashmap on demand. Each client would start out with a relatively small base set of data, say, the state associated with the state id "base_data". As they required more specialized data, they would perform a partial state transfer requesting the exact data they required; the state associated with state id "specialized_data_1". This creates a kind of localized caching service where updates to the cache propogate to appropriate clients within the cluster.
Is this an appropriate use of Partial State Transfer with JGroups? Is there a better way to do this? Am I completely misunderstanding partial state transfer? Since JGroups 3.x doesn't support partial state transfer, how could this be implemented there? I haven't found very much documentation on partial state transfer, beyond this small section in the documentation (scroll/search for "3.6.15. Partial state transfer"), so I'd appreciate any other good references you might recommend.
Thanks

Partial state transfer was removed some time ago, as it was broken, see the link below for details. You could probably do this with messages. What you want to do sounds a bit like what Infinispan already provides, so you may want to take a look at their DIST mode.
http://jgroups.1086181.n5.nabble.com/Partial-state-transfer-removed-in-3-0-td3173.html

Related

Is there a way to ensure uniqueness of a field without relying on database

Without relying on the database, is there a way to ensure a field (let's say a User's emailAddress) is unique.
Some common failure attempts:
Check first if emailAddress exists (by querying the DB) and if not then create the user. Now obviously in the window of check-then-act some other thread can create a user with same email. Hence this solution is no good.
Apply a language-level lock on the method responsible for creating the user. This solution fails as we need redundancy of the service for performance reasons and lock is on a single JVM.
Use an Event store (like an Akka actor's mailbox), event being an AddUser message, but since the actor behavior is asynchronous, the requestor(sender) can't be notified that user creation with unique email was successful. Moreover, how do 2 requests (with same email) know they contain a unique email? This may get complicated.
Database, being a single source of data that every thread and every service instance will write to, makes sense to implement the unique constraint here. But this holds true for Relational databases.
Then what about NoSql databases? some do allow for a unique constraint, but it's not their native behavior, or maybe it is.
But the question of not using the database to implement uniqueness of a field, what could be the options?

I think your question is more generic - "how do I ensure a database write action succeeded, and how do I handle cases where it didn't?". Uniqueness is just one failure mode - you may be attempting to insert a value that's too big, or of the wrong data type, or that doesn't match a foreign key constraint.
Relational databases solve this through being ACID-compliant, and throwing errors for the client to deal with when a transaction fails.
You want (some of) the benefits of ACID without the relational database. That's a fairly big topic of conversation. The obvious way to solve this is to introduce the concept of "transaction" in your application layer. For instance, in your case, you might send a "create account(emailAddress, name, ...)" message, and have the application listen for either an "accountCreated" or "accountCreationFailed" response. The recipient of that message is responsible for writing to the database; you have a couple of options. One is to lock that thread (so only one process can write to the database at any time); that's not super scalable. The other mechanism I've used is introducing status flags - you write the account data to the database with a "draft" flag, then check for your constraints (including uniqueness), and set the "draft" flag to "validated" if the constraints are met (i.e. there is no other record with the same email address), and "failed" if they are not.

to check for uniquness you need to store the "state" of the program. for safety you need to be able to apply changes to the state transactionally.
you can use database transactions. a few of the NoSQL databases support transactions too, for example, redis and MongoDB. you have to check for each vendor separately to see how they support transactions. in this setup, each client will connect to the database and it will handle all of the details for you. also depending on your use case you should be careful about the isolation level configuration.
if durability is not a concern then you can use in memory databases that support transactions.
which state store you choose, it should support transactions. there are several ways to implement transactions and achieve consistency. many relational databases like PostgresSQL achieve this by implementing the MVCC algorithm. in a distributed environment you have to look for distributed transactions such as 2PC, Paxos, etc.
normally everybody relies on availabe datastore solutions unless there is a weird or specific requirement for the project.
final note, the communication pattern is not related to the underlying problem here. for example, in the Actor case you mentioned, at the end of the day, each actor has to query the state to find if a email exists or not. if your state store supports Serializability then there is no problem and conflicts will not happen (communicating the error to the client is another issue). suppose that you are using PostgreSQL. when a insert/update query is issued, it is wrapped around a transaction and the underlying MVCC algorithm will take care of everything. in an advanced and distrbiuted environment you can use data stores that support distributed transactions, like CockroachDB.
if you want to dive deep you can research these keywords: ACID, isolation levels, atomicity, serializability, CAP theorem, 2PC, MVCC, distributed transacitons, distributed locks, ...

NoSQL databases provide different, weaker, guarantees than relational databases. Generally, the tradeoff is you give up ACID guarantees in exchange for increased scalability in the dimensions that matter for your application.
It's possible to provide some kind of uniqueness guarantee, but subject to certain tradeoffs. With NoSQL, there are always tradeoffs.
If your NoSQL store supports optimistic concurrency control, maybe this approach will work:
Store a separate document that contains the set of all emailAddress values, across all documents in your NoSQL table. This is one instance of this document at a given time.
Each time you want to save a document containing emailAddress, first confirm email address uniqueness:
Perform the following actions, protected by optimistic locking. You can on the backend if this due to a concurrent update:
Read this "all emails" document.
Confirm the email isn't present.
If not present, add the email address to the "all emails document"
Save it.
You've now traded one problem ... the lack of unique constraints, for another ... the inability to synchronise updates across your original document and this new "all emails" document. This may or may not be acceptable, it depends on the guarantees that your application needs to provide.
e.g. Maybe you can accept that an email may be added to "all emails", that saving the related document to your other "table" subsequently fails, and that that email address is now not able to be used. You could clean this up with a batch job somehow. Not sure.
The index of emails could be stored in some other service (e.g. a persistent cache). The same problem exists, you need to keep the index and your document store in sync somehow.
There's no easy solution. For a detailed overview of the relevant concepts, I'd recommend Designing Data-Intensive Applications by Martin Kleppmann.

Write-through cache Redis

I have been poundering on how to reliably implement a write-through caching mechanism to store realtime data.
Basically what we need is this:
Save data to Redis -> Save to database (underlying)
Read data from Redis <- Read from database in case unavailable in cache
The resources online to help in the implementation of this caching strategy seem scarce.
The problem is:
1) No built-in transaction possibility between Redis and the database (Mongo in my case).
2) No transactions mean that writes to the underlying database are unreliable.
The most straightforward way I see how this can be implemented is by using a broker like Kafka and putting messages on a persistent queue to be processed later.
Therefore Kafka would be the responsible entity for reliable processing.
Another way would be by having a custom implementation in a scheduler that checks the Redis database for dirty records. On first thought there seem to be some tradeoffs to this approach and I would like not having to go this road if possible.
I am looking on some options on how this can be implemented otherwise.
Or whether this is in fact the most viable approach.

So better approach than is as u mentioned above is to use kafka and consumer which will store data to mongo. But read about it delivery guarantee, as i remember exactly once is guaranteed in kafka streams only (between two topics), in your case your database should be idempotent because u get at least once guarantee. And don't forget to turn AOF on with Redis, not to loose data. And don't forget that in this case u get eventual consistency in db with all the consequences.

On review I will use MongoDB as a single datastore without Redis at all.
Premature optimization is evil I guess.
Anyhow, I can add additional architecture afterwards after benchmarking.
Plans to refactor towards a cache shouldn't be too hard.
Scaling is additional concern so I shouldn't be bothered with that during development right now.
Accepted #Ipave answer, going with a single datastore for the moment.

Create a transaction like to use EJB container administration

I have a code in my business layer that updates data on database and also in a rest service.
The question is that if it doesn't fail data must be save in both places and, in other hand, if it fails it must to rollback in database and send another requisition to rest api.
So, what I'm looking for is a way to use transaction management of EJB to also orchestrait calls to api. When in commit time, send a set requisition to api and, when in rollback time, send delete requisition to api.
In fact I need to maintain consistency and make both places syncronous.
I have read about UserTransactions and managedbeans but I don't have a clue about what is the best way to do that.

You can use regular distributed transactions, depending on your infrastructure and participants. This might be possible e.g. if all participants are EJBs and the data stores are capable to handle distributed transactions.
This won't work with loosely coupled componentes, and your setup looks like this.
I do not recommend to create your own distributed transaction protocol. Regarding the edge and corner cases, you will probably not end up with consistent data in the end.
I would suggest to think about using event sourcing and eventually consistency for things like that. For example, you could emit an event (command) for writing data. If your "rollback" is needed, you can emit an event (command) to delete the date written before. After all events are processed, the data is consistent.
Some interesting links might be:
Martin Fowler - Event Sourcing
Martin Fowler - CQRS
Apache Kafka

maintaining cache state in different servers

This may be a dumb question, but i am not getting what to google even.
I have a server which fetches the some data from DB, caches this data and when ever any request involves this data, then data is fetched from cache instead of from DB.There by reducing the time taken to serve the request.
This cache can be modified, i.e may be some key can get added to it or deleted or updated.
Any change which occurs in cache will also happen on DB.
The Problem is now due to heavy rush in traffic we want to add a load balancer infront of my server. Lets say i add one more server. Then the two servers will have two different cache. if some thing gets added in the first server cache, how should i inform the second server cache to get it refreshed??

If you ultimately decide to move the cache outside your main webserver process, then you could also take a look at consistent hashing. This would be a alternative to a replicated cache.
The problem with replicated caches, is they scale inversely proportional to the number of nodes participating in the cache. i.e. their performance degrades as you add additional nodes. They work fine when there is a small number of nodes. If data is to be replicated between N nodes (or you need to send eviction messages to N nodes), then every write requires 1 write to the cache on the originating node, and N-1 writes to the other nodes.
In consistent hashing, you instead define a hashing function, which takes the key of the data you want to store or retrieve as input, and it returns the id of the server in the cluster which is responsible for caching the data for that key. So each caching server is responsible for a fraction of the overall keys, the client can determine which server will contain the sought data without any lookup, and data and eviction messages do not need to be replicated between caching servers.
The "consistent" part of consistent hashing, refers to how your hashing function handles new servers being added to or removed from the cluster: some re-distribution of keys between servers is required, but the function is designed to minimize the amount of such disruption.
In practice, you do not actually need a dedicated caching cluster, as your caches could run in-process in your web servers; each web server being able to determine the other webserver which should store cache data for a key.
Consistent hashing is used at large scale. It might be overkill for you at this stage. But just be aware of the scalability bottleneck inherent in O(N) messaging architectures. A replicated cache is possibly a good idea to start with.
EDIT: Take a look at Infinispan, a distributed cache which indeed uses consistent hashing out of box.

Any way you like ;) If you have no idea, I suggest you look at or use ehcache or Hazelcast. It may not be the best solutions for you but it is some of the most widely used. (And CV++ ;) I suggest you understand what it does first.

Denormalization in Google App Engine?

Background::::
I'm working with google app engine (GAE) for Java. I'm struggling to design a data model that plays to big table's strengths and weaknesses, these are two previous related posts:
Database design - google app engine
Appointments and Line Items
I've tentatively decided on a fully normalized backbone with denormalized properties added into entities so that most client requests can be serviced with only one query.
I reason that a fully normalized backbone will:
Help maintain data integrity if I code a mistake in the denormalization
Enable writes in one operation from a client's perspective
Allow for any type of unanticipated query on the data (provided one is willing to wait)
While the denormalized data will:
Enable most client requests to be serviced very fast
Basic denormalization technique:::
I watched an app engine video describing a technique referred to as "fan-out." The idea is to make quick writes to normalized data and then use the task queue to finish up the denormalization behind the scenes without the client having to wait. I've included the video here for reference, but its an hour long and theres no need to watch it in order to understand this question:
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
If I use this "fan-out" technique, every time the client modifies some data, the application would update the normalized model in one quick write and then fire off the denormalization instructions to the task queue so the client does not have to wait for them to complete as well.
Problem:::
The problem with using the task queue to update the denormalized version of the data is that the client could make a read request on data that they just modified before the task queue has completed the denormalization on that data. This would provide the client with stale data that is incongruent with their recent request confusing the client and making the application appear buggy.
As a remedy, I propose fanning out denormalization operations in parallel via asynchronous calls to other URLS in the application via URLFetch: http://code.google.com/appengine/docs/java/urlfetch/ The application would wait until all of the asynchronous calls had been completed before responding to the client request.
For example, if I have an "Appointment" entity and a "Customer" entity. Each appointment would include a denormalized copy of the customer information for who its scheduled for. If a customer changed their first name, the application would make 30 asynchronous calls; one to each affected appointment resource in order to change the copy of the customer's first name in each one.
In theory, this could all be done in parallel. All of this information could be updated in roughly the time it takes to make 1 or 2 writes to the datastore. A timely response could be made to the client after the denormalization was completed eliminating the possibility of the client being exposed to incongruent data.
The biggest potential problem I see with this is that the application can not have more than 10 asynchronous request calls going at any one time (documented here): http://code.google.com/appengine/docs/java/urlfetch/overview.html).
Proposed denormalization technique (recursive asynchronous fan-out):::
My proposed remedy is to send denormalization instructions to another resource that recursively splits the instructions into equal-sized smaller chunks, calling itself with the smaller chunks as parameters until the number of instructions in each chunk is small enough to be executed outright. For example, if a customer with 30 associated appointments changed the spelling of their first name. I'd call the denormalization resource with instructions to update all 30 appointments. It would then split those instructions up into 10 sets of 3 instructions and make 10 asynchronous requests to its own URL with each set of 3 instructions. Once the instruction set was less than 10, the resource would then make asynchronous requests outright as per each instruction.
My concerns with this approach are:
It could be interpreted as an attempt to circumvent app engine's rules, which would cause problems. (its not even allowed for a URL to call itself, so I'd in fact have to have two URL resources that handle the recursion that would call each other)
It is complex with multiple points of potential failure.
I'd really appreciate some input on this approach.

This sounds awfully complicated, and the more complicated the design the more difficult it is to code and maintain.
Assuming you need to denormalize your data, I'd suggest just using the basic denormalization technique, but keep track of which objects are being updated. If a client requests an object which is being updated, you know you need to query the database to get the updated data; if not, you can rely on the denormalized data. Once the task queue finishes, it can remove the object from the "being updated" list, and everything can rely on the denormalized data.
A sophisticated version could even track when each object was edited, so a given object would know if it had already been updated by the task queue.

It sounds like you are re-implemeting Materialized Views http://en.wikipedia.org/wiki/Materialized_view.

I suggest you the easy solution with Memcache. Uppon update from your client, you could save an Entity in the Memcache storing the Key of the updated Entity with the status 'updating'. When you task finisches, it will delete the Memcached status. Then you would check the status before a read, allowing the user to be correctly informed if the Entity is still 'locked'.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.