Hazelcast data affinity with preferred member as primary

Hazelcast data affinity with preferred member as primary - java

I have a clustered system set up with Hazelcast to store my data. Each node in the cluster is responsible for connecting to a service on localhost and piping data from this service into the Hazelcast cluster.
I would like this data to be stored primarily on the node that received it, and also processed on that node. I'd like the data to be readable and writable on other nodes with moderately less performance requirements.
I started with a naive implementation that does exactly as I described with no special considerations. I noticed performance suffered quite a bit (we had a separate implementation using Infinispan to compare it with). Generally speaking, there is little logical intersection between the data I'm processing from each individual service. It's stored in a Hazelcast cluster so it can be read and occasionally written from all nodes and for failover scenarios. I still need to read the last good state of the failed node if either the Hazelcast member fails on that node or the local service fails on that node.
So my first attempt at co-locating the data and reducing network chatter was to key much of the data with a serverId (number from 1 to 3 on, say, a 3-node system) and include this in the key. The key then implements PartitionAware. I didn't notice an improvement in performance so I decided to execute the logic itself on the cluster and key it the same way (with a PartitionAware/Runnable submitted to a DurableExecutorService). I figured if I couldn't select which member the logic could be processed on, I could at least execute it on the same member consistently and co-located with the data.
That made performance even worse as all data and all execution tasks were being stored and run on a single node. I figured this meant node #1 was getting partitions 1 to 90, node #2 was getting 91 to 180, and node #3 was getting 181 to 271 (or some variant of this without complete knowledge of the key hash algorithm and exactly how my int serverId translates to a partition number). So hashing serverId 1, 2, 3 and resulted in e.g. the oldest member getting all the data and execution tasks.
My next attempt was to set backup count to (member count) - 1 and enable backup reads. That improved things a little.
I then looked into ReplicatedMap but it doesn't support indexing or predicates. One of my motivations to moving to Hazelcast was its more comprehensive support (and, from what I've seen, better performance) for indexing and querying map data.
I'm not convinced any of these are the right approaches (especially since mapping 3 node numbers to partition numbers doesn't match up to how partitions were intended to be used). Is there anything else I can look at that would provide this kind of layout, with one member being a preferred primary for data and still having readable backups on 1 or more other members after failure?
Thanks!

Data grids provide scalability, you can add or remove storage nodes to adjust capacity, and for this to work the grid needs to be able to rebalance the data load. Rebalancing means moving some of the data from one place to another. So as a general rule, the placement of data is out of your control and may change while the grid runs.
Partition awareness will keep related items together, if they move they move together. A runnable/callable accessing both can satisfy this from the one JVM, so will be more efficient.
There are two possible improvements if you really need data local to a particular node, read-backup-data or near-cache. See this answer.
Both or either will help reads, but not writes.

Related

Does Hazelcast store MultiMap values in the local instance when backup is disabled?

I am configuring a Hazelcast Multimap without backups (on purpose):
config.getMultiMapConfig(SESSIONS_MAP)
.setBackupCount(0)
.setAsyncBackupCount(0)
.setValueCollectionType(MultiMapConfig.ValueCollectionType.SET);
My goal is that each instance stores its own values in the MultiMap, so that when a server disappears, those values are lost. Is above configuration correct?
Example: Server instances in a cluster host user sessions. I want to store users in a MultiMap, so that each user is physically stored on the local instance, but other instances can look up where a user session exists. When a server crashes, the user sessions disappear, and so should the entries in the MultiMap. [Users are actually stored in rooms, like MultiMap<roomId, Set<userId>>, where a room may span multiple instances. If one instance goes down, the room may survive, but I want the users on the current instance to become unavailable in the MultiMap as well.]
Only if above is guaranteed: In a controlled shutdown, is it worth to clean up the local entries before shutting down, or is it cheaper to just make the instance disappear?
The manual at https://docs.hazelcast.org/docs/latest-dev/manual/html-single/index.html#configuring-multimap doesn't clearly spell out what actually happens (or I am too blind to find it).

If you set backup counts to zero, it means that each entry will only be stored in one partition (the primary). But it doesn't mean that partition will be hosted on the "local" cluster node.
The partition where any entry is stored is determined by a hashing algorithm, but the mapping of partitions to cluster nodes will change as cluster membership changes (nodes are added or removed). So I don't think trying to manipulate the hashcode is a good way to go.
Since you mention the "local instance", I'm guessing you're using Hazelcast in embedded mode, and the Hazelcast cluster nodes are on the same servers that host the "rooms". You might want to configure a MembershipListener; this listener would be notified whenever a node leaves the cluster, and the listener could then remove map entries related to user sessions hosted in rooms on that node.

Thats a wrong use case for a partition-based distributed system. When you store in a partitioned distributed data-structure such as Map or MultiMap, you do not have control over which partition would host your key-value data. The host partition to your data is determined by consistent hashing algorithm applied on the key. This applies to both - write as well as read operations. And with backup enabled, the data is replicated in backup partitions on each node so that data can be recovered in case of a node failure.
So in your case, you don't even know whether a particular entry is indeed local to your instance (unless you are manually recording this mapping of key-partition using Hazelcast APIs). You are looking up an entry hoping it to be local to that instance because you executed the write operation of that entry from that same node but in reality, that entry may be stored on a partition in some other node in the cluster.
I believe what you want is NearCache which in other words can also be addressed as L1 cache - local to your application. If you loose the app instance, you loose the NearCache and is not available with MultiMap. But even with NearCache, you will never receive "null" or "data not found" because NearCache in principle, loads the data from partition owner (cluster node) if the data is not found in NearCache.
You can also turning off backup but that will mean loosing data on the lost node which may not be local to your application.
Hope that helps.

Zookeeper reads is not fully consistent as per documentation, but is creating a znode fully consistent?

Below are my assumptions/queries. Please address if there is something wrong in my understanding
By Reading the documentation I understood that
Zookeeper writes go to the Leader, and they are replicated to follower. A read request can be served from the follower(slave) itself. And hence read can be stale.
Why can't we use zookeeper as a cache system?
As the write request is always made/redirected to Leader, it means node creation is consistent. When two clients sending a write request for same node name, one of them will ALWAYS get an error(NodeExistsException).
If above is true, then can we use zookeeper to keep track of duplicate requests by creating a znode with the requestId.
For generating a sequence number in a distributed system, we can use the sequential node creation.

Based on what information is available in the question and the comments, it appears that the basic question is:
In a stateless multi server architecture, how best to prevent data duplication, here the data is "has this refund been processed?"
This qualifies as "primarily opinion based". There are multiple ways to do this and no one way is the best. You can do it with MySQL and you can do it with Zookeeper.
Now comes pure opinion and speculation:
To process a refund, there must be some database somewhere? Why not just check against it? The duplicate-request scenario that you are preparing against seems like a rare occurrence - this wont be happening hundred times per sec. If so, then this scenario does not warrant high performance implementation. Just a database lookup should be fine.
Your workload seems to be 1:1 ratio of read:write. Every time a refund is processed, you check whether it is already processed or not and if not processed then process it and make an entry for it. Now Zookeeper itself says it works best for something like 10:1 ratio of read:write. While there is no such metric available for MySQL, it does not need to make certain* guarantees that zookeeper makes for write activities. Hence i hope, it should be better for pure write intensive loads. (* Guarantees like sequentiality, broadcast, consensus etc)
Just a nitpick, but your data is a linear list of hundreds (thousands? millions?) of transaction ids. This is exactly what MySQL (or any database) and its Primary Key is built for. Zookeeper is made for more complex/powerful hierarchical data. That you do not need.

Enforce partition be stored on the specific executor

I have 5-partitions-RDD and 5 workers/executors.
How can I ask Spark to save each RDD's partition on the different worker (IP)?
Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?
Means, I can specify the number of partitions, but Spark still can cache everything on a single node.
Replication is not an option since RDD is huge.
Workarounds I have found
getPreferredLocations
RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.
As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
SparkContext's makeRDD
SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.
Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.
P.S. More context
I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS.
I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:
Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?

How to know affected rows in Cassandra(CQL)?

There doesn't seem to be any direct way to know affected rows in cassandra for update, and delete statements.
For example if I have a query like this:
DELETE FROM xyztable WHERE PKEY IN (1,2,3,4,5,6);
Now, of course, since I've passed 6 keys, it is obvious that 6 rows will be affected.
But, like in RDBMS world, is there any way to know affected rows in update/delete statements in datastax-driver?
I've read cassandra gives no feedback on write operations here.
Except that I could not see any other discussion on this topic through google.
If that's not possible, can I be sure that with the type of query given above, it will either delete all or fail to delete all?

In the eventually consistent world you can look at these operations as if it was saving a delete request, and depending on the requested consistency level, waiting for a confirmation from several nodes that this request has been accepted. Then the request is delivered to the other nodes asynchronously.
Since there is no dependency on anything like foreign keys, then nothing should stop data from being deleted if the request was successfully accepted by the cluster.
However, there are a lot of ifs. For example, deleting data with a consistency level one, successfully accepted by one node, followed by an immediate node hard failure may result in the loss of that delete if it was not replicated before the failure.
Another example - during the deletion, one node was down, and stayed down for a significant amount of time, more than the gc_grace_period, i.e., more than it is required for the tombstones to be removed with deleted data. Then if this node is recovered, then all suddenly all data that has been deleted from the rest of the cluster, but not from this node, will be brought back to the cluster.
So in order to avoid these situations, and consider operations successful and final, a cassandra admin needs to implement some measures, including regular repair jobs (to make sure all nodes are up to date). Also applications need to decide what is better - faster performance with consistency level one at the expense of possible data loss, vs lower performance with higher consistency levels but with less possibility of data loss.

There is no way to do this in Cassandra because the model for writes, deletes, and updates in Cassandra is basically the same. In all of those cases a cell is added to the table which has either the new information or information about the delete. This is done without any inspection of the current DB state.
Without checking the rest of the replicas and doing a full merge on the row there is no way to tell if any operation will actually effect the current read state of the database.
This leads to the oft cited anti-pattern of "Reading before a write." In Cassandra you are meant to write as fast as possible and if you need to have history, use a datastructure which preservations a log of modifications rather than just current state.
There is one option for doing queries like this, using the CAS syntax of IF value THEN do other thing but this is a very expensive operation compared normal write and should be used sparingly.

Is the IN relation in Cassandra bad for queries?

Given an example of the following select in CQL:
SELECT * FROM tickets WHERE ID IN (1,2,3,4)
Given ID is a partition key, is using IN relation better than doing multiple queries or is there no difference?

I remembered seeing someone answer this question in the Cassandra user mailing list a short while back, but I cannot find the exact message right now. Ironically, Cassandra Evangelist Rebecca Mills just posted an article that addresses this issue (Things you should be doing when using Cassandra drivers...points #13 and #22). But the answer is "yes" that in some cases, multiple, parallel queries would be faster than using an IN. The underlying reason can be found in the DataStax SELECT documentation.
When not to use IN
...Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
So based on that, it would seem that this becomes more of a problem as your cluster gets larger.
Therefore, the best way to solve this problem (and not have to use IN at all) would be to rethink your data model for this query. Without knowing too much about your schema, perhaps there are attributes (column values) that are shared by ticket IDs 1, 2, 3, and 4. Maybe using something like level or group (if tickets are for a particular venue) or maybe even an event (id), instead.
Basically, while using a unique, high-cardinality identifier to partition your data sounds like a good idea, it actually makes it harder to query your data (in Cassandra) later on. If you could come up with a different column to partition your data on, that would certainly help you in this case. Regardless, creating a new, specific column family (table) to handle queries for those rows is going to be a better approach than using IN or multiple queries.

Yes, its better to query individually than using IN in Cassandra.
For this query, the coordinator has to get the data from 4 different partitions and if each partition is very big then the data gets filled in JVM which can cause problem.
Instead querying the data using multiple queries is better as each query is individual and don't have to wait for other partitions data to send it back to user.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.