How should I use Guava's Hashing#consistentHash? - java

I'm looking into using a consistent hash algorithm in some java code I'm writing. The guava Hashing library has a consistentHash(HashCode, int) method, but the documentation is rather lacking. My initial hope was that I could just use consistentHash() for simple session affinity to efficiently distribute load across a set of backend servers.
Does anyone have a real-world example of how to use this method? In particular I'm concerned with managing the removal of a bucket from the target range.
For example:
#Test
public void testConsistentHash() {
List<String> servers = Lists.newArrayList("server1", "server2", "server3", "server4", "server5");
int bucket = Hashing.consistentHash(Hashing.md5().hashString("someId"), servers.size());
System.out.println("First time routed to: " + servers.get(bucket));
// one of the back end servers is removed from the (middle of the) pool
servers.remove(1);
bucket = Hashing.consistentHash(Hashing.md5().hashString("blah"), servers.size());
System.out.println("Second time routed to: " + servers.get(bucket));
}
Leads to the output:
First time routed to: server4
Second time routed to: server5
What I want is for that identifier ("someId") to map to the same server after removal of a server earlier in the list. So in the sample above, after removal I guess I'd want bucket 0 to map to "server1", bucket 1 to map to "server3", bucket 2 to map to "server4" and bucket 3 to map to "server5".
Am I supposed to maintain a separate (more complicated than a list) data structure to manage bucket removal and addition? I guess I had envisioned perhaps a more complicated Hashing API that would manage the remapping after adding and removal of particular buckets for me.
Note: I know the sample code is using a small input and bucket set. I tried this with 1000s of input across 100 buckets and the result is the same. Inputs that map to buckets 0-98 stay the same when I change the buckets to 99 and bucket 99 gets distributed across the remaining 99 buckets.

I'm afraid that no data structure can do it really right with the current consistentHash. As the method accepts only the list size, nothing but appending and removal from the end can be supported. Currently, the best solution consist probably of replacing
servers.remove(n)
by
server.set(n, servers.get(servers.size() - 1);
servers.remove(servers.size() - 1);
This way you sort of swap the failed and the very last server. This looks bad as it makes the assignments to the two swapped servers wrong. This problem is only half as bad as one of them have failed. But it makes sense, as after the following removal of the last list element, everything's fine, except for the assignments to the failed server and to the previously last server.
So twice as much assignments as needed change. Not optimal, but hopefully usable?

I don't think there's a good way to do this at the moment. consistentHash in its current form is useful only in simple cases -- basically, where you have a knob to increase or decrease the number of servers... but always by adding and removing at the end.
There's some work underway to add a class like this:
public final class WeightedConsistentHash<B, I> {
/** Initially, all buckets have weight zero. */
public static <B, I> WeightedConsistentHash<B, I> create(
Funnel<B> bucketFunnel, Funnel<I> inputFunnel);
/**
* Sets the weight of bucket "bucketId" to "weight".
* Requires "weight" >= 0.0.
*/
public void setBucketWeight(B bucketId, double weight);
/**
* Returns the bucket id that "input" maps to.
* Requires that at least one bucket has a non-zero weight.
*/
public B hash(I input);
}
Then you would write:
WeightedConsistentHash<String, String> serverChooser =
WeightedConsistentHash.create(stringFunnel(), stringFunnel());
serverChooser.setBucketWeight("server1", 1);
serverChooser.setBucketWeight("server2", 1);
// etc.
System.out.println("First time routed to: " + serverChooser.hash("someId"));
// one of the back end servers is removed from the (middle of the) pool
serverChooser.setBucketWeight("server2", 0);
System.out.println("Second time routed to: " + serverChooser.hash("someId"));
And you should get the same server each time. Does that API look suitable?

The guava API does not have any knowledge of your server list. It can only guarantee this:
int bucket1 = Hashing.consistentHash(Hashing.md5().hashString("server1"),N);
int bucket2 = Hashing.consistentHash(Hashing.md5().hashString("server1"),N-1);
assertThat(bucket1,is(equalTo(bucket2))); iff bucket1==bucket2!=N-1
you need to manange the bucket to your server list yourself

The answer proposed in the question is the correct one:
Am I supposed to maintain a separate (more complicated than a list) data structure to manage bucket removal and addition?
Guava is hashing into a ring with ordinal numbers. The mapping from those ordinal numbers to the server ids has to be maintained externally:
Given N servers initially - one can choose some arbitrary mapping for each ordinal number 0..N-1 to server-ids A..K (0->A, 1->B, .., N-1->K). A reverse mapping from the server id to it's ordinal number is also required (A->0, B->1, ..).
On the deletion of a server - the ordinal number space shrinks by one. All the ordinal numbers starting with the one for the deleted server need to be remapped to the next server (shift by one).
So for example, after the initial mapping, say server C (corresponding to ordinal number 2) got deleted. Now the new mappings become: (0->A, 1->B, 2->D, 3->E, .., N-2->K)
On the addition of a server L (say going from N to N+1 servers) - a new mapping can be added from N->L.
What we are doing here is mimicking how nodes would move in a ring as they are added and deleted. While the ordering of the nodes remains the same - their ordinal numbers (on which Guava operates) can change as nodes come and leave.

Related

Allocate Algorithm/Availability Algorithm

I am wondering if anyone knows of an Algorithm I could use to help me solve the following problem:
Allocate people (n) to certain events (m), m can have only one person attached to it and it must be randomized each time (Same person allowed if only one option available(n)). n has properties such as time available and day available. For n to be matched to m the time available and day available must match for both n and m. There can be multiple of n that match the times of m but it has to be the best fit so the rest of m can be allocated. The diagram below will more than likely explain it better (Sorry). n can be allocated to more than one m but should be done fairly such that one n doesnt have all of the available m's
As you can see Person A could be attached to Event A but due to the need to have them all matching (the best attempt to match) it is attached to Event B to allow Person C to be allocated to Event A and person B to Event C.
I am simply wondering if anyone knows the name of this type of problem and how I could go about solving it, I am coding the program in Java
This is a variant of the the Max Flow Problem. There are many algorithms taylor-made to solve max-flow problems, including The Ford-Fulkerson Algorithm or its refinement, the Edmonds-Karp Algorithm. Once you are able to change your problem into a max-flow problem, solving it is fairly simple. But what is the max flow problem?
The problem takes in a weighted, directed graph and asks the question "What is the maximum amount of flow that can be directed from the source (a node) to the sink (another node)?".
There are a few constraints, that make logical sense when thinking of the graph as a network of water flows.
The amount of flow through each edge must be less than or equal to the "capacity" (weight) of that edge for every edge in the graph. They also must be non-negative numbers.
The amount of flow into each node must equal the amount of flow leaving that node for ever node except the source and sink. There is no limit to the amount of flow that goes through a node.
Consider the following graph, with s as the source and t as the sink.
The solution to the max flow problem would be a total flow of 25, with the following flow amounts:
It is simple to transform your problem into a max flow problem. Assuming your inputs are:
N people, plus associated information on when person p_i is available time and date wise.
M events with a time and place.
Create a graph with the following properties:
A super source s
N person nodes p_1 ... p_n, with an edge of capacity infinity connecting s to p_i for all i in 1 ... n.
A super sink t
M event nodes e_1 ... e_m, with an edge of capacity 1 connecting e_i to t for all i in 1 ... m
For every combination of a person and event (p_i, e_j), an edge with capacity infinity connecting p_i to e_j iff p can validly attend event e (otherwise no edge connecting p_i and e_j).
Constructing a graph to these specifications has O(1) + O(N) + O(N) + O(M) + O(M) + O(1) + O(NM) = O(NM) runtime.
For your example the constructed graph would look like the following (with unlabeled edges having capacity infinity):
You correctly noticed that there is a Max Flow with value 4 for this example, and any max flow would return the same value. Once you can perform this transformation, you can run any max flow algorithm and solve your problem.
Create a class called AllocatePerson that has a Person and a list of Events as Attribute called lsInnerEvents (you have to define the class Person and the class of Events first, both with a list of Time and Day).
In the Constructor of AllocatePerson you feed a Person and a list of Events, the constructor will cycle thought the events and add to the internal list only the one that matches the parameter of the Person.
The main code will create an AllocatePerson for each Person (1 at the time) implementing the following logic:
if the newly create object "objAllocatePerson" has the lsInnerEvents list with size = 1 you remove the element contained in lsInnerEvents from the List of Events to Allocate and will fire a procedure called MaintainEvents(Events removedEvents) passing the event allocated (the one inside lsInnerEvents).
The function MaintainEvents will cycle through the current Array of AllocatePersons and remove from their lsInnerEvents the "removedEvents", if after that the size of lsInnerEvents is = 1, it will recursively invoke MaintainEvents() with the new removed events, and remove the new lsInnerEvents from the main List of Events to allocate.
At the end of the execution you will have all the association simply by cycling through the array of AllocatePersons, where lsInnerEvents size is 1
An approach that you can consider is as follows:
Create Java Objects for Persons and Events.
Place all Events in a pool (Java Collection)
Have Each Person select an Event from the pool. As Each Person can only select Events on Specific Days, Create a subset of Events that will be in the pool for selection from the Person.
Add necessary attributes to the Events to ensure that it can only be selected once by a Person

Data too big for int indexing

Suppose I'm into Big Data (as in bioinformatics), and I've chosen to analyze it in Java using the wonderful Collections Map-Reduce framework on HPC. How can I work with datasets of more than 2 31 ^ 1 - items? For example,
final List<Gene> genome = getHugeData();
profit.log(genome.parallelStream().collect(magic);
Wrap your data so it consists of many chunks -- once you're exceed 2 ^ 31 - 1 you're going to next one. Sketch is:
class Wrapper {
private List<List<Gene>> chunks;
Gene get(long id) {
int chunkId = id / Integer.MAX_VALUE;
int itemId = id % Integer.MAX_VALUE;
List<Gene> chunk = chunks.get(chunkId);
return chunk.get(itemId);
}
}
In this case you have multiple problems. How big your data is?
The simplest solution is to use another structure such as LinkedList which (only if you are interested in serial accesses) or a HashMap which may have a high insertion cost. A LinkedList does not allow any random access whatsoever. If you want to access the 5th element you have to access first all previous 4 elements as well.
Here is another thought:
Let us assume that each gene has an id number (long). You can use an index structure such as a B+-tree and index your data using the tree. The index does not have to be stored on the disk it can remain on the memory. It does not have much overhead as well. You can find many implementations of it online.
Another solution would be to create a container class which would contain either other container classes or Genes. In order to achieve that both should implement an interface called e.g. Containable. In that way both classes Gene and Container are Containable(s). Once a container reaches its max. size it can be inserted in another container and so on. You can create multiple levels that way.
I would suggest you look online (e.g. Wikipedia) for the B+-tree if u are not familiar with that.
An Array with 2^31 Objects would consume about 17 GB memory...
You schould store the data a Database.

Looking for a table-like data structure

I have 2 sets of data.
Let say one is a people, another is a group.
A people can be in multiple groups while a group can have multiple people.
My operations will basically be CRUD on group and people.
As well as a method that makes sure a list of people are in different groups (which gets called alot).
Right now I'm thinking of making a table of binary 0's and 1's with horizontally representing all the people and vertically all the groups.
I can perform the method in O(n) time by adding each list of binaries and compare with the "and" operation of the list of binaries.
E.g
Group A B C D
ppl1 1 0 0 1
ppl2 0 1 1 0
ppl3 0 0 1 0
ppl4 0 1 0 0
check (ppl1, ppl2) = (1001 + 0110) == (1001 & 0110)
= 1111 == 1111
= true
check (ppl2, ppl3) = (0110 + 0010) == (0110+0010)
= 1000 ==0110
= false
I'm wondering if there is a data structure that does something similar already so I don't have to write my own and maintain O(n) runtime.
I don't know all of the details of your problem, but my gut instinct is that you may be over thinking things here. How many objects are you planning on storing in this data structure? If you have really large amounts of data to store here, I would recommend that you use an actual database instead of a data structure. The type of operations you are describing here are classical examples of things that relational databases are good at. MySQL and PostgreSQL are examples of large scale relational databases that could do this sort of thing in their sleep. If you'd like something lighter-weight SQLite would probably be of interest.
If you do not have large amounts of data that you need to store in this data structure, I'd recommend keeping it simple, and only optimizing it when you are sure that it won't be fast enough for what you need to do. As a first shot, I'd just recommend using java's built in List interface to store your people and a Map to store groups. You could do something like this:
// Use a list to keep track of People
List<Person> myPeople = new ArrayList<Person>();
Person steve = new Person("Steve");
myPeople.add(steve);
myPeople.add(new Person("Bob"));
// Use a Map to track Groups
Map<String, List<Person>> groups = new HashMap<String, List<Person>>();
groups.put("Everybody", myPeople);
groups.put("Developers", Arrays.asList(steve));
// Does a group contain everybody?
groups.get("Everybody").containsAll(myPeople); // returns true
groups.get("Developers").containsAll(myPeople); // returns false
This definitly isn't the fastest option available, but if you do not have a huge number of People to keep track of, you probably won't even notice any performance issues. If you do have some special conditions that would make the speed of using regular Lists and Maps unfeasible, please post them and we can make suggestions based on those.
EDIT:
After reading your comments, it appears that I misread your issue on the first run through. It looks like you're not so much interested in mapping groups to people, but instead mapping people to groups. What you probably want is something more like this:
Map<Person, List<String>> associations = new HashMap<Person, List<String>>();
Person steve = new Person("Steve");
Person ed = new Person("Ed");
associations.put(steve, Arrays.asList("Everybody", "Developers"));
associations.put(ed, Arrays.asList("Everybody"));
// This is the tricky part
boolean sharesGroups = checkForSharedGroups(associations, Arrays.asList(steve, ed));
So how do you implement the checkForSharedGroups method? In your case, since the numbers surrounding this are pretty low, I'd just try out the naive method and go from there.
public boolean checkForSharedGroups(
Map<Person, List<String>> associations,
List<Person> peopleToCheck){
List<String> groupsThatHaveMembers = new ArrayList<String>();
for(Person p : peopleToCheck){
List<String> groups = associations.get(p);
for(String s : groups){
if(groupsThatHaveMembers.contains(s)){
// We've already seen this group, so we can return
return false;
} else {
groupsThatHaveMembers.add(s);
}
}
}
// If we've made it to this point, nobody shares any groups.
return true;
}
This method probably doesn't have great performance on large datasets, but it is very easy to understand. Because it's encapsulated in it's own method, it should also be easy to update if it turns out you need better performance. If you do need to increase performance, I would look at overriding the equals method of Person, which would make lookups in the associations map faster. From there you could also look at a custom type instead of String for groups, also with an overridden equals method. This would considerably speed up the contains method used above.
The reason why I'm not too concerned about performance is that the numbers you've mentioned aren't really that big as far as algorithms are concerned. Because this method returns as soon as it finds two matching groups, in the very worse case you will call ArrayList.contains a number of times equal to the number of groups that exist. In the very best case scenario, it only needs to be called twice. Performance will likely only be an issue if you call the checkForSharedGroups very, very often, in which case you might be better off finding a way to call it less often instead of optimizing the method itself.
Have you considered a HashTable? If you know all of the keys you'll be using, it's possible to use a Perfect Hash Function which will allow you to achieve constant time.
How about having two separate entities for People and Group. Inside People have a set of Group and vice versa.
class People{
Set<Group> groups;
//API for addGroup, getGroup
}
class Group{
Set<People> people;
//API for addPeople,getPeople
}
check(People p1, People p2):
1) call getGroup on both p1,p2
2) check the size of both the set,
3) iterate over the smaller set, and check if that group is present in other set(of group)
Now, you can basically store People object in any data structure. Preferably a linked list if size is not fixed otherwise an array.

Java find nearest (or equal) value in collection

I have a class along the lines of:
public class Observation {
private String time;
private double x;
private double y;
//Constructors + Setters + Getters
}
I can choose to store these objects in any type of collection (Standard class or 3rd party like Guava). I have stored some example data in an ArrayList below, but like I said I am open to any other type of collection that will do the trick. So, some example data:
ArrayList<Observation> ol = new ArrayList<Observation>();
ol.add(new Observation("08:01:23",2.87,3.23));
ol.add(new Observation("08:01:27",2.96,3.17));
ol.add(new Observation("08:01:27",2.93,3.20));
ol.add(new Observation("08:01:28",2.93,3.21));
ol.add(new Observation("08:01:30",2.91,3.23));
The example assumes a matching constructor in Observation. The timestamps are stored as String objects as I receive them as such from an external source but I am happy to convert them into something else. I receive the observations in chronological order so I can create and rely on a sorted collection of observations. The timestamps are NOT unique (as can be seen in the example data) so I cannot create a unique key based on time.
Now to the problem. I frequently need to find one (1) observation with a time equal or nearest to a certain time, e.g if my time was 08:01:29 I would like to fetch the 4th observation in the example data and if the time is 08:01:27 I want the 3rd observation.
I can obviously iterate through the collection until I find the time that I am looking for, but I need to do this frequently and at the end of the day I may have millions of observations so I need to find a solution where I can locate the relevant observations in an efficient manner.
I have looked at various collection-types including ones where I can filter the collections with Predicates but I have failed to find a solution that would return one value, as opposed to a subset of the collection that fulfills the "<="-condition. I am essentially looking for the SQL equivalent of SELECT * FROM ol WHERE time <= t LIMIT 1.
I am sure there is a smart and easy way to solve my problem so I am hoping to be enlightened. Thank you in advance.
Try TreeSet providing a comparator that compares the time. It mantains an ordered set and you can ask for TreeSet.floor(E) to find the greatest min (you should provide a dummy Observation with the time you are looking for). You also have headSet and tailSet for ordered subsets.
It has O(log n) time for adding and retrieving. I think is very suitable for your needs.
If you prefer a Map you can use a TreeMap with similar methods.
Sort your collection (ArrayList will probably work best here) and use BinarySearch which returns an integer index of either a match of the "closest" possible match, ie it returns an...
index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size(),
Have the Observation class implement Comparable and use a TreeSet to store the objects, which will keep the elements sorted. TreeSet implements SortedSet, so you can use headSet or tailSet to get a view of the set before or after the element you're searching for. Use the first or last method on the returned set to get the element you're seeking.
If you are stuck with ArrayList, but can keep the elements sorted yourself, use Collections.binarySearch to search for the element. It returns a positive number if the exact element is found, or a negative number that can be used to determine the closest element. http://download.oracle.com/javase/1.4.2/docs/api/java/util/Collections.html#binarySearch(java.util.List,%20java.lang.Object)
If you are lucky enough to be using Java 6, and the performance overhead of keeping a SortedSet is not a big deal for you. Take a look at TreeSet ceiling, floor, higher and lower methods.

How does consistent hashing work?

I am trying to understand how consistent hashing works. This is the article which I am trying to follow but not able to follow, to start with my questions are:
I understand, servers are mapped into ranges of hashcodes and the data distribution is more fixed and look becomes easy. But how does this deal with the problem a new node is added in the cluster?
The sample java code is not working, any suggestion of a simple java based consistent hashing.
Update
Any alternatives to consistent hashing?
For python implementation Refer my github repo
Simplest Explanation
What is normal hashing ?
Let's say we have to store the following key value pair in a distributed memory store like redis.
Let say we have a hash function f(id) ,that takes above ids and creates hashes from it .
Assume we have 3 servers - (s1 , s2 and s3)
We can do a modulo of hash by the no of servers ie 3 , to map each each key to a server and we are left with following.
We could retrieve the value for a key by simple lookup using f(). Say for key Jackson , f("Jackson")%(no of servers) => 1211*3 = 2 (node-2).
This looks perfecto , yea close but not cigar !
But What if a server say node-1 went down ? Applying the same formula ie f(id)%(no of servers) for user Jackson, 1211%2 = 1
ie we got node-1 when the actual key is hashed to node-2 from the above table .
We could do remapping here , What if we have a billion keys ,in that case we have to remap a large no of keys which is tedious :(
This is a major flow in traditional hashing technique.
What is Consistent Hashing ?
In Consistent hashing , we visualize list of all nodes in a circular ring .(Basically a sorted array)
start func
For each node:
Find f(node) where f is the hash function
Append each f(node) to a sorted array
For any key
Compute the hash f(key)
Find the first f(node)>f(key)
map it
end func
For example, if we have to hash key smith, we compute the hash value 1123 , find the immediate node having hash value > 1123 ie node 3 with hash value 1500
Now , What if we loose a server , say we loose node-2 , All the keys can be mapped to next server node-3 :)
Yea , we only have to remap the keys of node-2
I will answer the first part of your question. First of all, there are some errors in that code, so I would look for a better example.
Using a cache server as the example here.
When you think about consistent hashing, you should think of it as a circular ring, as the article you linked to does. When a new server is added, it will have no data on it to start with. When a client fetches data that should be on that server and does not find it, a cache-miss will occurs. The program should then fill in the data on the new node, so future requests will be a cache-hit. And that is about it, from a caching point of view.
Overview
I wrote a blog for how the consistent-hashing works, here to answer those original questions, below are the quick summary.
Consistent-hashing are more commonly used for the data partitioning purpose, and we usually see it in the components like
Load balancer
API gateway
...
To answer the questions, the below will covers
How it works
How to add/find/remove server nodes
How to implement it
Any alternatives to consistent hashing?
Let's use a simple example here the load balancer, the load balancer maps 2 origin nodes (servers behind the load balancer) and the incoming requests in the same hash ring circle (let's say the hash ring range is [0,255]).
Initial State
For the server nodes, we have a table:
Find Node
For any incoming request, we apply the same hash function, then we assume that we get a hashcode for a request which hashcode = 120, now from the table, we find the next closest node in the clockwise order, so the node 2 is the target node in this case.
Similarly, what if we get a request with hashcode = 220? Because the hash ring range is a circle, so we return the first node then.
Add Node
Now let's add one more node into the cluster: node 3 (hashcode 150), then our table will be updated to:
Then we use the same algorithm in the Find Node section to find the next nearest node. Say, the request with hashcode = 120, now it will be routed to node-3.
Remove Node
Removal is also straight forward, just remove the entry <node, hashcode> from the table, let's say we remove the node-1, then the table will be updated to:
Then all the requests with:
Hashcode in [201, 255] and [0, 150] will be routed to the node-3
Hashcode in [151, 200] will be routed to node-2
Implementation
Below is a simple c++ version (with virtual-node enabled), which is quite similar to Java.
#define HASH_RING_SZ 256
struct Node {
int id_;
int repl_;
int hashCode_;
Node(int id, int replica) : id_(id), repl_(replica) {
hashCode_ =
hash<string>{} (to_string(id_) + "_" + to_string(repl_)) % HASH_RING_SZ;
}
};
class ConsistentHashing {
private:
unordered_map<int/*nodeId*/, vector<Node*>> id2node;
map<int/*hash code*/, Node*> ring;
public:
ConsistentHashing() {}
// Allow dynamically assign the node replicas
// O(Rlog(RN)) time
void addNode(int id, int replica) {
for (int i = 0; i < replica; i++) {
Node* repl = new Node(id, replica);
id2node[id].push_back(repl);
ring.insert({node->hashCode_, repl});
}
}
// O(Rlog(RN)) time
void removeNode(int id) {
auto repls = id2node[id];
if (repls.empty()) return;
for (auto* repl : repls) {
ring.erase(repl->hashCode_);
}
id2node.erase(id);
}
// Return the nodeId
// O(log(RN)) time
int findNode(const string& key) {
int h = hash<string>{}(key) % HASH_RING_SZ;
auto it = ring.lower_bound(h);
if (it == ring.end()) it == ring.begin();
return it->second->id;
}
};
Alternatives
If I understand the question correctly, the question is for asking any alternatives to consistent hashing for the data partitioning purpose. There are a lot actually, depends on the actual use case we may choose different approaches, like:
Random
Round-robin
Weighted round-robin
Mod-hashing
Consistent-hash
Weighted(Dynamic) consistent-hashing
Range-based
List-based
Dictionary-based
...
And specifically in the load balancing domain, there are also some approaches like:
Least Connection
Least Response Time
Least Bandwidth
...
All the above approaches have their own pros and cons, there is no best solution, we have to do the trade-off accordingly.
Last
Above just a quick summary for the original questions, for the further reading, like the:
Consistent-hashing unbalancing issue
Snow crash issue
Virtual node concept
How to tweak the replica number
I've covered them in the blog already, below are the shortcuts:
Blog - Data partitioning: Consistent-hashing
Youtube video - Consistent-hashing replica tweaking
Wikipedia - Consistent-hashing
Have fun :)
I'll answer the first part of your question.
The question that arises is how consistent hashing actually works?
As we know that in a client-server model Load Balancer will be there that will route the request to the different servers depending upon the traffic of the network request to the server.
So, the purpose of hashing is to assign the numerals to all the clients that are requesting for and mode(mathematics) it by a number of servers we have. The remainder that we will get after mode, we assign the request to that particular server.
In Consistent Hashing Strategy, It uses a hashing function to position clients and servers on a circular path. It will route the request if the client is in the clockwise direction, the request of the client will be accepted by the server that comes first in the path.
What if our one server dies?
Earlier, in a simple hashing strategy, we need to redo the calculation and route the request according to the remainder that we will get and we will face the problem of cache hits.
In this consistent hashing strategy, if any server dies, the request of the client will move to the next server in a path in the same clockwise direction. That means it will not affect the other servers and cache hits and consistency will be maintained.
You say that...
I understand, servers are mapped into ranges of hashcodes and the data distribution is more fixed
... but that is not how consistent hashing works.
In fact, the opposite: consistent hashing's physical_node:virtual_node mapping is dynamically random while still being "evenly" distributed.
I answer in detail here how this randomness is implemented.
Give that a read, and make sure that you understand how it all fits together. Once you have the mental model, the Java blog article you linked to should conceptually make much more sense:
It would be nice if, when a cache machine was added, it took its fair share of objects from all the other cache machines. Equally, when a cache machine was removed, it would be nice if its objects were shared between the remaining machines. This is exactly what consistent hashing does

Categories