I am trying to understand how consistent hashing works. This is the article which I am trying to follow but not able to follow, to start with my questions are:
I understand, servers are mapped into ranges of hashcodes and the data distribution is more fixed and look becomes easy. But how does this deal with the problem a new node is added in the cluster?
The sample java code is not working, any suggestion of a simple java based consistent hashing.
Update
Any alternatives to consistent hashing?
For python implementation Refer my github repo
Simplest Explanation
What is normal hashing ?
Let's say we have to store the following key value pair in a distributed memory store like redis.
Let say we have a hash function f(id) ,that takes above ids and creates hashes from it .
Assume we have 3 servers - (s1 , s2 and s3)
We can do a modulo of hash by the no of servers ie 3 , to map each each key to a server and we are left with following.
We could retrieve the value for a key by simple lookup using f(). Say for key Jackson , f("Jackson")%(no of servers) => 1211*3 = 2 (node-2).
This looks perfecto , yea close but not cigar !
But What if a server say node-1 went down ? Applying the same formula ie f(id)%(no of servers) for user Jackson, 1211%2 = 1
ie we got node-1 when the actual key is hashed to node-2 from the above table .
We could do remapping here , What if we have a billion keys ,in that case we have to remap a large no of keys which is tedious :(
This is a major flow in traditional hashing technique.
What is Consistent Hashing ?
In Consistent hashing , we visualize list of all nodes in a circular ring .(Basically a sorted array)
start func
For each node:
Find f(node) where f is the hash function
Append each f(node) to a sorted array
For any key
Compute the hash f(key)
Find the first f(node)>f(key)
map it
end func
For example, if we have to hash key smith, we compute the hash value 1123 , find the immediate node having hash value > 1123 ie node 3 with hash value 1500
Now , What if we loose a server , say we loose node-2 , All the keys can be mapped to next server node-3 :)
Yea , we only have to remap the keys of node-2
I will answer the first part of your question. First of all, there are some errors in that code, so I would look for a better example.
Using a cache server as the example here.
When you think about consistent hashing, you should think of it as a circular ring, as the article you linked to does. When a new server is added, it will have no data on it to start with. When a client fetches data that should be on that server and does not find it, a cache-miss will occurs. The program should then fill in the data on the new node, so future requests will be a cache-hit. And that is about it, from a caching point of view.
Overview
I wrote a blog for how the consistent-hashing works, here to answer those original questions, below are the quick summary.
Consistent-hashing are more commonly used for the data partitioning purpose, and we usually see it in the components like
Load balancer
API gateway
...
To answer the questions, the below will covers
How it works
How to add/find/remove server nodes
How to implement it
Any alternatives to consistent hashing?
Let's use a simple example here the load balancer, the load balancer maps 2 origin nodes (servers behind the load balancer) and the incoming requests in the same hash ring circle (let's say the hash ring range is [0,255]).
Initial State
For the server nodes, we have a table:
Find Node
For any incoming request, we apply the same hash function, then we assume that we get a hashcode for a request which hashcode = 120, now from the table, we find the next closest node in the clockwise order, so the node 2 is the target node in this case.
Similarly, what if we get a request with hashcode = 220? Because the hash ring range is a circle, so we return the first node then.
Add Node
Now let's add one more node into the cluster: node 3 (hashcode 150), then our table will be updated to:
Then we use the same algorithm in the Find Node section to find the next nearest node. Say, the request with hashcode = 120, now it will be routed to node-3.
Remove Node
Removal is also straight forward, just remove the entry <node, hashcode> from the table, let's say we remove the node-1, then the table will be updated to:
Then all the requests with:
Hashcode in [201, 255] and [0, 150] will be routed to the node-3
Hashcode in [151, 200] will be routed to node-2
Implementation
Below is a simple c++ version (with virtual-node enabled), which is quite similar to Java.
#define HASH_RING_SZ 256
struct Node {
int id_;
int repl_;
int hashCode_;
Node(int id, int replica) : id_(id), repl_(replica) {
hashCode_ =
hash<string>{} (to_string(id_) + "_" + to_string(repl_)) % HASH_RING_SZ;
}
};
class ConsistentHashing {
private:
unordered_map<int/*nodeId*/, vector<Node*>> id2node;
map<int/*hash code*/, Node*> ring;
public:
ConsistentHashing() {}
// Allow dynamically assign the node replicas
// O(Rlog(RN)) time
void addNode(int id, int replica) {
for (int i = 0; i < replica; i++) {
Node* repl = new Node(id, replica);
id2node[id].push_back(repl);
ring.insert({node->hashCode_, repl});
}
}
// O(Rlog(RN)) time
void removeNode(int id) {
auto repls = id2node[id];
if (repls.empty()) return;
for (auto* repl : repls) {
ring.erase(repl->hashCode_);
}
id2node.erase(id);
}
// Return the nodeId
// O(log(RN)) time
int findNode(const string& key) {
int h = hash<string>{}(key) % HASH_RING_SZ;
auto it = ring.lower_bound(h);
if (it == ring.end()) it == ring.begin();
return it->second->id;
}
};
Alternatives
If I understand the question correctly, the question is for asking any alternatives to consistent hashing for the data partitioning purpose. There are a lot actually, depends on the actual use case we may choose different approaches, like:
Random
Round-robin
Weighted round-robin
Mod-hashing
Consistent-hash
Weighted(Dynamic) consistent-hashing
Range-based
List-based
Dictionary-based
...
And specifically in the load balancing domain, there are also some approaches like:
Least Connection
Least Response Time
Least Bandwidth
...
All the above approaches have their own pros and cons, there is no best solution, we have to do the trade-off accordingly.
Last
Above just a quick summary for the original questions, for the further reading, like the:
Consistent-hashing unbalancing issue
Snow crash issue
Virtual node concept
How to tweak the replica number
I've covered them in the blog already, below are the shortcuts:
Blog - Data partitioning: Consistent-hashing
Youtube video - Consistent-hashing replica tweaking
Wikipedia - Consistent-hashing
Have fun :)
I'll answer the first part of your question.
The question that arises is how consistent hashing actually works?
As we know that in a client-server model Load Balancer will be there that will route the request to the different servers depending upon the traffic of the network request to the server.
So, the purpose of hashing is to assign the numerals to all the clients that are requesting for and mode(mathematics) it by a number of servers we have. The remainder that we will get after mode, we assign the request to that particular server.
In Consistent Hashing Strategy, It uses a hashing function to position clients and servers on a circular path. It will route the request if the client is in the clockwise direction, the request of the client will be accepted by the server that comes first in the path.
What if our one server dies?
Earlier, in a simple hashing strategy, we need to redo the calculation and route the request according to the remainder that we will get and we will face the problem of cache hits.
In this consistent hashing strategy, if any server dies, the request of the client will move to the next server in a path in the same clockwise direction. That means it will not affect the other servers and cache hits and consistency will be maintained.
You say that...
I understand, servers are mapped into ranges of hashcodes and the data distribution is more fixed
... but that is not how consistent hashing works.
In fact, the opposite: consistent hashing's physical_node:virtual_node mapping is dynamically random while still being "evenly" distributed.
I answer in detail here how this randomness is implemented.
Give that a read, and make sure that you understand how it all fits together. Once you have the mental model, the Java blog article you linked to should conceptually make much more sense:
It would be nice if, when a cache machine was added, it took its fair share of objects from all the other cache machines. Equally, when a cache machine was removed, it would be nice if its objects were shared between the remaining machines. This is exactly what consistent hashing does
Related
My dataset is made up of data points which are 5000-element arrays (of Doubles) and each data point has a clusterId assigned to it.
For the purposes of the problem I am solving, I need to aggregate those arrays (element-wise) per clusterId and then do a dot product calculation between each data point and its respective aggregate cluster array.
The total number of data points I am dealing with is 4.8mm and they are split across ~50k clusters.
I use 'reduceByKey' to get the aggregated arrays per clusterId (which is my key) - using this dataset, I have two distinct options:
join the aggregate (clusterId, aggregateVector) pairs to the original dataset - so that each aggregateVector is available to each partition
collect the rdd of (clusterId, aggregateVector) locally and serialize it back to my executors - again, so that I can make the aggregateVectors available to each partition
My understanding is that joins cause re-partitioning based on the join key, so in my case, the unique values of my key are ~50k, which will be quite slow.
What I tried is the 2nd approach - I managed to collect the RDD localy - and turn it into a Map of clusterId as the key and 5000-element Array[Double] as the value.
However, when I try to broadcast/serialize this variable into a Closure, I am getting a ''java.lang.OutOfMemoryError: Requested array size exceeds VM limit''.
My question is - given the nature of my problem where I need to make aggregate data available to each executor, what is the best way to approach this, given that the aggregate dataset (in my case 50k x 5000) could be quite large?
Thanks
I highly recommend the join. 5000 values x 50,000 elements x 8 bytes per value is already 2 GB, which is manageable, but it's definitely in the "seriously slow things down, and maybe break some stuff" ballpark.
You are right that repartitioning can sometimes be slow, but I think you are more concerned about it than necessary. It's still an entirely parallel/distributed operation, which makes it essentially infinitely scalable. Collecting things into the driver is not.
I understand that hash tables are designed to have easy sorting and retrieval of data when storing massive amounts of them. However, when retrieving a specific piece of data, how do they retrieve it if they were stored in an alternative location due to collision?
Say there are 10 indexes and data A was stored in index 3 and data E runs into collision because data A is stored in index 3 already and collision prevention puts it in index 7 instead. When it comes time to retrieve data E, how does it retrieve E instead of using the first hash function and retrieving A instead?
Sorry if this is dumb question. I'm still somewhat new to programming.
I don't believe that Java will resolve a hashing collision by moving an item to a different bucket. Doing that would make it difficult if not impossible to determine the correct bucket into which it should have been hashed. If you read this SO article carefully, you will note that it points out two tools which Java has at its disposal. First, it maintains a list of values for each bucket* (read note below). Second, if the list becomes too large it can increase the number of buckets.
I believe that the list has now been replaced with a tree. This will ensure O(n*lgn) performance for lookup, insertion, etc., in the worst case, whereas with a list the worst case performance was O(n).
I am implementing a Crawler and I wanted to generate a unique hash code for every URL crawled by my system. This will help me in checking duplicate URLs, matching complete URL can be a expensive stuff. Crawler will crawl millions of pages daily. So output of this hash function should be unique.
Unless you know every address ahead of time, and there happens to be a perfect hash for said set of addresses, this task is theoretically impossible.
By the pigeonhole principle, there must exist at least two strings that have the same Integer value no matter what technique you use for conversion, considering that Integers have a finite range, and strings do not. While addresses, in reality, are not infinitely long, you're still going to get multiple addresses that map to the same hash value. In theory, there are infinitely many strings that will map to the same Integer value.
So, in conclusion, you should probably just use a standard HashMap.
Additionally, you need to worry about the following:
www.stackoverflow.com http://www.stackoverflow.com
http://stackoverflow.com stackoverflow.com ...
which are all equivalent, so you would need to normalize first, then hash. While there are some algorithms that given the set first will generate a perfect hash, I doubt that that is necessary for your purposes.
I think the solution is to normalize URLs first by removing first parts like http:// or http://www. from the beginning and last parts like / or ?... or #....
After this cleaning, you should have a clean domain URL, and you can do a hash for it.
But the best solution is to use a bloomfilter (a probabilistic data structure) which can tell you of the URL was probably visited or guaranteed not visited
At my job I was to develop and implement a solution for the following problem:
Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.
I came up with the solution like this (using Spring-Batch):
Batch job steps:
Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.
Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.
Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.
I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.
The problem is that my manager consider this task way too easy to spend that much time on it.
And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?
I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.
Thank you in advance!
UPD: I think I did not stated my question clearly. Let me ask in other way:
Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?
Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.
As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).
You solution seems reasonable and efficient, however I'd probably use SQL.
While parsing the Key/Value pairs I'd insert/update into a SQL table.
I'd then query the table for the top records.
Here's an example using only T-SQL (SQL 2008, but the concept should be workable in most any mordern rdbms)
The SQL between / START / and / END / would be the statements you need to execute in your code.
BEGIN
-- database table
DECLARE #tbl TABLE (
k INT -- key
, v INT -- value
, c INT -- count
, UNIQUE CLUSTERED (k, v)
)
-- insertion loop (for testing)
DECLARE #x INT
SET #x = 0
SET NOCOUNT OFF
WHILE (#x < 1000000)
BEGIN
--
SET #x = #x + 1
DECLARE #k INT
DECLARE #v INT
SET #k = CAST(RAND() * 10 as INT)
SET #v = CAST(RAND() * 100 as INT)
-- the INSERT / UPDATE code
/* START this is the sql you'd run for each row */
UPDATE #tbl SET c = c + 1 WHERE k = #k AND v = #v
IF ##ROWCOUNT = 0
INSERT INTO #tbl VALUES (#k, #v, 1)
/* END */
--
END
SET NOCOUNT ON
-- final select
DECLARE #topN INT
SET #topN = 50
/* START this is the sql you'd run once at the end */
SELECT
a.k
, a.v
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid]
, k
, v
FROM #tbl
) a
WHERE a.rid < #topN
/* END */
END
Gee, it doesn't seem like much work to try the old fashioned way of just doing it in-memory.
I would try just doing it first, then if you run out of memory, try one key per run (as per #Storstamp's answer).
If using the "simple" solution is not an option due to the size of the data, my next choice would be to use an SQL database. However, as most of these require quite much memory (and coming down to a crawl when heavily overloaded in RAM), maybe you should redirect your search into something like a NoSQL database such as MongoDB that can be quite efficient even when mostly disk-based. (Which your environment basically requires, having only 1GB of heap available).
The NoSQL database will do all the basic bookkeeping for you (storing the data, keeping track of all indexes, sorting it), and may probably do it a bit more efficient than your solution, due to the fact that all data may be sorted and indexed already when inserted, removing the extra steps of sorting the lines in the /chunk- files, merging them etc.
You will end up with a solution that is probably much easier to administrate, and it will also allow you to set up different kind of queries, instead of being optimized only for this specific case.
As a project manager I would not oppose your current solution. It is already fast and solves the problem. As an architect however, I would object due to the solution being a bit hard to maintain, and for not using proven technologies that basically does partially the same thing as you have coded on your own. It is hard to beat the tree and hash implementations of modern databases.
I'm looking into using a consistent hash algorithm in some java code I'm writing. The guava Hashing library has a consistentHash(HashCode, int) method, but the documentation is rather lacking. My initial hope was that I could just use consistentHash() for simple session affinity to efficiently distribute load across a set of backend servers.
Does anyone have a real-world example of how to use this method? In particular I'm concerned with managing the removal of a bucket from the target range.
For example:
#Test
public void testConsistentHash() {
List<String> servers = Lists.newArrayList("server1", "server2", "server3", "server4", "server5");
int bucket = Hashing.consistentHash(Hashing.md5().hashString("someId"), servers.size());
System.out.println("First time routed to: " + servers.get(bucket));
// one of the back end servers is removed from the (middle of the) pool
servers.remove(1);
bucket = Hashing.consistentHash(Hashing.md5().hashString("blah"), servers.size());
System.out.println("Second time routed to: " + servers.get(bucket));
}
Leads to the output:
First time routed to: server4
Second time routed to: server5
What I want is for that identifier ("someId") to map to the same server after removal of a server earlier in the list. So in the sample above, after removal I guess I'd want bucket 0 to map to "server1", bucket 1 to map to "server3", bucket 2 to map to "server4" and bucket 3 to map to "server5".
Am I supposed to maintain a separate (more complicated than a list) data structure to manage bucket removal and addition? I guess I had envisioned perhaps a more complicated Hashing API that would manage the remapping after adding and removal of particular buckets for me.
Note: I know the sample code is using a small input and bucket set. I tried this with 1000s of input across 100 buckets and the result is the same. Inputs that map to buckets 0-98 stay the same when I change the buckets to 99 and bucket 99 gets distributed across the remaining 99 buckets.
I'm afraid that no data structure can do it really right with the current consistentHash. As the method accepts only the list size, nothing but appending and removal from the end can be supported. Currently, the best solution consist probably of replacing
servers.remove(n)
by
server.set(n, servers.get(servers.size() - 1);
servers.remove(servers.size() - 1);
This way you sort of swap the failed and the very last server. This looks bad as it makes the assignments to the two swapped servers wrong. This problem is only half as bad as one of them have failed. But it makes sense, as after the following removal of the last list element, everything's fine, except for the assignments to the failed server and to the previously last server.
So twice as much assignments as needed change. Not optimal, but hopefully usable?
I don't think there's a good way to do this at the moment. consistentHash in its current form is useful only in simple cases -- basically, where you have a knob to increase or decrease the number of servers... but always by adding and removing at the end.
There's some work underway to add a class like this:
public final class WeightedConsistentHash<B, I> {
/** Initially, all buckets have weight zero. */
public static <B, I> WeightedConsistentHash<B, I> create(
Funnel<B> bucketFunnel, Funnel<I> inputFunnel);
/**
* Sets the weight of bucket "bucketId" to "weight".
* Requires "weight" >= 0.0.
*/
public void setBucketWeight(B bucketId, double weight);
/**
* Returns the bucket id that "input" maps to.
* Requires that at least one bucket has a non-zero weight.
*/
public B hash(I input);
}
Then you would write:
WeightedConsistentHash<String, String> serverChooser =
WeightedConsistentHash.create(stringFunnel(), stringFunnel());
serverChooser.setBucketWeight("server1", 1);
serverChooser.setBucketWeight("server2", 1);
// etc.
System.out.println("First time routed to: " + serverChooser.hash("someId"));
// one of the back end servers is removed from the (middle of the) pool
serverChooser.setBucketWeight("server2", 0);
System.out.println("Second time routed to: " + serverChooser.hash("someId"));
And you should get the same server each time. Does that API look suitable?
The guava API does not have any knowledge of your server list. It can only guarantee this:
int bucket1 = Hashing.consistentHash(Hashing.md5().hashString("server1"),N);
int bucket2 = Hashing.consistentHash(Hashing.md5().hashString("server1"),N-1);
assertThat(bucket1,is(equalTo(bucket2))); iff bucket1==bucket2!=N-1
you need to manange the bucket to your server list yourself
The answer proposed in the question is the correct one:
Am I supposed to maintain a separate (more complicated than a list) data structure to manage bucket removal and addition?
Guava is hashing into a ring with ordinal numbers. The mapping from those ordinal numbers to the server ids has to be maintained externally:
Given N servers initially - one can choose some arbitrary mapping for each ordinal number 0..N-1 to server-ids A..K (0->A, 1->B, .., N-1->K). A reverse mapping from the server id to it's ordinal number is also required (A->0, B->1, ..).
On the deletion of a server - the ordinal number space shrinks by one. All the ordinal numbers starting with the one for the deleted server need to be remapped to the next server (shift by one).
So for example, after the initial mapping, say server C (corresponding to ordinal number 2) got deleted. Now the new mappings become: (0->A, 1->B, 2->D, 3->E, .., N-2->K)
On the addition of a server L (say going from N to N+1 servers) - a new mapping can be added from N->L.
What we are doing here is mimicking how nodes would move in a ring as they are added and deleted. While the ordering of the nodes remains the same - their ordinal numbers (on which Guava operates) can change as nodes come and leave.