Does Apache Helix support partition split and merge? - java

I understand that Apache Helix allows dynamic cluster expansion/shrinkage (e.g, adding/failing/removing physical nodes). However, in the case that a single physical node can not handle a single partition replica, I need to split a partition into two. I understand that we need to pre-estimate the workload, so we can set up sufficient number of partitions up-front. However, as traffic goes up unpredictably, it is almost impossible to do such a pre-estimation. Can anyone tell me whether Helix supports re-partitioning out-of-box? If I need to customize it to add the repartitioning functionality, how large is the effort and how to do it in principle. I want to have a quick estimate. Thanks

Helix does not support partition splitting/merging out of the box. We could not come up with a generic way to support this without understanding the underlying system.
Having said that, it's possible to build a custom solution using the primitives provided by Helix. If you can provide additional information about your system, I might be able to suggest something.

I would suggest to start with a high number of (logical) partitions, and assigning each node multiple partitions. When the service needs more resources, add nodes and move some partitions from existing nodes to the new nodes.
For instance, assuming you'd start with 50 nodes, you'd split the space of your workload to 50000 logical partitions, and assign 1000 partitions to each node. Then when increasing to, say, 75 nodes, redistribute. So it would be 667 partitions per node.
Depending on the actual scenario, you might want to minimize the reallocated partitions, for example using a consistent hashing algorithm.

Related

Fast indexing using multiple ES nodes?

All I read and understand about running multiple ES nodes is to enable index replication and scaling. I was wondering if it could help us to make indexing faster for large number of files. I have two questions and they are as follows:
Question 1: Would it be accurate to think that using multiple ES nodes would allow us to index multiple times faster?
Question 2: what effect does it have on indexing if I keep enable all nodes as a data node? on the other hand what effect does it have on indexing if I have few non-data nodes (e.g. one dedicate Master and one dedicate Client node) with few data nodes? Which will be better in terms of speed and scaling?
Answer1: No.
The speed of indexing will in fact decrease if you enable replication (though it may increase search performance). You can look at this question for improving indexing performance.
Answer2: It depends (if no replica then same).
During indexing the data will go only to data nodes. Your cluster state will contain information about which nodes is data node and route the request accordingly. The performance impact will only be because of the node receiving the request has to reroute/forward the request to the data nodes
If you are adding machines without increasing the number of replicas you will get a better performance during indexing. It is not surprising since you are adding more resources while the amount of work to be done remains pretty much the same.
In our environment we are using 20 nodes on production and 5-10 nodes on debug. Both environments hold the same volume of data. Since ES updates speed (We are using groovy scripts to merge new documents to existing documents) is our primary bottleneck, we are able to see much better performance in our production environment in oppose to other environments.
You already got some useful links at other answers to your question. I can add that in our case the 3 most significant factors in data upload improvements were: Reducing the refresh_interval, increasing the merge_factor and using Elastic-Hadoop plugin (We upload the data from Spark) that handles all the major data transfer optimisation on the application level.
Every one of those steps have its own disadvantages, so read the manuals carefully before changing the configurations.

How to get the logical data usage within a Cassandra cluster

We could look at the physical bytes on disk, but that number includes all the replicas. So I am wondering is there a good approach to get the logical data usage (the real meaningful data size) within the cluster without iterating all of them? Thanks.
Unfortunately no. Even the nodetool cfstats which shows live sstable size, includes the replica. You can iterate and add all and divide by the number of replicas you have to get a tough estimate of one.

In Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?

I am using Hadoop to analyze a very uneven distribution of data. Some keys have thousands of values, but most have only one. For example, network traffic associated with IP addresses would have many packets associated with a few talkative IPs and just a few with most IPs. Another way of saying this is that the Gini index is very high.
To process this efficiently, each reducer should either get a few high-volume keys or a lot of low-volume keys, in such a way as to get a roughly even load. I know how I would do this if I were writing the partition process: I would take the sorted list of keys (including all duplicate keys) that was produced by the mappers as well as the number of reducers N and put splits at
split[i] = keys[floor(i*len(keys)/N)]
Reducer i would get keys k such that split[i] <= k < split[i+1] for 0 <= i < N-1 and split[i] <= k for i == N-1.
I'm willing to write my own partitioner in Java, but the Partitioner<KEY,VALUE> class only seems to have access to one key-value record at a time, not the whole list. I know that Hadoop sorts the records that were produced by the mappers, so this list must exist somewhere. It might be distributed among several partitioner nodes, in which case I would do the splitting procedure on one of the sublists and somehow communicate the result to all other partitioner nodes. (Assuming that the chosen partitioner node sees a randomized subset, the result would still be approximately load-balanced.) Does anyone know where the sorted list of keys is stored, and how to access it?
I don't want to write two map-reduce jobs, one to find the splits and another to actually use them, because that seems wasteful. (The mappers would have to do the same job twice.) This seems like a general problem: uneven distributions are pretty common.
I've been thinking about this problem, too. This is the high-level approach I would take if someone forced me.
In addition to the mapper logic you have in place to solve your business problem, code some logic to gather whatever statistics you'll need in the partitioner to distribute key-value pairs in a balanced manner. Of course, each mapper will only see some of the data.
Each mapper can find out its task ID and use that ID to build a unique filename in a specified hdfs folder to hold the gathered statistics. Write this file out in the cleanup() method which runs at the end of the task.
use lazy initialization in the partitioner to read all files in the specified hdfs directory. This gets you all of the statistics gathered during the mapper phase. From there you're left with implementing whatever partitioning logic you need to correctly partition the data.
This all assumes that the partitioner isn't called until all mappers have finished, but that's the best I've been able to do so far.
In best of my understanding - there is no single place in MR processing where all keys are present. More then this - there is no guarantee that single machine can store this data.
I think this problem does not have ideal solution in current MR framework. I think so because to have ideal solution - we have to wait for the end of last mapper and only then analyze key distribution and parametrize partitioner with this knowledge.
This approach will significantly complicate the system and raise latency.
I think good approximation might be to do random sampling over data to get the idea of the keys distribution and then make partiotioner to work according to it.
As far as I understand Terasort implementation is doing something very similar : http://sortbenchmark.org/YahooHadoop.pdf

Is Cleo (linkedin's autocomplete solution) suitable for billions of elements?

Cleo has several different type of lookahead searches which are backed by some very clever indexing strategies. The GenericTypeahead is presumably for the largest of datasets.
From http://sna-projects.com/cleo/design.php:
"The GenericTypeahead is designed for large data sets, which may contain millions of elements..."
Unfortunately the documentation doesn't go into how well or how the Typeahead's scale up. Has anyone used Cleo for very large datasets that might have some insight?
Cleo is for a single instance/node (i.e. a single JVM) and does not have any routing or broker logic. Within a single Cleo instance, you can have multiple logical partitions to take advantage of multi-core CPUs. On a typical commodity box with 32G - 64G memory, you can easily support tens of millions elements by setting up 2 or 3 Cleo GenericTypeahead instances.
To support billions of elements, you will have to use horizontal partitioning to set up many Cleo instances on many commodity boxes and then do scatter-and-gather.
Check out https://github.com/jingwei/cleo-primer to see how to set up a single Cleo GenericTypeahead instance within minutes.
Cheers.

MongoDB EC2 BenchMark Configuration

I am planning to use MongoDB on EC2 for my web application. Right now I am getting 8000 r/w on MongoDB. My MongoDB instance Type is m1-large.
For the optimum performance I have followed this sites:
site-1 site-2
I have tried a lot but failed to acheive that performace which is mentioned in above site. I want to know is there any other resource from where I can find the optimum EC2 performance Benchmark and some sort of configuration ?
Here are some things you can do to increase write performance on EC2:
Raid stuff
Raid 0 will get you better speed but make sure your application
doesn't require the mirroring provided by raid 10 (the recommended
setup)
Rais 10 is mirroring (raid 1) and striping (raid 0) together.
Note: There's a 2gbs rate limit between an individual EC2 instance and the
EBS service as a whole, so this is the maximum rate you could ever
possibly hope to write to EBS from a single node no matter how many
volumes you have.
Use sharding
The more the writes are spread around to different boxen the less
write load each one has.
Index only the importat stuff
Try to limit/reduce the amount of indexes you have. With each additional index, an insert or update will also incur additional writes (and therefore worse write performance).
Control Document size
Try to control document size so that they don't grow and therefore have to be moved to the end of the collection very often. If you have an array you are pushing to, try to fix (limit) the # of items and their individual size so that mongo can be smart about padding factor and help you out here.

Categories