HBase load balancing by splitting regions

HBase load balancing by splitting regions - java

I am having an HBase cluster of 5-nodes and mostly having input request of fetching sequential data.
For optimizing the storage, I ran manual region-splitting on highly loaded regions but it doesn't optimise much as it splits the region but mostly on same region-server.
How can I control region-splitting in this way
r-1(k1 to k2) on server s1,
r-2(k2 to k3) on server s2,
r-3(k3 to k4) on server s3,
r-4(k4 to k5) on server s4,
r-5(k5 to k6) on server s5,
r-6(k6 to k7) on server s1,
I.e, after splitting, no consecutive regions goes to same server to control the load on same server.

I am assuming by server you mean RegionServer. The regions are allotted regionservers randomly, so if your cluster is big enough, this situation should not occur (or should occur rarely). The idea is that you shouldn't need to bother about this. Also, understand that the regionserver is only a gateway for the data. It relies on HDFS to fetch the actual data, and where the data is coming from, is decided by HDFS.
Besides, even if consecutive regions end up being served by the same RS, you should be able to use multithreading to get the data faster. HBase already internally runs a separate thread for each region AFAIK. Usually, it doesn't lead to too much load. Did you see that there is actually excessive load due to this? Did you do any profiling to see what is causing the load?
So, there should really be no need to do this, but in special cases, you can use the HBaseAdmin.move method to achieve this. You can possibly write some code to go through all the regions of a table using HTable.getRegionLocations(), sort the regions as per the start keys and manually (using HBaseAdmin.move()) ensure that all consecutive regions are on separate regionservers. But I strongly doubt that this is actually a problem, and I would advise you to confirm this before going for this approach.

Related

Dynamo db write operations

Dynamo db allows only 25 requests per batch. Is there any way we can increase this in Java as I have to process 1000's of records per second? Any solution better than dividing it in batches and processing them?

the 25 per BatchWriteItem is a hard dynamodb limit as documented here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
There is nothing preventing you from doing multiple BatchWrites in parallel. The thing that is going to gate how much you can write is the write-provisioned-throughput on the table.

BatchWrites in DynamoDB were introduce to reduce the number of round trips required to perform multiple write operations for languages that do not provide opportunities for parallel threads to perform the work such as PHP.
While you will still get better performance because of the reduced round trips by using the batch API, there is still the possibility that individual writes can fail and your code will need to look for those. A robust way to perform massively parallel writes using Java would be to use the ExecutorService class. This provides a simple mechanism to use multiple threads to perform the inserts. However, just as individual items within a batch can fail, you will want to track the Future objects to ensure the writes are performed successfully.
Another way to improve throughput is to run your code on EC2. If you are calling DynamoDB from your laptop or a datacenter outside of AWS the round trip time will take longer and the requests will be slightly slower.
The bottom line is to use standard Java multi-threading techniques to get the performance you want. However, past a certain point you may need to fan out and use additional hardware to drive even higher write OPS.

Whenever you've got a large stream of real-time data that needs to end up in AWS, Kinesis Streams are probably the way to go. Particularly with AWS Kinesis Firehose, you can pipe your data to S3 at massive scale with no administrative overhead. You can then use DataPipeline to move it to Dynamo.

Does the JVM intercept disk transactions/have its own disk buffer?

Question:
Do you guys know if calls to write on disk are intercepted by the jvm? Does it have its own buffer between the application and the OS? More specifically, can the jvm make an asynchronous disk write operation look synchronous to the application?
Background:
I've been running some applications with Berkeley DB, in sync mode, that is, the database is supposed to return calls to db.put(key, value) only after the (key, value) pair has been safely persisted to disk. To set such options, I do:
envConfig.setDurability(Durability.COMMIT_SYNC);
dbConfig.setDeferredWrite(false);
Above, envConfig is an EnvironmentConfig object and dbconfig is a DatabaseConfig object, which I use to adjust the behavior of the database.
Anyway, the above configuration is supposed to make every put(...) call to cause a disk transaction (which you can measure, i.e., with iostat on Linux), right? This would be because the alternative (COMMIT_NO_SYNC with deferred write) would return calls to put without waiting for the disk, so that it could buffer a good amount of data to write all at once, improving performance, at the expense of safety.
Problem:
I'm making several thousands of calls to put per second, but the number of disk transactions per second does not change almost, whether or not I set the above options in the database.

I am not providing an exact answer of the problem. But here is my experience with disk operations. In past I have faced the problem, where my expectation of a disk operation no fulfilled on stipulated time.
The disk write is always much much slower than writing into memory. I guess the disk writing here depends on hardware, native OS API and CPU allocation being given to disk write operation. So, you should not expect the disk write be as fast as you method call. In fact no program logic should be written assuming certain performance parameter of any devices like printer, disk etc.
If you meant to do that, you must have a reconciliation method, which ensure that the operation would be 100% completed, before next operation could be done.

java fastest concurrent random file R/W method for SSDs without memory swap

I have a linux box with 32GB of ram and a set of 4 SSD in a raid 0 config that maxes out at about 1GB of throughput (random 4k reads) and I am trying to determine the best way of accessing files on them randomly and conccurently using java. The two main ways I have seen so far are via random access file and mapped direct byte buffers.
Heres where it gets tricky though. I have my own memory cache for objects so any call to the objects stored in a file should go through to disk and not paged memory (I have disabled the swap space on my linux box to prevent this). Whilst mapped direct memory buffers are supposedly the fastest they rely on swapping which is not good because A) I am using all the free memory for the object cache, using mappedbytebuffers instead would incur a massive serialization overhead which is what the object cache is there to prevent.(My program is already CPU limited) B) with mappedbytebuffers the OS handles the details of when data is written to disk, I need to control this myself, ie. when I write(byte[]) it goes straight out to disk instantly, this is to prevent data corruption incase of power failure as I am not using ACID transactions.
On the other hand I need massive concurrency, ie. I need to read and write to multiple locations in the same file at the same time (whilst using offset/Range locks to prevent data corruption) I'm not sure how I can do this without mappedbytebuffers, I could always just que the reads/Writes but I'm not sure how this will negatively affect my throughput.
Finally I can not have a situation when I am creating new byte[] objects for reads or writes, this is because I perform almost a 100000 read/write operations per second, allocating and Garbage collecting all those objects would kill my program which is time sensitive and already CPU limited, reusing byte[] objects is fine through.
Please do not suggest any DB software as I have tried most of them and they add to much complexity and cpu overhead.
Anybody had this kind of dilemma?

Whilst mapped direct memory buffers are supposedly the fastest they rely on swapping
No, not if you have enough RAM. The mapping associates pages in memory with pages on disk. Unless the OS decides that it needs to recover RAM, the pages won't be swapped out. And if you are running short of RAM, all that disabling swap does is cause a fatal error rather than a performance degradation.
I am using all the free memory for the object cache
Unless your objects are extremely long-lived, this is a bad idea because the garbage collector will have to do a lot of work when it runs. You'll often find that a smaller cache results in higher overall throughput.
with mappedbytebuffers the OS handles the details of when data is written to disk, I need to control this myself, ie. when I write(byte[]) it goes straight out to disk instantly
Actually, it doesn't, unless you've mounted your filesystem with the sync option. And then you still run the risk of data loss from a failed drive (especially in RAID 0).
I'm not sure how I can do this without mappedbytebuffers
A RandomAccessFile will do this. However, you'll be paying for at least a kernel context switch on every write (and if you have the filesystem mounted for synchronous writes, each of those writes will involve a disk round-trip).
I am not using ACID transactions
Then I guess the data isn't really that valuable. So stop worrying about the possibility that someone will trip over a power cord.

Your objections to mapped byte buffers don't hold up. Your mapped files will be distinct from your object cache, and though they take address space they don't consume RAM. You can also sync your mapped byte buffers whenever you want (at the cost of some performance). Moreover, random access files end up using the same apparatus under the covers, so you can't save any performance there.
If mapped bytes buffers aren't getting you the performance you need, you might have to bypass the filesystem and write directly to raw partitions (which is what DBMS' do). To do that, you probably need to write C++ code for your data handling and access it through JNI.

memcached and performance

I might be asking very basic question, but could not find a clear answer by googling, so putting it here.
Memcached caches information in a separate Process. Thus in order to get the cached information requires inter-process communication (which is generally serialization in java). That means, generally, to fetch a cached object, we need to get a serialized object and generally transport it to network.
Both, serialization and network communication are costly operations. if memcached needs to use both of these (generally speaking, there might be cases when network communication is not required), then how Memcached is fast? Is not replication a better solution?
Or this is a tradeoff of distribution/platform independency/scalability vs performance?

You are right that looking something up in a shared cache (like memcached) is slower than looking it up in a local cache (which is what i think you mean by "replication").
However, the advantage of a shared cache is that it is shared, which means each user of the cache has access to more cache than if the memory was used for a local cache.
Consider an application with a 50 GB database, with ten app servers, each dedicating 1 GB of memory to caching. If you used local caches, then each machine would have 1 GB of cache, equal to 2% of the total database size. If you used a shared cache, then you have 10 GB of cache, equal to 20% of the total database size. Cache hits would be somewhat faster with the local caches, but the cache hit rate would be much higher with the shared cache. Since cache misses are astronomically more expensive than either kind of cache hit, slightly slower hits are a price worth paying to reduce the number of misses.
Now, the exact tradeoff does depend on the exact ratio of the costs of a local hit, a shared hit, and a miss, and also on the distribution of accesses over the database. For example, if all the accesses were to a set of 'hot' records that were under 1 GB in size, then the local caches would give a 100% hit rate, and would be just as good as a shared cache. Less extreme distributions could still tilt the balance.
In practice, the optimum configuration will usually (IMHO!) be to have a small but very fast local cache for the hottest data, then a larger and slower cache for the long tail. You will probably recognise that as the shape of other cache hierarchies: consider the way that processors have small, fast L1 caches for each core, then slower L2/L3 caches shared between all the cores on a single die, then perhaps yet slower off-chip caches shared by all the dies in a system (do any current processors actually use off-chip caches?).

You are neglecting the cost of disk i/o in your your consideration, which is generally going to be the slowest part of any process, and is the main driver IMO for utilizing in-memory caching like memcached.

Memory caches use ram memory over the network. Replication uses both ram-memory as well as persistent disk memory to fetch data. Their purposes are very different.
If you're only thinking of using Memcached to store easily obtainable data such as 1-1 mapping for table records :you-re-gonna-have-a-bad-time:.
On the other hand if your data is the entire result-set of a complex SQL query that may even oveflow the SQL memory pool (and need to be temporarily written to disk to be fetched) you're going to see a big speed-up.
The previous example mentions needing to write data to disk for a read operation - yes it happens if the result set is too big for memory (imagine a CROSS JOIN) that means that you both read and write to that drive (thrashing comes to mind).
In A highly optimized application written in C for example you may have a total processing time of 1microsec and may need to wait for networking and/or serialization/deserialization (marshaling/unmarshaling) for a much longer time than the app execution time itself. That's when you'll begin too feel the limitations of memory-caching over the network.

Can I use Terracotta to scale a RAM-intensive application?

I'm evaluating Terracotta to help me scale up an application which is currently RAM-bounded. It is a collaborative filter and stores about 2 kilobytes of data per-user. I want to use Amazon's EC2, which means I'm limited to 14GB of RAM, which gives me an effective per-server upper-bound of around 7 million users. I need to be able to scale beyond this.
Based on my reading so-far I gather that Terracotta can have a clustered heap larger than the available RAM on each server. Would it be viable to have an effective clustered heap of 30GB or more, where each of the servers only supports 14GB?
The per-user data (the bulk of which are arrays of floats) changes very frequently, potentially hundreds of thousands of times per minute. It isn't necessary for every single one of these changes to be synchronized to other nodes in the cluster the moment they occur. Is it possible to only synchronize some object fields periodically?

I'd say the answer is a qualified yes for this. Terracotta does allow you to work with clustered heaps larger than the size of a single JVM although that's not the most common use case.
You still need to keep in mind a) the working set size and b) the amount of data traffic. For a), there is some set of data that must be in memory to perform the work at any given time and if that working set size > heap size, performance will obviously suffer. For b), each piece of data added/updated in the clustered heap must be sent to the server. Terracotta is best when you are changing fine-grained fields in pojo graphs. Working with big arrays does not take the best advantage of the Terracotta capabilities (which is not to say that people don't use it that way sometimes).
If you are creating a lot of garbage, then the Terracotta memory managers and distributed garbage collector has to be able to keep up with that. It's hard to say without trying it whether your data volumes exceed the available bandwidth there.
Your application will benefit enormously if you run multiple servers and data is partitioned by server or has some amount of locality of reference. In that case, you only need the data for one server's partition in heap and the rest does not need to be faulted into memory. It will of course be faulted if necessary for failover/availability if other servers go down. What this means is that in the case of partitioned data, you are not broadcasting to all nodes, only sending transactions to the server.
From a numbers point of view, it is possible to index 30GB of data, so that's not close to any hard limit.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.