How to get the logical data usage within a Cassandra cluster - java

We could look at the physical bytes on disk, but that number includes all the replicas. So I am wondering is there a good approach to get the logical data usage (the real meaningful data size) within the cluster without iterating all of them? Thanks.

Unfortunately no. Even the nodetool cfstats which shows live sstable size, includes the replica. You can iterate and add all and divide by the number of replicas you have to get a tough estimate of one.

Related

Does Apache Helix support partition split and merge?

I understand that Apache Helix allows dynamic cluster expansion/shrinkage (e.g, adding/failing/removing physical nodes). However, in the case that a single physical node can not handle a single partition replica, I need to split a partition into two. I understand that we need to pre-estimate the workload, so we can set up sufficient number of partitions up-front. However, as traffic goes up unpredictably, it is almost impossible to do such a pre-estimation. Can anyone tell me whether Helix supports re-partitioning out-of-box? If I need to customize it to add the repartitioning functionality, how large is the effort and how to do it in principle. I want to have a quick estimate. Thanks
Helix does not support partition splitting/merging out of the box. We could not come up with a generic way to support this without understanding the underlying system.
Having said that, it's possible to build a custom solution using the primitives provided by Helix. If you can provide additional information about your system, I might be able to suggest something.
I would suggest to start with a high number of (logical) partitions, and assigning each node multiple partitions. When the service needs more resources, add nodes and move some partitions from existing nodes to the new nodes.
For instance, assuming you'd start with 50 nodes, you'd split the space of your workload to 50000 logical partitions, and assign 1000 partitions to each node. Then when increasing to, say, 75 nodes, redistribute. So it would be 667 partitions per node.
Depending on the actual scenario, you might want to minimize the reallocated partitions, for example using a consistent hashing algorithm.

Processing large arrays that do not fit in RAM in Java

I am developing a text analysis program that represents documents as arrays of "feature counts" (e.g., occurrences of a particular token) within some pre-defined feature space. These arrays are stored in an ArrayList after some processing.
I am testing the program on a 64 mb dataset, with 50,000 records. The program worked fine with small data sets, but now it consistently throws a "out of memory" Java heap exception when I start loading the arrays into an ArrayList object (using the .add(double[]) method). Depending on how much memory I allocate to the stack, I will get this exception at the 1000th to 3000th addition to the ArrayList, far short of my 50,000 entries. It became clear to me that I cannot store all this data in RAM and operate on it as usual.
However, I'm not sure what data structures are best suited to allow me to access and perform calculations on the entire dataset when only part of it can be loaded into RAM?
I was thinking that serializing the data to disk and storing the locations in a hashmap in RAM would be useful. However, I have also seen discussions on caching and buffered processing.
I'm 100% sure this is a common CS problem, so I'm sure there are several clever ways that this has been addressed. Any pointers would be appreciated :-)
You have plenty of choices:
Increase heap size (-Xmx) to several gigabytes.
Do not use boxing collections, use fastutil - that should decrease your memory use 4x. http://fastutil.di.unimi.it/
Process your data in batches or sequentially - do not keep whole dataset in memory simultaneously.
Use a proper database. There are even intraprocess databases like HSQL, your mileage may vary.
Process your data via map-reduce, perhaps something local like pig.
How about using Apache Spark (Great for in-memory cluster computing) ?This would help scale your infrastructure as your data set gets Larger.

Fast indexing using multiple ES nodes?

All I read and understand about running multiple ES nodes is to enable index replication and scaling. I was wondering if it could help us to make indexing faster for large number of files. I have two questions and they are as follows:
Question 1: Would it be accurate to think that using multiple ES nodes would allow us to index multiple times faster?
Question 2: what effect does it have on indexing if I keep enable all nodes as a data node? on the other hand what effect does it have on indexing if I have few non-data nodes (e.g. one dedicate Master and one dedicate Client node) with few data nodes? Which will be better in terms of speed and scaling?
Answer1: No.
The speed of indexing will in fact decrease if you enable replication (though it may increase search performance). You can look at this question for improving indexing performance.
Answer2: It depends (if no replica then same).
During indexing the data will go only to data nodes. Your cluster state will contain information about which nodes is data node and route the request accordingly. The performance impact will only be because of the node receiving the request has to reroute/forward the request to the data nodes
If you are adding machines without increasing the number of replicas you will get a better performance during indexing. It is not surprising since you are adding more resources while the amount of work to be done remains pretty much the same.
In our environment we are using 20 nodes on production and 5-10 nodes on debug. Both environments hold the same volume of data. Since ES updates speed (We are using groovy scripts to merge new documents to existing documents) is our primary bottleneck, we are able to see much better performance in our production environment in oppose to other environments.
You already got some useful links at other answers to your question. I can add that in our case the 3 most significant factors in data upload improvements were: Reducing the refresh_interval, increasing the merge_factor and using Elastic-Hadoop plugin (We upload the data from Spark) that handles all the major data transfer optimisation on the application level.
Every one of those steps have its own disadvantages, so read the manuals carefully before changing the configurations.

Replacing a huge dump file with an efficient lookup Java key-value text store

I have a huge dump file - 12GB of text containing millions of entries. Each entry has a numeric id, some text, and other irrelevant properties. I want to convert this file into something that will provide an efficient look-up. That is, given an id, it would return the text quickly. The limitations:
Embedded in Java, preferably without an external server or foreign language dependencies.
Read and writes to the disk, not in-memory - I don't have 12GB of RAM.
Does not blow up too much - I don't want to turn a 12GB file into a 200GB index. I don't need full text search, sorting, or anything fancy - Just key-value lookup.
Efficient - It's a lot of data and I have just one machine, so speed is an issue. Tools that can store large batches and/or work well with several threads are preferred.
Storing more than one field is nice, but not a must. The main concern is the text.
Your recommendations are welcomed!
I would use Java Chronicle or something like it (partly because I wrote it) because it is designed to access large amounts of data (larger than your machine) some what randomly.
It can store any number of fields in text or binary formats (or a combination if you wish) It adds 8 bytes per record you want to be able to randomly access. It doesn't support deleting records (you can mark them for reuse), but you can update and add new records.
It can only have a single writer thread, but it can be read by an number of threads on the same machine (even different processes)
It doesn't support batching but it can read/write millions of entries per second with typical sub microsecond latency (except for random reads/writes which are not in memory)
It uses next to no heap (<1 MB for TBs of data)
It uses an id which is sequential but you can build a table to do just that translation.
BTW: You can buy 32 GB for less than $200. Perhaps its time to get more memory ;)
Why not use JavaDb - the db that comes with Java ?
It'll store the info on disk, and be efficient in terms of lookups, provided you index properly. It'll run in-JVM, so you don't need a separate server/service. You talk to it using standard JDBC.
I suspect it'll be pretty efficient. This database has a long history (it used to be IBM's Derby) and will have had a lot of effort expended on it in terms of robustness and efficiency.
You'll obviously need to do an initial onboarding of the data to create the database, but that's a one-off task.

MongoDB EC2 BenchMark Configuration

I am planning to use MongoDB on EC2 for my web application. Right now I am getting 8000 r/w on MongoDB. My MongoDB instance Type is m1-large.
For the optimum performance I have followed this sites:
site-1 site-2
I have tried a lot but failed to acheive that performace which is mentioned in above site. I want to know is there any other resource from where I can find the optimum EC2 performance Benchmark and some sort of configuration ?
Here are some things you can do to increase write performance on EC2:
Raid stuff
Raid 0 will get you better speed but make sure your application
doesn't require the mirroring provided by raid 10 (the recommended
setup)
Rais 10 is mirroring (raid 1) and striping (raid 0) together.
Note: There's a 2gbs rate limit between an individual EC2 instance and the
EBS service as a whole, so this is the maximum rate you could ever
possibly hope to write to EBS from a single node no matter how many
volumes you have.
Use sharding
The more the writes are spread around to different boxen the less
write load each one has.
Index only the importat stuff
Try to limit/reduce the amount of indexes you have. With each additional index, an insert or update will also incur additional writes (and therefore worse write performance).
Control Document size
Try to control document size so that they don't grow and therefore have to be moved to the end of the collection very often. If you have an array you are pushing to, try to fix (limit) the # of items and their individual size so that mongo can be smart about padding factor and help you out here.

Categories