Is Cleo (linkedin's autocomplete solution) suitable for billions of elements?

Is Cleo (linkedin's autocomplete solution) suitable for billions of elements? - java

Cleo has several different type of lookahead searches which are backed by some very clever indexing strategies. The GenericTypeahead is presumably for the largest of datasets.
From http://sna-projects.com/cleo/design.php:
"The GenericTypeahead is designed for large data sets, which may contain millions of elements..."
Unfortunately the documentation doesn't go into how well or how the Typeahead's scale up. Has anyone used Cleo for very large datasets that might have some insight?

Cleo is for a single instance/node (i.e. a single JVM) and does not have any routing or broker logic. Within a single Cleo instance, you can have multiple logical partitions to take advantage of multi-core CPUs. On a typical commodity box with 32G - 64G memory, you can easily support tens of millions elements by setting up 2 or 3 Cleo GenericTypeahead instances.
To support billions of elements, you will have to use horizontal partitioning to set up many Cleo instances on many commodity boxes and then do scatter-and-gather.
Check out https://github.com/jingwei/cleo-primer to see how to set up a single Cleo GenericTypeahead instance within minutes.
Cheers.

Related

Does Apache Helix support partition split and merge?

I understand that Apache Helix allows dynamic cluster expansion/shrinkage (e.g, adding/failing/removing physical nodes). However, in the case that a single physical node can not handle a single partition replica, I need to split a partition into two. I understand that we need to pre-estimate the workload, so we can set up sufficient number of partitions up-front. However, as traffic goes up unpredictably, it is almost impossible to do such a pre-estimation. Can anyone tell me whether Helix supports re-partitioning out-of-box? If I need to customize it to add the repartitioning functionality, how large is the effort and how to do it in principle. I want to have a quick estimate. Thanks

Helix does not support partition splitting/merging out of the box. We could not come up with a generic way to support this without understanding the underlying system.
Having said that, it's possible to build a custom solution using the primitives provided by Helix. If you can provide additional information about your system, I might be able to suggest something.

I would suggest to start with a high number of (logical) partitions, and assigning each node multiple partitions. When the service needs more resources, add nodes and move some partitions from existing nodes to the new nodes.
For instance, assuming you'd start with 50 nodes, you'd split the space of your workload to 50000 logical partitions, and assign 1000 partitions to each node. Then when increasing to, say, 75 nodes, redistribute. So it would be 667 partitions per node.
Depending on the actual scenario, you might want to minimize the reallocated partitions, for example using a consistent hashing algorithm.

Fast indexing using multiple ES nodes?

All I read and understand about running multiple ES nodes is to enable index replication and scaling. I was wondering if it could help us to make indexing faster for large number of files. I have two questions and they are as follows:
Question 1: Would it be accurate to think that using multiple ES nodes would allow us to index multiple times faster?
Question 2: what effect does it have on indexing if I keep enable all nodes as a data node? on the other hand what effect does it have on indexing if I have few non-data nodes (e.g. one dedicate Master and one dedicate Client node) with few data nodes? Which will be better in terms of speed and scaling?

Answer1: No.
The speed of indexing will in fact decrease if you enable replication (though it may increase search performance). You can look at this question for improving indexing performance.
Answer2: It depends (if no replica then same).
During indexing the data will go only to data nodes. Your cluster state will contain information about which nodes is data node and route the request accordingly. The performance impact will only be because of the node receiving the request has to reroute/forward the request to the data nodes

If you are adding machines without increasing the number of replicas you will get a better performance during indexing. It is not surprising since you are adding more resources while the amount of work to be done remains pretty much the same.
In our environment we are using 20 nodes on production and 5-10 nodes on debug. Both environments hold the same volume of data. Since ES updates speed (We are using groovy scripts to merge new documents to existing documents) is our primary bottleneck, we are able to see much better performance in our production environment in oppose to other environments.
You already got some useful links at other answers to your question. I can add that in our case the 3 most significant factors in data upload improvements were: Reducing the refresh_interval, increasing the merge_factor and using Elastic-Hadoop plugin (We upload the data from Spark) that handles all the major data transfer optimisation on the application level.
Every one of those steps have its own disadvantages, so read the manuals carefully before changing the configurations.

K-D Tree vs R-Tree for small, dynamic data

I have been reading several SO posts regarding K-D Trees vs. R-Trees but I still have some questions regarding my specific application.
For my Java application, I want to maintain a relatively small number of spatial data points (a few hundred thousand). The key is that data insertion will not be bulk loaded, but rather, frequently and incrementally inserted. I should also mention that I will be performing a good number of periodic range queries on sub-regions of the spatial domain.
I have read that K-D Trees do not typically support incremental building and that R-trees are more suitable for this since they maintain a balanced state.
However, after looking into the solutions suggested here:
Java commercial-friendly R-tree implementation?
I did not find that the implementations were easy to work with for returning a list of points in range searches. However, I have found: http://java-ml.sourceforge.net/ to have a very nice implementation of a K-D Tree that works quickly and outperforms standard array storage for a test set of points (~25K). Additionally, I have read that R-trees store redundant information when dealing with points (since a point is a rectangle with min=max).
Since I am working with a smaller number of points, are the differences between the two structures less important than, say, if I was working with a database application storing millions of points?

It is incorrect that R-trees can't store points. They are designed to support rectangles, and will need to do so at inner nodes. But a good implementation should store points at the leaf level, and roughly have the double data capacity there.
You can trivially store point, and expose them as a "rectangles" with min=max to the tree management code.
Your data isn't small. Small would be like 100 objects. For 100 objects, an R-tree won't make much sense, as it would likely consists of a single leaf only. For good performance, an R-tree needs a good fan-out. k-d-tree always have a fan-out of 2; they are binary trees. At 100k objects, a k-d-tree will be pretty deep. Assuming that you have a fanout of 100 (for dynamic r-trees, you then should allow up to 200 objects per page), you can store 1 million points in a 3-level tree.
I've used the ELKI R*-tree, and it is really fast. But it's not commercial friendly, unless you get a different license: it's AGPL-3 licensed, which is a copyleft license.
Furthermore, the API isn't designed for standalone use. If you want to use them, the best way is to work with the full ELKI framework, instead of trying to rip out the R*-tree.
If your data is low dimensional (say, 3-dimensional) and has a finite bound, don't underestimate the performance of simple grid-based approaches. In particular for in-memory operations. In many cases, I wouldn't even go to an Octree, but just define the optimal grid for my use case, and then implement it using object lists. Keep sorted by one coordinate within each grid cell to further accelerate performance.

If you want to frequently add/remove/update data points, you may want to look at the PH-Tree. The is on open source Java version available: www.phtree.org
It works a bit like a quadtree, but is much more efficient by using binary hypercubes and prefix-sharing.
It has excellent update performance (no rebalancing required) and is quite memory efficient. It works better with larger datasets, but 100K should be fine for 2 or 3 dimensions.

Optimising Java objects for CPU cache line efficiency

I'm writing a library where:
It will need to run on a wide range of different platforms / Java implementations (the common case is likely to be OpenJDK or Oracle Java on Intel 64 bit machines with Windows or Linux)
Achieving high performance is a priority, to the extent that I care about CPU cache line efficiency in object access
In some areas, quite large graphs of small objects will be traversed / processed (let's say around 1GB scale)
The main workload is almost exclusively reads
Reads will be scattered across the object graph, but not totally randomly (i.e. there will be significant hotspots, with occasional reads to less frequently accessed areas)
The object graph will be accessed concurrently (but not modified) by multiple threads. There is no locking, on the assumption that concurrent modification will not occur.
Are there some rules of thumb / guidelines for designing small objects so that they utilise CPU cache lines effectively in this kind of environment?
I'm particularly interested in sizing and structuring the objects correctly, so that e.g. the most commonly accessed fields fit in the first cache line etc.
Note: I am fully aware that this is implementation dependent, that I will need to benchmark, and of the general risks of premature optimization. No need to waste any further bandwidth pointing this out. :-)

A first step towards cache line efficiency is to provide for referential locality (i.e. keeping your data close to each other). This is hard to do in JAVA where almost everything is system allocated and accessed by reference.
To avoid references, the following might be obvious:
have non-reference types (i.e. int, char, etc.) as fields in your
objects
keep your objects in arrays
keep your objects small
These rules will at least ensure some referential locality when working on a single object and when traversing the object references in your object graph.
Another approach might be to not use object for your data at all, but have global non-ref typed arrays (of same size) for each item that would normally be a field in your class and then each instance would be identified by a common index into these arrays.
Then for optimizing the size of the arrays or chunks thereof, you have to know the MMU characteristics (page/cache size, number of cache lines, etc). I don't know if JAVA provides this in the System or Runtime classes, but you could pass this information as system properties on start up.
Of course this is totally orthogonal to what you should normally be doing in JAVA :)
Best regards

You may require information about the various caches of your CPU, you can access it from Java using Cachesize (currently supporting Intel CPUs). This can help to develop cache-aware algorithms.
Disclaimer : author of the lib.

Array of Structs are always faster than Structs of arrays?

I was wondering if the data layout Structs of Arrays (SoA) is always faster than an Array of Structs (AoS) or Array of Pointers (AoP) for problems with inputs that only fits in RAM programmed in C/JAVA.
Some days ago I was improving the performance of a Molecular Dynamic algorithm (in C), summarizing in this algorithm it is calculated the force interaction among particles based on their force and position.
Original the particles were represented by a struct containing 9 different doubles, 3 for particles forces (Fx,Fy,Fz) , 3 for positions and 3 for velocity. The algorithm had an array containing pointers to all the particles (AoP). I decided to change the layout from AoP to SoA to improve the cache use.
So, now I have a Struct with 9 array where each array stores forces, velocity and positions (x,y,z) of each particle. Each particle is accessed by it own array index.
I had a gain in performance (for an input that only fits in RAM) of about 1.9x, so I was wondering if typically changing from AoP or AoS to SoA it will always performance better, and if not in which types of algorithms this do not occurs.

Much depends of how useful all fields are. If you have a data structure where using one fields means you are likely to use all of them, then an array of struct is more efficient as it keeps together all the things you are likely to need.
Say you have time series data where you only need a small selection of the possible fields you have. You might have all sorts of data about an event or point in time, but you only need say 3-5 of them. In this case a structure of arrays is more efficient because a) you don't need to cache the fields you don't use b) you often access values in order i.e. caching a field, its next value and its next is useful.
For this reason, time-series information is often stored as a collection of columns.

This will depend on how exactly you access the data.
Try to imagine, what exactly happens in the hardware when you access your data, in either SoA or AoS.
To reason about your question, you must consider following things -
If the cache is absent, the performance should be the same, assuming that memory access latency is equal for all the elements of the data.
Now with the cache, if you access consecutive address locations, definitely you will get performance improvement. This is exactly valid in your case. When you have AoS, The locations are not consecutive in the memory, so you must lose some performance there.
You must be accessing in for loops your data like for(int i=0;i<1000000;i++) Fx[i] = 0. So if the data is huge in quantity, you will easily see the small performance benefits. If your data was small, this would not matter much.
Finally, you also don't know about the DRAM that you are using. It will have some benefits when you access consecutive data. For example to understand why it is like that you can refer to wiki.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.