I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points.
It's totally good enough to get all entries in a bucket for a query. Which ones i really need could then be processed in a different way under consideration of some filter parameters my problem include.
I already found likelike but hope that there is something a bit smaller and without need of any other tools (like Apache Hadoop in the case of likelike).
Maybe this one:
"TarsosLSH is a Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works."
Code can be found here
Apache Spark has an LSH implementation: https://spark.apache.org/docs/2.1.0/ml-features.html#locality-sensitive-hashing (API).
After having played with both the tdebatty and TarsosLSH implementations, I'll likely use Spark, as it supports sparse vectors as input. The tdebatty requires a non-sparse array of booleans or int's, and the TarsosLSH Vector implementation is a non-sparse array of doubles. This severely limits the number of dimensions one can reasonably support.
This page provides links to more projects, as well as related papers and information: https://janzhou.org/lsh/.
There is this one:
http://code.google.com/p/lsh-clustering/
I haven't had time to test it but at least it compiles.
Here another one:
https://github.com/allenlsy/knn
It uses LSH for KNN. I'm currently investigating it's usability =)
The ELKI data mining framework comes with an LSH index. It can be used with most algorithms included (anything that uses range or nn searches) and sometimes works very well.
In other cases, LSH doesn't seem to be a good approach. It can be quite tricky to get the LSH parameters right: if you choose some parameters too high, runtime grows a lot (all the way to a linear scan). If you choose them too low, the index becomes too approximative and loses to many neighbors.
It's probably the biggest challenge with LSH: finding good parameters, that yield the desired speedup and getting a good enough accuracy out of the index...
Related
I'm building a system that does text classification. I'm building the system in Java. As features I'm using the bag-of-words model. However one problem with such a model is that the number of features is really high, which makes it impossible to fit the data in memory.
However, I came across this tutorial from Scikit-learn which uses specific data structures to solve the issue.
My questions:
1 - How do people solve such an issue using Java in general?
2- Is there a solution similar to the solution given in scikit-learn?
Edit: the only solution I've found so far is to personally write a Sparse Vector implementation using HashTables.
If you want to build this system in Java, I suggest you use Weka, which is a machine learning software similar to sklearn. Here is a simple tutorial about text classification with Weka:
https://weka.wikispaces.com/Text+categorization+with+WEKA
You can download Weka from:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
HashSet/HashMap are the usual way people store bag-of-words vectors in Java - they are naturally sparse representations that grow not with the size of dictionary but with the size of document, and the latter is usually much smaller.
If you deal with unusual scenarios, like very big document/representations, you can look for a few sparse bitset implementations around, they may be slightly more economical in terms of memory and are used for massive text classification implementations based on Hadoop, for example.
Most NLP frameworks make this decision for you anyway - you need to supply things in the format the framework wants them.
I have been reading several SO posts regarding K-D Trees vs. R-Trees but I still have some questions regarding my specific application.
For my Java application, I want to maintain a relatively small number of spatial data points (a few hundred thousand). The key is that data insertion will not be bulk loaded, but rather, frequently and incrementally inserted. I should also mention that I will be performing a good number of periodic range queries on sub-regions of the spatial domain.
I have read that K-D Trees do not typically support incremental building and that R-trees are more suitable for this since they maintain a balanced state.
However, after looking into the solutions suggested here:
Java commercial-friendly R-tree implementation?
I did not find that the implementations were easy to work with for returning a list of points in range searches. However, I have found: http://java-ml.sourceforge.net/ to have a very nice implementation of a K-D Tree that works quickly and outperforms standard array storage for a test set of points (~25K). Additionally, I have read that R-trees store redundant information when dealing with points (since a point is a rectangle with min=max).
Since I am working with a smaller number of points, are the differences between the two structures less important than, say, if I was working with a database application storing millions of points?
It is incorrect that R-trees can't store points. They are designed to support rectangles, and will need to do so at inner nodes. But a good implementation should store points at the leaf level, and roughly have the double data capacity there.
You can trivially store point, and expose them as a "rectangles" with min=max to the tree management code.
Your data isn't small. Small would be like 100 objects. For 100 objects, an R-tree won't make much sense, as it would likely consists of a single leaf only. For good performance, an R-tree needs a good fan-out. k-d-tree always have a fan-out of 2; they are binary trees. At 100k objects, a k-d-tree will be pretty deep. Assuming that you have a fanout of 100 (for dynamic r-trees, you then should allow up to 200 objects per page), you can store 1 million points in a 3-level tree.
I've used the ELKI R*-tree, and it is really fast. But it's not commercial friendly, unless you get a different license: it's AGPL-3 licensed, which is a copyleft license.
Furthermore, the API isn't designed for standalone use. If you want to use them, the best way is to work with the full ELKI framework, instead of trying to rip out the R*-tree.
If your data is low dimensional (say, 3-dimensional) and has a finite bound, don't underestimate the performance of simple grid-based approaches. In particular for in-memory operations. In many cases, I wouldn't even go to an Octree, but just define the optimal grid for my use case, and then implement it using object lists. Keep sorted by one coordinate within each grid cell to further accelerate performance.
If you want to frequently add/remove/update data points, you may want to look at the PH-Tree. The is on open source Java version available: www.phtree.org
It works a bit like a quadtree, but is much more efficient by using binary hypercubes and prefix-sharing.
It has excellent update performance (no rebalancing required) and is quite memory efficient. It works better with larger datasets, but 100K should be fine for 2 or 3 dimensions.
The problem I have to solve is that I have to input IP address prefixes and that data associated with them in a tree so they can be queried later. I'm reading these addresses from a file and the file may contain as many as 16 million records and the file could have duplicates and i have to store those too.
I wrote my own binary search tree, but learned that a TreeMap in Java is implemented using a Red Black tree, but a TreeMap can't contain duplicates.
I want the query to take O(logn) time.
The data structure needs to be in Ram, so I'm also not sure how I'm going to store 16 million nodes.
I wanted to ask: Would it be too much of a performance hit using a library like guava to insert the Ips in Multi-maps? Or is there a better way to do this?
Using a built in library, which is tested documented and well maintained is usually a good practice.
It will also help you learn more about guava. Once you start using it "for just one thing", you will most likely realize there is much more you can use to make your life a bit easier.
Also, an alternative is using a TreeMap<Key,List<MyClass>> rather then TreeMap<Key,MyClass> as a custom implementation of a Multimap.
Regarding memory - you should try to minimize your data as much as possible (use efficient data structures, no need for "wasty" String, for example for storing IPs, there are cheaper alternatives, exploit them.
Also note - the OS will be able to offer you more memory then the RAM you have, by using virtual memory (practically for 64bits machine - it is most likely to be more then enough). However, it will most likely be less efficient then a DS dedicated for disk (such as B+ trees, for example).
Alternatives:
As alternatives to the TreeMap - you might be interested in other data structures (each with its advantages and disadvantages):
hash table - implemented as HashMap in java. Your type will then beHashMap<Key,List<Value>>. It allows O(1) average case query, but might decay to O(n) worst case. It also does not allow efficient range queries.
trie or its more space efficient version - radix tree. Allows O(1) access to each key, but is usually less space efficient then the alternatives. With this approach, you will implement the Map interface with the DS, and your type will be Map<Key,List<Value>>
B+ tree, which is much more optimized for disk - if your data is too large to fit in RAM after all.
My understanding is to calculate percentiles, the data needs to be sorted. Would this be possible with a huge amount of data spread across multiple servers, without moving it around?
While MapReduce as a paradigm does not looks suited for the problem, hadoop's implementation of MR - is.
Hadoop's implementation of map reduce is based on distributed sort - and it is what you need. Hadoop is doing sort by moving data between servers only once - not that bad.
I would suggest to look onto hadoop terasort implementaiton which illustrate the good (and probabbly the best) way to sort massive data with hadoop. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html
I would first create a histogram, either on one machine or multiple machines. Once you have a count for each possible value of buckets of possible values you can combine these if needed. The gain for using a histogram is that it has O(1) insertion/sort time instead of O(log n) and uses O(M) space where M is the number of possible values or buckets instead of O(N) where N is the number of sample.
A histogram is naturally sorted so you can get a total count and find the percentiles by counting from either end.
The answer to your question is yes, it is possible. But Map-Reduce isn't really designed for this kind of task. Map-Reduce (as is used in a Hadoop cluster, for instance) shines on unstructured or semi-structured data. While it has the ability to process other kinds, it is not best suited for it. (I had one project at a company where they wanted to analyze XML in a Hadoop cluster... it wasn't the most fun thing.)
This scholarly article describes some of the issues with Map-Reduce on structured data and offers an alternative approach with "Clydesdale". (I have never heard of or used this, so I can neither endorse it or speak to its strengths/weaknesses.)
I'm looking for more links that offer explanations and alternatives.
I need to solve nonlinear minimization (least residual squares of N unknowns) problems in my Java program. The usual way to solve these is the Levenberg-Marquardt algorithm. I have a couple of questions
Does anybody have experience on the different LM implementations available? There exist slightly different flavors of LM, and I've heard that the exact implementation of the algorithm has a major effect on the its numerical stability. My functions are pretty well-behaved so this will probably not be a problem, but of course I'd like to choose one of the better alternatives. Here are some alternatives I've found:
FPL Statistics Group's Nonlinear Optimization Java Package. This includes a Java translation of the classic Fortran MINPACK routines.
JLAPACK, another Fortran translation.
Optimization Algorithm Toolkit.
Javanumerics.
Some Python implementation. Pure Python would be fine, since it can be compiled to Java with jythonc.
Are there any commonly used heuristics to do the initial guess that LM requires?
In my application I need to set some constraints on the solution, but luckily they are simple: I just require that the solutions (in order to be physical solutions) are nonnegative. Slightly negative solutions are result of measurement inaccuracies in the data, and should obviously be zero. I was thinking to use "regular" LM but iterate so that if some of the unknowns becomes negative, I set it to zero and resolve the rest from that. Real mathematicians will probably laugh at me, but do you think that this could work?
Thanks for any opinions!
Update: This is not rocket science, the number of parameters to solve (N) is at most 5 and the data sets are barely big enough to make solving possible, so I believe Java is quite efficient enough to solve this. And I believe that this problem has been solved numerous times by clever applied mathematicians, so I'm just looking for some ready solution rather than cooking my own. E.g. Scipy.optimize.minpack.leastsq would probably be fine if it was pure Python..
The closer your initial guess is to the solution, the faster you'll converge.
You said it was a non-linear problem. You can do a least squares solution that's linearized. Maybe you can use that solution as a first guess. A few non-linear iterations will tell you something about how good or bad an assumption that is.
Another idea would be trying another optimization algorithm. Genetic and ant colony algorithms can be a good choice if you can run them on many CPUs. They also don't require continuous derivatives, so they're nice if you have discrete, discontinuous data.
You should not use an unconstrained solver if your problem has constraints. For
instance if know that some of your variables must be nonnegative you should tell
this to your solver.
If you are happy to use Scipy, I would recommend scipy.optimize.fmin_l_bfgs_b
You can place simple bounds on your variables with L-BFGS-B.
Note that L-BFGS-B takes a general nonlinear objective function, not just
a nonlinear least-squares problem.
I agree with codehippo; I think that the best way to solve problems with constraints is to use algorithms which are specifically designed to deal with them. The L-BFGS-B algorithm should probably be a good solution in this case.
However, if using python's scipy.optimize.fmin_l_bfgs_b module is not a viable option in your case (because you are using Java), you can try using a library I have written: a Java wrapper for the original Fortran code of the L-BFGS-B algorithm. You can download it from http://www.mini.pw.edu.pl/~mkobos/programs/lbfgsb_wrapper and see if it matches your needs.
The FPL package is quite reliable but has a few quirks (array access starts at 1) due to its very literal interpretation of the old fortran code. The LM method itself is quite reliable if your function is well behaved. A simple way to force non-negative constraints is to use the square of parameters instead of the parameters directly. This can introduce spurious solutions but for simple models, these solutions are easy to screen out.
There is code available for a "constrained" LM method. Look here http://www.physics.wisc.edu/~craigm/idl/fitting.html for mpfit. There is a python (relies on Numeric unfortunately) and a C version. The LM method is around 1500 lines of code, so you might be inclined to port the C to Java. In fact, the "constrained" LM method is not much different than the method you envisioned. In mpfit, the code adjusts the step size relative to bounds on the variables. I've had good results with mpfit as well.
I don't have that much experience with BFGS, but the code is much more complex and I've never been clear on the licensing of the code.
Good luck.
I haven't actually used any of those Java libraries so take this with a grain of salt: based on the backends I would probably look at JLAPACK first. I believe LAPACK is the backend of Numpy, which is essentially the standard for doing linear algebra/mathematical manipulations in Python. At least, you definitely should use a well-optimized C or Fortran library rather than pure Java, because for large data sets these kinds of tasks can become extremely time-consuming.
For creating the initial guess, it really depends on what kind of function you're trying to fit (and what kind of data you have). Basically, just look for some relatively quick (probably O(N) or better) computation that will give an approximate value for the parameter you want. (I recently did this with a Gaussian distribution in Numpy and I estimated the mean as just average(values, weights = counts) - that is, a weighted average of the counts in the histogram, which was the true mean of the data set. It wasn't the exact center of the peak I was looking for, but it got close enough, and the algorithm went the rest of the way.)
As for keeping the constraints positive, your method seems reasonable. Since you're writing a program to do the work, maybe just make a boolean flag that lets you easily enable or disable the "force-non-negative" behavior, and run it both ways for comparison. Only if you get a large discrepancy (or if one version of the algorithm takes unreasonably long), it might be something to worry about. (And REAL mathematicians would do least-squares minimization analytically, from scratch ;-P so I think you're the one who can laugh at them.... kidding. Maybe.)