I'm looking for a matrix / linear algebra library in Java that provides a sparse matrix that can be written to concurrently from different threads. Most of the libraries I've come across either do not provide sparse matrices at all, or 1.) back them with an open addressed hash map, or 2.) store then in CSR or CSC format which is not at all amenable to multithreaded construction. Right now I'm gather the entries in parallel using a concurrent hash map and them populating the sparse matrix from a single thread, but this seems like a waste of resources (space to store the concurrent hash map, and time to essentially fill in the matrix twice).
You can't just magically make sparse matrix algebra routines scalably parallel. Tackling these issues involves some of the most complex numerical analysis algorithms around and is still the subject of intense research.
You don't say what you want to do with these matrices but I imagine that you want solution to systems of linear equations. If you want that in parallel then you'll need a 3rd party library, very large matrices, and likely some money.
The most common way to assemble sparse matrices is to assemble them in triplet format and convert to compressed row or column format. The assembly can be expensive but it is easy to do in parallel. Just let each thread have its own list of triplets and splice them together before converting to compressed format.
I remember the matrices in parallel colt being thread safe. The library is a multithreaded version of colt.
Related
I am working on a Java project which has thousands of matrix calculations. But the matrices are at most 10x10 matrices.
I wonder if it is better to use a matrix library or use write the simple functions (determinant(), dotproduct() etc.) Because when small matrices are used, it is advised not to use libraries but do the operations by custom functions.
I know that matrix libraries like JAMA provides high performance when it comes to 10000x10000 matrices or so.
Instead making 5-6 calculations with 10000x10000 matrices, I make 100000 calculations with 10x10 matrices. Number of primitive operations are nearly the same.
Are both cases same in terms of performance? Should I treat myself as if I'm working with huge matrices and use a library?
I suspect for a 10x10 matrix you won't see much difference.
In tests I have done for hand coding a 4x4 matrix the biggest overhead was loading the data into the L1 cache and how you did it didn't matter very much. For a 3x3 matrix and smaller it did appear to make a significant difference.
Getting the maximum possible speed (with lots of effort)
For maximum possible speed I would suggest writing a C function that uses vector math intrinsics such as Streaming SIMD Extensions (SSE) or Advanced Vector Extensions (AVX) operations, together with multithreading (e.g. via OpenMP).
Your Java program would pass all 100k matrices to this native function, which would then handle all the calculations. Portability becomes an issue, e.g. AVX instructions are only supported on recent CPUs. Developer effort, especially if you are not familiar with SSE/AVX increases a lot too.
Reasonable speed without too much effort
You should use multiple threads by creating a class that extends java.lang.Thread or implements java.lang.Runnable. Each thread iterates through a subset of the matrices, calling your maths routine(s) for each matrix. This part is key to getting decent speed on multi-core CPUs. The maths could be your own Java function to do the calculations on a single matrix, or you could use a library's functions.
I wonder if it is better to use a matrix library or use write the
simple functions (determinant(), dotproduct() etc.) Because when small
matrices are used, it is advised not to use libraries but do the
operations by custom functions.
...
Are both cases same in terms of performance? Should I treat myself as
if I'm working with huge matrices and use a library?
No, using a library and writing your own function for the maths are not the same performance-wise. You may be able to write a faster function that is specialised to your application, but consider this:
The library functions should have fewer bugs than code you will write.
A good library will use implementations that are efficient (i.e. least amount of operations). Do you have the time to research and implement the most efficient algorithms?
You might find the Apache Commons Math library useful. I would encourage you to benchmark Apache Commons Math and JAMA to choose the fastest.
I have been reading several SO posts regarding K-D Trees vs. R-Trees but I still have some questions regarding my specific application.
For my Java application, I want to maintain a relatively small number of spatial data points (a few hundred thousand). The key is that data insertion will not be bulk loaded, but rather, frequently and incrementally inserted. I should also mention that I will be performing a good number of periodic range queries on sub-regions of the spatial domain.
I have read that K-D Trees do not typically support incremental building and that R-trees are more suitable for this since they maintain a balanced state.
However, after looking into the solutions suggested here:
Java commercial-friendly R-tree implementation?
I did not find that the implementations were easy to work with for returning a list of points in range searches. However, I have found: http://java-ml.sourceforge.net/ to have a very nice implementation of a K-D Tree that works quickly and outperforms standard array storage for a test set of points (~25K). Additionally, I have read that R-trees store redundant information when dealing with points (since a point is a rectangle with min=max).
Since I am working with a smaller number of points, are the differences between the two structures less important than, say, if I was working with a database application storing millions of points?
It is incorrect that R-trees can't store points. They are designed to support rectangles, and will need to do so at inner nodes. But a good implementation should store points at the leaf level, and roughly have the double data capacity there.
You can trivially store point, and expose them as a "rectangles" with min=max to the tree management code.
Your data isn't small. Small would be like 100 objects. For 100 objects, an R-tree won't make much sense, as it would likely consists of a single leaf only. For good performance, an R-tree needs a good fan-out. k-d-tree always have a fan-out of 2; they are binary trees. At 100k objects, a k-d-tree will be pretty deep. Assuming that you have a fanout of 100 (for dynamic r-trees, you then should allow up to 200 objects per page), you can store 1 million points in a 3-level tree.
I've used the ELKI R*-tree, and it is really fast. But it's not commercial friendly, unless you get a different license: it's AGPL-3 licensed, which is a copyleft license.
Furthermore, the API isn't designed for standalone use. If you want to use them, the best way is to work with the full ELKI framework, instead of trying to rip out the R*-tree.
If your data is low dimensional (say, 3-dimensional) and has a finite bound, don't underestimate the performance of simple grid-based approaches. In particular for in-memory operations. In many cases, I wouldn't even go to an Octree, but just define the optimal grid for my use case, and then implement it using object lists. Keep sorted by one coordinate within each grid cell to further accelerate performance.
If you want to frequently add/remove/update data points, you may want to look at the PH-Tree. The is on open source Java version available: www.phtree.org
It works a bit like a quadtree, but is much more efficient by using binary hypercubes and prefix-sharing.
It has excellent update performance (no rebalancing required) and is quite memory efficient. It works better with larger datasets, but 100K should be fine for 2 or 3 dimensions.
My understanding is to calculate percentiles, the data needs to be sorted. Would this be possible with a huge amount of data spread across multiple servers, without moving it around?
While MapReduce as a paradigm does not looks suited for the problem, hadoop's implementation of MR - is.
Hadoop's implementation of map reduce is based on distributed sort - and it is what you need. Hadoop is doing sort by moving data between servers only once - not that bad.
I would suggest to look onto hadoop terasort implementaiton which illustrate the good (and probabbly the best) way to sort massive data with hadoop. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html
I would first create a histogram, either on one machine or multiple machines. Once you have a count for each possible value of buckets of possible values you can combine these if needed. The gain for using a histogram is that it has O(1) insertion/sort time instead of O(log n) and uses O(M) space where M is the number of possible values or buckets instead of O(N) where N is the number of sample.
A histogram is naturally sorted so you can get a total count and find the percentiles by counting from either end.
The answer to your question is yes, it is possible. But Map-Reduce isn't really designed for this kind of task. Map-Reduce (as is used in a Hadoop cluster, for instance) shines on unstructured or semi-structured data. While it has the ability to process other kinds, it is not best suited for it. (I had one project at a company where they wanted to analyze XML in a Hadoop cluster... it wasn't the most fun thing.)
This scholarly article describes some of the issues with Map-Reduce on structured data and offers an alternative approach with "Clydesdale". (I have never heard of or used this, so I can neither endorse it or speak to its strengths/weaknesses.)
I'm looking for more links that offer explanations and alternatives.
I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points.
It's totally good enough to get all entries in a bucket for a query. Which ones i really need could then be processed in a different way under consideration of some filter parameters my problem include.
I already found likelike but hope that there is something a bit smaller and without need of any other tools (like Apache Hadoop in the case of likelike).
Maybe this one:
"TarsosLSH is a Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works."
Code can be found here
Apache Spark has an LSH implementation: https://spark.apache.org/docs/2.1.0/ml-features.html#locality-sensitive-hashing (API).
After having played with both the tdebatty and TarsosLSH implementations, I'll likely use Spark, as it supports sparse vectors as input. The tdebatty requires a non-sparse array of booleans or int's, and the TarsosLSH Vector implementation is a non-sparse array of doubles. This severely limits the number of dimensions one can reasonably support.
This page provides links to more projects, as well as related papers and information: https://janzhou.org/lsh/.
There is this one:
http://code.google.com/p/lsh-clustering/
I haven't had time to test it but at least it compiles.
Here another one:
https://github.com/allenlsy/knn
It uses LSH for KNN. I'm currently investigating it's usability =)
The ELKI data mining framework comes with an LSH index. It can be used with most algorithms included (anything that uses range or nn searches) and sometimes works very well.
In other cases, LSH doesn't seem to be a good approach. It can be quite tricky to get the LSH parameters right: if you choose some parameters too high, runtime grows a lot (all the way to a linear scan). If you choose them too low, the index becomes too approximative and loses to many neighbors.
It's probably the biggest challenge with LSH: finding good parameters, that yield the desired speedup and getting a good enough accuracy out of the index...
I have noticed that matlab does some matrix function really fast for example adding 5 to all elements of an n*n array happens almost instantly even if the matrix is large because you don't have to loop through every element, doing the same in java the for loop takes forever if the matrix is large.
I have two questions, are there efficient built-in classes in java for doing matrix operations, second how can I code something to update all elements of a big matrix in java more efficiently.
Just stumbled into this posting and thought I would throw my two cents in. I am author of EJML and I am also working on a performance and stability benchmark for java libraries. While several issues go into determining how fast an algorithm is, Mikhail is correct that caching is a very important issue in performance of large matrices. For smaller matrices the libraries overhead becomes more important.
Due to overhead in array access, pure Java libraries are slower than highly optimized c libraries, even if the algorithms are exactly the same. Some libraries get around this issue by making calls to native code. You might want to check out
http://code.google.com/p/matrix-toolkits-java/
which does exactly that. There will be some overhead in copying memory from java to the native library, but for large matrices this is insignificant.
For a benchmark on pure java performance (the one that I'm working on) check out:
http://code.google.com/p/java-matrix-benchmark/
Another benchmark is here:
http://www.ujmp.org/java-matrix/benchmark/
Either of these benchmarks should give you a good idea of performance for large matrices.
Colt may be the fastest.
"Colt provides a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java. " "For example, IBM Watson's Ninja project showed that Java can indeed perform BLAS matrix computations up to 90% as fast as optimized Fortran."
JAMA!
"JAMA is a basic linear algebra package for Java. It provides user-level classes for constructing and manipulating real, dense matrices."
Or the Efficient Java Matrix Library
"Efficient Java Matrix Library (EJML) is a linear algebra library for manipulating dense matrices. Its design goals are; 1) to be as computationally efficient as possible for both small and large matrices, and 2) to be accessible to both novices and experts."