Java implementation of singular value decomposition for large sparse matrices - java

I'm just wondering if anyone out there knows of a java implementation of singular value decomposition (SVD) for large sparse matrices? I need this implementation for latent semantic analysis (LSA).
I tried the packages from UJMP and JAMA but they choke when the number of row >= 1000 and col >= 500. If anyone can point me to psuedocode or something out there, that would be greatly appreciated.

There's a list of Java numerical libraries at Wikipedia. The NIST library, which is quite good, unfortunately does not deal with sparse matrices. I'm not too familiar with the other packages. You might take a look at Colt; it's also quite high quality and does handle sparse matrices for some operations; I don't know about SVD, although I imagine it does. I've also heard that UJMP is also worth a look.
EDIT: Sorry to hear that UJMP doesn't handle your problem. I had heard that it was worth a look.

Related

Optimization: Oj algorithms (java) versus SCIP (python)

Does anybody know how these 2 solvers, (Oj algorithms) from Java and SCIP for Python, relate to each other performance wise (as in: which one is the fastest), when dealing with a typical MILP (Mixed Integer Linear Programming) problem? On first sight, I can't seem to find anything online that can point me in the right direction, and I'm curious!
Thanks in advance!
The SCIP Optimization Suite is one of the fastest MIP and MINLP solvers available in source code. PySCIPOpt, its interface to Python, might be a bit slower when constructing the model but solving times are still good since it's running the pure SCIP C library in the background.
To be honest, I have no experience with oj! Algorithms and cannot say how good this solver is. Apparently it allows to link to Gurobi or CPLEX, so guess in this case it's mainly a modelling wrapper around those APIs providing high performance.
In the end it comes down to your modelling preferences/requirements and your specific problem instances.

LSH Libraries in Java

I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points.
It's totally good enough to get all entries in a bucket for a query. Which ones i really need could then be processed in a different way under consideration of some filter parameters my problem include.
I already found likelike but hope that there is something a bit smaller and without need of any other tools (like Apache Hadoop in the case of likelike).
Maybe this one:
"TarsosLSH is a Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works."
Code can be found here
Apache Spark has an LSH implementation: https://spark.apache.org/docs/2.1.0/ml-features.html#locality-sensitive-hashing (API).
After having played with both the tdebatty and TarsosLSH implementations, I'll likely use Spark, as it supports sparse vectors as input. The tdebatty requires a non-sparse array of booleans or int's, and the TarsosLSH Vector implementation is a non-sparse array of doubles. This severely limits the number of dimensions one can reasonably support.
This page provides links to more projects, as well as related papers and information: https://janzhou.org/lsh/.
There is this one:
http://code.google.com/p/lsh-clustering/
I haven't had time to test it but at least it compiles.
Here another one:
https://github.com/allenlsy/knn
It uses LSH for KNN. I'm currently investigating it's usability =)
The ELKI data mining framework comes with an LSH index. It can be used with most algorithms included (anything that uses range or nn searches) and sometimes works very well.
In other cases, LSH doesn't seem to be a good approach. It can be quite tricky to get the LSH parameters right: if you choose some parameters too high, runtime grows a lot (all the way to a linear scan). If you choose them too low, the index becomes too approximative and loses to many neighbors.
It's probably the biggest challenge with LSH: finding good parameters, that yield the desired speedup and getting a good enough accuracy out of the index...

matlab matrix functions in java

I have noticed that matlab does some matrix function really fast for example adding 5 to all elements of an n*n array happens almost instantly even if the matrix is large because you don't have to loop through every element, doing the same in java the for loop takes forever if the matrix is large.
I have two questions, are there efficient built-in classes in java for doing matrix operations, second how can I code something to update all elements of a big matrix in java more efficiently.
Just stumbled into this posting and thought I would throw my two cents in. I am author of EJML and I am also working on a performance and stability benchmark for java libraries. While several issues go into determining how fast an algorithm is, Mikhail is correct that caching is a very important issue in performance of large matrices. For smaller matrices the libraries overhead becomes more important.
Due to overhead in array access, pure Java libraries are slower than highly optimized c libraries, even if the algorithms are exactly the same. Some libraries get around this issue by making calls to native code. You might want to check out
http://code.google.com/p/matrix-toolkits-java/
which does exactly that. There will be some overhead in copying memory from java to the native library, but for large matrices this is insignificant.
For a benchmark on pure java performance (the one that I'm working on) check out:
http://code.google.com/p/java-matrix-benchmark/
Another benchmark is here:
http://www.ujmp.org/java-matrix/benchmark/
Either of these benchmarks should give you a good idea of performance for large matrices.
Colt may be the fastest.
"Colt provides a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java. " "For example, IBM Watson's Ninja project showed that Java can indeed perform BLAS matrix computations up to 90% as fast as optimized Fortran."
JAMA!
"JAMA is a basic linear algebra package for Java. It provides user-level classes for constructing and manipulating real, dense matrices."
Or the Efficient Java Matrix Library
"Efficient Java Matrix Library (EJML) is a linear algebra library for manipulating dense matrices. Its design goals are; 1) to be as computationally efficient as possible for both small and large matrices, and 2) to be accessible to both novices and experts."

Easiest to code algorithm for Rubik's cube?

What would be a relatively easy algorithm to code in Java for solving a Rubik's cube. Efficiency is also important but a secondary consideration.
Perform random operations until you get the right solution. The easiest algorithm and the least efficient.
The simplest non-trivial algorithm I've found is this one:
http://www.chessandpoker.com/rubiks-cube-solution.html
It doesn't look too hard to code up. The link mentioned in Yannick M.'s answer looks good too, but the solution of 'the cross' step looks like it might be a little more complex to me.
There are a number of open source solver implementations which you might like to take a look at. Here's a Python implementation. This Java applet also includes a solver, and the source code is available. There's also a Javascript solver, also with downloadable source code.
Anthony Gatlin's answer makes an excellent point about the well-suitedness of Prolog for this task. Here's a detailed article about how to write your own Prolog solver. The heuristics it uses are particularly interesting.
Might want to check out: http://peter.stillhq.com/jasmine/rubikscubesolution.html
Has a graphical representation of an algorithm to solve a 3x3x3 Rubik's cube
I understand your question is related to Java, but on a practical note, languages like Prolog are much better suited problems like solving a Rubik's cube. I assume this is probably for a class though and you may have no leeway as to the choice of tool.
You can do it by doing BFS(Breadth-First-Search). I think the implementation is not that hard( It is one of the simplest algorithm under the category of the graph). By doing it with the data structure called queue, what you will really work on is to build a BFS tree and to find a so called shortest path from the given condition to the desire condition. The drawback of this algorithm is that it is not efficient enough( Without any modification, even to solver a 2x2x2 cubic the amount time needed is ~5 minutes). But you can always find some tricks to boost the speed.
To be honest, it is one of the homework of the course called "Introduction of Algorithm" from MIT. Here is the homework's link: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/assignments/MIT6_006F11_ps6.pdf. They have a few libraries to help you to visualize it and to help you avoid unnecessary effort.
For your reference, you can certainly look at this java implementation. -->
Uses two phase algorithm to solve rubik's cube. And have tried this code and it works as well.
One solution is to I guess simultaneously run all possible routes. That does sound stupid but here's the logic - over 99% of possible scrambles will be solvable in under 20 moves. This means that although you cycle through huge numbers of possibilities you are still going to do it eventually. Essentially this would work by having your first step as the scrambled cube. Then you would have new cubes stored in variables for each possible move on that first cube. For each of these new cubes you do the same thing. After each possible move check if it is complete and if so then that is the solution. Here to make sure you have the solution you would need an extra bit of data on each Rubiks cube saying the moves done to get to that stage.

Solving nonlinear equations numerically

I need to solve nonlinear minimization (least residual squares of N unknowns) problems in my Java program. The usual way to solve these is the Levenberg-Marquardt algorithm. I have a couple of questions
Does anybody have experience on the different LM implementations available? There exist slightly different flavors of LM, and I've heard that the exact implementation of the algorithm has a major effect on the its numerical stability. My functions are pretty well-behaved so this will probably not be a problem, but of course I'd like to choose one of the better alternatives. Here are some alternatives I've found:
FPL Statistics Group's Nonlinear Optimization Java Package. This includes a Java translation of the classic Fortran MINPACK routines.
JLAPACK, another Fortran translation.
Optimization Algorithm Toolkit.
Javanumerics.
Some Python implementation. Pure Python would be fine, since it can be compiled to Java with jythonc.
Are there any commonly used heuristics to do the initial guess that LM requires?
In my application I need to set some constraints on the solution, but luckily they are simple: I just require that the solutions (in order to be physical solutions) are nonnegative. Slightly negative solutions are result of measurement inaccuracies in the data, and should obviously be zero. I was thinking to use "regular" LM but iterate so that if some of the unknowns becomes negative, I set it to zero and resolve the rest from that. Real mathematicians will probably laugh at me, but do you think that this could work?
Thanks for any opinions!
Update: This is not rocket science, the number of parameters to solve (N) is at most 5 and the data sets are barely big enough to make solving possible, so I believe Java is quite efficient enough to solve this. And I believe that this problem has been solved numerous times by clever applied mathematicians, so I'm just looking for some ready solution rather than cooking my own. E.g. Scipy.optimize.minpack.leastsq would probably be fine if it was pure Python..
The closer your initial guess is to the solution, the faster you'll converge.
You said it was a non-linear problem. You can do a least squares solution that's linearized. Maybe you can use that solution as a first guess. A few non-linear iterations will tell you something about how good or bad an assumption that is.
Another idea would be trying another optimization algorithm. Genetic and ant colony algorithms can be a good choice if you can run them on many CPUs. They also don't require continuous derivatives, so they're nice if you have discrete, discontinuous data.
You should not use an unconstrained solver if your problem has constraints. For
instance if know that some of your variables must be nonnegative you should tell
this to your solver.
If you are happy to use Scipy, I would recommend scipy.optimize.fmin_l_bfgs_b
You can place simple bounds on your variables with L-BFGS-B.
Note that L-BFGS-B takes a general nonlinear objective function, not just
a nonlinear least-squares problem.
I agree with codehippo; I think that the best way to solve problems with constraints is to use algorithms which are specifically designed to deal with them. The L-BFGS-B algorithm should probably be a good solution in this case.
However, if using python's scipy.optimize.fmin_l_bfgs_b module is not a viable option in your case (because you are using Java), you can try using a library I have written: a Java wrapper for the original Fortran code of the L-BFGS-B algorithm. You can download it from http://www.mini.pw.edu.pl/~mkobos/programs/lbfgsb_wrapper and see if it matches your needs.
The FPL package is quite reliable but has a few quirks (array access starts at 1) due to its very literal interpretation of the old fortran code. The LM method itself is quite reliable if your function is well behaved. A simple way to force non-negative constraints is to use the square of parameters instead of the parameters directly. This can introduce spurious solutions but for simple models, these solutions are easy to screen out.
There is code available for a "constrained" LM method. Look here http://www.physics.wisc.edu/~craigm/idl/fitting.html for mpfit. There is a python (relies on Numeric unfortunately) and a C version. The LM method is around 1500 lines of code, so you might be inclined to port the C to Java. In fact, the "constrained" LM method is not much different than the method you envisioned. In mpfit, the code adjusts the step size relative to bounds on the variables. I've had good results with mpfit as well.
I don't have that much experience with BFGS, but the code is much more complex and I've never been clear on the licensing of the code.
Good luck.
I haven't actually used any of those Java libraries so take this with a grain of salt: based on the backends I would probably look at JLAPACK first. I believe LAPACK is the backend of Numpy, which is essentially the standard for doing linear algebra/mathematical manipulations in Python. At least, you definitely should use a well-optimized C or Fortran library rather than pure Java, because for large data sets these kinds of tasks can become extremely time-consuming.
For creating the initial guess, it really depends on what kind of function you're trying to fit (and what kind of data you have). Basically, just look for some relatively quick (probably O(N) or better) computation that will give an approximate value for the parameter you want. (I recently did this with a Gaussian distribution in Numpy and I estimated the mean as just average(values, weights = counts) - that is, a weighted average of the counts in the histogram, which was the true mean of the data set. It wasn't the exact center of the peak I was looking for, but it got close enough, and the algorithm went the rest of the way.)
As for keeping the constraints positive, your method seems reasonable. Since you're writing a program to do the work, maybe just make a boolean flag that lets you easily enable or disable the "force-non-negative" behavior, and run it both ways for comparison. Only if you get a large discrepancy (or if one version of the algorithm takes unreasonably long), it might be something to worry about. (And REAL mathematicians would do least-squares minimization analytically, from scratch ;-P so I think you're the one who can laugh at them.... kidding. Maybe.)

Categories