Using Java with Nvidia GPUs (CUDA)

Using Java with Nvidia GPUs (CUDA) - java

I'm working on a business project that is done in Java, and it needs huge computation power to compute business markets. Simple math, but with huge amount of data.
We ordered some CUDA GPUs to try it with and since Java is not supported by CUDA, I'm wondering where to start. Should I build a JNI interface? Should I use JCUDA or are there other ways?
I don’t have experience in this field and I would like if someone could direct me to something so I can start researching and learning.

First of all, you should be aware of the fact that CUDA will not automagically make computations faster. On the one hand, because GPU programming is an art, and it can be very, very challenging to get it right. On the other hand, because GPUs are well-suited only for certain kinds of computations.
This may sound confusing, because you can basically compute anything on the GPU. The key point is, of course, whether you will achieve a good speedup or not. The most important classification here is whether a problem is task parallel or data parallel. The first one refers, roughly speaking, to problems where several threads are working on their own tasks, more or less independently. The second one refers to problems where many threads are all doing the same - but on different parts of the data.
The latter is the kind of problem that GPUs are good at: They have many cores, and all the cores do the same, but operate on different parts of the input data.
You mentioned that you have "simple math but with huge amount of data". Although this may sound like a perfectly data-parallel problem and thus like it was well-suited for a GPU, there is another aspect to consider: GPUs are ridiculously fast in terms of theoretical computational power (FLOPS, Floating Point Operations Per Second). But they are often throttled down by the memory bandwidth.
This leads to another classification of problems. Namely whether problems are memory bound or compute bound.
The first one refers to problems where the number of instructions that are done for each data element is low. For example, consider a parallel vector addition: You'll have to read two data elements, then perform a single addition, and then write the sum into the result vector. You will not see a speedup when doing this on the GPU, because the single addition does not compensate for the efforts of reading/writing the memory.
The second term, "compute bound", refers to problems where the number of instructions is high compared to the number of memory reads/writes. For example, consider a matrix multiplication: The number of instructions will be O(n^3) when n is the size of the matrix. In this case, one can expect that the GPU will outperform a CPU at a certain matrix size. Another example could be when many complex trigonometric computations (sine/cosine etc) are performed on "few" data elements.
As a rule of thumb: You can assume that reading/writing one data element from the "main" GPU memory has a latency of about 500 instructions....
Therefore, another key point for the performance of GPUs is data locality: If you have to read or write data (and in most cases, you will have to ;-)), then you should make sure that the data is kept as close as possible to the GPU cores. GPUs thus have certain memory areas (referred to as "local memory" or "shared memory") that usually is only a few KB in size, but particularly efficient for data that is about to be involved in a computation.
So to emphasize this again: GPU programming is an art, that is only remotely related to parallel programming on the CPU. Things like Threads in Java, with all the concurrency infrastructure like ThreadPoolExecutors, ForkJoinPools etc. might give the impression that you just have to split your work somehow and distribute it among several processors. On the GPU, you may encounter challenges on a much lower level: Occupancy, register pressure, shared memory pressure, memory coalescing ... just to name a few.
However, when you have a data-parallel, compute-bound problem to solve, the GPU is the way to go.
A general remark: Your specifically asked for CUDA. But I'd strongly recommend you to also have a look at OpenCL. It has several advantages. First of all, it's an vendor-independent, open industry standard, and there are implementations of OpenCL by AMD, Apple, Intel and NVIDIA. Additionally, there is a much broader support for OpenCL in the Java world. The only case where I'd rather settle for CUDA is when you want to use the CUDA runtime libraries, like CUFFT for FFT or CUBLAS for BLAS (Matrix/Vector operations). Although there are approaches for providing similar libraries for OpenCL, they can not directly be used from Java side, unless you create your own JNI bindings for these libraries.
You might also find it interesting to hear that in October 2012, the OpenJDK HotSpot group started the project "Sumatra": http://openjdk.java.net/projects/sumatra/ . The goal of this project is to provide GPU support directly in the JVM, with support from the JIT. The current status and first results can be seen in their mailing list at http://mail.openjdk.java.net/mailman/listinfo/sumatra-dev
However, a while ago, I collected some resources related to "Java on the GPU" in general. I'll summarize these again here, in no particular order.
(Disclaimer: I'm the author of http://jcuda.org/ and http://jocl.org/ )
(Byte)code translation and OpenCL code generation:
https://github.com/aparapi/aparapi : An open-source library that is created and actively maintained by AMD. In a special "Kernel" class, one can override a specific method which should be executed in parallel. The byte code of this method is loaded at runtime using an own bytecode reader. The code is translated into OpenCL code, which is then compiled using the OpenCL compiler. The result can then be executed on the OpenCL device, which may be a GPU or a CPU. If the compilation into OpenCL is not possible (or no OpenCL is available), the code will still be executed in parallel, using a Thread Pool.
https://github.com/pcpratts/rootbeer1 : An open-source library for converting parts of Java into CUDA programs. It offers dedicated interfaces that may be implemented to indicate that a certain class should be executed on the GPU. In contrast to Aparapi, it tries to automatically serialize the "relevant" data (that is, the complete relevant part of the object graph!) into a representation that is suitable for the GPU.
https://code.google.com/archive/p/java-gpu/ : A library for translating annotated Java code (with some limitations) into CUDA code, which is then compiled into a library that executes the code on the GPU. The Library was developed in the context of a PhD thesis, which contains profound background information about the translation process.
https://github.com/ochafik/ScalaCL : Scala bindings for OpenCL. Allows special Scala collections to be processed in parallel with OpenCL. The functions that are called on the elements of the collections can be usual Scala functions (with some limitations) which are then translated into OpenCL kernels.
Language extensions
http://www.ateji.com/px/index.html : A language extension for Java that allows parallel constructs (e.g. parallel for loops, OpenMP style) which are then executed on the GPU with OpenCL. Unfortunately, this very promising project is no longer maintained.
http://www.habanero.rice.edu/Publications.html (JCUDA) : A library that can translate special Java Code (called JCUDA code) into Java- and CUDA-C code, which can then be compiled and executed on the GPU. However, the library does not seem to be publicly available.
https://www2.informatik.uni-erlangen.de/EN/research/JavaOpenMP/index.html : Java language extension for for OpenMP constructs, with a CUDA backend
Java OpenCL/CUDA binding libraries
https://github.com/ochafik/JavaCL : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings
http://jogamp.org/jocl/www/ : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings
http://www.lwjgl.org/ : Java bindings for OpenCL: Auto-generated low-level bindings and object-oriented convenience classes
http://jocl.org/ : Java bindings for OpenCL: Low-level bindings that are a 1:1 mapping of the original OpenCL API
http://jcuda.org/ : Java bindings for CUDA: Low-level bindings that are a 1:1 mapping of the original CUDA API
Miscellaneous
http://sourceforge.net/projects/jopencl/ : Java bindings for OpenCL. Seem to be no longer maintained since 2010
http://www.hoopoe-cloud.com/ : Java bindings for CUDA. Seem to be no longer maintained

From the research I have done, if you are targeting Nvidia GPUs and have decided to use CUDA over OpenCL, I found three ways to use the CUDA API in java.
JCuda (or alternative)- http://www.jcuda.org/. This seems like the best solution for the problems I am working on. Many of libraries such as CUBLAS are available in JCuda. Kernels are still written in C though.
JNI - JNI interfaces are not my favorite to write, but are very powerful and would allow you to do anything CUDA can do.
JavaCPP - This basically lets you make a JNI interface in Java without writing C code directly. There is an example here: What is the easiest way to run working CUDA code in Java? of how to use this with CUDA thrust. To me, this seems like you might as well just write a JNI interface.
All of these answers basically are just ways of using C/C++ code in Java. You should ask yourself why you need to use Java and if you can't do it in C/C++ instead.
If you like Java and know how to use it and don't want to work with all the pointer management and what-not that comes with C/C++ then JCuda is probably the answer. On the other hand, the CUDA Thrust library and other libraries like it can be used to do a lot of the pointer management in C/C++ and maybe you should look at that.
If you like C/C++ and don't mind pointer management, but there are other constraints forcing you to use Java, then JNI might be the best approach. Though, if your JNI methods are just going be wrappers for kernel commands you might as well just use JCuda.
There are a few alternatives to JCuda such as Cuda4J and Root Beer, but those do not seem to be maintained. Whereas at the time of writing this JCuda supports CUDA 10.1. which is the most up-to-date CUDA SDK.
Additionally there are a few java libraries that use CUDA, such as deeplearning4j and Hadoop, that may be able to do what you are looking for without requiring you to write kernel code directly. I have not looked into them too much though.

I'd start by using one of the projects out there for Java and CUDA: http://www.jcuda.org/

Marco13 already provided an excellent answer.
In case you are in search for a way to use the GPU without implementing CUDA/OpenCL kernels, I would like to add a reference to the finmath-lib-cuda-extensions (finmath-lib-gpu-extensions) http://finmath.net/finmath-lib-cuda-extensions/ (disclaimer: I am the maintainer of this project).
The project provides an implementation of "vector classes", to be precise, an interface called RandomVariable, which provides arithmetic operations and reduction on vectors. There are implementations for the CPU and GPU. There are implementation using algorithmic differentiation or plain valuations.
The performance improvements on the GPU are currently small (but for vectors of size 100.000 you may get a factor > 10 performance improvements). This is due to the small kernel sizes. This will improve in a future version.
The GPU implementation use JCuda and JOCL and are available for Nvidia and ATI GPUs.
The library is Apache 2.0 and available via Maven Central.

There is not much information on the nature of the problem and the data, so difficult to advise. However, would recommend to assess the feasibility of other solutions, that can be easier to integrate with java and enables horizontal as well as vertical scaling. The first I would suggest to look at is an open source analytical engine called Apache Spark https://spark.apache.org/ that is available on Microsoft Azure but probably on other cloud IaaS providers too. If you stick to involving your GPU then the suggestion is to look at other GPU supported analytical databases on the market that fits in the budget of your organisation.

Related

Java best practices for vectorized computations

I'm researching methods for computing expensive vector operations in Java, e.g. dot-products or multiplications between large matrices. There are a few good threads on here on this topic, like this and this. It appears that there is no reliable way of having the JIT compile code to use CPU vector instructions (SSE2, AVX, MMX...). Moreover, high-performance linear algebra libraries (ND4J, jblas, ...) do in fact make JNI calls to BLAS/LAPACK libraries for the core routines. And I understand BLAS/LAPACK packages to be the de facto standard choices for native linear algebra computations.
On the other hand others (JAMA, ...) implement algorithms in pure Java without native calls.
My questions are:
What are the best practices here?
Is making native calls to BLAS/LAPACK actually a recommended choice? Are there other libraries worth considering?
Is the overhead of JNI calls negligible compared to the performance gain? Does anyone have experience as to where the threshold lies (e.g. how small an input should be to make JNI calls more expensive than a pure Java routine?)
How big is the portability tradeoff?
I hope this question could be of help both for those who develop their own computation routines, and for those who just want to make an educated choice between different implementations.
Insights are appreciated!

There are no clear best practices for every case. Whether you could/should use a pure Java solution (not using SIMD instructions) or (optimized with SIMD) native code through JNI depends on your particular application and specifically the size of your arrays and possible restrictions on the target system.
There could be a requirement that you are not allowed to install specific native libraries in the target system and BLAS is not already installed. In that case you simply have to use a Java library.
Pure Java libraries tend to perform better for arrays with length much smaller than 100 and at some point after that you get better performance using native libraries through JNI. As always, your mileage may vary.
Pertinent benchmarks have been performed (in random order):
http://ojalgo.org/performance_ejml.html
http://lessthanoptimal.github.io/Java-Matrix-Benchmark/
Performance of Java matrix math libraries?
These benchmarks can be confusing as they are informative. One library may be faster for some operation and slower for some other. Also keep in mind that there may be more than one implementation of BLAS available for your system. I currently have 3 installed on my system blas, atlas and openblas. Apart from choosing a Java library wrapping a BLAS implementation you also have to choose the underlying BLAS implementation.
This answer has a fairly up to date list except it doesn't mention nd4j that is rather new. Keep in mind that jeigen depends on eigen so not on BLAS.

Is it possible to create a Hybrid GPU accelerated application in Java that utilize CUDA & OpenCL?

In my experience applications written in CUDA run faster than written in OpenCL when run on the same NVidia hardware.
How can this capability be utilized without losing the cross-platform capabilities of OpenCL?
I suspect it may be possible to create a "failback" system where, if there are no NVidia devices available and/or no CUDA version of the requested kernel, then the system would failback to utilizing the OpenCL version. Alternatively, large tasks could be load balanced across NVidia and non-NVidia hardware. Ideally such an application would need to be cross platform and also function on machines that don't have NVidia hardware available.
As far as I can tell, this boils down to being able to utilize CUDA support as dynamic libraries (dll/.so). I am already using JOCL to access OpenCL but I don't see how I would be able to bind to kernels generated with CUDA as all examples I'm able to find are stand-alone applications.
Are there any open-source examples of such systems?
Are there any technical limitations that make developing such a hybrid application impossible?

Answering the question:
The development is possible, and you can do it without any problem
with the tools commented in the comments. (examples, JCUDA and JOCL) A
quick google search will bring you many free wrappers to put CUDA and OpenCL to Java.
As for the failsafe, CUDA_ERROR_INVALID_DEVICE will be returned at
init CUDA in a non-CUDA system in JCUDA. JOCL will give similar error at initializing stage. Then you can simply select the one that didn't fail or the best one for you. (or in the last case, CPU code only in Java)
However, I cannot understand the background of your question. Since I couldn't find any situation where OpenCL was slower to CUDA. At least, not in the last version of the standards. And my personal usage has shown that even in some cases OpenCL is faster (+-5%). Of course you need to implement both properly, otherwise, one of them will be deeply penalized by a wrong deployment.
You would better take the way of using just one of the both options, either CUDA (if you find it easyer and gives you good performance without any headache) or OpenCL (for flexibility). Using both, maintaining both, and selecting properly the useful one for each case as well as having to deal with the fail safe code, will make your project terribly difficult.

Maybe also have a look at OpenCL which, in theory, should be a bit more cross-platform and also allows to transparently run on different processors (read: GPU and/or CPU as available).

Is there a Java library for accelerated vector computations?

I'm looking for a Java lib that permits to do some fast computations with vector (and maybe matrices too).
By fast I mean that it takes advantage of GPU processing and/or SSE instructions. I'm wondering if it can be possible to find something more portable as possible. I recognize that the JVM provides a thick abstraction layer of the hardware.
I've come across JCUDA, but there's a drawback: on a computer without an Nnvidia graphic card it should be run in emulation mode (so I come to believe it will be not efficient as expected). Has anyone already tried it?

What about OpenCL? It should provide you a good starting point for this kind of optimized operations.
There exist many bindings for Java, starting from jocl (but take a loot also at JavaCL or LWJGL that added support from 2.6)

If by fast you mean high speed rather than requiring support for your particular hardware, I'd recommend Colt. Vectors are called 1-d matrices in this library.

I'd recommend using UJMP (wraps most if not all of the high-speed Java matrix libraries) and wait for a decent GPGPU implementation to be written for it (I started hacking it with JavaCL a while ago, but it needs some serious rewrite, maybe using ScalaCLv2 that's in the works).

Best approach for GPGPU/CUDA/OpenCL in Java?

General-purpose computing on graphics processing units (GPGPU) is a very attractive concept to harness the power of the GPU for any kind of computing.
I'd love to use GPGPU for image processing, particles, and fast geometric operations.
Right now, it seems the two contenders in this space are CUDA and OpenCL. I'd like to know:
Is OpenCL usable yet from Java on Windows/Mac?
What are the libraries ways to interface to OpenCL/CUDA?
Is using JNA directly an option?
Am I forgetting something?
Any real-world experience/examples/war stories are appreciated.

AFAIK, JavaCL / OpenCL4Java is the only OpenCL binding that is available on all platforms right now (including MacOS X, FreeBSD, Linux, Windows, Solaris, all in Intel 32, 64 bits and ppc variants, thanks to its use of JNA).
It has demos that actually run fine from Java Web Start at least on Mac and Windows (to avoid random crashes on Linux, please see this wiki page, such as this Particles Demo.
It also comes with a few utilities (GPGPU random number generation, basic parallel reduction, linear algebra) and a Scala DSL.
Finally, it's the oldest bindings available (since june 2009) and it has an active user community.
(Disclaimer: I'm JavaCL's author :-))

You may also consider Aparapi. It allows you to write your code in Java and will attempt to convert bytecode to OpenCL at runtime.
Full disclosure. I am the Aparapi developer.

Well CUDA is a modification of C, to write CUDA kernel you have to code in C, and then compile to executable form with nvidia's CUDA compiler. Produced native code could then be linked with Java using JNI. So technically you can't write kernel code from Java. There is JCUDA http://www.jcuda.de/jcuda/JCuda.html, it provides you with cuda's apis for general memory/device menagement and some Java methods that are implemented in CUDA and JNI wrapped (FFT, some linear algebra methods.. etc etc..).
On the other hand OpenCL is just an API. OpenCL kernels are plain strings passed to the API so using OpenCL from Java you should be able to specify your own kernels. OpenCL binding for java can be found here http://www.jocl.org/.

I've been using JOCL and I'm very happy with it.
The main disadvantage of OpenCL over CUDA (at least for me) is the lack of available libraries (Thrust, CUDPP, etc). However CUDA can be easily ported to OpenCL, and by looking at how those libraries work (algorithms, strategies, etc) is actually very nice as you learn a lot with it.

I know it's late but take a look at this: https://github.com/pcpratts/rootbeer1
I have not worked with it but seems much easier to use than other solutions.
From the project page:
Rootbeer is more advanced than CUDA or OpenCL Java Language Bindings. With bindings the developer must serialize complex graphs of objects into arrays of primitive types. With Rootbeer this is done automatically. Also with language bindings, the developer must write the GPU kernel in CUDA or OpenCL. With Rootbeer a static analysis of the Java Bytecode is done (using Soot) and CUDA code is automatically generated.

I can also recommend JOCL by jogamp.org, works on Linux, Mac, and Windows. CONRAD, for example, uses heavily OpenCL in combination with JOCL.

If you want to do some image processing or geometric operations, you may want a linear algebra library with gpu support (with CUDA for instance). I would suggest you ND4J witch is the linear algrebra with CUDA GPU support on which DeepLearning4J is built. With that you don't have to deal with CUDA directly and have to low level code in c. Plus if you want to do more stuff with image with DL4J you will have access to specific image processing operations such as convolution.

You can take a look at the CUDA4J API
http://sett.com/gpgpu/the-cuda4j-api

High volume SVM (machine learning) system

I working on a possible machine learning project that would be expected to do high speed computations for machine learning using SVM (support vector machines) and possibly some ANN.
I'm resonably comfortable working on matlab with these, but primarly in small datasets, just for experimentation. I'm wondering if this matlab based approach will scale? or should i be looking into something else? C++ / gpu based computing? java wrapping of the matlab code and pushing it onto app engine?
Incidentally, there seems to be a lot fo literature on GPUs, but not much on how useful they are on machine learning applications using matlab, & the cheapest CUDA enlabled GPU money can buy? is it even worth the trouble?

I work on Pattern Recognition problems. Let me please to give you some advices if you plan to work effectively on SVM/ANN problems and if you realy don't have access to a computer cluster:
1) Don't use Matlab. Use Python and its large number of numerical libraries instead for Visualisation/Analysis of your computations.
2) Critical sections better to implement using C. You can integrate them then with your Python scripts very easy .
3) CUDA/GPU is not a solution if you mostly deal with non-polinomial time complexity problems which is typical in Machine Learning, so it brings no great speed-up; dot/matrix products are only a tiny part of SVM calculations - you still will have to deal with feature extractions and lists/objects processing, try instead to optimize your algorithms and devise effective algorithmic methods. If you need parallelism (e.g. for ANNs), use threads or processes.
4) Use GCC compiler to compile your C program - it will build the very fast executable code. To speed-up numerical computations you can try GCC optimization flags (e.g. Streaming SIMD Extensions)
5) Run your program on any modern CPU under Linux OS.
For realy good performance, use Linux clusters.

Both libsvm and SVM light have matlab interfaces. Besides, most learning tasks are trivially parallelizable, so take a look at matlab commands like parfor and the rest of the Parallel Computing Toolbox.

I would advice against using Matlab for anything beyond prototyping.
When the project becomes more complex and extensive, proportion of your own code will grow versus functionality provided by matlab and toolboxes. The more developed the project becomes, the less you benefit from matlab and the more you need features, libraries and - more importanly - practices, processes and tools of general purpose languages.
Scaling of matlab solution is achieved by interfacing with non-matlab code, and I've seen matlab project turn into nothin more than a glue calling modules written in multi-purpose languages. Causing everyday pains for everyone involved.
If you are comfortable with Java, I'd recommend using it together with some good math library (at least, you can always interface MKL). Even with recent Matlab optimisations, MKL + JVM are much faster - scaling and maintainability are beyond comparison.
C++ with processor specific intrinsics can provide better performance, but at a price of development time and maintainability. Adding CUDA imporves performance further, but the amount of work and specific knowledge is hardly worth it. Certainly not if you don't have prior experience with GPU calcucations. As soon as you go beyond single processor, it's much more effective to add another CPU or two to system than to struggle with GPU calculations.

Nothing as of now will scale beyond a limit. libsvm has a tool for subset selection, to select a set of data points for training. forget about ANN, it will not generalize and there is no theory that helps to choose the number of hidden nodes, etc.. It has to be manually optimized a lot and can get trapped in local minima.. Go with SVM only

Here you can find some semiparametric approxiamtions that can work with a high volumen of data very fast:
http://www.dabi.temple.edu/budgetedsvm/
https://robedm.github.io/LIBIRWLS/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.