Java bindings for the CUDA runtime and driver API
With JCuda it is possible to interact with the CUDA runtime and driver
API from Java programs. JCuda is the common platform for all libraries
on this site.
You may obtain the latest version of JCuda and the JCuda source
code in the Downloads
The following features are currently provided by JCuda:
- Support for the CUDA driver API
- Possibility to load own modules in the driver API
- Support for the CUDA runtime API
- Full interoperability among different CUDA based libraries, namely
- JCublas - Java bindings for CUBLAS, the NVIDIA CUDA BLAS library
- JCufft - Java bindings for CUBLAS, the NVIDIA CUDA FFT library
- JCudpp - Java bindings for the CUDA Data Parallel Primitives Library
- JCurand - Java bindings for CURAND, the NVIDIA CUDA random number generator
- JCusparse - Java bindings for CUSPARSE, the NVIDIA CUDA sparse matrix library
- Comprehensive API documentation extracted from the documentations of the native libraries
- OpenGL interoperability
- Convenient error handling
Please note that not all functionalities have been tested extensively on all
operating systems, GPU devices and host architectures. There certainly are
more limitations, which will be added to the following list as soon as I
become aware of them:
You may either browse the JCuda
here, or download the JCuda documentation in a ZIP file
from the Downloads
Most of the documentation is directly taken from the CUDA Reference Manual and
the CUDA Programming Guide from the the
The main application of the JCuda runtime bindings is the interaction
with existing libraries that are built based upon the CUDA runtime API.
Some Java bindings for libraries using the CUDA runtime API are available
on this web site, namely,
- JCublas, the Java bindings for
CUBLAS, the NVIDIA CUDA BLAS library
- JCufft, the Java bindings for
CUFFT, the NVIDIA CUDA FFT library, and
- JCudpp, the Java bindings for
CUDPP, the CUDA Data Parallel Primitives Library
- JCurand, the Java bindings for
CURAND, the NVIDIA CUDA random number generator
- JCusparse, the Java bindings for
CUSPARSE, the NVIDIA CUDA sparse matrix library
The following snippet illustrates how one of these libraries may be used
with the JCuda runtime API.
You may also want to download the complete, compileable
JCuda runtime API sample from the samples page
that shows how to use
the runtime libraries.
// Allocate memory on the device and copy the host data to the device
Pointer deviceData = new Pointer();
float hostData = createInputData();
cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
// Perform in-place complex-to-complex 1D transforms using JCufft
cufftHandle plan = new cufftHandle();
JCufft.cufftPlan1d(plan, complexElements, cufftType.CUFFT_C2C, 1);
JCufft.cufftExecC2C(plan, deviceData, deviceData, JCufft.CUFFT_FORWARD);
// Copy the result from the device to the host and clean up
cudaMemcpy(Pointer.to(hostData), deviceData, memorySize,
The main usage of the JCuda driver bindings is to load PTX- and CUBIN
modules and execute the kernels from a Java application.
The following code snippet illustrates the basic steps of how to load a
CUBIN file using the JCuda driver bindings, and how to execute a kernel
from the module.
You may also want to download a complete
JCuda driver sample from
the samples page.
// Initialize the driver and create a context for the first device.
CUdevice device = new CUdevice();
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
// Load the PTX that contains the kernel.
CUmodule module = new CUmodule();
// Obtain a handle to the kernel function.
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "functionName");
// Allocate the device input data, and copy the
// host input data to the device
CUdeviceptr deviceData = new CUdeviceptr();
cuMemcpyHtoD(deviceData, hostData, memorySize);
// Set up the kernel parameters
Pointer kernelParameters = Pointer.to(
// Call the kernel function.
gx, gy, gz, // Grid dimension
bx, by, bz, // Block dimension
sharedMemorySize, stream, // Shared memory size and stream
kernelParameters, null // Kernel- and extra parameters
// Copy the data back from the device to the host and clean up
cuMemcpyDtoH(hostData, deviceData, memorySize);
Just as CUDA supports interoperability with OpenGL, JCuda supports interoperability with
The OpenGL interoperability makes it possible to access memory that is bound
to OpenGL from JCuda. Thus, JCuda can be used to write vertex coordinates
that are computed in a CUDA kernel into Vertex Buffer Objects
(VBO), or pixel data into Pixel Buffer Objects (PBO). These objects may then be
rendered efficiently using JOGL or LWJGL. Additionally, JCuda allows CUDA kernels
to access data that is created on Java side efficiently via texture references.
There are some samples for JCuda OpenGL
interaction on the samples page.
The following image is a screenshot of one of the sample applications that reads
volume data from an input file, copies it into a 3D texture, uses a CUDA kernel
to render the volume data into a PBO, and displays the resulting PBO with JOGL.
It uses the kernels from the
sample from the NVIDIA CUDA samples web site.
The most obvious limitiation of Java compared to C is the lack of real pointers.
All objects in Java are implicitly accessed via references
. Arrays or objects
are created using the
keyword, as it is done in C++. References
, as pointers may be in C/C++. So there are similarities
between C/C++ pointers and Java references (and the name
is not a coincidence). But nevertheless,
references are not suitable for emulating native pointers, since they do not allow
, and may not be passed to the native libraries. Additionally,
"references to references" are not possible.
To overcome these limitations, the
has been introduced in JCuda. It may be treated similar to
pointer in C, and thus may be used for native
host or device memory, and for Java memory:
Pointer devicePointer = new Pointer();
JCuda.cudaMalloc(devicePointer, 4 * Sizeof.FLOAT);
float array = new float;
Pointer hostPointer = Pointer.to(array);
Pointer hostPointerWithOffset = hostPointer.withByteOffset(2 * Sizeof.FLOAT);
JCuda.cudaMemcpy(devicePointer, hostPointerWithOffset, 4 * Sizeof.FLOAT,
Pointers may either be created by instantiating a new Pointer, which initially
will be a
pointer, or by passing either a (direct or
array-based) Buffer or a primitive Java array to one of the "to(...)" methods
of the Pointer class. Additionally, one may pass an array of Pointer objects
to the "to(...)" method, which is important to be able to allocate a 2D array
(i.e. an array of Pointers) on the device, which may then be passed to the
library or kernel. See the
JCuda driver API
example for how to pass a 2D array to a kernel.
NOTE: This section will be updated soon, according to the notes
about asynchronous operations that have been added in the final
releases of CUDA 4.1, and information about stream callbacks
that have been introduced in CUDA 5.0
In previous versions of JCuda, asynchronous operations had not officially
been supported. This limitation actually applied only to a specific group
of operations, while other asynchronous operations have already been
working properly. The current version of JCuda may be used to perform
specific kinds of asynchronous operations, but note that
not all of them have been tested extensively under all conditions
This section will summarize the different kinds of asynchronous operations,
and detail which limitations still exist in the current version.
Asynchronous operations in CUDA
CUDA offers various types of asynchronous operations. The most important ones are
cudaMemcpyAsync functions in the Runtime API
cuMemcpy*Async functions in the Driver API
cuLaunchKernel function in the Driver API
Additionally, the runtime libraries offer methods to set a
that should be associated with the functions of the respective library, for example
. In general, all
these asynchronous functions return immediately when they are called,
although the result of the computation may not yet be available.
The stream and event handling functions may be used to achieve proper
synchronization between different tasks that may be associated with
Limitations of the support for asynchronous operations in JCuda
The main difficulties with asynchronous operations arise when host
memory is involved. When an array is passed from Java to CUDA via JNI,
then the array may be copied, and CUDA may internally be working with
a copy of the array. The decision whether the memory is copied or not
is made by the Java Virtual Machine and can not be influenced by
the programmer. This basically means that the function either has
to block until the respective operation is completed, or the result
of the operation may be undefined.
Therefore, the following asynchronous operations are NOT
by JCuda, and may result in undefined behavior:
cudaMemcpyAsync with a pointer to a Java array
cuMemcpy*Async with a pointer to a Java array
Future versions of JCuda may even throw an exception when an attempt
is made to call such a function with invalid parameters.
Note that calling these functions with pointers to device memory is
allowed. Also note that this limitation explicitly refers to Java arrays
It is also possible to create a pointer to a direct buffer. Such a buffer
may be created using the
the Java NIO ByteBuffer class, and consists of host memory that may directly
be accessed by the native functions. Therefore, it should be possible
to use these direct buffers in asynchronous operations, but this
has not been tested
Asynchronous operations in CUBLAS and CUSPARSE
The most recent versions of CUBLAS and CUSPARSE (as defined in the header
" and "
inherently asynchronous. This means that all functions return immediately
when they are called, although the result of the computation may not yet
be available. This does not impose any problems as long as the functions
do not involve host memory. However, in the newest versions of CUBLAS and
CUSPARSE, several functions have been introduced that may accept parameters
or return results of computations either via pointers to device memory or
via pointers to host memory.
These functions are also offered in JCublas2 and JCusparse2. When they
are called with pointers to device memory, they are executed asynchronously
and return immediately, writing the result to the device memory as soon
as the computation is finished. But due to the limitations described above,
this is not possible when they are are called with pointers to Java arrays.
In this case, the functions will block until the computation has completed.
Note that the functions will not block when they receive a pointer to
a direct buffer, but this has not been tested