R has featured packages to support GPU programming for over five years. Beginning with the Gputools package, developers continue to introduce new and more sophisticated tools to take advantage of these powerful coprocessors. The gmatrix package is a recent continuation of this trend which offers some significant new features.

Background

GPUs, or graphical processing units, are the numerical engines powering the visual display of modern computers. They are a type of coprocessor, or computational unit distinct from the central processing unit, or CPU. When first introduced GPUs were discrete components, or cards. While powerful and memory-rich discrete cards are still popular, many processors now ship with GPUs contained in the same physical hardware (“die”) as the CPU. Whether discrete or resident on-die, though, the GPU offers high numerical performance, owing to the parallel computation at which it excels.

GPUs perform the computation necessary, for example, to render high quality three-dimensional moving images. While their success has largely been driven by the gaming industry, these coprocessors have come into their own as indispensible hardware for parallel computation. As numerical modellers discovered over a decade ago, the GPU’s vector processing capabilities can offer significant speedups for a wide variety of computational problems.

Programming the GPU can require specialized skills, but tools continue to be introduced to make their use more appealing. Although there are numer- ous GPU vendors and licensors - Nvidia, AMD, Qualcomm and Imagination Technologies, to name a few - there are two main families of programming tools: CUDA and OpenCL. CUDA is Nvidia’s proprietary software, tailored for its own products, while OpenCL is an open standard, intended to be vendor- agnostic. It is not necessary to be familiar with either family, or with parallel programming, to exploit the GPU’s power.

Coprocessors and R

Successive releases of R, and packages, have offerred increasingly improved support for parallel execution, both at the multicore and multiprocessor levels. Coprocessors, however, continue to represent a specialized niche. This may be due in part not only to their relatively arcane programming models, but also to the difficulty in knowing just when to use them. Moreover, while the advantages of high-memory, discrete GPUs are relatively well understood, on-die GPUs continue to evolve rapidly, and their ultimate utility as general-purpose coprocessors remains to be seen.

Some GPU-based R packages offer tuned versions of well-known operations, for example lm and matrix multiplication. The coprocessor implementations may merely be wrappers for OpenCL or CUDA library calls, they may be whole- cloth implementations, or they may be a combination of the two. In any case, the user merely invokes a command, and some or all of the computation is performed on the GPU, with the result sent back to the CPU.

Dispatch

While the GPU can accelerate such operations ten- or even hundred-fold, the actual advantage varies with characteristics of both the operands involved and the underlying hardware. There is no way to know in advance where the tradeoff lies, i.e., whether execution on the CPU or the GPU will be the better choice. Even when the tradeoff can be roughly estimated, there is no general dispatch mechanism to assign the operation to one piece of hardware versus another. A user seeking the best performance must typically make the choice explicitly, based on empirical evidence. In the case of numerical linear algebra, at least, the HiPLARM packages offers a solution, employing autotuned software capable of characterizing these tradeoffs on the resident hardware and dispatching the appropriate version of the command.

There are also packages which feature specialized solutions or workflows optimized for execution on the GPU. These include, for example, CudaBayesReg and WideLM. These packages were written to accelerate specific types of problems. Dispatch is less of an issue for such packages, as the user is essentially buying in to the higher-level functionality of the software.

API-centered packages

Two packages, RCuda and OpenCL, offer the user API-oriented interfaces to low-level GPU programming. As their respective names imply, these packages expose features of CUDA and OpenCL directly to the user. Thus developers of low-level GPU code can now work within the R environment.

The Rth package, while vendor-specific, has a different goal. Here the user is provided with wrappers to useful functions implemented with Thrust, a computational package built on top of CUDA. As Thrust remains under active development, the Rth package is able to leverage continuing improvements in high-performance GPU implementations.

The cost of communication

For discrete coprocessors at least, communication speed is the key overhead to performance. This is because the path travelled by the data when moving to or from the coprocessor is many times longer than when moving to or from the processor. As just suggested, a well-trained dispatch mechanism can compensate for communication costs by invoking the appropriate version (i.e., CPU or GPU), at least for certain well-understood computations. A major limitation of current dispatch-based approaches, though, is the assumption that the ultimate destination of the computation is the CPU. This may not always be the case.

Iteratively-reweighted least squares is an accessible example of why this assumption need not hold. Each iteration of the algorithm involves operations which can be performed on either of the CPU or the GPU. While these operations might individually be faster on one or the other type of hardware, it is more likely that their aggregate will execute faster on the GPU, as communication costs need only be paid for the first operation, following which all intermediate results reside on the GPU. Clearly, if the user is given the ability to specify that a computation take place on the GPU, and that the result need not be sent back, the potential is enhanced for accelerating more complex algorithms.

The gmatrix package

The gmatrix package, by Nathan Morris, offers a straightforward solution. By leveraging R’s type system, the package offers new classes, the gvector and eponymous gmatrix, for which GPU storage and execution is the default. As an S4 class, for example, the gmatrix is similar in spirit to Matrix, a derived class with specialized features. To perform a supported operation, such as transpose, on the GPU there is no need to invoke a function with a contrived name, such as gpuTranspose(). One instead just uses t() and, assuming that the object has type gmatrix, both the tranpose operation and its result reside on the GPU.

This is more than just an exercise in namespace binding. The type system is doing the heavy lifting here, dispatching commands based on class. It remains the responsibility of the user to determine where the work is best performed. Creating “g-variant” objects is an elegant way to go about it. A large collection of matrix and vector operations are supplied, similar to what is available from other packages, along with additional features such as GPU-based generation of common distributions. Some special syntactic sugar is needed when constructing new objects directly on the GPU, but transfer between host and GPU are as straightforward as invoking commands h() and g(). Once objects are appropriately typed, again, S4 does most of the work.

gmatrix is built atop CUDA and much of the GPU functionality derives from a proprietary BLAS implementation, cuBLAS, as well as from Thrust. Although the low-level back end uses a proprietary implementation, however, the overall programming model used by gmatrix is essentially vendor-agnostic. That is, although the user must contend with what operations to call and whether the data resides on the host or the GPU, the under-the-hood implementation is completely hidden. In particular, the back end could be re-engineered for some other API, leaving the look-and-feel of the package unchanged. This is one reason that gmatrix could have some staying power.

Other authors have noted the possibility of using the type system to dispatch hardware-variant commands to achieve high performance. Christopher Brown reported early work using OpenCL at Nvidia’s annual GTC in 2010, for example. The pdbR series of packages appear to use similar ideas for large clusters. gmatrix, though, is the first fully-implemented package to exploit this idea successfully for the GPU.

In several presentations of the parmatpows matrix exponentiation package, Norm Matloff discusses speedups offered by using gmatrix to provide core linear algebra functionality on the GPU. This includes his 2014 talk at UseR!

Conclusions

There are a variety of packages available to the R programmer to facilitate high- performance development on the GPU. Their functionality ranges from low-level interface to highly tuned command-level and workflow-level implementations. The gmatrix package allows the user to guide optimization of an algorithm by specifying where the data is to reside at any step within a computation. Because of its reliance on object typing, as opposed to syntax, it does so at minimal imposition to the user. The package is a welcome addition to the toolset, and should be seriously considered by anyone wishing to accelerate their R work using a GPU.