Multi-Dimensional Finite Difference Methods on a GPU

One key aspect for the performance of multi-dimensional finite difference methods based on operator splitting is the performance of the underlying tridiagonal system solver [1]. The authors in [2] have analysed different solver strategies and they have reported a speed-up factor of around 15 between GPU and CPU for the best solver strategy (cyclic reduction) on very large systems. CUDA’s sparse matrix library cuSPARSE contains a routine to solve tridiagonal systems based on cyclic reduction. nVIDIA claims a speed-up factor of  around ten compared against Intel’s MKL library [3].

These factors are smaller than the speed-up factors reported for pure Monte-Carlo pricing algorithms. Main reason is that a tridiagonal system solver can not be parallelised by a simple divide and conquer approach like many Monte-Carlo pricing algorithms.

Main classes of the GPU implementation of the Hundsdorfer-Verwer operator splitting scheme are:

  1. GPUFdmLinearOp: implements the FdmLinearOp interface and can be initialized from any instance of FdmLinearOp/FdmLinearOpComposite via the toMatrix()/toMatrixDecomp() methods.
  2. GPUTripleBandLinearOp: GPU based implementation of TripleBandLinearOp. This linear operator can be initialized from any instance of the classes FdmLinearOp/FdmLinearOpComposite via the toMatrix/toMatrixDecomp() methods. The solver for the tridiagonal system is based on cuSPARSE.
  3. GPUHundsdorferScheme: GPU based implementation of the Hundsdorfer-Verwer operator splitting scheme.


plotThe code is available here and it is based on the latest QuantLib version from the trunk, CUDA 4.1 or higher and Cusp 0.3 or higher.  As can be seen from the diagrams above the speed-up factor depends on the problem size. Speed-up factors above ten can only be achieved for very large two-dimensional or for medium-sized three-dimensional problems. The GPU precision is defined in cudatypes.hpp by the typedef for the type “real”.

[1] Karel in ’t Hout, ADI finite difference schemes for the Heston model with correlation.

[2] Pablo Quesada-Barriuso, Julián Lamas-Rodríguez, Dora B. Heras, Montserrat Bóo, Francisco Argüello, Selecting the Best Tridiagonal System Solver Projected on Multi-Core CPU and GPU Platforms.



The Sobol Brownian Bridge Generator on a GPU

The biggest mistake that can be made with quasi random numbers is to just use them in the same way as one uses pseudo random numbers. The real advantage of quasi Monte-Carlo shows up only after

N\sim e^d

samples, where d is the number of dimensions of the problem. If one is using significantly less samples then it becomes very likely that the results are totally wrong. Therefore a dimensional reduction of the problem is often the first step when applying quasi Monte-Carlo methods.

The Brownian bridge is tailor-made to reduce the number of significant dimensions of a Monte-Carlo integration of a stochastic differential equation  In his recent text-book [1] Mark Joshi shows how to use Brownian bridges efficiently for quasi Monte-Carlo pricing and he also outlines the QuantLib implementation. It has been shown in [3] that the overall statistical error of a structured equity portfolio has been reduced by a factor of three when switching from a pseudo random number generator to a Sobol Brownian bridge generator, which in turn means a speed-up factor of nine leaving the Monte Carlo error unchanged.

An efficient implementation of a Sobol Brownian bridge generator is therefore a standard building block for quasi Monte-Carlo option pricing. A CUDA 4.2 implementation for nVIDIA GPUs based on QuantLib 1.2 is available here. The interface and usage are closely related to QuantLib’s Sobol Brownian bridge generator for CPUs. The memory layout of the results on the GPU is CURAND_ORDERING_QUASI_DEFAULT. The zip file also contains a validation program, which checks that both versions of the Sobol Brownian bridge are generating the same results.

As the diagram above shows, the GPU (GTX 560) out-performance the CPU (i3@3.0GHz) on average by a factor of 50 for single precision and by a factor of 12 for double precision.

[1] Mark Joshi, More Mathematical Finance
[2] Sebasien Gurrieri, An Analysis of Sobol Sequence and the Brownian Bridge
[3] Andre Bernemann, Raplh Schreyer, Klaus Spanderen, Accelerating Exotic Option Pricing and Model Calibration Using GPUs

Accelerate VPP Pricing on the GPU using Preconditioner

As outlined in Solving Sparse Linear Systems using CUSP and CUDA a VPP pricing problem can be solved on a GPU using finite difference methods based on the implicit Euler scheme. The matrix inversion is carried out on the GPU using the BiCGStab solver of the numerical library cusp [1]. The  optimal exercise problem for the VPP can be solved using dynamic programming for every time slice on the GPU (see [2]).

The lower convergence order in \Delta t of the implicit Euler scheme is not a disadvantage because the hourly optimization step forces \Delta t = 1h anyhow. This step size is small enough for the implicit Euler scheme to obtain a very good accuracy. The explicit Euler scheme is  numerically unstable, even if \Delta t = 1h. It converged only for large \Delta x, \Delta y and \Delta u.

Clearly the matrix inversion dominates the overall runtime of the pricing routine. Preconditioning is a standard technique to improve the performance of the involved BiCGStab algorithm. The cusp library supports a set of suited preconditioner for this problem.

  • Diagonal: preconditioning is very inexpensive to use, but has often limited effectiveness.
  • Approximate Inverse: Non-symmetric Bridson with different dropping strategies.
  • Smoothed Aggregation: Algebraic Multi-grid preconditioner based on smoothed aggregation.

On a CPU QuantLib supports preconditioning based on the Thomas algorithm. This algorithm solves a tridiagonal matrix with O(8n) operations. The corresponding preconditioner can be seen as a natural extension of the diagonal preconditioner. Experiments on a CPU indicate that  the tridiagonal preconditioner leads to a significant speed-up for our VPP pricing problem. The upcoming CUSPARSE library of the CUDA 4.1 release contains a tridiagonal solver for the GPU. With the help of a small interface class the CUDA tridiagonal solver can be used as a cusp preconditioner. The source code is available here. The code depends on the latest QuantLib version from the SVN trunk.

The testbed is a VPP contract with t_{minUp} = t_{minDown} = 2h and a maturity of 4 weeks (corresponds to 4*7*24=672 exercise steps). Therefore in addition to the three dimensions for the stochastic model the problem has a fourth dimension of size 2t_{minUp} + t_{minDown} = 6  to solve the optimization problem via dynamic programming.

The following diagram shows the different run-times of the algorithms on a GPU and a CPU.

The fastest algorithm on the GPU (implicit Euler scheme with BiCGStab and tridiagonal preconditioner) outperforms the best CPU implementation (operator splitting based on the Douglas scheme) roughly by a factor of ten. The non-symmetric Bridson preconditioner with the dropping strategy from Lin & More on the GPU is the second best algorithm but it ran out of memory on larger grids. The diagonal preconditioner slows down the overall performance on the GPU and the BiCGStab algorithm with the smoothed aggregation preconditioner did not converge.

[1] cusp, Generic Parallel Algorithms for Sparse Matrix and Graph Computations

[2] Pricing of a Virtual Power Plant on a GPU

Solving Sparse Linear Systems using CUSP and CUDA

A finite difference equation can be represented and solved based on a sparse linear system. When using schemes with implicit parts to solve the equation one needs to calculate the inverse of this sparse matrix. Iterative solvers like the BiCGStab algorithm (plus preconditioner) are tailor-made for these kind of problems. In addition these solvers can easily be adapted to parallel computing architectures like modern GPUs. The cusp library [1] implements a set of common iterative solvers for sparse matrices based on CUDA and thrust [2].

Alternatively operator splitting techniques can be used to solve the problem. In this case only tridiagonal linear systems have to be inverted. The Thomas algorithm (named after Liewellyn Thomas) can solve such systems in O(n) operations instead of O(n^3) operations required by the Gaussian elimination. But it is not easy to parallelize the Thomas algorithm.

On a CPU operator splitting usually outperforms iterative solvers like BiCGStab for the implicit part. The VPP pricing problem from the previous blog entries will now serve as a benchmark to compare the solver performance on the CPU against a GPU. As can be seen in the diagram below on a CPU operator splitting is faster than the BiCGStab solver plus preconditioner. But the BiCGStab on the GPU clearly gains the lead even though the corresponding preconditioner for the GPU has not been implemented yet. Please find the interface class between QuantLib/boost::ublas and cusp here.

[1] cusp, Generic Parallel Algorithms for Sparse Matrix and Graph Computations

[2] thrust, Code at the speed of light

Pricing of a Virtual Power Plant on a GPU

Even the pricing of  a simple virtual power plant (VPP) is challenging. Main reasons are the high number of possible states of the VPP and the large number of possible exercise dates because often a VPP is priced as a bermudan-style option with hourly exercise rights. The implementation effort for an exact pricing engine based on finite difference methods (see e.g.[1]) or based on least squares Monte-Carlo is comparable large. As shown in [1] Monte-Carlo combined with  perfect foresight optimization can result in a very good approximation. The algorithm consists of a Monte-Carlo path generator and a dynamic programming optimization part, which calculates the optimal load schedule plan for each path separately. The stochastic processes involved are outlined in [1].

The CUDA based GPU implementation is available here.  It depends on the latest QuantLib version from the SVN trunk or the upcoming QuantLib 1.2 release and CUDA 4.0. The corresponding C++ implementation is a speed-optimized version of the test-case VPPTest::testVPPPricing. This version also supports multi-threading. The following hardware was used to compare both implementations:

  • CPU: Core i5@3.0 GHz, quad-core
  • GPU: GTX560@810/1620MHz, 336 cores

As can be seen in the diagram below the GPU outperforms the CPU roughly by a factor 100 for single precision and a factor of 50 if the GPU is using double precision.

The CUDA implementation consists of the following files:

gpuvpppricingengine.hpp / gpuvpppricingengine.cpp
A QuantLib pricing engine for a simple VPP based on a Monte-Carlo simulation and perfect foresight optimization via dynamic programming. The physical size of the Monte-Carlo simulation is controlled by the following parameters of the constructor

  1. Size nSimulations: number of Monte-Carlo simulations carried out.
  2. bool antithetic: enables/disables antithetic sampling
  3. Size blockSize: number of threads in a CUDA block.
  4. Size gridSize: number of CUDA blocks that are grouped together in a simulation kernel.

gpuvpppricingengine_kernel.hpp gpuvpppricingengine_kernel.def
The CUDA implementation consists of two kernels. The first kernel is the Monte-Carlo path generator, which calculates the paths on hourly granularity and stores them in the global memory of the graphic card.. The technics used are outlined e.g. in [2], [3] and [4]. The second kernel performs the optimization of the load schedule based on dynamic programming. The memory layout of this step depends on the number of possible states of the VPP because every possible state is stored in the shared memory of the GPU. The number of states is given by N_{states} = 2t_{up}+t_{down}. CUDA does not support efficient dynamic shared memory allocation. Therefore the sizes of all shared memory arrays must be given at compile time. To allow an optimal use of the limited shared memory capacity different kernels with different N_{states} values are generated using X-macros and the appropriate kernel is chosen at runtime.

defines basic CUDA types, especially the typedef for the type “real” can be used to compile the code either for single or double precision.

C++ interface for a GPU random number generator

gpucurand.hpp / gpucurand.cpp / gpucurand_kernel.hpp /
implementation of the GPURand interface based on the CURAND library, which is part of CUDA 4.0.

[1] this blog, VPP Pricing III: Exact Pricing based on Finite Difference Methods.

[2] L. Howes, D. Thomas, Efficient Random Number Generation and Application Using CUDA.

[3] A. Bernemann, R. Schreyer, K. Spanderen, Accelerating Exotic Option Pricing and Model Calibration Using GPUs

[4] M. Joshi, Graphical Asian Options

Calibrate the Heston Model on a GPU: CUDA 3.x and 4.0 Update

Please find here a CUDA based program (upgraded for CUDA 3.x and 4.0) to price and calibrate a piecewise constant time dependent Heston model based on the QuantLib 1.0. The example code is in test-suite/hestonmodel.cpp. Some implementation details are outlined in the following paper, A. Bernemann, R. Schreyer, K. Spanderen: Accelerating Exotic Option Pricing and Model Calibration Using GPUs.

Scrambled Sobol Sequences

The authors of [1][2] presented an efficient way to create scrambled Sobol numbers and have shown that their scrambling algorithm can lead to a significantly higher convergence speed. The upcoming CUDA 4.0 Release provides an implementation of the scrambling algorithm outlined in [2].

The aim is now to compare the order of convergence of a PRNG (Merssenne-Twister),  a QRNG (Sobol) and this scrambled-QRNG for the pricing of a portfolio of geometric Asian options. To number of observation dates of the Asian options in the benchmark portfolio varies between 45 and 120 observation dates. The dimensionality of the pricing problem is considerably large. We’ll use a Brownian bridge to construct the paths. This ensures that the first coordinates of a Sobol sequence are used to sample the most important modes.

The error analysis is not an easy task.  Simply applying another randomized shift is not sufficient as this transformation changes the discrepancy of a point-set (A shifted (t,m,d)-net does not need to be a (t,m,d)-net.). In [3] this problem was overcome by looking on the RMS relative error

\sqrt{\frac{1}{m}\sum_{i=1}^m\left(\frac{\hat{C}_i -C_i}{C_i}\right)^2}

of a set of benchmark options with true prices given by C_1,...,C_m and the Monte-Carlo approximations \hat{C}_1,...,\hat{C}_m.

We are in particular interest in ratios between the PRNG, QRNG and scrambled QRNG errors and in the order \alpha of the convergence.  For PRNG we are expecting a convergence of order err \simeq N^{\alpha} with  \alpha=-0.5 and N being the number of paths.

As can be seen in the diagram the scrambled Sobol sequence provides the smallest overall error bounds but the convergence order between scrambled and non-scrambled Sobol does not differ significantly. As the implementation comes more or less for free it’s definitely worth to check scrambled Sobol sequences out.

C++ and CUDA code is available here.  It depends on QuantLib 1.1 and CUDA 4.0. If you want to generate the plot directly out of the C++ program you also need R, RCPP and RInside.

[1] J. Dick, Higher order scrambled digital nets achieve the optimal rate of the root mean square error for smooth integrands. ArXiv, 8 July 2010.

[2] A.B. Owen, Local antithetic sampling with scrambled nets. ArXiv, 28 May 2008

[3] P. Glasserman, Monte Carlo Methods in Financial Engineering.  ISBN-0387004513