Accelerate VPP Pricing on the GPU using Preconditioner

As outlined in Solving Sparse Linear Systems using CUSP and CUDA a VPP pricing problem can be solved on a GPU using finite difference methods based on the implicit Euler scheme. The matrix inversion is carried out on the GPU using the BiCGStab solver of the numerical library cusp [1]. The  optimal exercise problem for the VPP can be solved using dynamic programming for every time slice on the GPU (see [2]).

The lower convergence order in \Delta t of the implicit Euler scheme is not a disadvantage because the hourly optimization step forces \Delta t = 1h anyhow. This step size is small enough for the implicit Euler scheme to obtain a very good accuracy. The explicit Euler scheme is  numerically unstable, even if \Delta t = 1h. It converged only for large \Delta x, \Delta y and \Delta u.

Clearly the matrix inversion dominates the overall runtime of the pricing routine. Preconditioning is a standard technique to improve the performance of the involved BiCGStab algorithm. The cusp library supports a set of suited preconditioner for this problem.

  • Diagonal: preconditioning is very inexpensive to use, but has often limited effectiveness.
  • Approximate Inverse: Non-symmetric Bridson with different dropping strategies.
  • Smoothed Aggregation: Algebraic Multi-grid preconditioner based on smoothed aggregation.

On a CPU QuantLib supports preconditioning based on the Thomas algorithm. This algorithm solves a tridiagonal matrix with O(8n) operations. The corresponding preconditioner can be seen as a natural extension of the diagonal preconditioner. Experiments on a CPU indicate that  the tridiagonal preconditioner leads to a significant speed-up for our VPP pricing problem. The upcoming CUSPARSE library of the CUDA 4.1 release contains a tridiagonal solver for the GPU. With the help of a small interface class the CUDA tridiagonal solver can be used as a cusp preconditioner. The source code is available here. The code depends on the latest QuantLib version from the SVN trunk.

The testbed is a VPP contract with t_{minUp} = t_{minDown} = 2h and a maturity of 4 weeks (corresponds to 4*7*24=672 exercise steps). Therefore in addition to the three dimensions for the stochastic model the problem has a fourth dimension of size 2t_{minUp} + t_{minDown} = 6  to solve the optimization problem via dynamic programming.

The following diagram shows the different run-times of the algorithms on a GPU and a CPU.

The fastest algorithm on the GPU (implicit Euler scheme with BiCGStab and tridiagonal preconditioner) outperforms the best CPU implementation (operator splitting based on the Douglas scheme) roughly by a factor of ten. The non-symmetric Bridson preconditioner with the dropping strategy from Lin & More on the GPU is the second best algorithm but it ran out of memory on larger grids. The diagonal preconditioner slows down the overall performance on the GPU and the BiCGStab algorithm with the smoothed aggregation preconditioner did not converge.

[1] cusp, Generic Parallel Algorithms for Sparse Matrix and Graph Computations

[2] Pricing of a Virtual Power Plant on a GPU

Solving Sparse Linear Systems using CUSP and CUDA

A finite difference equation can be represented and solved based on a sparse linear system. When using schemes with implicit parts to solve the equation one needs to calculate the inverse of this sparse matrix. Iterative solvers like the BiCGStab algorithm (plus preconditioner) are tailor-made for these kind of problems. In addition these solvers can easily be adapted to parallel computing architectures like modern GPUs. The cusp library [1] implements a set of common iterative solvers for sparse matrices based on CUDA and thrust [2].

Alternatively operator splitting techniques can be used to solve the problem. In this case only tridiagonal linear systems have to be inverted. The Thomas algorithm (named after Liewellyn Thomas) can solve such systems in O(n) operations instead of O(n^3) operations required by the Gaussian elimination. But it is not easy to parallelize the Thomas algorithm.

On a CPU operator splitting usually outperforms iterative solvers like BiCGStab for the implicit part. The VPP pricing problem from the previous blog entries will now serve as a benchmark to compare the solver performance on the CPU against a GPU. As can be seen in the diagram below on a CPU operator splitting is faster than the BiCGStab solver plus preconditioner. But the BiCGStab on the GPU clearly gains the lead even though the corresponding preconditioner for the GPU has not been implemented yet. Please find the interface class between QuantLib/boost::ublas and cusp here.

[1] cusp, Generic Parallel Algorithms for Sparse Matrix and Graph Computations

[2] thrust, Code at the speed of light