Please find here my talk “*Beyond Simple Monte-Carlo: Parallel Computing with QuantLib*” at this year’s QuantLib User Meeting in Düsseldorf.

# Category: Finite Difference Methods

# Fokker-Planck Equation, Feller Constraint and Boundary Conditions

The Fokker-Planck forward equation is an important tool to calibrate local volatility extensions of stochastic volatility models, e.g. the local vol extension of the Heston model. But the treatment of the boundary conditions – especially at zero instantaneous variance – is notoriously difficult if the Feller constraint

of the square root process for the variance

is violated. The corresponding Fokker-Planck forward equation

describes the time evolution of the probability density function *p *and the boundary at the origin is instantaneously attainable if the Feller constraint is violated. In this case the stationary solution [1] of the Fokker-Planck equation

diverges with

and the zero flow boundary condition at the origin

becomes hard to track numerically. Rewriting the partial differential equation in terms of helps to mitigate the problem [2].

The corresponding zero flow boundary condition in *q *reads as follows

The first step in order to apply the boundary condition is to discretize the transformed Fokker-Planck equation in space direction on a non-uniform grid

with

Next step is to discretize the zero flow boundary condition using a second order forward discretization [3] for the derivative .

with

Now the value at the boundary can be expressed in terms of and .

This equation can be used to remove the boundary value from the discretization of the transformed Fokker-Planck equation. In the same manner the zero flow boundary condition can be implemented for the upper boundary and for the original Fokker-Planck equation.

The following parameters of the square-root process will be used to test the different boundary conditions.

The uniform grid has 100 or 1000 grid points and the stationary solution of the Fokker-Planck equation is chosen as the start configuration on the grid points. The boundaries are defined via the inverse cumulated distribution function such that

.

The grid covers 98% of the distribution and this value would not change over time if the boundary conditions are satisfied without discretiszation errors. To test the magnitude of the discretization errors we let the initial solution evolve over one year using 100 time steps with the Crank-Nicolson scheme. The resulting solution is interpolated using cubic splines to get the function . The value

servers as a quality indicator for the discretization errors of the boundary condition.

As can be seen in the diagram below if is small then the original Fokker-Planck equation shows smaller discretization errors than the transformed Fokker-Planck equation in . But if becomes larger and especially if the Feller constraint is violated then the solution of the transformed Fokker-Planck equation clearly outperforms the original solution. As a rule of thumb based on this example one should use the original Fokker-Planck equation if and the transformed equation otherwise.

The same techniques can be used to solve the Fokker-Planck equation for the Heston model. The code for the numerical tests is available in the latest QuantLib version from the SVN trunk. The diagram is based on the test FDHestonTest::testSquareRootEvolveWithStationaryDensity.

[1] A. Dragulescu, V. Yakovenko, Probability distribution of returns in the Heston model with stochastic volatility

[2] V. Lucic, Boundary Conditions for Computing Densities in Hybrid Models via PDE Methods

[3] P. Lermusiaux, Numerical Fluid Mechanics, Lecture Note 14, Page 5

# Fokker-Planck Forward Equation for the Heston Model

Consider the stochastic differential equation of the Heston model for the dynamics of the log-spot

The Fokker-Planck partial differential forward equation describes the time evolution of the probability density function

with the initial condition

where denotes the Dirac delta distribution. A semi-closed form solution for this problem is presented in [1]. When solving the Fokker-Planck forward equation special care must be taken with respect to the boundary conditions, especially if the Feller constraint

is violated. In this case the boundary at the origin is instantaneously attainable. A generalisation of Feller’s zero-flux boundary condition should be applied at the origin [2].

A three point forward differentiation formula can be used to calculate a second order accurate approximation of the partial derivative for [3].

The diagram below shows the solution of the forward equation for the model

The Feller constraint is violated for and this changes the shape of the solution completely.

The code for this example is available here and it is based on the latest QuantLib version from the SVN trunk. It also depends on RInside and Rcpp to generate the plots. In addition the zip contains a short movie clip of the time evolution of the solution for .

[1] A. Dragulescu, V. Yakovenko, Probability distribution of returns in the Heston model with stochastic volatility

[2] V. Lucic, Boundary Conditions for Computing Densities in Hybrid Models via PDE Methods

[3] K. A. Kopecky, Numerical Differentiation

# Multi-Dimensional Finite Difference Methods on a GPU

One key aspect for the performance of multi-dimensional finite difference methods based on operator splitting is the performance of the underlying tridiagonal system solver [1]. The authors in [2] have analysed different solver strategies and they have reported a speed-up factor of around 15 between GPU and CPU for the best solver strategy (cyclic reduction) on very large systems. CUDA’s sparse matrix library cuSPARSE contains a routine to solve tridiagonal systems based on cyclic reduction. nVIDIA claims a speed-up factor of around ten compared against Intel’s MKL library [3].

These factors are smaller than the speed-up factors reported for pure Monte-Carlo pricing algorithms. Main reason is that a tridiagonal system solver can not be parallelised by a simple divide and conquer approach like many Monte-Carlo pricing algorithms.

Main classes of the GPU implementation of the Hundsdorfer-Verwer operator splitting scheme are:

**GPUFdmLinearOp**: implements the FdmLinearOp interface and can be initialized from any instance of FdmLinearOp/FdmLinearOpComposite via the*toMatrix()/toMatrixDecomp()*methods.**GPUTripleBandLinearOp**: GPU based implementation of TripleBandLinearOp. This linear operator can be initialized from any instance of the classes FdmLinearOp/FdmLinearOpComposite via the toMatrix/toMatrixDecomp() methods. The solver for the tridiagonal system is based on cuSPARSE.**GPUHundsdorferScheme**: GPU based implementation of the Hundsdorfer-Verwer operator splitting scheme.

The code is available here and it is based on the latest QuantLib version from the trunk, CUDA 4.1 or higher and Cusp 0.3 or higher. As can be seen from the diagrams above the speed-up factor depends on the problem size. Speed-up factors above ten can only be achieved for very large two-dimensional or for medium-sized three-dimensional problems. The GPU precision is defined in cudatypes.hpp by the typedef for the type “real”.

[1] Karel in ’t Hout, ADI finite difference schemes for the Heston model with correlation.

[2] Pablo Quesada-Barriuso, Julián Lamas-Rodríguez, Dora B. Heras, Montserrat Bóo, Francisco Argüello, Selecting the Best Tridiagonal System Solver Projected on Multi-Core CPU and GPU Platforms.

[3] nVIDIA, cuSPARSE

# Richardson Extrapolation for American Options

Popular finite difference schemes like the one-dimensional Crank-Nicholson scheme or multi dimensional operating splitting methods like the Hundsdorfer-Verwer scheme often achieve second order convergence in for typical partial differential equations in financial engineering. Unfortunately if these schemes are used together with dynamic programming to price America options the second order convergence is destroyed by the first order convergence of the dynamic programming step.

The Richardson extrapolation is a simple numerical gem to increase the convergence order of a numerical method. Say a numerical method has the following convergence behavior

.

Please note that this is a more restrictive requirement than to assume that the numerical method is of order *n* because does not depend on . Now the formula

also has

but the order of convergence of is *n+1*. The order does not depend on the particular value of *x* but often *x* is chosen to be two. This simple trick can be extended to the case where *n* is unknown or the Richardson extrapolation can be used *m* times to increase the order of convergence to *n+m* [2].

Lets apply the Richardson extrapolation to the pricing of an American Put under the Heston stochastic differential equation. The corresponding partial differential equation

is solved using the Hundsdorfer-Verwer scheme [1]. The early exercise condition is impressed using dynamic programming after every time step. The parameters of the experiment are

.

As can be seen in the diagram below even though the Richardson extrapolation is a simple formula it increases the convergence by one order from *1.0* to approx. *2.1*. If the order is not known a priori then the resulting order is still *1.4*.

The source code is available here. It depends on the latest QuantLib version from the SVN trunk.

[1] Karel in ’t Hout, ADI finite difference schemes for the Heston model with correlation.

[2] Eric Hung-Lin Liu, Fundamental Methods of Numerical Extrapolation with Applications.

# Accelerate VPP Pricing on the GPU using Preconditioner

As outlined in Solving Sparse Linear Systems using CUSP and CUDA a VPP pricing problem can be solved on a GPU using finite difference methods based on the implicit Euler scheme. The matrix inversion is carried out on the GPU using the BiCGStab solver of the numerical library cusp [1]. The optimal exercise problem for the VPP can be solved using dynamic programming for every time slice on the GPU (see [2]).

The lower convergence order in of the implicit Euler scheme is not a disadvantage because the hourly optimization step forces anyhow. This step size is small enough for the implicit Euler scheme to obtain a very good accuracy. The explicit Euler scheme is numerically unstable, even if . It converged only for large and .

Clearly the matrix inversion dominates the overall runtime of the pricing routine. Preconditioning is a standard technique to improve the performance of the involved BiCGStab algorithm. The cusp library supports a set of suited preconditioner for this problem.

- Diagonal: preconditioning is very inexpensive to use, but has often limited effectiveness.
- Approximate Inverse: Non-symmetric Bridson with different dropping strategies.
- Smoothed Aggregation: Algebraic Multi-grid preconditioner based on smoothed aggregation.

On a CPU QuantLib supports preconditioning based on the Thomas algorithm. This algorithm solves a tridiagonal matrix with operations. The corresponding preconditioner can be seen as a natural extension of the diagonal preconditioner. Experiments on a CPU indicate that the tridiagonal preconditioner leads to a significant speed-up for our VPP pricing problem. The upcoming CUSPARSE library of the CUDA 4.1 release contains a tridiagonal solver for the GPU. With the help of a small interface class the CUDA tridiagonal solver can be used as a cusp preconditioner. The source code is available here. The code depends on the latest QuantLib version from the SVN trunk.

The testbed is a VPP contract with and a maturity of 4 weeks (corresponds to exercise steps). Therefore in addition to the three dimensions for the stochastic model the problem has a fourth dimension of size to solve the optimization problem via dynamic programming.

The following diagram shows the different run-times of the algorithms on a GPU and a CPU.

The fastest algorithm on the GPU (implicit Euler scheme with BiCGStab and tridiagonal preconditioner) outperforms the best CPU implementation (operator splitting based on the Douglas scheme) roughly by a factor of ten. The non-symmetric Bridson preconditioner with the dropping strategy from Lin & More on the GPU is the second best algorithm but it ran out of memory on larger grids. The diagonal preconditioner slows down the overall performance on the GPU and the BiCGStab algorithm with the smoothed aggregation preconditioner did not converge.

[1] cusp, Generic Parallel Algorithms for Sparse Matrix and Graph Computations

# Bermudan Swaption Pricing based on Finite Difference Methods

Even though short rate models like the Hull-White model or the G2++ model are getting a bit long in the tooth these models are still used for risk management or as benchmark models. Since the early days QuantLib supports the pricing of Bermudan swaptions based on trinomial trees. It’s time to compare the performance and accuracy of the trinomial tree pricing algorithm with finite difference methods.

The G2++ model is defined by the following stochastic differential equation

where is denoting the market instantaneous forward rate at time 0 for the maturity T (see. e.g. [1]). The Hull-White model can be seen as a one-dimensional simplification of the G2++ model. The corresponding partial differential equation (PDE) can be derived using the Feynman-Kac theorem.

QuantLib supports several operator splitting schemes to solve multi-dimensional PDEs. The SVN trunk contains FDM engines to price Bermudan swaptions under the Hull-White and the G2++ model. Other short rate models like the CIR++ or Black-Karasinski model can be implemented in the same manner.

As can be seen in the diagrams below the finite difference method based pricing engines outperform the trinomial tree based engines.

[1] D. Brigo, F. Mercurio, Interest Rate Models – Theory and Practice