Inter-Thread Communication using Lock-Free Algorithms

In his blog Martin Fowler discusses the LMAX architecture, a high throughput retail financial trading platform. A remarkable detail of the architecture is the central business logic processor, which is implemented as a single threaded Java program. The supporting pre- and post-processing is running as a multi threaded application using lock-free ring buffers.

In general lock-free algorithms are implemented using atomic read-modify-write primitives which are provided by modern CPUs. Probably the most used lock-free algorithm in the boost library is the reference counting in boost::shared_ptr. On popular hardware platforms these reference count updates are based on atomic increments/decrements using  compare-and-swap (CAS) operations. The performance improvements over normal mutex exclusion locks are significant as the following little experiment shows. Basis for this experiment is a function which increments an integer counter 500 million times in a loop [1]. This loop is carried out in

  • simple single threaded loop
  • single thread loop using atomic increments
  • single thread loop acquiring a mutex exclusion lock for every pass.
  • two threads using atomic increments
  • two threads using mutex exclusion locks

A Java and the C++ implementation is available here. The Java code uses the package java.util.concurrent.atomic and the C++ code uses boost::detail::atomic_count to implement the atomic increments. The run times are measured on a Core i3@3GHz.

Lock-free algorithms are difficult to implement and to debug. Tim Blechmann has written a little gem library boost::lockfree (Unfortunately the library is not part of the official boost library.). Among others this library contains a wait-free single-producer/single-consumer ring buffer. This ring buffer is e.g. tailor-made to separate the Monte-Carlo path generation from the pricing of a derivate. With a few lines of code the path generation can then run in a different thread.

The code available here contains a BufferedDataFactory based on the boost::lockfree::ringbuffer class and a slightly modified version of the QuantLib test case HybridHestonHullWhiteProcessTest::testZeroBondPricing. In this version of the test case the path generation is running in a separated thread and uses the BufferedDataFactory to hand over the data to the pricing thread.

[1] Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads.

Parallel Model Calibration using MPI and Boost.MPI

The message passing standard MPI is a language-independent communication protocol. MPI supports the parallelization of numerical algorithms on both massive parallel computers and on symmetric multi processor systems. MPI is standardized, highly portable and the de facto standard on massive parallel supercomputers. Even though MPI can be used in a multi-threading environment it is normally used in a multi-process environment. Therefore MPI is tailor-made to parallelize algorithms based on the non thread-safe QuantLib.

The roots of the MPI specification are going back to the early 90′s and you will feel the age if you use the C-API, which is designed to achieve maximum performance. The Boost.MPI library – quoting from the web page – “is a C++ friendly interface to the standard Message Passing Interface… Boost.MPI can build MPI data types for user-defined types using the Boost.Serialization library”.

Model calibration can be a very time-consuming task, e.g. the calibration of a Heston or a Heston-Hull-White model using American puts with discrete dividends. The class MPICalibrationHelper acts as a MPI wrapper for a given CalibrationHelper and allows to parallelize an existing model calibration routine (hopefully with minimal impact/effort). The source code is available here. It contains the MPICalibrationHelper class and as an example the parallel version of the DAXCalibration test case (part of test-suite/hestonmodel.cpp). The code depends on QuantLib 1.0 or higher, Boost.Thread and Boost.MPI.

The diagram above shows the speed-up of a Heston-Hull-White calibration with discrete dividends on an eight core machine using a finite difference pricing engine. The main reason for the sub-linear scaling is the limited memory bandwidth between the CPUs and the main memory and not the MPI communication overhead.

QuantLib-SWIG and a Thread-Safe Observer Pattern in C++

The QuantLib library is not really thread-safe. Successful usage in multi-core and parallel environments is often achieved by message passing between multiple processes rather than shared memory and multi-threading. If different threads are acting on distinct objects then QuantLib can also be used in the multi-threading environment (define QL_ENABLE_SESSIONS to make the singletons thread dependent.).

If you are using QuantLib in Java or Scala via SWIG [1] then the QuantLib routines are automatically executed in a multi-threading environment even if your main routine is single threaded. The garbage collector of the JVM is usually running in different thread. In conjunction with QuantLib’s implementation of the Observer pattern this can lead to serious problems, e.g. the following single threaded Scala code crashes on a multi-core computer after a short period of time.

import org.quantlib.{Array => QArray, _}
object ObserverTest {
    def main(args: Array[String]) : Unit = {
        System.loadLibrary("QuantLibJNI");
        val aSimpleQuote = new SimpleQuote(0)

        while (true) {
            (0 until 10).foreach(_ => {
                new QuoteHandle(aSimpleQuote)
                aSimpleQuote.setValue(aSimpleQuote.value + 1)
            })
            System.gc
        }
    }
}

The garbage collector might call the destructor of an observer (QuoteHandle) at the same point in time when the update method on the same object is invoked by aSimpleQuote.setValue. (Find here the original postings on the QuantLib mailing list regarding this problem.). For a greenfield project the author in [2] has a nice solution for this problem in C++. Unfortunately this solution can not be applied to the QuantLib because the solution involves a change of signature of the Observer interface and thereafter would lead to a lot of change in the QuantLib library.

The main problem here is that we have to disable the Observer before the destructor starts to work. This could in theory be achieved by adding a special “Deleter” to every new instance of a boost::shared_ptr. But again this would lead to a lot of changes in the library.

When compiled with the preprocessor directive BOOST_SP_ENABLE_DEBUG_HOOKS the boost library adds a callback hook before a destructor is called by any boost smart pointer. This hook can be utilized to disable an Observer before the actual destructor is executed. In addition the boost::signals2 library [3] provides a simple and thread-safe notification mechanism.

The corresponding thread-safe implementation of the original QuantLib Observer/Observable interface can be found here. Set the preprocessor directive BOOST_SP_ENABLE_DEBUG_HOOKS for every source file. Replace/Add the files observable.hpp and observable.cpp and recompile the QuantLib library and the QuantLib-SWIG module.

[1] Simplified Wrapper and Interface Generator, SWIG

[2] Shuo Chen, Where Destructors meet Threads

[3] Douglas G., Mori Hess F., Boost.Signals2

Accelerate VPP Pricing on the GPU using Preconditioner

As outlined in Solving Sparse Linear Systems using CUSP and CUDA a VPP pricing problem can be solved on a GPU using finite difference methods based on the implicit Euler scheme. The matrix inversion is carried out on the GPU using the BiCGStab solver of the numerical library cusp [1]. The  optimal exercise problem for the VPP can be solved using dynamic programming for every time slice on the GPU (see [2]).

The lower convergence order in \Delta t of the implicit Euler scheme is not a disadvantage because the hourly optimization step forces \Delta t = 1h anyhow. This step size is small enough for the implicit Euler scheme to obtain a very good accuracy. The explicit Euler scheme is  numerically unstable, even if \Delta t = 1h. It converged only for large \Delta x, \Delta y and \Delta u.

Clearly the matrix inversion dominates the overall runtime of the pricing routine. Preconditioning is a standard technique to improve the performance of the involved BiCGStab algorithm. The cusp library supports a set of suited preconditioner for this problem.

  • Diagonal: preconditioning is very inexpensive to use, but has often limited effectiveness.
  • Approximate Inverse: Non-symmetric Bridson with different dropping strategies.
  • Smoothed Aggregation: Algebraic Multi-grid preconditioner based on smoothed aggregation.

On a CPU QuantLib supports preconditioning based on the Thomas algorithm. This algorithm solves a tridiagonal matrix with O(8n) operations. The corresponding preconditioner can be seen as a natural extension of the diagonal preconditioner. Experiments on a CPU indicate that  the tridiagonal preconditioner leads to a significant speed-up for our VPP pricing problem. The upcoming CUSPARSE library of the CUDA 4.1 release contains a tridiagonal solver for the GPU. With the help of a small interface class the CUDA tridiagonal solver can be used as a cusp preconditioner. The source code is available here. The code depends on the latest QuantLib version from the SVN trunk.

The testbed is a VPP contract with t_{minUp} = t_{minDown} = 2h and a maturity of 4 weeks (corresponds to 4*7*24=672 exercise steps). Therefore in addition to the three dimensions for the stochastic model the problem has a fourth dimension of size 2t_{minUp} + t_{minDown} = 6  to solve the optimization problem via dynamic programming.

The following diagram shows the different run-times of the algorithms on a GPU and a CPU.

The fastest algorithm on the GPU (implicit Euler scheme with BiCGStab and tridiagonal preconditioner) outperforms the best CPU implementation (operator splitting based on the Douglas scheme) roughly by a factor of ten. The non-symmetric Bridson preconditioner with the dropping strategy from Lin & More on the GPU is the second best algorithm but it ran out of memory on larger grids. The diagonal preconditioner slows down the overall performance on the GPU and the BiCGStab algorithm with the smoothed aggregation preconditioner did not converge.

[1] cusp, Generic Parallel Algorithms for Sparse Matrix and Graph Computations

[2] Pricing of a Virtual Power Plant on a GPU

Bermudan Swaption Pricing based on Finite Difference Methods

Even though short rate models like the Hull-White model or the G2++ model are getting a bit long in the tooth these models are still used for risk management or as benchmark models. Since the early days QuantLib supports the pricing of Bermudan swaptions based on trinomial trees. It’s time to compare the performance and accuracy of the trinomial tree pricing algorithm with finite difference methods.

The G2++ model is defined by the following stochastic differential equation

where f^M(0,T) is denoting the market instantaneous forward rate at time 0 for the maturity T (see. e.g. [1]). The Hull-White model can be seen as a one-dimensional simplification of the G2++ model. The corresponding partial differential equation (PDE) can be derived using the Feynman-Kac theorem.

QuantLib supports several operator splitting schemes to solve multi-dimensional PDEs.  The SVN trunk contains FDM engines to price Bermudan swaptions under the Hull-White and the G2++ model. Other short rate models like the CIR++ or Black-Karasinski model can be implemented in the same manner.

As can be seen in the diagrams below the finite difference method based pricing engines outperform the trinomial tree based engines.

[1] D. Brigo, F. Mercurio, Interest Rate Models – Theory and Practice

Solving Sparse Linear Systems using CUSP and CUDA

A finite difference equation can be represented and solved based on a sparse linear system. When using schemes with implicit parts to solve the equation one needs to calculate the inverse of this sparse matrix. Iterative solvers like the BiCGStab algorithm (plus preconditioner) are tailor-made for these kind of problems. In addition these solvers can easily be adapted to parallel computing architectures like modern GPUs. The cusp library [1] implements a set of common iterative solvers for sparse matrices based on CUDA and thrust [2].

Alternatively operator splitting techniques can be used to solve the problem. In this case only tridiagonal linear systems have to be inverted. The Thomas algorithm (named after Liewellyn Thomas) can solve such systems in O(n) operations instead of O(n^3) operations required by the Gaussian elimination. But it is not easy to parallelize the Thomas algorithm.

On a CPU operator splitting usually outperforms iterative solvers like BiCGStab for the implicit part. The VPP pricing problem from the previous blog entries will now serve as a benchmark to compare the solver performance on the CPU against a GPU. As can be seen in the diagram below on a CPU operator splitting is faster than the BiCGStab solver plus preconditioner. But the BiCGStab on the GPU clearly gains the lead even though the corresponding preconditioner for the GPU has not been implemented yet. Please find the interface class between QuantLib/boost::ublas and cusp here.

[1] cusp, Generic Parallel Algorithms for Sparse Matrix and Graph Computations

[2] thrust, Code at the speed of light

VPP Pricing V: Least-Squares Monte-Carlo

For the sake of completeness please find here the code for the evaluation of the virtual power plant (VPP) using a least-squares Monte-Carlo algorithm. The code depends on the latest QuantLib version from the SVN trunk or the upcoming QuantLib 1.2 release. The model and power plant specifications can be found in the previous blog entries. A more general description  of the problem and the algorithms can be found e.g. here [1]. Test forward curves can be taken e.g. from the Kyos example download page.

The regression polynomials are of third order in the spark spread x and the stochastic component of the gas price y.

p_a(x,y)=a_0 + a_1 x + a_2 x^2 +a_3x^3 + a_4y + a_5y^2 + a_6y^3 + a_7xy+a_8x^2y+a_9xy^2

The regression is carried out for every exercise right (every hour) and every possible VPP state separately. The calibration phase is based on ordinary Monte-Carlo scenarios, whereas the pricing is done using Quasi Monte-Carlo scenarios (Sobol sequence) and a Brownian Bridge (BB).

The following table summarizes the performance of the different pricing algorithms for the example contract and maturity of six month. Target accuracy is around 1% relative error in the NPV. The timings are given for a Core i5@3GHz CPU using four threads or a GTX560@0.8/1.6GHz GPU with 336 cores.

I don’t know the reason for the bad performance of the Gnu Linear Programming Kit for these kind of problems. Some commercial linear optimizer are much faster but they can not compete with dynamic programming for a simple VPP. As soon as e.g. time integral constraints are involved linear programming might become the method of choice.

[1] H. van Dijken, The value of starting up the power plant.

Pricing of a Virtual Power Plant on a GPU

Even the pricing of  a simple virtual power plant (VPP) is challenging. Main reasons are the high number of possible states of the VPP and the large number of possible exercise dates because often a VPP is priced as a bermudan-style option with hourly exercise rights. The implementation effort for an exact pricing engine based on finite difference methods (see e.g.[1]) or based on least squares Monte-Carlo is comparable large. As shown in [1] Monte-Carlo combined with  perfect foresight optimization can result in a very good approximation. The algorithm consists of a Monte-Carlo path generator and a dynamic programming optimization part, which calculates the optimal load schedule plan for each path separately. The stochastic processes involved are outlined in [1].

The CUDA based GPU implementation is available here.  It depends on the latest QuantLib version from the SVN trunk or the upcoming QuantLib 1.2 release and CUDA 4.0. The corresponding C++ implementation is a speed-optimized version of the test-case VPPTest::testVPPPricing. This version also supports multi-threading. The following hardware was used to compare both implementations:

  • CPU: Core i5@3.0 GHz, quad-core
  • GPU: GTX560@810/1620MHz, 336 cores

As can be seen in the diagram below the GPU outperforms the CPU roughly by a factor 100 for single precision and a factor of 50 if the GPU is using double precision.

The CUDA implementation consists of the following files:

gpuvpppricingengine.hpp / gpuvpppricingengine.cpp
A QuantLib pricing engine for a simple VPP based on a Monte-Carlo simulation and perfect foresight optimization via dynamic programming. The physical size of the Monte-Carlo simulation is controlled by the following parameters of the constructor

  1. Size nSimulations: number of Monte-Carlo simulations carried out.
  2. bool antithetic: enables/disables antithetic sampling
  3. Size blockSize: number of threads in a CUDA block.
  4. Size gridSize: number of CUDA blocks that are grouped together in a simulation kernel.

gpuvpppricingengine_kernel.hpp
gpuvpppricingengine_kernel.cu/ gpuvpppricingengine_kernel.def
The CUDA implementation consists of two kernels. The first kernel is the Monte-Carlo path generator, which calculates the paths on hourly granularity and stores them in the global memory of the graphic card.. The technics used are outlined e.g. in [2], [3] and [4]. The second kernel performs the optimization of the load schedule based on dynamic programming. The memory layout of this step depends on the number of possible states of the VPP because every possible state is stored in the shared memory of the GPU. The number of states is given by N_{states} = 2t_{up}+t_{down}. CUDA does not support efficient dynamic shared memory allocation. Therefore the sizes of all shared memory arrays must be given at compile time. To allow an optimal use of the limited shared memory capacity different kernels with different N_{states} values are generated using X-macros and the appropriate kernel is chosen at runtime.

cudatype.hpp
defines basic CUDA types, especially the typedef for the type “real” can be used to compile the code either for single or double precision.

gpurand.hpp
C++ interface for a GPU random number generator

gpucurand.hpp / gpucurand.cpp / gpucurand_kernel.hpp / gpucurand_kernel.cu
implementation of the GPURand interface based on the CURAND library, which is part of CUDA 4.0.

[1] this blog, VPP Pricing III: Exact Pricing based on Finite Difference Methods.

[2] L. Howes, D. Thomas, Efficient Random Number Generation and Application Using CUDA.

[3] A. Bernemann, R. Schreyer, K. Spanderen, Accelerating Exotic Option Pricing and Model Calibration Using GPUs

[4] M. Joshi, Graphical Asian Options

Finite Difference Schemes for the Heston-Hull-White Model

The hybrid Heston-Hull-White model is tailor-made to analyse the impact of stochastic interest rates on structured equity notes like e.g. auto-callables.

Unfortunately a semi-closed solution for european options exists only if at least two correlations are equal to zero which is in general unrealistic. A set of semi-closed approximations for this model can be found here [1]. The QuantLib has a finite difference pricing engine for american, bermudan and european options for the Heston-Hull-White model. This pricing engine supports cash dividends, control variate via the semi-closed Heston Model and pricing different strikes of european options of the same maturity using one backward solver run (especially useful to gain a large speed-up during model calibration.).

The Heston-Hull-White model is a good testbed to test the efficiency of the finite difference schemes based on operator splitting which are implemented in the QuantLib :

  • Douglas
  • Craig-Sneyd
  • Modified Craig-Sneyd
  • Hundsdorfer-Verwer
  • Modified Hundsdorfer-Verwer

These operator splitting methods are described here [2]. The testbed contains ten parameter sets of the Heston-Model taken from different publications.


The Hull-White parameters are set to \kappa_r=0.00883 and \sigma_r=0.00631.  The equity interest rate correlation is \rho_{Sr}=-0.4, interest rates and stochastic volatility aren’t correlated \rho_{vr}=0. The benchmark call options have maturity of 5 years, underlying at time t=0 is S_0=100 and possible strikes are K \in \{ 75,85,90,95,100,105,110,115,120,125,130,140,150 \}. The benchmark value is the average relative difference between the reference values and the option prices on the lattice for the different N_s strikes and N_m models.

The diagram below shows the results of the “Equity Case” for different grid sizes \{ t, S, v, r\}. and with control variate based on the semi–closed Heston Model. Clearly the Douglas scheme is the worst performer, the (modified) Craig-Sneyd and modified Hundsdorfer-Verwer scheme are the winner.

The overall average over the ten models is dominated by the two “pathological” parameter sets  “Ikonen-Toivanen” and “Kahl-Jäckel”. Again the Douglas scheme can not compete with the other schemes. The differences between the other schemes are comparable small except for the largest grid where the modified Hundsdorfer-Verwer scheme performs badly.

The relative pricing error with and without control variate is shown in the diagram below for the “Equity Case” Heston model and the modified Craig-Sneyd scheme. On average the usage of the control variate reduces the relative pricing error by a factor of 15.

The source code is available here. It depends on QuantLib 1.1 or higher.If you want to generate the plots you’ll also need R.

[1] L. A. Grzelak, C. W. Oosterleea, Lech A., On the Heston Model with Stochastic Interest Rates.

[2] K.J. in ‘t Hout, S. Foulon, ADI finite difference schemes for option pricing in the Heston model with correlation. Int. J. Numer. Anal. Mod. 7, 303-320 (2010).

[3]  S. Ikonen, J. Toivanen, Operator Splitting Methods for American Options with Stochastic Volatility.

[4] C. Kahl, P. Jäckel Not-so-complex logarithms in the Heston model.

Using Scala for Payoff Scripting

The advantages of payoff scripting based on a build-in interpreter or “on-the-fly compiler” instead of implementing the payoffs in C++ are obvious. Faster time-to-market because there is no need to recompile and deploy the C++ pricing library and people without deep C++ knowledge are able to develop and test new structured products. One disadvantage is often the execution speed of the chosen scripting language. Examples of languages I have seen/used for payoff scripting are Python  (C++ interface boost::python), Lua, tinycc and CINT. When it comes to execution speed none of these are suited to build high performance solutions, see. e.g. [1].  This is especially true if the Monte-Carlo scenario generator is running on a GPU. The payoff scripting on the CPU can then easily become the bottleneck of your pricing library.

Scala is a modern programming language that integrates object-oriented and functional language features. The Scala compiler generates byte code for the Java VM. Therefore the execution speed of a Scala script is comparable with Java and roughly a factor of two slower than C++ [1].

The Scala compiler itself is a Scala object and can be used at runtime to compile and link new scripts or classes. In addition using JNI it is fairly easy to attach a Java VM to a C++ process and to exchange data between C++ and Scala. Also Scala offers a lot of features to design an “internal”, user-friendly domain specific language (DSL) for payoff scripting.

The code for a small QuantLib/Scala Monte-Carlo simulation in action is available here. It depends on QuantLib 1.0 or higher, a Java 1.6 VM and Scala 2.8/9. Overwrite the PayoffImpl.scala class to implement different payoffs without recompiling the C++ code.

[1] Computer Language Benchmark Game

Follow

Get every new post delivered to your Inbox.