Update 21.02.2016: Added values for QR decomposition with pivoting and QuantLib performance improvements.
Least Squares Monte Carlo simulations spend a significant amount of the total computation time on the generalized least squares especially if the problem itself has a high dimensional state. Preferred techniques to solve the normal equations are the QR decomposition
where is orthogonal and is upper triangular or the singular value decomposition (SVD)
where is column-orthogonal, is an orthogonal matrix and is a positive semi-definite diagonal matrix. The Cholesky Factorization
where is a lower triangular is the fasted method but often numerically unstable. The author in  summaries all methods and also outlines the computational efforts involved for observations and parameters as
- Chlolesky Factorization: costs flops
- QR decomposition: costs flops
- Singular value decomposition: costs flops
For LSMC simulations we have and therefore the QR decomposition has no computational advantages over the SVD. Since a QR decomposition without column pivoting has numerical problems if is rank-deficient, the singular value decomposition is often the method of choice for LSMC simulations.
All three decomposition methods are available in QuantLib, in LAPACK (with or without optimized OpenBLAS library) and in Intel’s MKL library. A standard Swing option valuation via LSMC should serve as a test bed to measure the performance of the QR decomposition with and without column pivoting and of the SVD algorithm. For LAPACK and MKL the methods dgels and dgesvd have been used to implement the SVD and the QR decomposition without pivoting whereas QR with pivoting is based on dgeqp3, dormqr and dtrsm. In order to keep results comparable the single thread performance was measured in all cases. The reference prices are calculated via finite difference methods and all LSMC implementations have led to the same price in line with the reference price. The current QR implementation in QuantLib 1.7 has a performance issue if the number of rows is much larger than the number of columns. For these tests an improved version of QuantLib’s QR decomposition has been used.
As expected MKL is often the fasted library but the difference between MKL and LAPACK plus OpenBLAS is small.
 Do Q Lee, 2012, Numerically Efficient Methods for Solving Least Squares Problems
The Heston model is defined by the following stochastic differential equation of the log spot
To a significant extent the popularity of the Heston model is based on the fact that semi-closed formulas for vanilla European options exist using the characteristic function of the model. The time evolution of the probability density function is given by the corresponding Fokker-Planck equation 
with the initial condition
The reduced probability density function
for this initial value problem can be calculated using a semi-closed integral formula 
This gives the opportunity to write a pricing engine for arbitrary European payoffs. The value of an European option with payoff function at maturity is given by
The calculation needs two nested integrations which can be carried out efficiently using e. g. the Gauss-Lobatto algorithm. The solution of the equation
determines the upper boundary for the integration over . The boundaries for the integration over are chosen such that the interval covers ten times the expected variance
Obviously the nested integration makes this algorithm more tricky than the standard ways to price plain vanilla European options but it is not limited to vanilla payoffs. The implementation of this algorithm can be found here within the QuantLib trunk on Github.
Broadie and Kaya  have outlined an algorithm to sample from the full probability density function instead of the reduced density function . Starting point for this algorithm is the exact solution of the Heston stochastic differential equation
The probability density function of the variance process is given by a noncentral chi-squared distribution
The distribution of the integral conditional on and can be calculated via the characteristic function
The modified Bessel function of first kind can be evaluated using series expansion for small and medium or asymptotic approximation for large . Unfortunately Boost provides only real versions of the Bessel functions and the copyright status of older complex valued Fortran77 routine is vague. Therefore QuantLib comes with its own implementation.
Please notice that is already a continuous version of the characteristic function and therefore the integration does not need to track the branches of when calculating the complex valued Bessel function .
The integration over the characteristic function is best done using either Gauss-Laguerre, Gauss-Lobatto or trapezoidal rule. The two later algorithms need to truncate the integration at some upper bound. First guess for a truncation limit can be taken from the Cornish-Fisher expansion for some very small . The moment-generating function can be used to get the first, second and third moment of the distribution via finite difference quotient.
The next term is now fairly easy to calculate
The log spot process can now be sampled using a standard normal random variable and
This sampling algorithm is exact even for very large time steps and therefore gives some advantages for quasi random Monte-Carlo methods but the inversion of the integration of the characteristic function is also very slow. The algorithm is implemented within the HestonProcess class.
 I. Clark, Foreign Exchange Option Pricing: A Practitioners Guide, p. 113
 A. Dragulescu, V. Yakovenko, Probability distribution of returns in the Heston model with stochastic volatility
 M. Broadie, Ö. Kaya, Exact Simulation of Stochastic Volatility and other Affine Jump Diffusion Processes
 R. Lord, Efficient pricing algorithms for exotic derivatives, p. 40
 J.R. Culham, Bessel Functions of the First and Second Kind
In his blog Martin Fowler discusses the LMAX architecture, a high throughput retail financial trading platform. A remarkable detail of the architecture is the central business logic processor, which is implemented as a single threaded Java program. The supporting pre- and post-processing is running as a multi threaded application using lock-free ring buffers.
In general lock-free algorithms are implemented using atomic read-modify-write primitives which are provided by modern CPUs. Probably the most used lock-free algorithm in the boost library is the reference counting in boost::shared_ptr. On popular hardware platforms these reference count updates are based on atomic increments/decrements using compare-and-swap (CAS) operations. The performance improvements over normal mutex exclusion locks are significant as the following little experiment shows. Basis for this experiment is a function which increments an integer counter 500 million times in a loop . This loop is carried out in
- simple single threaded loop
- single thread loop using atomic increments
- single thread loop acquiring a mutex exclusion lock for every pass.
- two threads using atomic increments
- two threads using mutex exclusion locks
A Java and the C++ implementation is available here. The Java code uses the package java.util.concurrent.atomic and the C++ code uses boost::detail::atomic_count to implement the atomic increments. The run times are measured on a Core i3@3GHz.
Lock-free algorithms are difficult to implement and to debug. Tim Blechmann has written a little gem library boost::lockfree (Parts of the library are now in the boost release 1.53.0). Among others this library contains a wait-free single-producer/single-consumer ring buffer. This ring buffer is e.g. tailor-made to separate the Monte-Carlo path generation from the pricing of a derivate. With a few lines of code the path generation can then run in a different thread.
The code available here contains a BufferedDataFactory based on the boost::lockfree::ringbuffer class and a slightly modified version of the QuantLib test case HybridHestonHullWhiteProcessTest::testZeroBondPricing. In this version of the test case the path generation is running in a separated thread and uses the BufferedDataFactory to hand over the data to the pricing thread.
 Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads.
For the sake of completeness please find here the code for the evaluation of the virtual power plant (VPP) using a least-squares Monte-Carlo algorithm. The code depends on the latest QuantLib version from the SVN trunk or the upcoming QuantLib 1.2 release. The model and power plant specifications can be found in the previous blog entries. A more general description of the problem and the algorithms can be found e.g. here . Test forward curves can be taken e.g. from the Kyos example download page.
The regression polynomials are of third order in the spark spread and the stochastic component of the gas price .
The regression is carried out for every exercise right (every hour) and every possible VPP state separately. The calibration phase is based on ordinary Monte-Carlo scenarios, whereas the pricing is done using Quasi Monte-Carlo scenarios (Sobol sequence) and a Brownian Bridge (BB).
The following table summarizes the performance of the different pricing algorithms for the example contract and maturity of six month. Target accuracy is around 1% relative error in the NPV. The timings are given for a Core i5@3GHz CPU using four threads or a GTX560@0.8/1.6GHz GPU with 336 cores.
I don’t know the reason for the bad performance of the Gnu Linear Programming Kit for these kind of problems. Some commercial linear optimizer are much faster but they can not compete with dynamic programming for a simple VPP. As soon as e.g. time integral constraints are involved linear programming might become the method of choice.
 H. van Dijken, The value of starting up the power plant.
Even the pricing of a simple virtual power plant (VPP) is challenging. Main reasons are the high number of possible states of the VPP and the large number of possible exercise dates because often a VPP is priced as a bermudan-style option with hourly exercise rights. The implementation effort for an exact pricing engine based on finite difference methods (see e.g.) or based on least squares Monte-Carlo is comparable large. As shown in  Monte-Carlo combined with perfect foresight optimization can result in a very good approximation. The algorithm consists of a Monte-Carlo path generator and a dynamic programming optimization part, which calculates the optimal load schedule plan for each path separately. The stochastic processes involved are outlined in .
The CUDA based GPU implementation is available here. It depends on the latest QuantLib version from the SVN trunk or the upcoming QuantLib 1.2 release and CUDA 4.0. The corresponding C++ implementation is a speed-optimized version of the test-case VPPTest::testVPPPricing. This version also supports multi-threading. The following hardware was used to compare both implementations:
- CPU: Core email@example.com GHz, quad-core
- GPU: GTX560@810/1620MHz, 336 cores
As can be seen in the diagram below the GPU outperforms the CPU roughly by a factor 100 for single precision and a factor of 50 if the GPU is using double precision.
The CUDA implementation consists of the following files:
gpuvpppricingengine.hpp / gpuvpppricingengine.cpp
A QuantLib pricing engine for a simple VPP based on a Monte-Carlo simulation and perfect foresight optimization via dynamic programming. The physical size of the Monte-Carlo simulation is controlled by the following parameters of the constructor
- Size nSimulations: number of Monte-Carlo simulations carried out.
- bool antithetic: enables/disables antithetic sampling
- Size blockSize: number of threads in a CUDA block.
- Size gridSize: number of CUDA blocks that are grouped together in a simulation kernel.
The CUDA implementation consists of two kernels. The first kernel is the Monte-Carlo path generator, which calculates the paths on hourly granularity and stores them in the global memory of the graphic card.. The technics used are outlined e.g. in ,  and . The second kernel performs the optimization of the load schedule based on dynamic programming. The memory layout of this step depends on the number of possible states of the VPP because every possible state is stored in the shared memory of the GPU. The number of states is given by . CUDA does not support efficient dynamic shared memory allocation. Therefore the sizes of all shared memory arrays must be given at compile time. To allow an optimal use of the limited shared memory capacity different kernels with different values are generated using X-macros and the appropriate kernel is chosen at runtime.
defines basic CUDA types, especially the typedef for the type “real” can be used to compile the code either for single or double precision.
C++ interface for a GPU random number generator
gpucurand.hpp / gpucurand.cpp / gpucurand_kernel.hpp / gpucurand_kernel.cu
implementation of the GPURand interface based on the CURAND library, which is part of CUDA 4.0.
 L. Howes, D. Thomas, Efficient Random Number Generation and Application Using CUDA.
 A. Bernemann, R. Schreyer, K. Spanderen, Accelerating Exotic Option Pricing and Model Calibration Using GPUs
 M. Joshi, Graphical Asian Options