Update 21.02.2016: Added values for QR decomposition with pivoting and QuantLib performance improvements.
Least Squares Monte Carlo simulations spend a significant amount of the total computation time on the generalized least squares especially if the problem itself has a high dimensional state. Preferred techniques to solve the normal equations are the QR decomposition
where is orthogonal and is upper triangular or the singular value decomposition (SVD)
where is column-orthogonal, is an orthogonal matrix and is a positive semi-definite diagonal matrix. The Cholesky Factorization
where is a lower triangular is the fasted method but often numerically unstable. The author in  summaries all methods and also outlines the computational efforts involved for observations and parameters as
- Chlolesky Factorization: costs flops
- QR decomposition: costs flops
- Singular value decomposition: costs flops
For LSMC simulations we have and therefore the QR decomposition has no computational advantages over the SVD. Since a QR decomposition without column pivoting has numerical problems if is rank-deficient, the singular value decomposition is often the method of choice for LSMC simulations.
All three decomposition methods are available in QuantLib, in LAPACK (with or without optimized OpenBLAS library) and in Intel’s MKL library. A standard Swing option valuation via LSMC should serve as a test bed to measure the performance of the QR decomposition with and without column pivoting and of the SVD algorithm. For LAPACK and MKL the methods dgels and dgesvd have been used to implement the SVD and the QR decomposition without pivoting whereas QR with pivoting is based on dgeqp3, dormqr and dtrsm. In order to keep results comparable the single thread performance was measured in all cases. The reference prices are calculated via finite difference methods and all LSMC implementations have led to the same price in line with the reference price. The current QR implementation in QuantLib 1.7 has a performance issue if the number of rows is much larger than the number of columns. For these tests an improved version of QuantLib’s QR decomposition has been used.
As expected MKL is often the fasted library but the difference between MKL and LAPACK plus OpenBLAS is small.
 Do Q Lee, 2012, Numerically Efficient Methods for Solving Least Squares Problems
The Heston model is defined by the following stochastic differential equation of the log spot
To a significant extent the popularity of the Heston model is based on the fact that semi-closed formulas for vanilla European options exist using the characteristic function of the model. The time evolution of the probability density function is given by the corresponding Fokker-Planck equation 
with the initial condition
The reduced probability density function
for this initial value problem can be calculated using a semi-closed integral formula 
This gives the opportunity to write a pricing engine for arbitrary European payoffs. The value of an European option with payoff function at maturity is given by
The calculation needs two nested integrations which can be carried out efficiently using e. g. the Gauss-Lobatto algorithm. The solution of the equation
determines the upper boundary for the integration over . The boundaries for the integration over are chosen such that the interval covers ten times the expected variance
Obviously the nested integration makes this algorithm more tricky than the standard ways to price plain vanilla European options but it is not limited to vanilla payoffs. The implementation of this algorithm can be found here within the QuantLib trunk on Github.
Broadie and Kaya  have outlined an algorithm to sample from the full probability density function instead of the reduced density function . Starting point for this algorithm is the exact solution of the Heston stochastic differential equation
The probability density function of the variance process is given by a noncentral chi-squared distribution
The distribution of the integral conditional on and can be calculated via the characteristic function
The modified Bessel function of first kind can be evaluated using series expansion for small and medium or asymptotic approximation for large . Unfortunately Boost provides only real versions of the Bessel functions and the copyright status of older complex valued Fortran77 routine is vague. Therefore QuantLib comes with its own implementation.
Please notice that is already a continuous version of the characteristic function and therefore the integration does not need to track the branches of when calculating the complex valued Bessel function .
The integration over the characteristic function is best done using either Gauss-Laguerre, Gauss-Lobatto or trapezoidal rule. The two later algorithms need to truncate the integration at some upper bound. First guess for a truncation limit can be taken from the Cornish-Fisher expansion for some very small . The moment-generating function can be used to get the first, second and third moment of the distribution via finite difference quotient.
The next term is now fairly easy to calculate
The log spot process can now be sampled using a standard normal random variable and
This sampling algorithm is exact even for very large time steps and therefore gives some advantages for quasi random Monte-Carlo methods but the inversion of the integration of the characteristic function is also very slow. The algorithm is implemented within the HestonProcess class.
 I. Clark, Foreign Exchange Option Pricing: A Practitioners Guide, p. 113
 A. Dragulescu, V. Yakovenko, Probability distribution of returns in the Heston model with stochastic volatility
 M. Broadie, Ö. Kaya, Exact Simulation of Stochastic Volatility and other Affine Jump Diffusion Processes
 R. Lord, Efficient pricing algorithms for exotic derivatives, p. 40
 J.R. Culham, Bessel Functions of the First and Second Kind
In his blog Martin Fowler discusses the LMAX architecture, a high throughput retail financial trading platform. A remarkable detail of the architecture is the central business logic processor, which is implemented as a single threaded Java program. The supporting pre- and post-processing is running as a multi threaded application using lock-free ring buffers.
In general lock-free algorithms are implemented using atomic read-modify-write primitives which are provided by modern CPUs. Probably the most used lock-free algorithm in the boost library is the reference counting in boost::shared_ptr. On popular hardware platforms these reference count updates are based on atomic increments/decrements using compare-and-swap (CAS) operations. The performance improvements over normal mutex exclusion locks are significant as the following little experiment shows. Basis for this experiment is a function which increments an integer counter 500 million times in a loop . This loop is carried out in
- simple single threaded loop
- single thread loop using atomic increments
- single thread loop acquiring a mutex exclusion lock for every pass.
- two threads using atomic increments
- two threads using mutex exclusion locks
A Java and the C++ implementation is available here. The Java code uses the package java.util.concurrent.atomic and the C++ code uses boost::detail::atomic_count to implement the atomic increments. The run times are measured on a Core i3@3GHz.
Lock-free algorithms are difficult to implement and to debug. Tim Blechmann has written a little gem library boost::lockfree (Parts of the library are now in the boost release 1.53.0). Among others this library contains a wait-free single-producer/single-consumer ring buffer. This ring buffer is e.g. tailor-made to separate the Monte-Carlo path generation from the pricing of a derivate. With a few lines of code the path generation can then run in a different thread.
The code available here contains a BufferedDataFactory based on the boost::lockfree::ringbuffer class and a slightly modified version of the QuantLib test case HybridHestonHullWhiteProcessTest::testZeroBondPricing. In this version of the test case the path generation is running in a separated thread and uses the BufferedDataFactory to hand over the data to the pricing thread.
 Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads.
For the sake of completeness please find here the code for the evaluation of the virtual power plant (VPP) using a least-squares Monte-Carlo algorithm. The code depends on the latest QuantLib version from the SVN trunk or the upcoming QuantLib 1.2 release. The model and power plant specifications can be found in the previous blog entries. A more general description of the problem and the algorithms can be found e.g. here . Test forward curves can be taken e.g. from the Kyos example download page.
The regression polynomials are of third order in the spark spread and the stochastic component of the gas price .
The regression is carried out for every exercise right (every hour) and every possible VPP state separately. The calibration phase is based on ordinary Monte-Carlo scenarios, whereas the pricing is done using Quasi Monte-Carlo scenarios (Sobol sequence) and a Brownian Bridge (BB).
The following table summarizes the performance of the different pricing algorithms for the example contract and maturity of six month. Target accuracy is around 1% relative error in the NPV. The timings are given for a Core i5@3GHz CPU using four threads or a GTX560@0.8/1.6GHz GPU with 336 cores.
I don’t know the reason for the bad performance of the Gnu Linear Programming Kit for these kind of problems. Some commercial linear optimizer are much faster but they can not compete with dynamic programming for a simple VPP. As soon as e.g. time integral constraints are involved linear programming might become the method of choice.
 H. van Dijken, The value of starting up the power plant.
Even the pricing of a simple virtual power plant (VPP) is challenging. Main reasons are the high number of possible states of the VPP and the large number of possible exercise dates because often a VPP is priced as a bermudan-style option with hourly exercise rights. The implementation effort for an exact pricing engine based on finite difference methods (see e.g.) or based on least squares Monte-Carlo is comparable large. As shown in  Monte-Carlo combined with perfect foresight optimization can result in a very good approximation. The algorithm consists of a Monte-Carlo path generator and a dynamic programming optimization part, which calculates the optimal load schedule plan for each path separately. The stochastic processes involved are outlined in .
The CUDA based GPU implementation is available here. It depends on the latest QuantLib version from the SVN trunk or the upcoming QuantLib 1.2 release and CUDA 4.0. The corresponding C++ implementation is a speed-optimized version of the test-case VPPTest::testVPPPricing. This version also supports multi-threading. The following hardware was used to compare both implementations:
- CPU: Core email@example.com GHz, quad-core
- GPU: GTX560@810/1620MHz, 336 cores
As can be seen in the diagram below the GPU outperforms the CPU roughly by a factor 100 for single precision and a factor of 50 if the GPU is using double precision.
The CUDA implementation consists of the following files:
gpuvpppricingengine.hpp / gpuvpppricingengine.cpp
A QuantLib pricing engine for a simple VPP based on a Monte-Carlo simulation and perfect foresight optimization via dynamic programming. The physical size of the Monte-Carlo simulation is controlled by the following parameters of the constructor
- Size nSimulations: number of Monte-Carlo simulations carried out.
- bool antithetic: enables/disables antithetic sampling
- Size blockSize: number of threads in a CUDA block.
- Size gridSize: number of CUDA blocks that are grouped together in a simulation kernel.
The CUDA implementation consists of two kernels. The first kernel is the Monte-Carlo path generator, which calculates the paths on hourly granularity and stores them in the global memory of the graphic card.. The technics used are outlined e.g. in ,  and . The second kernel performs the optimization of the load schedule based on dynamic programming. The memory layout of this step depends on the number of possible states of the VPP because every possible state is stored in the shared memory of the GPU. The number of states is given by . CUDA does not support efficient dynamic shared memory allocation. Therefore the sizes of all shared memory arrays must be given at compile time. To allow an optimal use of the limited shared memory capacity different kernels with different values are generated using X-macros and the appropriate kernel is chosen at runtime.
defines basic CUDA types, especially the typedef for the type “real” can be used to compile the code either for single or double precision.
C++ interface for a GPU random number generator
gpucurand.hpp / gpucurand.cpp / gpucurand_kernel.hpp / gpucurand_kernel.cu
implementation of the GPURand interface based on the CURAND library, which is part of CUDA 4.0.
 L. Howes, D. Thomas, Efficient Random Number Generation and Application Using CUDA.
 A. Bernemann, R. Schreyer, K. Spanderen, Accelerating Exotic Option Pricing and Model Calibration Using GPUs
 M. Joshi, Graphical Asian Options
The advantages of payoff scripting based on a build-in interpreter or “on-the-fly compiler” instead of implementing the payoffs in C++ are obvious. Faster time-to-market because there is no need to recompile and deploy the C++ pricing library and people without deep C++ knowledge are able to develop and test new structured products. One disadvantage is often the execution speed of the chosen scripting language. Examples of languages I have seen/used for payoff scripting are Python (C++ interface boost::python), Lua, tinycc and CINT. When it comes to execution speed none of these are suited to build high performance solutions, see. e.g. . This is especially true if the Monte-Carlo scenario generator is running on a GPU. The payoff scripting on the CPU can then easily become the bottleneck of your pricing library.
Scala is a modern programming language that integrates object-oriented and functional language features. The Scala compiler generates byte code for the Java VM. Therefore the execution speed of a Scala script is comparable with Java and roughly a factor of two slower than C++ .
The Scala compiler itself is a Scala object and can be used at runtime to compile and link new scripts or classes. In addition using JNI it is fairly easy to attach a Java VM to a C++ process and to exchange data between C++ and Scala. Also Scala offers a lot of features to design an “internal”, user-friendly domain specific language (DSL) for payoff scripting.
The code for a small QuantLib/Scala Monte-Carlo simulation in action is available here. It depends on QuantLib 1.0 or higher, a Java 1.6 VM and Scala 2.8/9. Overwrite the PayoffImpl.scala class to implement different payoffs without recompiling the C++ code.
Even though perfect foresight provides only an upper bound to the real VPP value the differences are often neglectable and the implementation efforts are small compared with “exact” pricing based on finite difference methods or least square Monte-Carlo. Perfect foresight is the method of choice in conjunction with a linear programming optimizer if the problem contains time-integral constraints. Therefore it is worth to test the efficiency of two standard variance reduction techniques, namely antithetic sampling and quasi Monte-Carlo (QMC) together with a Brownian Bridge. Both methods are explained in , antithetic sampling in chapter 4.2 and quasi Monte-Carlo in section 5. Randomized QMC is used to calculate the error estimates for QMC as it is outlined in chapter 5.4.
Using the parameterization of the previous section VPP Pricing III, QMC in conjunction with a Brownian Bridge clearly out-performance the other algorithms for a 6 month contract as can be seen in the diagrams below. The code is available here. It depends on the latest QuantLib version from the SVN trunk or the upcoming QuantLib 1.2 release. If you want to generate the plots you’ll also need R.
 P. Glasserman, Monte Carlo Methods in Financial Engineering. ISBN-0387004513
The next two steps are defining a simple VPP contract (or a simplified gas-run power plant) and setting up a mixed integer linear programming optimization (MIP) to calculated the intrinsic value and an upper bound for the extrinsic value based on a Monte-Carlo simulation and assuming perfect foresight. The third step outlined in the next part will then be the “exact” pricing of the extrinsic value using dynamic programming and finite difference methods.
The set-up of the simplified gas-run power plant is similar to the one explained in chapter 4.2.3 of the text-book . In general the power plant has three power output level:
- Plant is off,
- Generation at minimum load
- Generation at maximum load
The power plant has a fixed efficiency rate
Ramp rates will be neglected, but the power plant has a minimum uptime and a minimum downtime . The start-up costs are given by a fixed start-up cost (in €) and the price of the gas needed to produce the start-up heat (in MWh).
The mixed integer linear optimization is running in one hour blocks and is using three decision variables per hour . The binary decision variable is true if the power plant is running at minimum load or at maximum load and is false if the plant is off. The real decision variable is equal to one if the plant is started in hour , which is implied by the following constraint
The minimum up-time and the minimum down-time is a consequence of the constraints
The real decision variable is equal to one if the power plant is running at maximum load and zero if the power plant is either running at minimum load or if the plant is off, that means
Let be the power price, be the gas price and be the carbon dioxide price at hour . The objective function is then given by
For a one year span the problem consists of decision variables and constraints. This comparable small problem can be solved using e.g. the Gnu Linear Programming Kit (GLPK). For an overview on open source linear/mixed integer programming solver see .
The model parameters and the example forward curves are outlined in the previous entry VPP Pricing I. The diagram below shows the intrinsic value and the upper bound for the total value (intrinsic plus extrinsic value) based on Monte-Carlo, perfect foresight and MIP for different power plant efficiencies . The parameters of the VPP contract are given by
It is now quite easy to add and price time-integral constraints, e.g. the following constraint restricts the number of starts within a year to be less than or equal to a given number
The following diagram shows the results for and a minimum load .
 M. Burger, B. Graeber, G. Schindlmayr, Managing Energy Risk, ISDN 978-0-470-ß2962-6