Using Rcpp and QuantLib to build new R packages is straight forward on Unix/Linux. More steps are needed to achieve the same on Windows. Please find here a short manual taking the RHestonSLV packages as an example.
Update 21.02.2016: Added values for QR decomposition with pivoting and QuantLib performance improvements.
Least Squares Monte Carlo simulations spend a significant amount of the total computation time on the generalized least squares especially if the problem itself has a high dimensional state. Preferred techniques to solve the normal equations are the QR decomposition
where is orthogonal and is upper triangular or the singular value decomposition (SVD)
where is column-orthogonal, is an orthogonal matrix and is a positive semi-definite diagonal matrix. The Cholesky Factorization
where is a lower triangular is the fasted method but often numerically unstable. The author in  summaries all methods and also outlines the computational efforts involved for observations and parameters as
- Chlolesky Factorization: costs flops
- QR decomposition: costs flops
- Singular value decomposition: costs flops
For LSMC simulations we have and therefore the QR decomposition has no computational advantages over the SVD. Since a QR decomposition without column pivoting has numerical problems if is rank-deficient, the singular value decomposition is often the method of choice for LSMC simulations.
All three decomposition methods are available in QuantLib, in LAPACK (with or without optimized OpenBLAS library) and in Intel’s MKL library. A standard Swing option valuation via LSMC should serve as a test bed to measure the performance of the QR decomposition with and without column pivoting and of the SVD algorithm. For LAPACK and MKL the methods dgels and dgesvd have been used to implement the SVD and the QR decomposition without pivoting whereas QR with pivoting is based on dgeqp3, dormqr and dtrsm. In order to keep results comparable the single thread performance was measured in all cases. The reference prices are calculated via finite difference methods and all LSMC implementations have led to the same price in line with the reference price. The current QR implementation in QuantLib 1.7 has a performance issue if the number of rows is much larger than the number of columns. For these tests an improved version of QuantLib’s QR decomposition has been used.
As expected MKL is often the fasted library but the difference between MKL and LAPACK plus OpenBLAS is small.
 Do Q Lee, 2012, Numerically Efficient Methods for Solving Least Squares Problems
Solving the Fokker-Planck equation via finite difference methods is not the only way to calibrate the Heston stochastic local volatility model
The basic equation to calibrate the leverage function for a local volatility surface and a set of Heston parameters is given by
Key problem here is to calculate the expectation value . This can either be done via the Fokker-Planck equation as outlined in  and the references in there or via Monte-Carlo simulations as shown in . Getting the solution of the Fokker-Planck equation via finite difference methods is anything but easy if the Feller constraint is violated. In this case the Monte-Carlo approach promises less numerical problems to overcome. Starting point for an efficient Monte-Carlo calibration is a fast and accurate simulation scheme for a stochastic local volatility (SLV) model. The variance part of the SLV can be sampled exactly using the non-central distribution.
The boost library supports the inverse of the cumulated non-central distribution but the calculation itself is rather slow. The Heston QE scheme outlined in  comes with an fast and accurate sampling scheme for the square root process. In the case it reads
The authors in  suggesting a sampling scheme for . The scheme below is a bit closer to the original QE scheme and without a rigorous proof it has shown slightly smaller discretisation errors for the examples.
To get from the Monte-Carlo paths to the leverage function the authors in  using a binning technique in every time step. QuantLib supports two ways to generate the random draws, either based on i.i.d normal variables or based on Sobol quasi Monte-Carlo simulations with brownian bridges.
Test case definition: Feller constraint is violated.
Effectively the leverage function will remove the skew and term structure introduced by the Heston model and shifts volatility to 30%. In order to measure the calibration quality the price mismatches expressed in volatility basis point differences are calculated for OTM calls and puts with prices above 0.02 and for maturity from one month until two year. The SLV prices are calculated by solving the Feynman-Kac backward equation using finite difference methods.
The diagram above shows the average pricing error for different number of Monte-Carlo calibration paths and for different bin sizes. As expected the optimal bin size depends on the number of Monte-Carlo simulations. Especially deep OTM options benefit from a large number of bins. The usage of Sobol quasi Monte-Carlo simulations with brownian bridges does not give a significant advantage.
The average pricing error is decreasing with the square root of the number of Monte-Carlo calibration paths if one always choses the optimal bin size as shown in the diagram below. Quasi Mont-Carlo path generation with brownian bridges gives a small advanage.
As expected the leverage function itself exposes some Monte-Carlo noise, see diagram below. The corresponding leverage function based on the finite difference method is smooth but the overall calibration error is of the same size.
The calibration for different mixing factors is straight forward. The diagram below e.g. shows the prices of a double no touch option for different distances of the (symemtric) lower and upper barriers expressed in terms of the corresponding Black-Scholes prices. This diagram type is derived from figure 8.8 in .
The source code is available on github as part of the test suite HestonSLVModelTest.
 Leif Andersen, Efficient Simulation of the Heston Stochastic Volatility Model
 Anthonie Van der Stoep, Lech Grzelak, Cornelis Oosterlee,
The Heston Stochastic-Local Volatility Model: Efficient Monte Carlo Simulation
 Johannes Goettker-Schnetmann, Klaus Spanderen, Calibrating the Heston Stochastic Local Volatility Model using the Fokker-Planck Equation
 Iain J. Clark, Foreign Exchange Option Pricing: A Practitioner’s Guide
Update: 10.03.2016: added performance results of latest QuantLib test-suite build on a 32 Core SMP machine using boost interprocess instead of MPI.
Running QuantLib’s test suite on a recent computer takes around 10min. The situation will improve a lot if the test runner utilises more than one core of today’s multi-core CPUs to run the tests in parallel. Unfortunately multi-threading won’t work because the boost unit test framework is not thread safe. A reasonable way forward to speed-up the test suite is via multiprocessing using message passing between the compute nodes based on the master-slave paradigm. The cross platform standard MPI together with boost::mpi is tailor-made for this task.
The missing piece is an external, parallel unit test runner, which uses MPI for load balancing and to collect the test results. The runner for QuantLib ‘s test suite needs boost version 1.59 or higher and can be found here. Please replace in quantlibtestsuite.cpp line 24
and do not link the executable with libboost_unit_test_framework.so because the new header file includes the header only version of the boost test framework (Thanks to Peter Caspers for the hint). Load balancing is a crucial point for the overall speed-up because QuantLib’s test suite has a handful of long running tests (max. around 90 seconds), which should be scheduled first. Therefore the MPI test runner collects the statistics of every unit-test’s runtime and stores these in a local file to plan the schedule of the next runs.
The diagram above shows the runtime of QuantLib’s test suite for a different number of parallel processes and CPU configurations. The minimum runtime is set by the longest running test case, which is around 50 seconds. On a single CPU the performance scales pretty linear with the number of cores being utilised and also hyper-threading cores help. Using more than sixteen real cores does not improve the situation any further because the overall runtime is already dominated by the longest running test case.