Almost exact SABR Interpolation using Neural Networks and Gradient Boosted Trees

Update 03-11-2019: Added arbitrage free SABR calibration based on neural networks.

Very efficient approximations exist for the SABR model

\begin{array}{rcl} \displaystyle dF_t &=& \alpha_t F_t^\beta dW_t \\ \nonumber d\alpha_t &=& \nu \alpha_t dZ_t \\ \nonumber \rho dt &=& <dW_t, dZ_t> \end{array}

like the original Hagan et. al. formula [1] or variants of it [2] but these analytic formulas are in general not arbitrage free. Solving the corresponding partial differential equation leads to an arbitrage free solution

\displaystyle \frac{\partial u}{\partial t} - \frac{\nu^2}{2}\frac{\partial u}{\partial x} + \frac{1}{2}e^{2x}F_t^{2\beta}\frac{\partial^2 u}{\partial F^2}+\frac{\nu^2}{2}\frac{\partial^2 u}{\partial x^2}+\rho\nu e^{x_t}F_t^\beta\frac{\partial^2 u}{\partial F \partial x} -ru = 0

but is computationally demanding. The basic idea here is to use a neural network or gradient boosted trees to interpolate (predict) the difference between the analytic approximation and the exact result from the partial differential equation for a large variate of model parameters.

First step is to reduce the number of dimensions of the parameter space \{F_0,  \alpha, \beta, \nu, \rho \} by utilizing the scaling symmetry of the SABR model [3]

\begin{array}{rcl} \displaystyle F_t &\rightarrow& \lambda F_t \\ \nonumber \alpha_t &\rightarrow& \lambda^{1-\beta}\alpha_t. \end{array}

so that we can focus on the case F_0=1.0 without lose of generality

\displaystyle  \sigma_{BS}\left( K, F_0, T, \alpha, \beta, \nu, \rho \right) = \sigma_{BS}\left(\lambda K, \lambda F_0, T, \lambda^{1-\beta}\alpha, \beta, \nu, \rho \right).

This in turns also limits the “natural” parameter space for \{ \alpha, \beta, \nu \} which will be set to

\displaystyle \alpha \in [0, 1], \ \beta \in [0, 1], \ \nu \in [0, 1].

Next on the list is to set-up an efficient PDE solver to prepare the training data. The QuantLib solver supports already the two standard error reduction techniques, namely adaptive grid refinement around important points and cell averaging around special points of the payoff. The latter one ensure a smooth second order convergence in spatial direction [4]. The Hundsdorfer-Viewer ADI scheme is also of second order in the time direction and additional Rannacher smoothing steps at the beginning will ensure a smooth convergence in the time direction as well [5].  Hence the Richardson extrapolation can be used to improve the convergence order of the overall algorithm. An example pricing for

F_0=1.0, K=0.466, T=0.6, \alpha = 0.825, \beta=0.299, \nu=0.410, \rho=-0.166

is shown in the diagram below to demonstrate the efficiency of the Richardson extrapolation. The original grid size for scaling factor 1.0 is \{ T, S, v \} = \{20, 200, 25\}.

The training data was generated by a five dimensional quasi Monte-Carlo Sobol sequence for the parameter ranges

\displaystyle \alpha \in [0, 1], \ \beta \in [0, 1], \ \rho \in [-1, 1], \ \nu \in [0, 1], T\in [\frac{1}{12}, 1].

The strikes are equally distributed between the 1\% and 99\% quantile of the risk neutral density distribution w.r.t to the ATM volatility of the SABR model. The PDE solver will not only calculate the fair value for F_0=1.0 but for a range of spot values around F_0. Using the scaling symmetry of the SABR model this can be utilized to calculate more prices with new K'=\lambda K and \alpha'=\lambda^{1-\beta}\alpha values for F_0=1.0.

The training set includes 617K samples values.  The network is trained to fit the difference between the correct SABR volatility from the solution of the partial differential equation and the Floc’h-Kennedy approximation. It does not need a large neural network to interpolate the parameter space, e.g. the following Tensorflow/Keras model definition with 46K parameters has been used in the examples below

model = Sequential()
model.add(Dense(20, activation='linear', input_shape=(7, )))
model.add(Dense(100, activation='linear'))
model.add(Dense(400, activation='sigmoid'))
model.add(Dense(10, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mae', optimizer='adam')

As always it is important for the predictive power of the neural network to normalize the input data e.g. by using sklearn.preprocessing.MinMaxScaler. The out-of-sample mean absolute error of the neural network is around 0.00025 in annualized volatility, far better than the Kennedy-Floc’h or Hagan et al approximation.sabr_vol_dist

The diagram below shows the difference between the correct volatility and the different approximations using the parameter example from the previous post. sabr_volsOne could also used gradient tree boosting algorithms like XGBoost or LightGBM. For example the models

xgb_model = xgb.XGBRegressor(nthread=-1,
                             eval_metric ="mae")

gbm_model = lgb.train({'objective': 'mae',
                       'num_leaves': 500 }
                      lgb.Dataset(train_X, train_Y),

result in similar out-of-sample mean absolute errors of 0.00030 for XGBoost and 0.00035 for LightGBM. On the first glance the interpolation looks smooth as can be seen in the diagram below using the same SABR model parameters, but zooming into it exposes non differentiable points, which defeats the object of stable greeks.

xgboost.pngThe average run time for the different approximations is shown in the tabular below.

\begin{tabular}{|l|c|c|} \hline \textbf{Algorithm} & \textbf{Run Time}  \\ \hline Hagan et al & 0.16us  \\ \hline Floc'h-Kennedy & 1.86us  \\ \hline Tensorflow DNN & 12.39us  \\ \hline XGBoost & 4.61us \\ \hline LightGBM & 21.84 us \\ \hline PDE \& Richardson & 1.23s \\ \hline\end{tabular}

With this highly efficient pricing routines calibration of the full SABR model can be done in a fraction of a second. To test this approach several Heston parameter configurations have been used to calculated the implied volatility of 15 benchmark options for a single expiry. The full SABR model has been calibrated against these volatility sets with help of a standard Levenberg-Marquardt optimizer by either using the PDE pricer or the neural network pricer. As expected the neural network calibration routine has only taken 0.2 seconds but the PDE calibration has taken over half an hour on average.

[1] P. Hagan, D. Kumar, A. Lesnieski, D. Woodward: Managing Smile Risk.
[2] F. Le Floc’h, G. Kennedy: Explicit SABR Calibration through Simple Expansions.
[3] H. Park: Efficient valuation method for the SABR model.
[4] K. in’t Hout: Numerical Partial Differential Equations in Finance explained.
[5] K. in’t Hout, M. Wyns: Convergence of the Hundsdorfer–Verwer scheme for two-dimensional convection-diffusion equations with mixed derivative term