Volatility Forecasts (Part 2 - XGBSTES Algorithm 1)

2024-07-18

(Code snippets for this post can be found in the Github repository.)

1 Recap

In the previous post, we replicated the Smooth Transition Exponential Smoothing (STES) model from (Taylor 2004) and (Liu, Taylor, Choo 2020). In a small replication study on SPY returns, STES delivered modest improvements in out-of-sample RMSE relative to simple Exponential Smoothing (ES). Here, we ask a natural follow-up: if we keep the same feature set X_t (lag returns, lag squared returns, and lag absolute returns), can a more flexible gate improve the forecast further?

This post is motivated by (Coulombe 2020), which shows how tree-based models can be used to predict time-varying coefficients in otherwise classical econometric settings. Many traditional models accommodate regimes and nonlinearity via Markov switching, time-varying parameters, or thresholds; Coulombe’s key point is that decision trees can often approximate these regime mappings directly. The paper considers a regression of the form:

\begin{equation} \begin{aligned} y_t &= X_t \beta_t + \epsilon_t \\ \beta_t &= \mathcal{F}(X_t) \\ \end{aligned} \end{equation}

where \mathcal{F} is the customized random forest that maps the features X_t to the coefficients \beta_t. The customized tree model uses a bagging technique appropriate for time-series data and adds temporal smoothness regularization to the predicted \beta_t.

2 XGBSTES

This paper got me interested in replacing the linear equation within the sigmoid function in the STES with a tree ensemble model. Recall that the STES has the following form:

\begin{equation} \begin{aligned} \sigma_t^2 &= \alpha_{t-1} r_{t-1}^2 + (1-\alpha_{t-1})\widehat{\sigma}_{t-1}^2 \\ \alpha_{t-1} &= \mathrm{sigmoid}\left(X_{t-1}^\top\beta\right) \\ \end{aligned} \end{equation}

By replacing the linear equation X^\top\beta with a tree ensemble model, we can perhaps better separate the feature space into regions where recent realized volatility plays a more critical role in forecasting future realized volatility from regions where long-term historical realized volatility is a better forecast. Volatility time series exhibit non-linear behaviour, and several models already use modern machine learning architectures (such as the Neural Network Heterogeneous Autoregressive Model).

The XGBSTES model is formulated as

\begin{equation} \begin{aligned} \sigma_t^2 &= \alpha_{t-1} r_{t-1}^2 + (1-\alpha_{t-1})\widehat{\sigma}_{t-1}^2 \\ \alpha_{t-1} &= \mathrm{sigmoid}\left(\mathcal{F}(X_{t-1})\right) \\ \end{aligned} \end{equation}

We fit the model by minimizing mean squared error (MSE) between the forecasted variance \sigma_t^2 and realized squared returns.

\mathtt{loss} = \frac{1}{T}\sum_{t=1}^{T}\left(r_t^2 - \widehat{\sigma}_t^2\right)^2

For each time step t, the forecasted variance \sigma_t^2 is a weighted combination of the previous period’s squared return r_{t-1}^2 and the previous forecast \widehat{\sigma}_{t-1}^2. The weight \alpha_{t-1} is produced by an XGBoost model applied to the transition variable X_{t-1} and passed through a sigmoid so it lies in (0,1). In other words, we do not predict next-day variance directly; instead, we predict a time-varying update weight and then generate the variance forecast recursively over contiguous blocks of the feature matrix.

Below we discuss how it can be implemented.

3 Implementation

3.1 Naive Implementation

When we first implemented XGBSTES, we treated it as a direct drop-in replacement for the linear gate in STES:

\widehat{\sigma}t^2=\alpha_{t-1} r{t-1}^2+(1-\alpha_{t-1})\widehat{\sigma}_{t-1}^2, \qquad \alpha_{t-1}=\mathrm{expit}(f_{t-1}), \qquad f_{t-1}=\mathcal{F}(X_{t-1}),

and we trained the tree model \mathcal{F} by minimizing the same MSE objective used in Part 1:

L=\sum_{t=1}^T\left(y_t-\widehat{\sigma}_t^2\right)^2,\qquad y_t=r_t^2

This approach is tempting because it appears to be a direct drop-in replacement: train the gate to minimize the same forecasting loss used for STES.

However, this implementation is wrong.

It is wrong because there is a structural mismatch between (i) the STES recursion and (ii) how gradient-boosted trees are optimized.

  1. The model’s cumulative error is path-dependent, yet XGBoost optimizes by additively combining the contributions of each training instance (row)

XGBoost’s optimization machinery is designed for objectives that decompose approximately as

L \approx \sum_{t=1}^T \ell_t(y_t,f_{t-1})

so each row t contributes a gradient and Hessian that depends on f_{t-1} locally.

In STES, the prediction at time t-1$ depends on the entire history of gates:

\widehat{\sigma}t^2 = g(\alpha{t-1},\dots,\alpha_1), \qquad \alpha_s=\mathrm{expit}(f_s)

So the true loss is not row-separable:

L = \sum_{t=1}^T \ell_t\left(y_t; f_1,f_2,\dots,f_t\right)

This matters because tree building assumes that splitting row t affects the loss mostly through that row’s own gradient signal. However, here, changing f_t changes the state recursion and can alter many later terms \ell_{t+1},\ell_{t+2},\dots. The objective is effectively coupled sequentially, while the booster is designed for per-row aggregation.

  1. The gradients used are typically truncated

Our previous implementation used this objective function

def stes_variance_objective(self, preds, train):
    labels = dtrain.get_label()
    alphas = expit(preds)
    
    returns2 = self.data['returns^2'].to_numpy()
    
    grads = np.zeros_like(preds)
    hesss = np.zeros_like(preds)
    varhs = np.zeros_like(preds)

    assert len(preds) == len(labels), "Mismatch between lengths of preds, returns, and labels"

    for t in range(len(alphas)):
        if t == 0:
            lvar_f = np.mean(returns2[:500])
            varhs[t] = alphas[t] * returns2[t] + (1 - alphas[t]) * lvar_f
            d_alpha = returns2[t] - lvar_f
        else:
            varhs[t] = alphas[t] * returns2[t] + (1 - alphas[t]) * varhs[t-1]
            d_alpha = returns2[t] - varhs[t-1]
        
        d_pred = -alphas[t] * (1 - alphas[t]) * d_alpha
        grads[t] = 2 * (labels[t] - varhs[t]) * d_pred
        hesss[t] = 2 * d_pred**2

    return grads, hesss

it computes a local derivative:

\frac{\partial \widehat{\sigma}_t^2}{\partial f_t} = \frac{\partial \widehat{\sigma}_t^2}{\partial \alpha_t}\cdot\frac{\partial \alpha_t}{\partial f_t} = \left(r_{t-1}^2-\widehat{\sigma}_{t-1}^2\right)\cdot \alpha_t(1-\alpha_t)

and then uses

\frac{\partial \ell_t}{\partial f_t} = -2\left(y_t-\widehat{\sigma}t^2\right)\cdot \left(r_{t-1}^2-\widehat{\sigma}_{t-1}^2\right)\cdot \alpha_t(1-\alpha_t)

However, in the true objective, f_t affects all future states, so the true gradient is

\frac{\partial L}{\partial f_t} = \sum_{k=t}^T \frac{\partial \ell_k}{\partial f_t}

The missing terms k>t can be non-negligible when volatility persistence is high, because the recursion propagates forward with factors roughly proportional to (1-\alpha_s). In other words, the booster is not minimizing the stated loss L; it is minimizing a surrogate that only partially accounts for temporal effects. That makes training sensitive to hyperparameters, window length, and initialization choices.

Even if we correct our gradient to include \frac{\partial \ell_k}{\partial f_t} for k>t, training may still suffer long-range dependency, much like an RNN. The sigmoid function can create an additional vanishing gradient problem, causing training to stall.

The Hessian computation is even more problematic.

3.2 Iterative Implementation (Algorithm 1)

We sidestep this by turning the problem into an alternating fit between: 1. inferring a “helpful” gate sequence given the current variance state, and 2. fitting a flexible model (tree ensemble) to predict that gate sequence from features.

We keep the STES recursion as the definition of the forecast path:

\widehat{\sigma}_t^2 = \alpha_t r_{t-1}^2 + (1-\alpha_t)\widehat{\sigma}^2_{t-1},

We parametrize the gate using a tree model score passed through a sigmoid:

\alpha_t = \mathrm{expit}(s_t) = \frac{1}{1+\exp(-s_t)}, \qquad s_t = \mathcal{F}(X_t).

We first invert the recursion to get a pseudo-target for \alpha_t

If \widehat{\sigma}_t^2 and \widehat{\sigma}^2_{t-1} were known, then the recursion can be solved for \alpha_t algebraically:

\widehat{\sigma}_t^2 = \alpha - \widehat{\sigma}^2_{t-1} = \alpha_t \left(r_{t-1}^2 - \widehat{\sigma}^2_{t-1}\right) \quad\Longrightarrow\quad \alpha_t = \frac{\widehat{\sigma}^2_t - \widehat{\sigma}^2_{t-1}}{r_{t-1}^2 - \widehat{\sigma}^2_{t-1}}.

Since \widehat{\sigma}^2_t is not observed, but our supervision target is:

y_t = r_t^2

We use the current estimate of the previous state \widehat{\sigma}^{2,(m)}_{t-1} (iteration m) to form a pseudo-label:

\widetilde{\alpha}t^{(m)} = \mathrm{clip}\left( \frac{y_t - \widehat{\sigma}^{2,(m)}_{t-1}}{r_{t-1}^2 - \widehat{\sigma}^{2, (m)}_{t-1}}, \varepsilon, 1-\varepsilon \right).

Since the denominator can get small or change sign, and because we need \widetilde{\alpha}\in(0,1) to match the expit gate, we clip the pseudo-label to be between 0 and 1.

Given the pseudo-label, we can now fit the gate model as a standard supervised problem. The tree model is trained in the usual way, with row-wise targets: - inputs: X_t - labels: \widetilde{\alpha}_t^{(m)} (or equivalently \mathrm{logit}(\widetilde{\alpha}_t^{(m)}))

XGBoost now sees a normal regression task, instead of a recursive loss that couples all rows together.

After fitting \mathcal{F}^{(m)}, form:

\widehat \alpha_t^{(m)} = \mathrm{expit}\left(\mathcal{F}^{(m)}(X_t)\right),

and update the variance path by running the recursion forward:

\widehat v_t^{(m+1)} = \widehat \alpha_t^{(m)} r_{t-1}^2 + \left(1-\widehat \alpha_t^{(m)}\right)\widehat v_{t-1}^{(m+1)}.

We will repeat the above until the variance path stabilizes, e.g.

\frac{1}{T}\sum_{t=1}^T\left(\widehat v_t^{(m+1)} - \widehat v_t^{(m)}\right)^2 < \delta

or until we hit a max number of outer iterations. This is an alternating scheme: - Given a variance state estimate, infer what gate would have reduced error at each time step. - Given that inferred gate sequence, fit a nonlinear model to reproduce it from X_t. - Then recompute the variance path implied by the learned gates.

This avoids needing to differentiate through the entire recursion inside XGBoost (which is the source of the “non-separable objective” mismatch).

3.2.1 Objective Function: Computing the Gradient and Hessian for the XGBSTES Model

The XGBoost allows the users to pass in their own objective function of the signature obj: pred, dtrain -> grad, hess. The gradient and Hessian are crucial components that guide the optimization process during training. The gradient represents the direction and rate of the steepest ascent of the loss function, and the Hessian measures the curvature of the loss function, providing information on how the gradient should be adjusted during optimization.

Algorithm (XGBSTES).

We use the gate \alpha_t = \mathrm{sigmoid}(s_t), \qquad s_t = \mathcal{F}(X_t), \qquad \mathrm{sigmoid}(z)=\frac{1}{1+e^{-z}}

and the one-step-ahead variance recursion \widehat{\sigma}_t^2 = \alpha_{t-1} r_{t-1}^2 + (1-\alpha_{t-1})\widehat{\sigma}_{t-1}^2 with labels y_t=r_t^2 and per-time loss \ell_t=(y_t-\widehat{\sigma}_t^2)^2.

  1. Inputs (for one objective call). A contiguous training block t=1,\dots,T with feature rows X_t (inside the DMatrix), labels y_t (from dtrain.get_label()), and an aligned array r_{t-1}^2. The objective receives the current raw predictions s_t from XGBoost.

  2. Initialization (variance state). Choose an initial variance level \widehat{\sigma}_0^2. In our implementation we use a fixed warm-start such as

\widehat{\sigma}_0^2 = \frac{1}{M}\sum_{i=0}^{M-1} r_i^2

for some window length $M$ (e.g. $M=500$).
  1. Forward pass (compute gate and variance forecast). For t=1,\dots,T:

    3.1. Compute the gate output for this row: \alpha_t = \mathrm{sigmoid}(s_t).

    3.2. Update the one-step variance forecast using the previous gate value \alpha_{t-1}: \widehat{\sigma}_t^2 = \alpha_{t-1} r_{t-1}^2 + (1-\alpha_{t-1})\widehat{\sigma}_{t-1}^2.

    3.3. Define the “innovation” entering the recursion: \Delta_{t-1} := r_{t-1}^2 - \widehat{\sigma}_{t-1}^2.

  2. Local derivatives (per-row gradient/Hessian returned to XGBoost). The gate derivative is

\frac{\partial \alpha_{t-1}}{\partial s_{t-1}} = \alpha_{t-1}(1-\alpha_{t-1})

Since \widehat{\sigma}_t^2 depends on s_{t-1} only through \alpha_{t-1},

\frac{\partial \widehat{\sigma}_t^2}{\partial s_{t-1}} = \Delta_{t-1}\,\alpha_{t-1}(1-\alpha_{t-1})

Therefore, using \ell_t=(y_t-\widehat{\sigma}_t^2)^2, we return the per-row quantities (aligned so that g_{t-1} and h_{t-1} correspond to row t-1):

g_{t-1} = \frac{\partial \ell_t}{\partial s_{t-1}} = 2(\widehat{\sigma}_t^2-y_t)\,\Delta_{t-1}\,\alpha_{t-1}(1-\alpha_{t-1})

h_{t-1} \approx 2\Big(\Delta_{t-1}\,\alpha_{t-1}(1-\alpha_{t-1})\Big)^2

(The Hessian uses a Gauss–Newton-style approximation)

  1. Termination (outer boosting loop). XGBoost repeatedly calls this objective across boosting rounds until reaching num_boost_round or an early-stopping criterion (if enabled). Each call performs Steps 2–4 on the current raw prediction vector \{s_t\}.

We fit the model on the same data as in my (now updated) last post. The results are presented in the next section.

4 Results

When we first ran this experiment, we tried several nearby variants of XGBSTES. They were useful while we were exploring the idea, but they clutter the post now and they do not change the main conclusion. To keep the story focused, we will only report one representative XGBSTES specification below: the baseline gblinear model fit with the alternating procedure described above.

On SPY, we keep the same fixed train/test split as in Part 1. Table 1 compares the constant ES baseline, the best-performing STES specification from the previous post, and the baseline XGBSTES fit from this post.

Model RMSE MAE MedAE
ES 4.64e-04 1.40e-04 5.00e-05
STES_EAESE 4.49e-04 1.38e-04 5.11e-05
XGBSTES_BASE 5.01e-04 1.25e-04 3.06e-05

[Table 1: SPY fixed split results.]

The result is a little mixed. XGBSTES improves MAE and MedAE, so on a typical day it tracks volatility more closely than STES. At the same time, it underperforms STES on RMSE and even falls behind the constant ES benchmark. The more flexible gate helps in ordinary periods, but it also creates larger misses when the timing is wrong.

That tradeoff was the reason we kept working on this model family. The alternating fit was clearly learning something, yet it was not learning the right thing consistently enough to replace STES.

5 Wrapping Up

This version of XGBSTES was still a useful step for us. The naive implementation was wrong, and working through why it was wrong clarified the mismatch between recursive state-space models and row-wise boosting objectives. The alternating fit repaired that mismatch enough to make the model trainable, but not enough to make it better than STES on the metric we cared about most.

That is where we will leave Part 2. XGBSTES with Algorithm 1 is interesting, it does improve typical errors, and it does produce a visibly different gate from STES. It also does not become the model we keep using later in the series. The next post picks up from exactly that point and moves from the alternating fit to an end-to-end optimization scheme.