Signature Methods (Part 4 — Rolling Signatures in Volatility Forecasts)

In Part 3 of the signature series we built intuition for which functionals signatures represent efficiently (variance, averages) and which they do not (running maximum, drawdown). The central result from that investigation was that signatures provide a principled feature hierarchy for ordered interactions, but augmentation design determines what information is accessible at low depth, and some targets are structurally incompatible with low-order truncation.

In this post we put that theory to work. We take the SPY volatility forecasting setup from our volatility series — Exponential Smoothing (ES), Smooth Transition Exponential Smoothing (STES), and XGBoost-STES — and ask a concrete question: can rolling signature features, computed over a trailing window of daily returns, improve out-of-sample variance forecasts within the STES framework?

The answer is negative. However, the experiment is worth reporting in detail because the way it fails is informative. We show that:

1 Setup

1.1 Models

We forecast next-day squared returns y_t = r_{t+1}^2 using three model architectures from our volatility series:

Our experiment adds rolling signature features to X_t and asks whether the models make better \alpha_t decisions with access to the shape of the recent price path.

Code

# SPY settings
SPY_TICKER = "SPY"
SPY_START = pd.Timestamp("2000-01-01", tz="UTC")
SPY_END = pd.Timestamp("2023-12-31", tz="UTC")
SPY_BDAYS = pd.bdate_range(SPY_START, SPY_END, tz="UTC")
SPY_START_BDAY = SPY_BDAYS.min()
SPY_END_BDAY = SPY_BDAYS.max()

# Fixed split indices
SPY_IS_INDEX = 200  # In-sample start
SPY_OS_INDEX = 4000  # Out-of-sample start

# Feature configuration
N_LAGS = 0  # Only use current day's return
INIT_WINDOW = 500  # Initialization window for STES models

# XGBoost parameters
XGB_PARAMS = {
    "booster": "gblinear",
    "updater": "coord_descent",
    "max_depth": 5,
    "learning_rate": 0.01,
    "subsample": 1.0,
    "colsample_bytree": 1.0,
    "min_child_weight": 1.0,
    "reg_lambda": 1.0,
    "reg_alpha": 0.0,
    "verbosity": 0,
}
XGB_ROUNDS = 200
SCALE_FACTOR = 100.0  # Scale returns to volatility points

print(f"Data range: {SPY_START_BDAY.date()} to {SPY_END_BDAY.date()}")
print(f"In-sample starts at index: {SPY_IS_INDEX}")
print(f"Out-of-sample starts at index: {SPY_OS_INDEX}")

Data range: 2000-01-03 to 2023-12-29
In-sample starts at index: 200
Out-of-sample starts at index: 4000

1.2 Signature Configuration

We compute level-2 signatures over a rolling 20-day window with the following augmentation pipeline. We use sktime’s SignatureTransformer for the computation. It already comes with a set of supported augmentations. Note that the order of the augmentations matters because each step transforms what the next step sees:

This ordering mirrors our earlier theory: make second-moment information visible at low depth after stabilizing the path representation. We can specify a signature computation using the following config:

Code

@dataclass(frozen=True)
class SignatureConfig:
    """Configuration for signature feature generation."""
    name: str
    lookback: int  # Number of days to look back
    level: int  # Signature truncation level
    augmentations: list[str]  # Augmentation pipeline
    sig_tfm: str  # 'signature' or 'logsignature'

# Example configuration
example_sig = SignatureConfig(
    name="L20_L2",
    lookback=20,
    level=2,
    augmentations=["cumsum", "basepoint", "addtime", "leadlag"],
    sig_tfm="signature"
)

print(f"Example: {example_sig.name}")
print(f"  Lookback: {example_sig.lookback} days")
print(f"  Level: {example_sig.level}")
print(f"  Augmentations: {' -> '.join(example_sig.augmentations)}")

Example: L20_L2
  Lookback: 20 days
  Level: 2
  Augmentations: cumsum -> basepoint -> addtime -> leadlag

We extend the feature template from our volatility-forecast project to compute signature features.

SignatureFeaturesTemplate.transform – core excerpt

# — from volatility_forecast/features/signature_features.py —

class SignatureFeaturesTemplate(_PriceTemplate):

    def transform(self, ctx, params, slice, state):
        lags      = int(params["lags"])          # rolling window length
        sig_level = int(params["sig_level"])      # signature truncation depth

        # 1) Build rolling path: one fixed-length window per timestamp
        logret = self._logret(panel, price_col)   # log-returns series
        path_df = pd.concat(
            {f"lag_{i}": logret.groupby(level="entity_id").shift(i)
             for i in range(lags, -1, -1)},
            axis=1,
        ).dropna()

        # 2) Resolve augmentation pipeline (order matters)
        augmentation_list = self._resolve_augmentation_pipeline(
            params.get("augmentation_list", "all")
        )

        # 3) Instantiate sktime SignatureTransformer
        sig_transformer = SignatureTransformer(
            augmentation_list=augmentation_list,   # e.g. ("cumsum","basepoint","addtime","leadlag")
            depth=sig_level,                       # truncation level
            window_name="global",                  # one window = one pre-built path
            rescaling="post",                      # factorial rescaling
            sig_tfm="signature",                   # full (not log-) signature
            backend="esig",
        )

        # 4) Reshape to sktime Panel format and compute
        paths   = path_df.to_numpy(dtype=float)    # (n_samples, n_timepoints)
        X_panel = paths[:, None, :]                 # (n_samples, 1_channel, n_timepoints)

        sig_df = sig_transformer.fit_transform(X_panel)
        sig_df.index = path_df.index                # realign (entity, time) index
        # → sig_df columns are the signature coefficients used as features

1.3 Data

We use SPY daily data from January 2000 to December 2023. The first 200 business days are reserved for model initialization, the next 3,800 form the in-sample training window, and the remaining ~2,000 days are the out-of-sample test set. Baseline features are the current day’s log return, absolute return, and squared return, plus a constant.

Baseline feature specification

def _base_feature_requests(lags: int) -> tuple[FeatureRequest, ...]:
    """Define baseline features: log returns, absolute returns, squared returns."""
    return (
        FeatureRequest(
            template=LagLogReturnTemplate(),
            params={
                "lags": lags,
                "source": "tiingo",
                "table": "market.ohlcv",
                "price_col": "close",
            },
        ),
        FeatureRequest(
            template=LagAbsLogReturnTemplate(),
            params={
                "lags": lags,
                "source": "tiingo",
                "table": "market.ohlcv",
                "price_col": "close",
            },
        ),
        FeatureRequest(
            template=LagSquaredLogReturnTemplate(),
            params={
                "lags": lags,
                "source": "tiingo",
                "table": "market.ohlcv",
                "price_col": "close",
            },
        ),
    )

def build_spy_spec_baseline(lags: int = N_LAGS) -> VolDatasetSpec:
    """Baseline spec identical to Part 2 (no signature features)."""
    features = _base_feature_requests(lags)

    target = TargetRequest(
        template=NextDaySquaredReturnTarget(),
        params={
            "source": "tiingo",
            "table": "market.ohlcv",
            "price_col": "close",
            "scale": 1.0,
        },
        horizon=1,
        name="y",
    )

    return VolDatasetSpec(
        universe=UniverseSpec(entities=[SPY_TICKER]),
        time=TimeSpec(
            start=SPY_START_BDAY,
            end=SPY_END_BDAY,
            calendar="XNYS",
            grid="B",
            asof=None,
        ),
        features=features,
        target=target,
        join_policy=JoinPolicy(how="inner", sort_index=True),
        missingness=MissingnessPolicy(final_row_policy="drop_if_any_nan"),
    )

Signature dataset alignment implementation

def build_spy_spec_with_signature(sig: SignatureConfig, lags: int = N_LAGS):
    """Spec that adds signature features on top of the baseline features."""
    base = list(_base_feature_requests(lags))

    # Add signature feature template
    sig_req = FeatureRequest(
        template=SignatureFeaturesTemplate(),
        params={
            "lags": int(sig.lookback),
            "sig_level": int(sig.level),
            "sig_tfm": str(sig.sig_tfm),
            "augmentation_list": (
                "none" if not sig.augmentations else "->".join(sig.augmentations)
            ),
            "source": "tiingo",
            "table": "market.ohlcv",
            "price_col": "close",
        },
    )

    features = tuple(base + [sig_req])

    target = TargetRequest(
        template=NextDaySquaredReturnTarget(),
        params={
            "source": "tiingo",
            "table": "market.ohlcv",
            "price_col": "close",
            "scale": 1.0,
        },
        horizon=1,
        name="y",
    )

    return VolDatasetSpec(
        universe=UniverseSpec(entities=[SPY_TICKER]),
        time=TimeSpec(
            start=SPY_START_BDAY,
            end=SPY_END_BDAY,
            calendar="XNYS",
            grid="B",
            asof=None,
        ),
        features=features,
        target=target,
        join_policy=JoinPolicy(how="inner", sort_index=True),
        missingness=MissingnessPolicy(final_row_policy="drop_if_any_nan"),
    )

def build_spy_dataset_with_signature(
    ctx,
    *,
    sig: SignatureConfig,
    baseline_X: pd.DataFrame,
    baseline_index: pd.Index,
    baseline_columns: list[str],
    lags: int = N_LAGS,
) -> pd.DataFrame:
    """Build signature dataset and align to baseline index.
    
    Key steps:
    1. Build dataset with signature features (drops early rows)
    2. Reindex to baseline index (restores dropped rows)
    3. Restore baseline columns from baseline dataset
    4. Fill signature columns with 0.0 for warmup rows
    """
    spec = build_spy_spec_with_signature(sig, lags)
    X, _y, _returns, _catalog = build_vol_dataset(ctx, spec, persist=False)

    X1 = _split_entity_frame(X, SPY_TICKER)
    if isinstance(X1, pd.Series):
        X1 = X1.to_frame()

    if "const" not in X1.columns:
        X1["const"] = 1.0

    # Align to baseline index
    X_aligned = X1.reindex(baseline_index)

    # Restore baseline columns
    X_aligned.loc[baseline_index, baseline_columns] = baseline_X.loc[
        baseline_index, baseline_columns
    ].to_numpy()

    # Fill signature features for warmup rows
    sig_cols = [c for c in X_aligned.columns if c not in set(baseline_columns)]
    if sig_cols:
        X_aligned[sig_cols] = X_aligned[sig_cols].fillna(0.0)

    return X_aligned

2 Experiment 1: Single-Channel Signatures

2.1 Feature Exploration

Before feeding signatures into the models, we characterize the feature set. The level-2 signature of the augmented 1D return path produces 20 coordinates in this configuration. We examine their mutual correlation structure and evaluate how much independent information they carry.

Most of the level-2 signature of a 20-day augmented return path have near-zero correlation with the target (next-day squared return), and many are highly correlated with each other — a direct consequence of the augmentation pipeline generating algebraically related terms.

The clustered correlation matrix below makes the redundancy structure visible. Large off-diagonal blocks of near-perfect correlation (|\rho| > 0.95) indicate groups of features that carry the same information.

Clustered correlation heatmap

subset = sig_cols[:50]
corr = X_sig_only[subset].corr().replace([np.inf, -np.inf], np.nan).fillna(0.0)

dist = 1.0 - np.abs(corr.values)
np.fill_diagonal(dist, 0.0)
dist[np.isclose(dist, 0.0)] = 0.0

Z = linkage(squareform(dist, checks=False), method="average")
order = leaves_list(Z)
corr_reordered = corr.iloc[order, order]

plt.figure(figsize=(10, 9))
sns.heatmap(
    corr_reordered,
    cmap="RdBu_r",
    center=0,
    square=True,
    cbar_kws={"shrink": 0.8},
)
plt.title("Clustered correlation matrix (ordered by hierarchical clustering)")
plt.tight_layout()
plt.show()

2.2 Cluster-Based Feature Selection

Rather than manually pruning augmentations, we keep the full pipeline and select one representative per correlation cluster:

If a max_features budget is set, the threshold is progressively lowered until the cluster count falls within budget. This is preferable to PCA because each selected column remains an interpretable signature coordinate, and the selection is fit on training data only.

The pair of heatmaps after the cluster output shows the correlation structure before and after selection: the left panel displays all 42 signature features with the dense off-diagonal blocks discussed above, while the right panel shows the 6 surviving representatives with near-zero mutual correlation.

Cluster-based feature selection

def select_features_by_cluster(
    X: pd.DataFrame,
    y: pd.Series,
    feature_cols: list[str],
    train_slice: slice,
    corr_threshold: float = 0.95,
    max_features: int | None = None,
) -> tuple[list[str], dict]:
    """Select one feature per correlation cluster, fit on training data only.

    Parameters
    ----------
    X : DataFrame with all features
    y : target series (aligned to X index)
    feature_cols : columns to cluster and select from
    train_slice : training indices (causal — clustering uses train only)
    corr_threshold : features with |corr| >= this are grouped together.
        This is the *starting* threshold — if ``max_features`` is set and
        the initial clustering yields too many clusters, the threshold is
        progressively lowered (i.e. clusters are merged more aggressively)
        until the feature count is within budget.
    max_features : int or None
        Hard cap on the number of selected features.  When set, the
        correlation threshold is lowered in steps of 0.05 until
        ``n_clusters <= max_features``.

    Returns
    -------
    selected : list of str
        Selected feature column names (one per cluster).
    cluster_map : dict
        Mapping from cluster id to list of member feature names.
    """
    X_train = X.iloc[train_slice][feature_cols]
    y_train = y.iloc[train_slice]

    # 0. Drop zero-variance features (e.g. warmup zeros that never vary in training)
    variances = X_train.var()
    live_cols = [c for c in feature_cols if variances[c] > 0]
    dropped_cols = [c for c in feature_cols if variances[c] <= 0]
    if dropped_cols:
        print(f"  Dropped {len(dropped_cols)} zero-variance features in training window")

    X_train = X_train[live_cols]

    # 1. Correlation matrix on training data
    #    NaN correlations (e.g. near-constant columns) treated as perfectly correlated
    #    so they get grouped together and only one survives
    corr_mat = X_train.corr().replace([np.inf, -np.inf], np.nan)
    corr_mat = corr_mat.fillna(1.0)

    # 2. Distance = 1 - |corr|
    dist = 1.0 - np.abs(corr_mat.values)
    np.fill_diagonal(dist, 0.0)
    dist = np.clip(dist, 0.0, None)  # numerical safety

    # 3. Hierarchical clustering — adapt threshold to meet max_features budget
    Z = linkage(squareform(dist, checks=False), method="average")

    threshold = corr_threshold
    while True:
        cut_distance = 1.0 - threshold
        labels = fcluster(Z, t=cut_distance, criterion="distance")
        n_clusters = len(set(labels))
        if max_features is None or n_clusters <= max_features:
            break
        # Lower threshold (merge more aggressively) and retry
        threshold -= 0.05
        if threshold < 0.30:
            # Safety floor — don't merge features with |corr| < 0.30
            break

    if threshold < corr_threshold:
        print(f"  Lowered corr_threshold {corr_threshold:.2f} → {threshold:.2f} "
              f"to meet max_features={max_features} (got {n_clusters} clusters)")

    # 4. For each cluster, pick feature with highest |corr| with target
    target_corr = X_train.corrwith(y_train).abs().fillna(0.0)
    selected = []
    cluster_map = {}  # cluster_id -> list of features
    for col, cid in zip(live_cols, labels):
        cluster_map.setdefault(cid, []).append(col)

    for cid, members in sorted(cluster_map.items()):
        best = max(members, key=lambda c: target_corr.get(c, 0.0))
        selected.append(best)

    return selected, cluster_map

Apply cluster-based selection

train_sl = slice(SPY_IS_INDEX, SPY_OS_INDEX)

selected_sig_cols, cluster_map = select_features_by_cluster(
    X_sig, y_base, sig_cols, train_slice=train_sl, corr_threshold=0.95,
)

print(f"Original signature features: {len(sig_cols)}")
print(f"Clusters found: {len(cluster_map)}")
print(f"Selected features (one per cluster): {len(selected_sig_cols)}")
print()

for cid, members in sorted(cluster_map.items(), key=lambda x: -len(x[1])):
    chosen = [m for m in members if m in selected_sig_cols][0]
    if len(members) > 1:
        dropped = [m for m in members if m != chosen]
        print(f"  Cluster {cid} ({len(members)} features): keep {chosen}, drop {dropped}")
    else:
        print(f"  Cluster {cid} (1 feature): keep {chosen}")

  Dropped 4 zero-variance features in training window
Original signature features: 20
Clusters found: 6
Selected features (one per cluster): 6

  Cluster 5 (5 features): keep sig.2.1ec5eb462c34, drop ['sig.2.2369463372ff', 'sig.2.8eed169e5d53', 'sig.2.67b9c580bd5e', 'sig.2.2df25188d3c4']
  Cluster 3 (4 features): keep sig.2.af47c4edd59d, drop ['sig.2.aa596a5e93f6', 'sig.2.0b4c29be55bf', 'sig.2.d61527266162']
  Cluster 4 (3 features): keep sig.2.49bf7214fc3f, drop ['sig.2.f030fdc202ec', 'sig.2.afaedaad902a']
  Cluster 1 (2 features): keep sig.2.59a44f852829, drop ['sig.2.04d7e305d350']
  Cluster 2 (1 feature): keep sig.2.7b3cf4c020ee
  Cluster 6 (1 feature): keep sig.2.f74613e2b353

/Users/steveyang/miniforge3/envs/py312/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:3065: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/steveyang/miniforge3/envs/py312/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:3066: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]

Before / after selection heatmap

fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Left: all signature features
corr_all = X_sig_only.corr().replace([np.inf, -np.inf], np.nan).fillna(0.0)
dist_all = 1.0 - np.abs(corr_all.values)
np.fill_diagonal(dist_all, 0.0)
dist_all = np.clip(dist_all, 0.0, None)
Z_all = linkage(squareform(dist_all, checks=False), method="average")
order_all = leaves_list(Z_all)
corr_all_reordered = corr_all.iloc[order_all, order_all]

sns.heatmap(corr_all_reordered, cmap="RdBu_r", center=0, square=True,
            cbar_kws={"shrink": 0.7}, ax=axes[0],
            xticklabels=False, yticklabels=False)
axes[0].set_title(f"All signature features ({len(sig_cols)})")

# Right: selected features only
X_selected = X_sig_only[selected_sig_cols]
corr_sel = X_selected.corr().replace([np.inf, -np.inf], np.nan).fillna(0.0)
if len(selected_sig_cols) > 2:
    dist_sel = 1.0 - np.abs(corr_sel.values)
    np.fill_diagonal(dist_sel, 0.0)
    dist_sel = np.clip(dist_sel, 0.0, None)
    Z_sel = linkage(squareform(dist_sel, checks=False), method="average")
    order_sel = leaves_list(Z_sel)
    corr_sel_reordered = corr_sel.iloc[order_sel, order_sel]
else:
    corr_sel_reordered = corr_sel

sns.heatmap(corr_sel_reordered, cmap="RdBu_r", center=0, square=True,
            cbar_kws={"shrink": 0.7}, ax=axes[1],
            xticklabels=False, yticklabels=False)
axes[1].set_title(f"After cluster selection ({len(selected_sig_cols)} features)")

plt.suptitle("Correlation structure: before vs after cluster-based feature selection", fontsize=13)
plt.tight_layout()
plt.show()

3 Experiment 1: Evaluation

We test whether appending the 6 selected signature coordinates to the model feature set improves out-of-sample forecasts. The three model tiers are the same as in Part 3:

All models are evaluated on a single fixed split (IS: 2000–2015, OOS: 2016–2023), averaged across 10 random seeds.

Model creation and helper functions

"""Taken from volatility_forecast/example/volatility_forecast_2.py"""

class ModelName(str, Enum):
    ES = "ES"
    STES_EAESE = "STES_EAESE"
    XGBSTES_E2E = "XGBSTES_E2E"

def _make_xgb_stes_model(
    *,
    seed: int | None,
    fit_method: str = "end_to_end",
    loss: str = "mse",
    xgb_params: dict | None = None,
    monotonic_constraints: dict[str, int] | None = None,
) -> XGBoostSTESModel:
    """Create XGBoost STES model with specified configuration."""
    params = dict(XGB_PARAMS)
    if xgb_params:
        params.update(xgb_params)
    
    # Strip gblinear-specific params when using tree booster
    if params.get("booster", "gbtree") != "gblinear":
        params.pop("updater", None)
    
    e2e_kwargs = {}
    if fit_method == "end_to_end":
        e2e_kwargs = {
            "e2e_grad_hess_scale": 1.0,
            "e2e_debug": True,
            "e2e_debug_print_once": True,
        }

    return XGBoostSTESModel(
        xgb_params=params,
        num_boost_round=XGB_ROUNDS,
        init_window=INIT_WINDOW,
        fit_method=fit_method,
        loss=loss,
        random_state=seed,
        monotonic_constraints=monotonic_constraints,
        **e2e_kwargs,
    )

def _infer_monotone_constraints(cols: list[str]) -> dict[str, int]:
    """Enforce -1 monotonicity on shock-magnitude features."""
    out: dict[str, int] = {}
    for c in cols:
        cl = c.lower()
        if any(k in cl for k in ["abs", "squared", "sq", "r2", "vol", "rv"]):
            out[c] = -1
    return out

def _xgb_variant_overrides(variant: str, cols: list[str]) -> dict:
    """Translate variant names into model configuration."""
    v = variant.upper()
    fit_method = "end_to_end" if "E2E" in v or "END_TO_END" in v else "alternating"
    loss = "pseudohuber" if "HUBER" in v else "mse"
    mono = _infer_monotone_constraints(cols) if "MONO" in v else None
    return {"fit_method": fit_method, "loss": loss, "monotonic_constraints": mono}

def _is_signature_col(col: str) -> bool:
    return col.startswith("sig.") or col.startswith("mcsig.")

def _scale_train_test(
    X: pd.DataFrame, train_slice: slice, test_slice: slice
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Standardize all features (including signatures) on train, apply to test.

    Matches volatility_forecast_2.py: every column except 'const' is scaled.
    """
    X_tr = X.iloc[train_slice].copy()
    X_te = X.iloc[test_slice].copy()

    cols_to_scale = [c for c in X.columns if c != "const"]

    if not cols_to_scale:
        return X_tr, X_te

    scaler = StandardScaler().fit(X_tr[cols_to_scale])
    X_tr.loc[:, cols_to_scale] = pd.DataFrame(
        scaler.transform(X_tr[cols_to_scale]),
        index=X_tr.index,
        columns=cols_to_scale,
    )
    X_te.loc[:, cols_to_scale] = pd.DataFrame(
        scaler.transform(X_te[cols_to_scale]),
        index=X_te.index,
        columns=cols_to_scale,
    )

    return X_tr, X_te

Fit and predict implementation

"""Taken from volatility_forecast/example/volatility_forecast_2.py"""
def _fit_predict_oos(
    *,
    variant: str,
    X: pd.DataFrame,
    y: pd.Series,
    r: pd.Series,
    train_slice: slice,
    test_slice: slice,
    seed: int | None = None,
    xgb_params: dict | None = None,
    extra_cols: list[str] | None = None,
    return_alpha: bool = False,
    cv_splits: int = 5,
) -> tuple[pd.Index, np.ndarray] | tuple[pd.Index, np.ndarray, pd.Series]:
    """Fit model on train slice and predict on test slice with warm-start.
    
    For XGBoost-STES variants, supports three fitting modes via suffix:
      - No suffix or _COLD: cold-start (default)
      - _WS: warm-start — initialize from baseline z-scores via base_margin
      - _2S: two-stage — freeze baseline, train only on signature features
    
    Tree booster mode: include _TREE in the variant name to use gbtree
    instead of gblinear.  Warm-start / two-stage are not supported with
    gbtree (the tree builder is greedy, so does not suffer from gblinear's
    non-convexity issues).
    """
    from scipy.special import expit as _expit
    
    variant_name = variant.value if isinstance(variant, Enum) else str(variant)
    
    # Detect warm-start / two-stage mode from variant suffix
    fit_mode = "cold"
    clean_variant = variant_name
    if variant_name.endswith("_WS"):
        fit_mode = "warm_start"
        clean_variant = variant_name[:-3]  # strip _WS
    elif variant_name.endswith("_2S"):
        fit_mode = "two_stage"
        clean_variant = variant_name[:-3]  # strip _2S
    
     # Detect tree booster from variant name
    is_tree = "_TREE" in clean_variant.upper()
    
    if is_tree and fit_mode != "cold":
        logger.warning(
            f"[{variant_name}] Warm-start/two-stage not supported for "
            f"gbtree — falling back to cold-start"
        )
        fit_mode = "cold"
    
    # For tree variants, inject gbtree-specific XGBoost parameters
    if is_tree:
        tree_defaults = {
            "booster": "gbtree",
            "max_depth": 2,
            "eta": 0.05,
            "subsample": 1.0,       # keep 1.0 for temporal STES
            "colsample_bytree": 1.0,
            "min_child_weight": 5.0,
            "reg_lambda": 1.0,
            "reg_alpha": 0.0,
            "verbosity": 0,
        }
        # User's xgb_params override tree defaults
        if xgb_params:
            tree_defaults.update(xgb_params)
        xgb_params = tree_defaults
    
    # Prepare data
    if clean_variant.startswith("XGBSTES_"):
        # Drop 'const' — gblinear already has an unpenalized bias term;
        # keeping const creates a redundant, partly-penalized intercept.
        if extra_cols:
            # Use baseline (non-sig) columns + ONLY the specified extra columns
            base_only = [c for c in X.columns if not _is_signature_col(c)]
            cols = base_only + [c for c in extra_cols if c in X.columns and c not in base_only]
        else:
            cols = [c for c in X.columns if not _is_signature_col(c)]
        cols = [c for c in cols if c != "const"]
        X_sel = X[cols]
        X_tr, X_te = _scale_train_test(X_sel, train_slice, test_slice)
        y_tr, r_tr = y.iloc[train_slice], r.iloc[train_slice]
        r_te = r.iloc[test_slice]
        
        over = _xgb_variant_overrides(clean_variant, cols)
        model = _make_xgb_stes_model(seed=None, xgb_params=xgb_params, **over)
    else:
        cols = select_variant_columns(X, clean_variant) or ["const"]
        cols = [c for c in cols if not _is_signature_col(c)]
        if extra_cols:
            cols = cols + [c for c in extra_cols if c not in set(cols)]
        
        X_sel = X[cols]
        X_tr, X_te = _scale_train_test(X_sel, train_slice, test_slice)
        y_tr, r_tr = y.iloc[train_slice], r.iloc[train_slice]
        r_te = r.iloc[test_slice]
        
        if clean_variant == ModelName.ES.value:
            model = ESModel(random_state=seed) if seed is not None else ESModel()
        else:
            model = STESModel(random_state=seed) if seed is not None else STESModel()
    
    # Fit model
    if clean_variant.startswith("XGBSTES_"):
        r_tr_scaled = r_tr * SCALE_FACTOR
        y_tr_scaled = y_tr * (SCALE_FACTOR**2)
        
        # CV grid: tree models need different hyperparameters
        if is_tree:
            grid_e2e = [
                {"max_depth": 2, "min_child_weight": 5.0, "eta": 0.05},
                {"max_depth": 3, "min_child_weight": 10.0, "eta": 0.03},
                {"max_depth": 2, "min_child_weight": 20.0, "eta": 0.1},
            ]
        else:
            grid_e2e = [
                {"min_child_weight": 5.0, "learning_rate": 0.05, "max_depth": 3},
                {"min_child_weight": 20.0, "learning_rate": 0.1, "max_depth": 3},
                {"min_child_weight": 50.0, "learning_rate": 0.05, "max_depth": 2},
            ]
        
        # ── WARM-START / TWO-STAGE: fit baseline first ──────────────────
        base_margin_train = None
        base_margin_all = None
        X_tr_fit = X_tr      # features used for fitting
        X_all_fit = None      # will be set later for prediction
        
        if fit_mode in ("warm_start", "two_stage") and extra_cols:
            # 1. Fit baseline model (without signature features)
            base_cols = [c for c in cols if not _is_signature_col(c)]
            X_tr_base_ws, X_te_base_ws = _scale_train_test(X[base_cols], train_slice, test_slice)
            
            model_base = _make_xgb_stes_model(seed=None, xgb_params=xgb_params, **over)
            model_base.fit(
                X_tr_base_ws, y_tr_scaled, returns=r_tr_scaled,
                perform_cv=True, cv_grid=grid_e2e, cv_splits=cv_splits,
                start_index=0, end_index=len(X_tr_base_ws),
            )
            
            # 2. Extract baseline z-scores: z = bias + Σ w_j * x_j
            dump_str = model_base.model_.get_dump()[0]
            b_bl, w_bl = _parse_gblinear_dump(dump_str)
            
            # Map baseline weights to the full (augmented) feature space
            w_embedded = np.zeros(len(cols))
            for j, c in enumerate(cols):
                if c in base_cols:
                    base_j = base_cols.index(c)
                    w_embedded[j] = w_bl[base_j]
            
            # Compute z_baseline on scaled augmented features
            X_tr_np = X_tr.values.astype(float)
            z_baseline_train = b_bl + X_tr_np @ w_embedded
            
            # Also for test (needed at prediction time)
            X_te_np = X_te.values.astype(float)
            z_baseline_test = b_bl + X_te_np @ w_embedded
            
            base_margin_train = z_baseline_train
            base_margin_all = np.concatenate([z_baseline_train, z_baseline_test])
            
            if fit_mode == "two_stage":
                # Only train on signature features; baseline is frozen in base_margin
                sig_fit_cols = [c for c in cols if _is_signature_col(c)]
                X_tr_fit = X_tr[sig_fit_cols]
                # Create separate model for sig-only features
                model = _make_xgb_stes_model(seed=None, xgb_params=xgb_params, **over)
            
            logger.info(
                f"[{fit_mode}] Baseline fitted (bias={b_bl:.4f}). "
                f"base_margin range: [{z_baseline_train.min():.3f}, {z_baseline_train.max():.3f}]"
            )
        
        model.fit(
            X_tr_fit, y_tr_scaled, returns=r_tr_scaled,
            perform_cv=True, cv_grid=grid_e2e, cv_splits=cv_splits,
            start_index=0, end_index=len(X_tr_fit),
            base_margin=base_margin_train,
        )
    elif extra_cols and isinstance(model, STESModel):
        # Group L2: penalize only signature features, leave baseline free.
        # Uniform L2 shrinks useful baseline weights when trying to suppress
        # noisy signature weights — group-specific penalties avoid this.
        from sklearn.model_selection import TimeSeriesSplit
        from scipy.optimize import least_squares as _lsq

        X_tr_np = X_tr.values.astype(float)
        y_tr_np = y_tr.values.astype(float)
        r_tr_np = r_tr.values.astype(float)
        p = X_tr_np.shape[1]
        col_list = list(X_tr.columns)

        # Mask: 1.0 for signature features, 0.0 for baseline / const
        sig_mask = np.array(
            [1.0 if _is_signature_col(c) else 0.0 for c in col_list]
        )

        l2_sig_grid = [0.0, 0.01, 0.1, 1.0, 10.0]
        tscv = TimeSeriesSplit(n_splits=cv_splits)
        best_l2, best_mse = 0.0, np.inf

        for l2_cand in l2_sig_grid:
            fold_mses = []
            for tr_idx, va_idx in tscv.split(X_tr_np):
                Xf_tr, Xf_va = X_tr_np[tr_idx], X_tr_np[va_idx]
                yf_tr, yf_va = y_tr_np[tr_idx], y_tr_np[va_idx]
                rf_tr, rf_va = r_tr_np[tr_idx], r_tr_np[va_idx]

                scale_tr = np.linalg.norm(yf_tr)
                p_vec = (
                    np.sqrt(l2_cand) * scale_tr * sig_mask
                    if l2_cand > 0
                    else None
                )

                rng = np.random.default_rng(seed)
                x0 = rng.normal(0, 1, size=p)
                res = _lsq(
                    STESModel._objective,
                    x0=x0,
                    jac=STESModel._jacobian,
                    args=(rf_tr, Xf_tr, yf_tr, 0, len(Xf_tr), p_vec),
                )
                beta = res.x

                alphas_va = _expit(Xf_va @ beta)
                r2_va = rf_va ** 2
                v = (
                    float(np.mean(rf_tr[-20:] ** 2))
                    if len(rf_tr) >= 20
                    else float(r2_va[0])
                )
                yhat_va = np.empty(len(yf_va))
                for t in range(len(yf_va)):
                    v_next = alphas_va[t] * r2_va[t] + (1 - alphas_va[t]) * v
                    yhat_va[t] = v_next
                    v = v_next
                fold_mses.append(float(np.mean((yf_va - yhat_va) ** 2)))

            avg_mse = float(np.mean(fold_mses))
            if avg_mse < best_mse:
                best_mse, best_l2 = avg_mse, l2_cand

        logger.info(
            f"Group-L2 CV selected λ_sig={best_l2:.4f} (MSE={best_mse:.2e})"
        )

        # Final fit with best group penalty
        scale = np.linalg.norm(y_tr_np)
        p_vec_final = (
            np.sqrt(best_l2) * scale * sig_mask if best_l2 > 0 else None
        )
        rng = np.random.default_rng(seed)
        x0 = rng.normal(0, 1, size=p)
        res = _lsq(
            STESModel._objective,
            x0=x0,
            jac=STESModel._jacobian,
            args=(r_tr_np, X_tr_np, y_tr_np, 0, len(X_tr_np), p_vec_final),
        )
        model.params = res.x
        model._set_schema(col_list, p)
    else:
        model.fit(X_tr, y_tr, returns=r_tr, start_index=0, end_index=len(X_tr))
    
    # Warm-start prediction: concatenate train+test to avoid cold start
    n_tr = len(X_tr)
    X_all = pd.concat([X_tr, X_te], axis=0)
    r_all = pd.concat([r_tr, r_te], axis=0)
    
    if clean_variant.startswith("XGBSTES_"):
        r_all_scaled = r_all * SCALE_FACTOR
        
        # For two-stage prediction, use sig-only features
        if fit_mode == "two_stage" and extra_cols:
            sig_fit_cols = [c for c in cols if _is_signature_col(c)]
            X_all_pred = pd.concat([X_tr[sig_fit_cols], X_te[sig_fit_cols]], axis=0)
        else:
            X_all_pred = X_all
        
        y_hat_all_scaled = model.predict(
            X_all_pred, returns=r_all_scaled,
            base_margin=base_margin_all,
        )
        y_hat_all = np.asarray(y_hat_all_scaled, dtype=float) / (SCALE_FACTOR**2)
        alpha_all = model.get_alphas(X_all_pred, base_margin=base_margin_all)
    else:
        y_hat_all = model.predict(X_all, returns=r_all)
        if isinstance(model, STESModel):
            _yhat_vals, alpha_vals = model.predict_with_alpha(X_all, returns=r_all)
            alpha_all = pd.Series(alpha_vals, index=X_all.index, name="alpha")
        else:
            alpha_val = getattr(model, "alpha_", np.nan)
            alpha_all = pd.Series(alpha_val, index=X_all.index, name="alpha")
    
    # Extract test predictions
    y_hat_arr = np.asarray(y_hat_all, dtype=float)[n_tr:]
    keep = np.isfinite(y_hat_arr)
    
    if return_alpha:
        return X_te.index[keep], y_hat_arr[keep], alpha_all
    return X_te.index[keep], y_hat_arr[keep]

Multi-seed evaluation

"""Taken from volatility_forecast/example/volatility_forecast_2.py"""
def evaluate_fixed_split(
    X: pd.DataFrame,
    y: pd.Series,
    r: pd.Series,
    *,
    variants: list[str],
    seeds: list[int],
    is_index: int = SPY_IS_INDEX,
    os_index: int = SPY_OS_INDEX,
    extra_cols_by_variant: dict[str, list[str]] | None = None,
    cv_splits: int = 5,
) -> pd.DataFrame:
    """Evaluate variants across multiple random seeds."""
    train_sl = slice(is_index, os_index)
    test_sl = slice(os_index, len(y))

    rows = []
    for variant in variants:
        rmses, maes, medaes = [], [], []
        extra_cols = (extra_cols_by_variant or {}).get(variant)

        for seed in seeds:
            try:
                idx_te, y_hat = _fit_predict_oos(
                    variant=variant,
                    X=X,
                    y=y,
                    r=r,
                    train_slice=train_sl,
                    test_slice=test_sl,
                    seed=seed,
                    extra_cols=extra_cols,
                    cv_splits=cv_splits,
                )
                y_true = y.loc[idx_te].values
                rmses.append(rmse(y_true, y_hat))
                maes.append(mae(y_true, y_hat))
                medaes.append(medae(y_true, y_hat))
            except Exception as e:
                logger.exception(
                    f"Fixed-split eval failed: variant={variant}, seed={seed}: {e}"
                )

        rows.append(
            {
                "variant": variant,
                "rmse": float(np.mean(rmses)) if rmses else np.nan,
                "mae": float(np.mean(maes)) if maes else np.nan,
                "medae": float(np.mean(medaes)) if medaes else np.nan,
            }
        )

    return pd.DataFrame(rows).sort_values("rmse")

3.1 Baseline Results

First the baseline — identical to Part 3. XGBoost-STES dominates on RMSE and MAE.

Evaluating baseline models...

	rmse	mae	medae
variant
XGBSTES_E2E	0.000443	0.000132	0.000045
STES_EAESE	0.000448	0.000138	0.000051
ES	0.000464	0.000140	0.000050

3.2 With Signature Features

Now we append the 6 cluster-selected signature coordinates as additional features. For STES, an L2 regularization penalty is tuned via 5-fold CV (grid: \lambda \in \{0, 0.01, 0.1, 1.0, 10.0\}).

	rmse	mae	medae
variant
XGBSTES_E2E	0.000443	0.000132	0.000045
STES_EAESE	0.000448	0.000138	0.000051
STES_EAESE_SIG_L20_L2	0.000453	0.000134	0.000046
XGBSTES_E2E_SIG_L20_L2	0.000457	0.000139	0.000046
ES	0.000464	0.000140	0.000050

The above table suggests that signatures do not carry useful information in determining the learning rate \alpha. Both STES and XGBoost-STES degrade slightly when signature features are added. The L2 CV for STES selects a high penalty — the optimizer effectively shrinks the signature coefficients toward zero. It is equally informative to examine the predicted smoothing parameter \alpha_t with and without signature features.

3.3 Diagnosing via \alpha_t Paths

The plot below overlays the predicted \alpha_t for STES and XGBoost-STES, with and without signature features. If signatures carried genuine signal, we would expect the augmented \alpha_t to track volatility regimes more tightly. Instead, the signature-augmented paths are noisier — the additional features introduce high-frequency variation in \alpha_t that does not correspond to actual regime changes. Moreover, there appears to be an \alpha level shift. This is a more serious problem, which we examine later in this post.

Alpha path comparison plot

train_slice = slice(SPY_IS_INDEX, SPY_OS_INDEX)
test_slice  = slice(SPY_OS_INDEX, len(y_base))

alpha_variants = {
    "STES (no sig)":   dict(variant="STES_EAESE",   extra_cols=None),
    "STES + sig":      dict(variant=stes_sig_name,   extra_cols=selected_sig_cols),
    "XGB-STES (no sig)": dict(variant="XGBSTES_E2E", extra_cols=None),
    "XGB-STES + sig":    dict(variant=xgb_sig_name,   extra_cols=selected_sig_cols),
}

alphas = {}
for label, kw in alpha_variants.items():
    _, _, alpha = _fit_predict_oos(
        X=X_sig_selected, y=y_base, r=r_base,
        train_slice=train_slice, test_slice=test_slice,
        seed=0, return_alpha=True, **kw,
    )
    alphas[label] = alpha.iloc[SPY_OS_INDEX - SPY_IS_INDEX :]

fig, axes = plt.subplots(2, 1, figsize=(14, 7), sharex=True)

for label in ["STES (no sig)", "STES + sig"]:
    axes[0].plot(alphas[label].index, alphas[label].values, label=label, linewidth=0.7)
axes[0].set_ylabel(r"$\alpha_t$")
axes[0].set_title("STES — predicted smoothing parameter (OOS)")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

for label in ["XGB-STES (no sig)", "XGB-STES + sig"]:
    axes[1].plot(alphas[label].index, alphas[label].values, label=label, linewidth=0.7)
axes[1].set_ylabel(r"$\alpha_t$")
axes[1].set_title("XGBoost-STES — predicted smoothing parameter (OOS)")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

fig.tight_layout()
plt.show()

Alpha level shift analysis

# ── Quantify the alpha level shift visible in the chart ──────────────────
print("OOS SMOOTHING PARAMETER (α) — LEVEL SHIFT ANALYSIS")
print("=" * 72)
print(f"\n{'Model':>25s}  {'mean(α)':>10s}  {'std(α)':>10s}  {'median':>10s}  {'IQR':>12s}")
print("-" * 72)
for label in ["STES (no sig)", "STES + sig", "XGB-STES (no sig)", "XGB-STES + sig"]:
    a = alphas[label].values
    q25, q75 = np.percentile(a, [25, 75])
    print(f"{label:>25s}  {a.mean():>10.4f}  {a.std():>10.4f}  {np.median(a):>10.4f}  [{q25:.4f}, {q75:.4f}]")

# ── Level shift magnitudes ──────────────────────────────────────────────
a_stes_base = alphas["STES (no sig)"].values
a_stes_sig  = alphas["STES + sig"].values
a_xgb_base  = alphas["XGB-STES (no sig)"].values
a_xgb_sig   = alphas["XGB-STES + sig"].values

print(f"\nLevel shifts:")
print(f"  STES:      mean α jumps from {a_stes_base.mean():.4f} → {a_stes_sig.mean():.4f}  (Δ = {a_stes_sig.mean() - a_stes_base.mean():+.4f}, {(a_stes_sig.mean()/a_stes_base.mean()-1)*100:+.1f}%)")
print(f"  XGB-STES:  mean α jumps from {a_xgb_base.mean():.4f} → {a_xgb_sig.mean():.4f}  (Δ = {a_xgb_sig.mean() - a_xgb_base.mean():+.4f}, {(a_xgb_sig.mean()/a_xgb_base.mean()-1)*100:+.1f}%)")

# ── Correlation between baseline and sig alpha paths ─────────────────────
corr_stes = np.corrcoef(a_stes_base, a_stes_sig)[0, 1]
corr_xgb = np.corrcoef(a_xgb_base, a_xgb_sig)[0, 1]
print(f"\nCorrelation between baseline α and sig α:")
print(f"  STES:     ρ = {corr_stes:.4f}")
print(f"  XGB-STES: ρ = {corr_xgb:.4f}")

# ── What does the level shift imply for the ES recursion? ────────────────
# α controls the weight on new shocks:  σ²_{t+1} = α·r²_t + (1-α)·σ²_t
# Higher α → faster reaction to shocks, less smoothing, more volatile forecasts
# Lower α → more smoothing, slower reaction, stabler forecasts
mean_shift_stes = a_stes_sig.mean() - a_stes_base.mean()
mean_shift_xgb = a_xgb_sig.mean() - a_xgb_base.mean()

# Implied half-life of a shock:  h = log(0.5) / log(1-α)
hl_stes_base = np.log(0.5) / np.log(1 - a_stes_base.mean())
hl_stes_sig = np.log(0.5) / np.log(1 - a_stes_sig.mean())
hl_xgb_base = np.log(0.5) / np.log(1 - a_xgb_base.mean())
hl_xgb_sig = np.log(0.5) / np.log(1 - a_xgb_sig.mean())

print(f"\nImplied shock half-life (days):")
print(f"  STES:     {hl_stes_base:.1f} days → {hl_stes_sig:.1f} days")
print(f"  XGB-STES: {hl_xgb_base:.1f} days → {hl_xgb_sig:.1f} days")
print(f"\n  The signature model reacts faster to shocks (shorter half-life).")
print(f"  This systematically changes the forecast character.")

OOS SMOOTHING PARAMETER (α) — LEVEL SHIFT ANALYSIS
========================================================================

                    Model     mean(α)      std(α)      median           IQR
------------------------------------------------------------------------
            STES (no sig)      0.1150      0.0236      0.1051  [0.1043, 0.1185]
               STES + sig      0.1492      0.0074      0.1513  [0.1467, 0.1539]
        XGB-STES (no sig)      0.1910      0.0136      0.1970  [0.1869, 0.1995]
           XGB-STES + sig      0.2203      0.0308      0.2268  [0.2072, 0.2383]

Level shifts:
  STES:      mean α jumps from 0.1150 → 0.1492  (Δ = +0.0342, +29.7%)
  XGB-STES:  mean α jumps from 0.1910 → 0.2203  (Δ = +0.0294, +15.4%)

Correlation between baseline α and sig α:
  STES:     ρ = -0.2254
  XGB-STES: ρ = 0.6658

Implied shock half-life (days):
  STES:     5.7 days → 4.3 days
  XGB-STES: 3.3 days → 2.8 days

  The signature model reacts faster to shocks (shorter half-life).
  This systematically changes the forecast character.

4 Experiment 2: Multi-Channel Signatures

The single-channel signature only sees the log-return path. Volatility is a second-moment phenomenon — it cares about magnitudes and asymmetry, not direction alone. We can build a richer path by stacking multiple return transformations as channels before computing the signature:

Channel	Formula	Captures
Log-price	X_t = \sum_{s=1}^t r_s	Path geometry, drift
Absolute return	\lvert r_t \rvert	Unsigned shock magnitude
Squared return	r_t^2	Variance proxy
Negative return	r_t \cdot \mathbb{1}_{r_t < 0}	Leverage effect

The signature of this 4D path produces cross-channel iterated integrals at level 2 — for example, terms that capture how the price path covaries with downside shocks, directly encoding the leverage effect. Augmentations are basepoint → addtime → leadlag (no cumsum — we integrate channel 0 manually).

The clustered correlation heatmap below visualizes the redundancy structure of the resulting 80+ multi-channel signature features. As with the single-channel case, we apply the same cluster-based selection pipeline, now capped at 20 features via max_features. The before/after heatmaps confirm that selection eliminates the dense correlation blocks while retaining diverse representatives.

Multi-channel rolling signature builder

from sktime.transformations.panel.signature_based import SignatureTransformer


def build_multichannel_rolling_signatures(
    returns: pd.Series,
    lookback: int = 20,
    level: int = 2,
    augmentations: tuple[str, ...] = ("basepoint", "addtime", "leadlag"),
    sig_tfm: str = "signature",
    rescaling: str = "post",
) -> pd.DataFrame:
    """Build signature features from a 4-channel return path over rolling windows.

    Channels:
        0: cumsum(r_t)        — log-price path (rebased per window)
        1: |r_t|              — absolute return
        2: r_t^2              — squared return
        3: r_t * 1_{r_t < 0}  — negative return (leverage proxy)

    Channel 0 is computed *within each window* so the path is rebased to 0.
    This removes drift/level effects that would otherwise leak non-stationarity
    into the signature features.

    Parameters
    ----------
    returns : pd.Series with DatetimeIndex
    lookback : rolling window length
    level : signature truncation depth
    augmentations : augmentation pipeline (no cumsum — handled manually)
    sig_tfm : 'signature' or 'logsignature'
    rescaling : 'post', 'pre', or None

    Returns
    -------
    pd.DataFrame with signature features, indexed by date.
    First valid date is returns.index[lookback - 1].
    """
    r = returns.values.astype(float)
    n = len(r)

    # Roll into panel: one instance per window
    n_windows = n - lookback + 1
    # sktime numpy3D convention: (N, C, L) = (instances, channels, time)
    panel = np.zeros((n_windows, 4, lookback))
    for i in range(n_windows):
        window = r[i : i + lookback]
        ch_logprice = np.cumsum(window)  # rebased log-price path
        ch_abs_r = np.abs(window)
        ch_r2 = window ** 2
        ch_r_neg = window * (window < 0).astype(float)
        panel[i] = np.vstack([ch_logprice, ch_abs_r, ch_r2, ch_r_neg])

    # Compute signatures
    sig_transformer = SignatureTransformer(
        augmentation_list=augmentations,
        depth=level,
        window_name="global",
        rescaling=rescaling,
        sig_tfm=sig_tfm,
        backend="esig",
    )
    sig_features = sig_transformer.fit_transform(panel)

    # Align to dates: window [i : i+lookback] corresponds to date index[i+lookback-1]
    valid_dates = returns.index[lookback - 1 :]
    assert len(valid_dates) == n_windows, (
        f"Date alignment mismatch: {len(valid_dates)} dates vs {n_windows} windows"
    )
    sig_features = pd.DataFrame(
        sig_features.values if hasattr(sig_features, "values") else sig_features,
        index=valid_dates,
    )
    sig_features.columns = [f"mcsig.{i}" for i in range(sig_features.shape[1])]

    return sig_features


def build_multichannel_dataset(
    baseline_X: pd.DataFrame,
    baseline_index: pd.Index,
    baseline_columns: list[str],
    returns: pd.Series,
    lookback: int = 20,
    level: int = 2,
) -> tuple[pd.DataFrame, list[str]]:
    """Build aligned dataset with baseline features + multi-channel signatures.

    Returns (X_mc, mc_sig_cols).
    """
    sig_df = build_multichannel_rolling_signatures(
        returns, lookback=lookback, level=level,
    )

    # Reindex to baseline, fill warmup with 0
    sig_df = sig_df.reindex(baseline_index).fillna(0.0)
    mc_sig_cols = list(sig_df.columns)

    X_mc = baseline_X.copy()
    X_mc[mc_sig_cols] = sig_df[mc_sig_cols].values

    return X_mc, mc_sig_cols

Cluster-based selection (multi-channel)

selected_mc_cols, mc_cluster_map = select_features_by_cluster(
    X_mc, y_base, mc_sig_cols, train_slice=train_sl,
    corr_threshold=0.95, max_features=20,
)

print(f"Multi-channel signature features: {len(mc_sig_cols)}")
print(f"Clusters found: {len(mc_cluster_map)}")
print(f"Selected features (one per cluster): {len(selected_mc_cols)}")
print()

for cid, members in sorted(mc_cluster_map.items(), key=lambda x: -len(x[1])):
    chosen = [m for m in members if m in selected_mc_cols][0]
    if len(members) > 1:
        dropped = [m for m in members if m != chosen]
        print(f"  Cluster {cid} ({len(members)} features): keep {chosen}, drop {dropped}")
    else:
        print(f"  Cluster {cid} (1 feature): keep {chosen}")

  Dropped 4 zero-variance features in training window
  Lowered corr_threshold 0.95 → 0.80 to meet max_features=20 (got 20 clusters)
Multi-channel signature features: 110
Clusters found: 20
Selected features (one per cluster): 20

  Cluster 19 (26 features): keep mcsig.51, drop ['mcsig.30', 'mcsig.35', 'mcsig.39', 'mcsig.40', 'mcsig.45', 'mcsig.49', 'mcsig.50', 'mcsig.55', 'mcsig.56', 'mcsig.57', 'mcsig.58', 'mcsig.59', 'mcsig.74', 'mcsig.80', 'mcsig.84', 'mcsig.85', 'mcsig.90', 'mcsig.94', 'mcsig.95', 'mcsig.100', 'mcsig.102', 'mcsig.103', 'mcsig.104', 'mcsig.105', 'mcsig.106']
  Cluster 11 (14 features): keep mcsig.3, drop ['mcsig.8', 'mcsig.13', 'mcsig.18', 'mcsig.32', 'mcsig.33', 'mcsig.42', 'mcsig.43', 'mcsig.63', 'mcsig.68', 'mcsig.87', 'mcsig.88', 'mcsig.97', 'mcsig.98']
  Cluster 18 (12 features): keep mcsig.82, drop ['mcsig.37', 'mcsig.38', 'mcsig.47', 'mcsig.48', 'mcsig.52', 'mcsig.53', 'mcsig.83', 'mcsig.92', 'mcsig.93', 'mcsig.107', 'mcsig.108']
  Cluster 14 (7 features): keep mcsig.1, drop ['mcsig.6', 'mcsig.15', 'mcsig.20', 'mcsig.25', 'mcsig.70', 'mcsig.75']
  Cluster 10 (7 features): keep mcsig.2, drop ['mcsig.7', 'mcsig.12', 'mcsig.17', 'mcsig.60', 'mcsig.62', 'mcsig.67']
  Cluster 7 (6 features): keep mcsig.4, drop ['mcsig.9', 'mcsig.14', 'mcsig.19', 'mcsig.64', 'mcsig.69']
  Cluster 4 (5 features): keep mcsig.23, drop ['mcsig.22', 'mcsig.73', 'mcsig.77', 'mcsig.78']
  Cluster 12 (4 features): keep mcsig.16, drop ['mcsig.11', 'mcsig.61', 'mcsig.66']
  Cluster 16 (4 features): keep mcsig.44, drop ['mcsig.34', 'mcsig.89', 'mcsig.99']
  Cluster 17 (3 features): keep mcsig.26, drop ['mcsig.21', 'mcsig.76']
  Cluster 13 (3 features): keep mcsig.31, drop ['mcsig.36', 'mcsig.86']
  Cluster 3 (3 features): keep mcsig.41, drop ['mcsig.46', 'mcsig.96']
  Cluster 15 (2 features): keep mcsig.24, drop ['mcsig.79']
  Cluster 8 (2 features): keep mcsig.27, drop ['mcsig.28']
  Cluster 6 (2 features): keep mcsig.54, drop ['mcsig.109']
  Cluster 1 (2 features): keep mcsig.81, drop ['mcsig.91']
  Cluster 9 (1 feature): keep mcsig.29
  Cluster 20 (1 feature): keep mcsig.71
  Cluster 5 (1 feature): keep mcsig.72
  Cluster 2 (1 feature): keep mcsig.101

Before / after selection heatmap (multi-channel)

fig, axes = plt.subplots(1, 2, figsize=(18, 8))

sns.heatmap(corr_mc_reordered, cmap="RdBu_r", center=0, square=True,
            cbar_kws={"shrink": 0.7}, ax=axes[0],
            xticklabels=False, yticklabels=False)
axes[0].set_title(f"All multi-channel sig features ({len(mc_sig_cols)})")

X_mc_sel = X_mc_only[selected_mc_cols]
corr_mc_sel = X_mc_sel.corr().replace([np.inf, -np.inf], np.nan).fillna(0.0)
if len(selected_mc_cols) > 2:
    dist_mc_sel = 1.0 - np.abs(corr_mc_sel.values)
    np.fill_diagonal(dist_mc_sel, 0.0)
    dist_mc_sel = np.clip(dist_mc_sel, 0.0, None)
    Z_mc_sel = linkage(squareform(dist_mc_sel, checks=False), method="average")
    order_mc_sel = leaves_list(Z_mc_sel)
    corr_mc_sel_reordered = corr_mc_sel.iloc[order_mc_sel, order_mc_sel]
else:
    corr_mc_sel_reordered = corr_mc_sel

sns.heatmap(corr_mc_sel_reordered, cmap="RdBu_r", center=0, square=True,
            cbar_kws={"shrink": 0.7}, ax=axes[1])
axes[1].set_title(f"After cluster selection ({len(selected_mc_cols)} features)")

plt.suptitle("Multi-channel signatures: before vs after feature selection", fontsize=13)
plt.tight_layout()
plt.show()

upper_mc = corr_mc_sel.where(np.triu(np.ones(corr_mc_sel.shape, dtype=bool), k=1))
pairs_mc = upper_mc.stack().rename("corr").reset_index()
pairs_mc.columns = ["f1", "f2", "corr"]
high_mc = pairs_mc[pairs_mc["corr"].abs() >= 0.90].sort_values("corr", key=abs, ascending=False)
print(f"Selected features: {len(selected_mc_cols)}")
print(f"Pairs with |corr| >= 0.90: {len(high_mc)}")
if len(high_mc):
    display(high_mc)

Selected features: 20
Pairs with |corr| >= 0.90: 0

5 Experiment 2: Evaluation

With 20 selected multi-channel features (capped via max_features), the correlation threshold auto-adjusted from 0.95 to a lower value to stay within budget. The hypothesis is that cross-channel terms — e.g. the iterated integral of price against downside shocks — provide genuinely new information for volatility prediction.

	rmse	mae	medae
variant
XGBSTES_E2E	0.000443	0.000132	0.000045
STES_EAESE	0.000448	0.000138	0.000051
XGBSTES_E2E_SIG_MC	0.000458	0.000132	0.000043
ES	0.000464	0.000140	0.000050
STES_EAESE_SIG_MC	0.000465	0.000134	0.000046

We see the same pattern as in the single-channel case. Including signature features on the multi-channel path yields no improvement. Despite encoding cross-channel interactions (leverage effect, volatility clustering), the multi-channel signatures add noise rather than signal. The baseline models remain dominant.

5.1 Diagnosing via \alpha_t Paths (Multi-Channel)

As with Experiment 1, we overlay the predicted \alpha_t with and without multi-channel signature features. The same pattern emerges: the signature-augmented paths are noisier without being more accurate. Moreover, for the XGBoost-STES model, there is a significant \alpha level shift. This could reflect a mix of score-distribution rebalancing (signatures change the mean/variance of z_t) and training path dependence in the end-to-end fit.

Alpha path comparison plot (multi-channel)

train_slice = slice(SPY_IS_INDEX, SPY_OS_INDEX)
test_slice  = slice(SPY_OS_INDEX, len(y_base))

mc_alpha_variants = {
    "STES (no sig)":       dict(variant="STES_EAESE",  extra_cols=None),
    "STES + mc-sig":       dict(variant=mc_stes_name,   extra_cols=selected_mc_cols),
    "XGB-STES (no sig)":   dict(variant="XGBSTES_E2E", extra_cols=None),
    "XGB-STES + mc-sig":   dict(variant=mc_xgb_name,    extra_cols=selected_mc_cols),
}

mc_alphas = {}
for label, kw in mc_alpha_variants.items():
    _, _, alpha = _fit_predict_oos(
        X=X_mc_selected, y=y_base, r=r_base,
        train_slice=train_slice, test_slice=test_slice,
        seed=0, return_alpha=True, **kw,
    )
    mc_alphas[label] = alpha.iloc[SPY_OS_INDEX - SPY_IS_INDEX :]

fig, axes = plt.subplots(2, 1, figsize=(14, 7), sharex=True)

for label in ["STES (no sig)", "STES + mc-sig"]:
    axes[0].plot(mc_alphas[label].index, mc_alphas[label].values, label=label, linewidth=0.7)
axes[0].set_ylabel(r"$\alpha_t$")
axes[0].set_title("STES — predicted smoothing parameter (OOS, multi-channel)")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

for label in ["XGB-STES (no sig)", "XGB-STES + mc-sig"]:
    axes[1].plot(mc_alphas[label].index, mc_alphas[label].values, label=label, linewidth=0.7)
axes[1].set_ylabel(r"$\alpha_t$")
axes[1].set_title("XGBoost-STES — predicted smoothing parameter (OOS, multi-channel)")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

fig.tight_layout()
plt.show()

6 Feature Redundancy and Regularization

Both experiments show that including signatures degrades out-of-sample forecasts. Since the baseline model is nested within the augmented model (set signature coefficients to zero), any degradation must come from estimation/regularization/optimization rather than from a hard restriction of the feature set. We investigate three potential mechanisms: (i) signature features may be redundant with baseline features; (ii) the STES regularization architecture distorts baseline weights; and (iii) the XGBoost linear booster assigns excessive weight to noisy signature features.

6.1 Cross-Correlation with Baseline Features

If signature features are strongly correlated with the existing baseline features — lagged log-return, absolute log-return, and squared log-return — then adding them provides little novel information while increasing estimation noise. The heatmap below shows the training-set cross-correlation between the six selected single-channel signature features and the three baseline features. Several of the signature features exhibit correlations exceeding 30% with the baseline features.

Cross-correlation heatmap: signature vs baseline features

baseline_feat_cols = [c for c in baseline_columns if c != "const"]
train_sl_diag = slice(SPY_IS_INDEX, SPY_OS_INDEX)

# X_sig has both baseline + sig columns
all_cols = baseline_feat_cols + selected_sig_cols
cross_corr = X_sig[all_cols].iloc[train_sl_diag].corr()
cross_block = cross_corr.loc[selected_sig_cols, baseline_feat_cols]

# Short labels for display
short_base = {c: c.split(".")[1][:12] for c in baseline_feat_cols}
short_sig = {c: c.replace("sig.2.", "S_")[:14] for c in selected_sig_cols}

fig, ax = plt.subplots(figsize=(8, 5))
sns.heatmap(
    cross_block.rename(index=short_sig, columns=short_base),
    annot=True, fmt=".2f", cmap="RdBu_r", center=0,
    ax=ax, cbar_kws={"shrink": 0.8}, vmin=-1, vmax=1,
)
ax.set_title("Cross-correlation: selected signature features vs baseline features\n(training set)")
plt.tight_layout()
plt.show()

# Flag notable correlations
print("\nPairs with |corr| > 0.3:")
for sig_col in selected_sig_cols:
    for base_col in baseline_feat_cols:
        c = cross_block.loc[sig_col, base_col]
        if abs(c) > 0.3:
            print(f"  {sig_col} <-> {base_col}: {c:.3f}")

max_abs = cross_block.abs().max().max()
print(f"\nMax |cross-corr|: {max_abs:.3f}")


Pairs with |corr| > 0.3:
  sig.2.59a44f852829 <-> lag.abslogret.1ad490bcb584: 0.413
  sig.2.59a44f852829 <-> lag.sqlogret.c2cf3aa70c10: 0.432
  sig.2.7b3cf4c020ee <-> lag.abslogret.1ad490bcb584: 0.520
  sig.2.7b3cf4c020ee <-> lag.sqlogret.c2cf3aa70c10: 0.527
  sig.2.49bf7214fc3f <-> lag.abslogret.1ad490bcb584: -0.387
  sig.2.49bf7214fc3f <-> lag.sqlogret.c2cf3aa70c10: -0.339
  sig.2.1ec5eb462c34 <-> lag.logret.fc5d3612d9c7: 0.366
  sig.2.1ec5eb462c34 <-> lag.abslogret.1ad490bcb584: -0.379
  sig.2.1ec5eb462c34 <-> lag.sqlogret.c2cf3aa70c10: -0.327

Max |cross-corr|: 0.527

6.2 Weight Distortion Under Uniform L2

The STES model uses L2 regularization to prevent overfitting when extra features are added. Cross-validation selects a penalty \lambda large enough to suppress the noisy signature coefficients, but a uniform penalty simultaneously shrinks the three baseline feature weights. This is a standard bias–variance tradeoff in ridge regression: the penalty shrinks all coefficients toward zero, so a \lambda chosen to neutralize a noisy feature block can also move well-tuned baseline coefficients away from their near-optimal values.

The table below compares the fitted \beta vectors with and without signature features, and reveals that the baseline weights shift by 88–103% when signatures are added.

The regularization path sweeps \lambda from 10^{-3} to 10^{2}. The characteristic U-shaped RMSE indicates that too little penalty allows signature noise to dominate, while too much penalty suppresses the baseline weights below their optima.

STES fitted parameter comparison

train_slice = slice(SPY_IS_INDEX, SPY_OS_INDEX)
test_slice = slice(SPY_OS_INDEX, len(y_base))

# ---- Baseline STES ----
cols_stes_base = select_variant_columns(X_base, "STES_EAESE") or ["const"]
X_sel_stes_base = X_base[cols_stes_base]
X_tr_stes_base, X_te_stes_base = _scale_train_test(X_sel_stes_base, train_slice, test_slice)
y_tr = y_base.iloc[train_slice]
r_tr = r_base.iloc[train_slice]

model_stes_base = STESModel(random_state=0)
model_stes_base.fit(X_tr_stes_base, y_tr, returns=r_tr, start_index=0, end_index=len(X_tr_stes_base))

# ---- STES + single-channel sig (with uniform L2 CV) ----
cols_stes_sig = [c for c in select_variant_columns(X_sig, "STES_EAESE") if not _is_signature_col(c)]
cols_stes_sig = cols_stes_sig + [c for c in selected_sig_cols if c not in set(cols_stes_sig)]
X_sel_stes_sig = X_sig[cols_stes_sig]
X_tr_stes_sig, X_te_stes_sig = _scale_train_test(X_sel_stes_sig, train_slice, test_slice)

l2_grid = [{"l2_reg": 0.0}, {"l2_reg": 0.01}, {"l2_reg": 0.1}, {"l2_reg": 1.0}, {"l2_reg": 10.0}]
model_stes_sig = STESModel(random_state=0)
model_stes_sig.fit(
    X_tr_stes_sig, y_tr, returns=r_tr,
    start_index=0, end_index=len(X_tr_stes_sig),
    perform_cv=True, cv_grid=l2_grid, cv_splits=5,
 )

# ---- Compact comparison table ----
base_df = pd.DataFrame({"feature": cols_stes_base, "beta_base": model_stes_base.params})
aug_df = pd.DataFrame({"feature": cols_stes_sig, "beta_aug": model_stes_sig.params})
cmp = base_df.merge(aug_df, on="feature", how="outer")
cmp["group"] = np.where(
    cmp["feature"].apply(_is_signature_col),
    "sig",
    np.where(cmp["feature"] == "const", "const", "base"),
 )
cmp["delta"] = cmp["beta_aug"] - cmp["beta_base"]
cmp["delta_pct"] = np.where(
    cmp["beta_base"].abs() > 1e-10,
    cmp["delta"] / cmp["beta_base"].abs() * 100.0,
    np.nan,
 )
_grp_ord = {"const": 0, "base": 1, "sig": 2}
cmp["_ord"] = cmp["group"].map(_grp_ord).fillna(9)
cmp = cmp.sort_values(["_ord", "feature"]).drop(columns=["_ord"])
display(cmp)

# Evaluate OOS RMSE for both
y_te = y_base.iloc[test_slice]
r_te = r_base.iloc[test_slice]

X_all_base = pd.concat([X_tr_stes_base, X_te_stes_base])
r_all = pd.concat([r_tr, r_te])
yhat_base_all = model_stes_base.predict(X_all_base, returns=r_all)
yhat_base_os = np.asarray(yhat_base_all)[len(X_tr_stes_base):]

X_all_sig = pd.concat([X_tr_stes_sig, X_te_stes_sig])
yhat_sig_all = model_stes_sig.predict(X_all_sig, returns=r_all)
yhat_sig_os = np.asarray(yhat_sig_all)[len(X_tr_stes_sig):]

rmse_base = calc_rmse(y_te.values, yhat_base_os)
rmse_sig = calc_rmse(y_te.values, yhat_sig_os)
print(f"\nOut-of-sample RMSE")
print(f"  Baseline STES:    {rmse_base:.6f}")
print(f"  STES + Sig (L2):  {rmse_sig:.6f}")
print(f"  Δ RMSE:           {rmse_sig - rmse_base:+.6f} ({(rmse_sig/rmse_base - 1)*100:+.2f}%)")

	feature	beta_base	beta_aug	group	delta	delta_pct
0	const	-2.047842	-1.790887	const	0.256955	12.547608
1	lag.abslogret.1ad490bcb584	0.187659	-0.005155	base	-0.192814	-102.747079
2	lag.logret.fc5d3612d9c7	-0.233339	-0.027293	base	0.206045	88.303181
3	lag.sqlogret.c2cf3aa70c10	-0.242507	-0.026775	base	0.215731	88.958876
4	sig.2.1ec5eb462c34	NaN	-0.001731	sig	NaN	NaN
5	sig.2.49bf7214fc3f	NaN	-0.000093	sig	NaN	NaN
6	sig.2.59a44f852829	NaN	-0.001720	sig	NaN	NaN
7	sig.2.7b3cf4c020ee	NaN	-0.009756	sig	NaN	NaN
8	sig.2.af47c4edd59d	NaN	0.001566	sig	NaN	NaN
9	sig.2.f74613e2b353	NaN	0.015328	sig	NaN	NaN


Out-of-sample RMSE
  Baseline STES:    0.000448
  STES + Sig (L2):  0.000451
  Δ RMSE:           +0.000003 (+0.59%)

L2 regularization path

l2_values = [0.0, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0]

path_results = []
path_params = {c: [] for c in cols_stes_sig}

for l2 in l2_values:
    model = STESModel(random_state=0, l2_reg=l2)
    model.fit(X_tr_stes_sig, y_tr, returns=r_tr, start_index=0, end_index=len(X_tr_stes_sig))

    # OOS prediction
    yhat_all = model.predict(X_all_sig, returns=r_all)
    yhat_os = np.asarray(yhat_all)[len(X_tr_stes_sig):]
    rmse_os = calc_rmse(y_te.values, yhat_os)

    # IS prediction
    yhat_is = np.asarray(yhat_all)[:len(X_tr_stes_sig)]
    rmse_is = calc_rmse(y_tr.values, yhat_is)

    # Parameter norms
    sig_params = [p for c, p in zip(cols_stes_sig, model.params) if _is_signature_col(c)]
    base_params = [p for c, p in zip(cols_stes_sig, model.params) if not _is_signature_col(c) and c != "const"]
    const_param = [p for c, p in zip(cols_stes_sig, model.params) if c == "const"]

    for c, p in zip(cols_stes_sig, model.params):
        path_params[c].append(p)

    path_results.append({
        "l2": l2,
        "rmse_os": rmse_os,
        "rmse_is": rmse_is,
        "norm_sig": np.sqrt(np.sum(np.array(sig_params)**2)),
        "norm_base": np.sqrt(np.sum(np.array(base_params)**2)),
        "const": const_param[0] if const_param else np.nan,
    })

path_df = pd.DataFrame(path_results)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Panel 1: RMSE vs L2
ax = axes[0]
ax.semilogx(path_df["l2"].replace(0, 1e-4), path_df["rmse_os"], "o-", label="OS RMSE (sig)", color="C1")
ax.semilogx(path_df["l2"].replace(0, 1e-4), path_df["rmse_is"], "s--", label="IS RMSE (sig)", color="C1", alpha=0.5)
ax.axhline(rmse_base, color="C0", ls="--", lw=1.5, label=f"Baseline OS RMSE = {rmse_base:.6f}")
ax.set_xlabel("L2 penalty (λ)")
ax.set_ylabel("RMSE")
ax.set_title("RMSE vs L2 penalty")
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

# Panel 2: Parameter norms
ax = axes[1]
ax.semilogx(path_df["l2"].replace(0, 1e-4), path_df["norm_sig"], "o-", label="‖β_sig‖", color="C3")
ax.semilogx(path_df["l2"].replace(0, 1e-4), path_df["norm_base"], "s-", label="‖β_base‖", color="C0")
ax.set_xlabel("L2 penalty (λ)")
ax.set_ylabel("Parameter L2 norm")
ax.set_title("Parameter norms vs L2 penalty")
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

# Panel 3: Individual parameter paths
ax = axes[2]
l2_plot = [max(v, 1e-4) for v in l2_values]
for c in cols_stes_sig:
    is_sig = _is_signature_col(c)
    style = "--" if is_sig else "-"
    lw = 1 if is_sig else 2
    parts = c.split(".")
    short = parts[-1][:8] if is_sig else (parts[1][:10] if len(parts) > 1 else c[:10])
    ax.semilogx(l2_plot, path_params[c], style, lw=lw, label=short)
ax.set_xlabel("L2 penalty (λ)")
ax.set_ylabel("β value")
ax.set_title("Individual parameter paths")
ax.legend(fontsize=6, ncol=2)
ax.grid(True, alpha=0.3)

plt.suptitle("L2 regularization path: STES + single-channel signatures", fontsize=13)
plt.tight_layout()
plt.show()

print(f"\n{'L2':>8s}  {'OS RMSE':>10s}  {'IS RMSE':>10s}  {'‖β_sig‖':>10s}  {'‖β_base‖':>10s}  {'const':>10s}")
for _, row in path_df.iterrows():
    marker = " ← CV" if row["l2"] == model_stes_sig.l2_reg else ""
    print(f"{row['l2']:>8.3f}  {row['rmse_os']:>10.6f}  {row['rmse_is']:>10.6f}  {row['norm_sig']:>10.6f}  {row['norm_base']:>10.6f}  {row['const']:>10.4f}{marker}")
print(f"\nBaseline OS RMSE (no sig): {rmse_base:.6f}")
print(f"Best augmented OS RMSE:    {path_df['rmse_os'].min():.6f} at λ={path_df.loc[path_df['rmse_os'].idxmin(), 'l2']:.3f}")


      L2     OS RMSE     IS RMSE     ‖β_sig‖    ‖β_base‖       const
   0.000    0.000508    0.000491    2.991787    1.275177     -1.6714
   0.001    0.000498    0.000491    2.189706    0.950782     -1.7469
   0.010    0.000454    0.000497    0.304377    0.448713     -2.0327
   0.050    0.000447    0.000498    0.097777    0.236692     -1.8117
   0.100    0.000448    0.000499    0.055606    0.152607     -1.7512
   0.500    0.000450    0.000501    0.026835    0.052670     -1.7580
   1.000    0.000451    0.000501    0.018400    0.038580     -1.7909 ← CV
   5.000    0.000454    0.000502    0.005564    0.020700     -1.8807
  10.000    0.000456    0.000503    0.003151    0.015572     -1.9340
  50.000    0.000460    0.000504    0.001056    0.006224     -2.0706
 100.000    0.000462    0.000505    0.000630    0.003671     -2.1143

Baseline OS RMSE (no sig): 0.000448
Best augmented OS RMSE:    0.000447 at λ=0.050

6.3 Recovering the Baseline with Group-Specific Regularization

Our initial STES model penalizes every feature, including the baseline parameters. If we instead penalize only the signature features while leaving the baseline weights free, the results differ. With this group-specific penalty (\lambda_{\text{base}} = 0, \lambda_{\text{sig}} varies), the augmented model converges to the exact baseline RMSE as \lambda_{\text{sig}} \to \infty — the baseline is correctly recovered as a limiting case. At intermediate \lambda_{\text{sig}}, the augmented model performs on par with the baseline but does not improve upon it.

Group-penalized STES: λ_base = 0, λ_sig varies

def fit_stes_group_l2(X_tr, y_tr, r_tr, cols, l2_sig, l2_base=0.0, seed=0):
    """Fit STES with group-specific L2 penalties."""
    X_np = X_tr.values.astype(float)
    y_np = y_tr.values.astype(float)
    r_np = r_tr.values.astype(float)
    n, p = X_np.shape

    # Build group penalty vector
    mask = np.zeros(p)
    for i, c in enumerate(cols):
        if _is_signature_col(c):
            mask[i] = l2_sig
        elif c != "const":
            mask[i] = l2_base

    scale = np.linalg.norm(y_np)
    penalty_vec = np.sqrt(mask) * scale if np.any(mask > 0) else None

    rng = np.random.default_rng(seed)
    x0 = rng.normal(0, 1, size=p)

    result = least_squares(
        STESModel._objective, x0=x0, jac=STESModel._jacobian,
        args=(r_np, X_np, y_np, 0, len(X_np), penalty_vec),
    )
    return result.x

# Sweep l2_sig with l2_base = 0
l2_sig_values = [0.0, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0]

group_results = []
for l2_sig in l2_sig_values:
    params = fit_stes_group_l2(X_tr_stes_sig, y_tr, r_tr, cols_stes_sig, l2_sig=l2_sig, l2_base=0.0)

    # Predict
    alphas = expit(X_all_sig.values @ params)
    r_all_np = r_all.values
    returns2 = r_all_np**2
    n = len(r_all_np)
    v = np.empty(n + 1)
    v[0] = returns2[0]
    yhat = np.empty(n)
    for t in range(n):
        yhat[t] = alphas[t] * returns2[t] + (1 - alphas[t]) * v[t]
        v[t + 1] = yhat[t]

    yhat_os = yhat[len(X_tr_stes_sig):]
    rmse_os = calc_rmse(y_te.values, yhat_os)

    sig_params = [p for c, p in zip(cols_stes_sig, params) if _is_signature_col(c)]
    base_params = [p for c, p in zip(cols_stes_sig, params) if not _is_signature_col(c) and c != "const"]

    group_results.append({
        "l2_sig": l2_sig,
        "rmse_os": rmse_os,
        "norm_sig": np.sqrt(np.sum(np.array(sig_params)**2)),
        "norm_base": np.sqrt(np.sum(np.array(base_params)**2)),
    })

group_df = pd.DataFrame(group_results)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
ax.semilogx(group_df["l2_sig"].replace(0, 1e-4), group_df["rmse_os"], "o-", label="OS RMSE (group L2)", color="C2")
ax.semilogx(path_df["l2"].replace(0, 1e-4), path_df["rmse_os"], "s--", label="OS RMSE (uniform L2)", color="C1", alpha=0.6)
ax.axhline(rmse_base, color="C0", ls="--", lw=1.5, label=f"Baseline OS RMSE = {rmse_base:.6f}")
ax.set_xlabel("L2 penalty on signature features (λ_sig)")
ax.set_ylabel("RMSE")
ax.set_title("Group L2 vs Uniform L2")
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

ax = axes[1]
ax.semilogx(group_df["l2_sig"].replace(0, 1e-4), group_df["norm_sig"], "o-", label="‖β_sig‖", color="C3")
ax.semilogx(group_df["l2_sig"].replace(0, 1e-4), group_df["norm_base"], "s-", label="‖β_base‖ (unpenalized)", color="C0")
ax.set_xlabel("L2 penalty on signature features (λ_sig)")
ax.set_ylabel("Parameter L2 norm")
ax.set_title("Group L2: parameter norms")
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

plt.suptitle("Group-penalized STES: penalize only signature features (λ_base = 0)", fontsize=13)
plt.tight_layout()
plt.show()

print(f"\n{'λ_sig':>8s}  {'OS RMSE':>10s}  {'‖β_sig‖':>10s}  {'‖β_base‖':>10s}")
for _, row in group_df.iterrows():
    print(f"{row['l2_sig']:>8.3f}  {row['rmse_os']:>10.6f}  {row['norm_sig']:>10.6f}  {row['norm_base']:>10.6f}")
print(f"\nBaseline OS RMSE: {rmse_base:.6f}")
print(f"Best group-L2 OS RMSE: {group_df['rmse_os'].min():.6f} at λ_sig={group_df.loc[group_df['rmse_os'].idxmin(), 'l2_sig']:.3f}")


   λ_sig     OS RMSE     ‖β_sig‖    ‖β_base‖
   0.000    0.000508    2.991787    1.275177
   0.001    0.000499    2.241665    1.069158
   0.010    0.000459    0.361808    0.552437
   0.050    0.000454    0.161420    0.471019
   0.100    0.000452    0.099237    0.441396
   0.500    0.000449    0.025562    0.402262
   1.000    0.000449    0.013593    0.395792
   5.000    0.000449    0.003063    0.387937
  10.000    0.000448    0.001591    0.386583
  50.000    0.000448    0.000333    0.385495
 100.000    0.000448    0.000168    0.385462

Baseline OS RMSE: 0.000448
Best group-L2 OS RMSE: 0.000448 at λ_sig=100.000

We noted that even the best regularized OOS RMSE under both regularization schemes is only on par with the baseline model, suggesting that level-2 signature features from the rolling 20-day window offer little residual value. This outcome is expected once we consider two properties of the volatility process:

In our setup, STES already maintains an exponentially weighted summary of past r_t^2 through v_{t+1} = v_t + \alpha_t(r_t^2 - v_t); adding a 20-day signature window is therefore not obviously adding a longer-memory statistic. If anything, it injects a shorter-horizon summary alongside a model state that already aggregates the full past (with exponential forgetting).

6.4 XGBoost Feature Weights

For the gblinear booster, each boosting round fits an additive linear update to the prediction. After M rounds the model output is

where b is the bias (intercept) and \mathbf{w} = (w_1,\dots,w_p) are the feature weights accumulated across all rounds via coordinate descent, each shrunk by the L2 penalty reg_lambda. Because the input features are standardized (zero mean, unit variance) before fitting, the absolute weight |w_j| is directly comparable across features and serves as a natural measure of importance: a larger |w_j| means feature j contributes more to the smoothing-parameter prediction inside XGBoost-STES.

The bar chart below displays these weights, colour-coded by feature group (blue = baseline, red = signature).

XGBoost gblinear feature weight extraction

grid_e2e = [
    {"min_child_weight": 5.0, "learning_rate": 0.05, "max_depth": 3},
    {"min_child_weight": 20.0, "learning_rate": 0.1, "max_depth": 3},
    {"min_child_weight": 50.0, "learning_rate": 0.05, "max_depth": 2},
]

r_tr_scaled = r_tr * SCALE_FACTOR
y_tr_scaled = y_tr * (SCALE_FACTOR**2)

def _parse_gblinear_dump(dump_str):
    """Parse gblinear dump → (bias, {feature_index: weight})."""
    raw = dump_str.strip()
    bias_sec, weight_sec = raw.split("weight:")
    bias = float(bias_sec.replace("bias:", "").strip())
    weights = [float(v) for v in weight_sec.strip().split("\n")]
    return bias, weights

# ── 1) Baseline-only model ─────────────────────────────────────────
# Drop 'const' — gblinear has its own unpenalized bias term
cols_base_xgb = [c for c in baseline_columns if c != "const"]
X_sel_base = X_base[cols_base_xgb]
X_tr_base, X_te_base = _scale_train_test(X_sel_base, train_slice, test_slice)

over = _xgb_variant_overrides("XGBSTES_E2E", cols_base_xgb)
model_xgb_base = _make_xgb_stes_model(seed=None, **over)
model_xgb_base.fit(
    X_tr_base, y_tr_scaled, returns=r_tr_scaled,
    perform_cv=True, cv_grid=grid_e2e, cv_splits=5,
    start_index=0, end_index=len(X_tr_base),
)
bias_base, weights_base = _parse_gblinear_dump(model_xgb_base.model_.get_dump()[0])
assert len(weights_base) == len(cols_base_xgb)
w_base = dict(zip(cols_base_xgb, weights_base))

cols_xgb_sig = [c for c in baseline_columns + selected_sig_cols if c != "const"]
X_sel_xgb_sig = X_sig[cols_xgb_sig]
X_tr_xgb_sig, X_te_xgb_sig = _scale_train_test(X_sel_xgb_sig, train_slice, test_slice)

over_sig = _xgb_variant_overrides("XGBSTES_E2E", cols_xgb_sig)
model_xgb_sig = _make_xgb_stes_model(seed=None, **over_sig)
model_xgb_sig.fit(
    X_tr_xgb_sig, y_tr_scaled, returns=r_tr_scaled,
    perform_cv=True, cv_grid=grid_e2e, cv_splits=5,
    start_index=0, end_index=len(X_tr_xgb_sig),
)
bias_sig, weights_sig = _parse_gblinear_dump(model_xgb_sig.model_.get_dump()[0])
assert len(weights_sig) == len(cols_xgb_sig)
w_sig = dict(zip(cols_xgb_sig, weights_sig))

# ── 3) Comparison table ───────────────────────────────────────────
print(f"{'Feature':>42s}  {'Baseline':>10s}  {'+ Sig':>10s}  {'Δw':>10s}  {'Δ%':>8s}")
print("-" * 86)
for c in cols_base_xgb:
    wb = w_base[c]
    ws = w_sig[c]
    dw = ws - wb
    dp = 100.0 * dw / wb if abs(wb) > 1e-12 else float("nan")
    print(f"  {c:>40s}  {wb:>10.6f}  {ws:>10.6f}  {dw:>+10.6f}  {dp:>+7.1f}%")
print(f"  {'bias':>40s}  {bias_base:>10.6f}  {bias_sig:>10.6f}  {bias_sig - bias_base:>+10.6f}")
print()
for c in cols_xgb_sig:
    if _is_signature_col(c):
        print(f"  {c:>40s}  {'—':>10s}  {w_sig[c]:>10.6f}   ← SIG")

# ── 4) Side-by-side bar chart ─────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=False)

# Left panel: baseline-only weights
ax = axes[0]
pairs_b = sorted(w_base.items(), key=lambda x: abs(x[1]))
names_b = [p[0].split(".")[-1][:12] for p in pairs_b]
vals_b = [p[1] for p in pairs_b]
ax.barh(range(len(names_b)), vals_b, color="#1f77b4")
ax.set_yticks(range(len(names_b)))
ax.set_yticklabels(names_b, fontsize=9)
ax.set_xlabel("gblinear weight")
ax.set_title("Baseline only")
ax.axvline(0, color="k", lw=0.5)

# Right panel: baseline + signature weights
ax = axes[1]
pairs_s = sorted(w_sig.items(), key=lambda x: abs(x[1]))
names_s = [p[0].split(".")[-1][:12] for p in pairs_s]
vals_s = [p[1] for p in pairs_s]
colors = ["#d62728" if _is_signature_col(p[0]) else "#1f77b4" for p in pairs_s]
ax.barh(range(len(names_s)), vals_s, color=colors)
ax.set_yticks(range(len(names_s)))
ax.set_yticklabels(names_s, fontsize=9)
ax.set_xlabel("gblinear weight")
ax.set_title("Baseline + Signature")
ax.axvline(0, color="k", lw=0.5)

fig.suptitle("XGBoost gblinear feature weights (blue = baseline, red = signature)", fontsize=12)
plt.tight_layout()
plt.show()

print(f"\nreg_lambda: {XGB_PARAMS['reg_lambda']}")

                                   Feature    Baseline       + Sig          Δw        Δ%
--------------------------------------------------------------------------------------
                   lag.logret.fc5d3612d9c7   -0.084265   -0.121535   -0.037270    +44.2%
                lag.abslogret.1ad490bcb584   -0.060445   -0.038866   +0.021579    -35.7%
                 lag.sqlogret.c2cf3aa70c10   -0.025341   -0.027290   -0.001949     +7.7%
                                      bias   -1.951480   -1.808500   +0.142980

                        sig.2.59a44f852829           —   -0.023961   ← SIG
                        sig.2.7b3cf4c020ee           —   -0.050910   ← SIG
                        sig.2.af47c4edd59d           —    0.026128   ← SIG
                        sig.2.49bf7214fc3f           —    0.060890   ← SIG
                        sig.2.1ec5eb462c34           —    0.055686   ← SIG
                        sig.2.f74613e2b353           —    0.109561   ← SIG


reg_lambda: 1.0

6.5 Diagnosing the Smoothing-Parameter Shift

When signature features are added, the model’s typical smoothing parameter shifts upward. The goal here is to show where that shift can come from in XGBoost-STES.

For the gblinear booster, the model produces a linear score z_t = b + \sum_{j=1}^{p} w_j\,x_{t,j}, and turns this into a time-varying smoothing parameter via a sigmoid gate \alpha_t = \sigma(z_t) = \frac{1}{1+e^{-z_t}}, which then enters the variance recursion v_{t+1} = v_t + \alpha_t\,(r_t^2 - v_t). So we are fitting through a nonlinear gate and a recursive filter, not doing an ordinary one-step regression on r_{t+1}^2.

Because all features are centered by StandardScaler on the training set, \mathbb{E}[x_{t,j}]\approx 0 and \mathbb{E}[z_t] \approx b. That makes b a useful proxy for the typical gate level. If b becomes less negative, the model’s baseline tendency is to react faster (higher typical \alpha_t), i.e., a shorter effective half-life in the EWMA recursion.

The key subtlety is that the average gate level is \bar\alpha = \mathbb{E}[\sigma(z_t)], and because \sigma is nonlinear, \bar\alpha depends on more than just \mathbb{E}[z_t]. In particular, changes in the spread of z_t matter. A simple second-order expansion makes this explicit: \mathbb{E}[\sigma(Z)] \approx \sigma(\mu) + \tfrac{1}{2}\,\sigma''(\mu)\,\mathrm{Var}(Z). Around b\approx-2, the curvature of \sigma is not negligible, so increasing \mathrm{Var}(z_t) can move \bar\alpha even if the mean stayed fixed.

In our fit, adding signatures shifts the bias from b=-1.95 to b=-1.81, and the unconditional level \sigma(b) rises from 12.4% to 14.1%. This is the concrete symptom of the rebalancing described above.

Finally, note that although the baseline is nested in the augmented feature set (signature weights could be zero), adding signatures can still degrade out-of-sample RMSE because the end-to-end objective is non-convex and the training procedure can land in a different solution when the feature space changes.

The next code cells quantify how the distribution of z_t changes with signatures (and which features drive it), and then test whether warm-starting/two-stage fits can recover baseline-like behaviour.

6.5.1 Signatures change z_t and \alpha_t

Mechanically, the signature features can move \alpha_t in two ways. They can shift the overall level of the linear score z_t = b + \sum_j w_j x_{t j}, and they can change the dispersion of z_t over time.

Since we preprocessed the signature features such that \mathbb{E}[x_{t,j}] = 0 on the training set, we might expect that the intercept b remains unchanged. Since \mathbb{E}[z_t]\approx b, we might consequently expect the overall level of z_t and \alpha_t to be unchanged. This is the correct expectation in a standard linear regression. However, the STES model embeds a logistic regression in estimating \alpha_t. For logistic regression, the intercept is not determined by \bar z alone, so adding predictors can change it even if the new predictors are centered.

Relatedly, the dispersion matters because \alpha_t = \sigma(z_t) is a nonlinear transformation of z_t. Once signatures change the distribution of z_t (especially \mathrm{Var}(z_t)), the best global “shift” of that distribution can change too, and the model expresses that through b.

We report below \mathrm{mean}(z) and \mathrm{std}(z) (and their implied \mathrm{mean}(\alpha) and \mathrm{std}(\alpha)), along with the IS vs OOS RMSE comparison.

Bias investigation: linear score, alpha level, and optimality

from scipy.special import expit as _expit

# ── 4) Mean linear score and resulting alpha ──────────────────────────────
X_tr_base_np = X_tr_base.to_numpy()
X_tr_sig_np = X_tr_xgb_sig.to_numpy()

z_base = bias_base + X_tr_base_np @ np.array(weights_base)
z_sig = bias_sig + X_tr_sig_np @ np.array(weights_sig)

alpha_base = _expit(z_base)
alpha_sig = _expit(z_sig)

print("4. MEAN LINEAR SCORE z AND RESULTING α = σ(z)")
print(f"   {'':>20s}  {'mean(z)':>12s}  {'std(z)':>12s}  {'mean(α)':>12s}  {'std(α)':>12s}")
print("   " + "-" * 64)
print(f"   {'Baseline':>20s}  {z_base.mean():>12.6f}  {z_base.std():>12.6f}  {alpha_base.mean():>12.6f}  {alpha_base.std():>12.6f}")
print(f"   {'+ Signature':>20s}  {z_sig.mean():>12.6f}  {z_sig.std():>12.6f}  {alpha_sig.mean():>12.6f}  {alpha_sig.std():>12.6f}")
print(f"   {'Δ':>20s}  {z_sig.mean() - z_base.mean():>+12.6f}  {z_sig.std() - z_base.std():>+12.6f}  {alpha_sig.mean() - alpha_base.mean():>+12.6f}  {alpha_sig.std() - alpha_base.std():>+12.6f}")

# ── 5) Decompose mean(z) into bias + feature contributions ───────────────
feat_contrib_base = X_tr_base_np @ np.array(weights_base)
feat_contrib_sig = X_tr_sig_np @ np.array(weights_sig)

# Per-feature contribution to mean(z): w_j * mean(x_j)
print(f"\n5. DECOMPOSITION OF mean(z) = bias + Σ w_j·mean(x_j)")
print(f"   {'':>20s}  {'bias':>12s}  {'Σw·x̄':>12s}  {'mean(z)':>12s}")
print("   " + "-" * 52)
print(f"   {'Baseline':>20s}  {bias_base:>12.6f}  {feat_contrib_base.mean():>12.6f}  {z_base.mean():>12.6f}")
print(f"   {'+ Signature':>20s}  {bias_sig:>12.6f}  {feat_contrib_sig.mean():>12.6f}  {z_sig.mean():>12.6f}")

print(f"\n   Since training features are centered, Σw·x̄ ≈ 0 and mean(z) ≈ bias.")
print(f"   The bias determines the long-run alpha level: σ({bias_base:.3f}) = {_expit(bias_base):.4f}")
print(f"   vs σ({bias_sig:.3f}) = {_expit(bias_sig):.4f}")
print(f"   Δα (from bias alone) = {_expit(bias_sig) - _expit(bias_base):+.4f}")

# ── 6) IS-sample / OOS RMSE comparison ───────────────────────────────────
r_tr_np = r_tr_scaled.to_numpy().reshape(-1)
y_tr_np = y_tr_scaled.to_numpy().reshape(-1)

yhat_base_all = model_xgb_base.predict(
    pd.concat([X_tr_base, X_te_base]),
    returns=pd.concat([r_tr_scaled, r_te * SCALE_FACTOR]),
)
yhat_sig_all = model_xgb_sig.predict(
    pd.concat([X_tr_xgb_sig, X_te_xgb_sig]),
    returns=pd.concat([r_tr_scaled, r_te * SCALE_FACTOR]),
)
yhat_base_is = np.asarray(yhat_base_all)[:len(X_tr_base)]
yhat_sig_is = np.asarray(yhat_sig_all)[:len(X_tr_xgb_sig)]
yhat_base_os = np.asarray(yhat_base_all)[len(X_tr_base):]
yhat_sig_os = np.asarray(yhat_sig_all)[len(X_tr_xgb_sig):]
y_te_np = y_te.to_numpy().reshape(-1) * (SCALE_FACTOR**2)

rmse_is_base = np.sqrt(np.mean((y_tr_np - yhat_base_is) ** 2))
rmse_is_sig = np.sqrt(np.mean((y_tr_np - yhat_sig_is) ** 2))
rmse_os_base = np.sqrt(np.mean((y_te_np - yhat_base_os) ** 2))
rmse_os_sig = np.sqrt(np.mean((y_te_np - yhat_sig_os) ** 2))

print(f"\n6. IN-SAMPLE vs OUT-OF-SAMPLE RMSE (scaled units)")
print(f"   {'':>20s}  {'IS RMSE':>12s}  {'OOS RMSE':>12s}  {'OOS/IS':>8s}")
print("   " + "-" * 56)
print(f"   {'Baseline':>20s}  {rmse_is_base:>12.6f}  {rmse_os_base:>12.6f}  {rmse_os_base/rmse_is_base:>8.4f}")
print(f"   {'+ Signature':>20s}  {rmse_is_sig:>12.6f}  {rmse_os_sig:>12.6f}  {rmse_os_sig/rmse_is_sig:>8.4f}")
print(f"   {'Δ (sig-base)':>20s}  {rmse_is_sig - rmse_is_base:>+12.6f}  {rmse_os_sig - rmse_os_base:>+12.6f}")

if rmse_is_sig > rmse_is_base:
    print(f"\n   ⚠ Signature model has HIGHER in-sample loss → possible convergence issue")
else:
    print(f"\n   ✓ Signature model has lower IS loss (as expected with more features)")
    print(f"     But higher OOS loss → overfitting, not a convergence problem")

4. MEAN LINEAR SCORE z AND RESULTING α = σ(z)
                              mean(z)        std(z)       mean(α)        std(α)
   ----------------------------------------------------------------
               Baseline     -1.951480      0.115524      0.124874      0.010263
            + Signature     -1.808500      0.252278      0.143133      0.023206
                      Δ     +0.142980     +0.136753     +0.018259     +0.012943

5. DECOMPOSITION OF mean(z) = bias + Σ w_j·mean(x_j)
                                 bias         Σw·x̄       mean(z)
   ----------------------------------------------------
               Baseline     -1.951480     -0.000000     -1.951480
            + Signature     -1.808500      0.000000     -1.808500

   Since training features are centered, Σw·x̄ ≈ 0 and mean(z) ≈ bias.
   The bias determines the long-run alpha level: σ(-1.951) = 0.1244
   vs σ(-1.808) = 0.1408
   Δα (from bias alone) = +0.0164

6. IN-SAMPLE vs OUT-OF-SAMPLE RMSE (scaled units)
                              IS RMSE      OOS RMSE    OOS/IS
   --------------------------------------------------------
               Baseline      5.017151      4.426015    0.8822
            + Signature      4.999607      4.569425    0.9140
           Δ (sig-base)     -0.017544     +0.143410

   ✓ Signature model has lower IS loss (as expected with more features)
     But higher OOS loss → overfitting, not a convergence problem

We see that signatures improve in-sample fit (IS RMSE \downarrow 0.018) but hurt out-of-sample performance (OOS RMSE \uparrow 0.143). With a larger feature set, a lower training error is not surprising; the question is what the extra features are doing to the smoothing dynamics \alpha_t that ultimately generate v_t.

6.5.2 \alpha variation is higher

A convenient way to see this is to look at the scale of each feature’s contribution to the score variation, |w_j|\,\mathrm{std}(x_j). In the augmented model, a large fraction of that total comes from signature coordinates. As \mathrm{std}(z_t) increases substantially, the sigmoid maps that wider range of z_t into a wider range of \alpha_t. Whether that additional variation reflects genuine regime adaptation or just extra flexibility that does not generalize is exactly what the rest of the section is trying to pin down.

Bias investigation: feature contributions to α dynamics

from scipy.special import expit as _expit

X_tr_base_np = X_tr_base.to_numpy()
X_tr_sig_np = X_tr_xgb_sig.to_numpy()
z_base = bias_base + X_tr_base_np @ np.array(weights_base)
z_sig = bias_sig + X_tr_sig_np @ np.array(weights_sig)

# ── 8) COUNTERFACTUAL: sig model weights but baseline bias ───────────────
print("8. COUNTERFACTUAL: SIG MODEL WITH BASELINE BIAS")
z_sig_cf_tr = bias_base + X_tr_sig_np @ np.array(weights_sig)
z_sig_cf_te = bias_base + X_te_xgb_sig.to_numpy() @ np.array(weights_sig)
z_sig_te = bias_sig + X_te_xgb_sig.to_numpy() @ np.array(weights_sig)

print(f"   α with actual sig bias ({bias_sig:.3f}):     train={_expit(z_sig).mean():.4f}  test={_expit(z_sig_te).mean():.4f}")
print(f"   α with baseline bias   ({bias_base:.3f}):     train={_expit(z_sig_cf_tr).mean():.4f}  test={_expit(z_sig_cf_te).mean():.4f}")
print(f"   Δα from bias shift:                  train={_expit(z_sig).mean() - _expit(z_sig_cf_tr).mean():+.4f}  test={_expit(z_sig_te).mean() - _expit(z_sig_cf_te).mean():+.4f}")

# ── 9) Feature contribution to z dynamics ────────────────────────────────
print(f"\n9. |w·std(x)| — FEATURE CONTRIBUTION TO α DYNAMICS")

# Short labels for compact table
def _short_label(c):
    if _is_signature_col(c):
        return c.split(".")[-1][:10]
    return c.split(".")[-2][:10] if "." in c else c[:10]

# Augmented model contributions (sorted by |w·std|)
scores_sig = [(c, w_sig[c], X_tr_xgb_sig[c].std(), abs(w_sig[c] * X_tr_xgb_sig[c].std()))
              for c in cols_xgb_sig]
total_sig = sum(s[3] for s in scores_sig)

print(f"\n   {'Feature':>14s}  {'w':>9s}  {'|w·std|':>9s}  {'%':>6s}  {'type':>5s}")
print("   " + "-" * 50)
for c, w, s, ws in sorted(scores_sig, key=lambda x: -x[3]):
    tag = "SIG" if _is_signature_col(c) else "base"
    print(f"   {_short_label(c):>14s}  {w:>+9.4f}  {ws:>9.4f}  {100*ws/total_sig:>5.1f}%  {tag:>5s}")

sig_pct = sum(s[3] for s in scores_sig if _is_signature_col(s[0])) / total_sig * 100
base_pct = 100 - sig_pct

print(f"\n   Baseline features: {base_pct:.1f}% of z dynamics")
print(f"   Signature features: {sig_pct:.1f}% of z dynamics")
print(f"   std(z): baseline-only = {z_base.std():.4f}, +sig = {z_sig.std():.4f} ({z_sig.std()/z_base.std():.2f}x)")
print(f"\n   INTERPRETATION:")
print(f"   The bias is NOT suboptimal — it is correctly compensating for the")
print(f"   increased z variance that signatures introduce. The sig features")
print(f"   contribute {sig_pct:.0f}% of α dynamics, increasing std(z) by {z_sig.std()/z_base.std():.2f}x.")
print(f"   The bias shifts up by {bias_sig - bias_base:+.3f} to keep the mean α at a")
print(f"   sensible level ({_expit(bias_base):.2%} → {_expit(bias_sig):.2%}), because the wider")
print(f"   z distribution pushes more mass into the sigmoid tails.")

8. COUNTERFACTUAL: SIG MODEL WITH BASELINE BIAS
   α with actual sig bias (-1.808):     train=0.1431  test=0.1466
   α with baseline bias   (-1.951):     train=0.1265  test=0.1297
   Δα from bias shift:                  train=+0.0166  test=+0.0169

9. |w·std(x)| — FEATURE CONTRIBUTION TO α DYNAMICS

          Feature          w    |w·std|       %   type
   --------------------------------------------------
           logret    -0.1215     0.1216   23.6%   base
       f74613e2b3    +0.1096     0.1096   21.3%    SIG
       49bf7214fc    +0.0609     0.0609   11.8%    SIG
       1ec5eb462c    +0.0557     0.0557   10.8%    SIG
       7b3cf4c020    -0.0509     0.0509    9.9%    SIG
        abslogret    -0.0389     0.0389    7.5%   base
         sqlogret    -0.0273     0.0273    5.3%   base
       af47c4edd5    +0.0261     0.0261    5.1%    SIG
       59a44f8528    -0.0240     0.0240    4.7%    SIG

   Baseline features: 36.5% of z dynamics
   Signature features: 63.5% of z dynamics
   std(z): baseline-only = 0.1155, +sig = 0.2523 (2.18x)

   INTERPRETATION:
   The bias is NOT suboptimal — it is correctly compensating for the
   increased z variance that signatures introduce. The sig features
   contribute 64% of α dynamics, increasing std(z) by 2.18x.
   The bias shifts up by +0.143 to keep the mean α at a
   sensible level (12.44% → 14.08%), because the wider
   z distribution pushes more mass into the sigmoid tails.

6.5.3 Model nesting does not protect out-of-sample performance

The nonlinearity of the sigmoid helps explain why the bias changes, but it still leaves a more basic question: how can the augmented model fit better in-sample and still do worse out-of-sample, even though the baseline is nested (we could set all signature weights to zero and recover the baseline form)?

One reason is that the end-to-end STES loss is not convex. The objective is L = \sum_t (y_t - v_t)^2, \qquad v_{t+1} = v_t + \sigma(z_t)(r_t^2 - v_t), and the composition of a sigmoid gate with a recursive filter can produce multiple locally optimal solutions. Changing the feature set and re-running CV can change where the optimizer lands.

We address this instability in the next section and focus on warm-start and two-stage fits. These techniques keep the augmented training anchored near the baseline behavior and make it easier to analyze the incremental value of signatures once we constrain ourselves to stay close to the baseline solution.

6.6 Warm-Start and Two-Stage Fitting

If adding new features leads to significant changes in the objective function landscape, one can keep the optimizer anchored near the baseline solution and only let the new features make incremental changes. We use two simple strategies that are common in transfer-learning style setups:

Warm-start. Set XGBoost’s base_margin to the baseline model’s fitted score z_t^{(0)} = \hat b_{\text{base}} + \hat w_{\text{base}}^\top x_t^{\text{base}}. With base_score = 0, the booster starts from z_t^{(0)} and learns only a correction \Delta z_t, so the total score is z_t = z_t^{(0)} + \Delta z_t. This is a practical way to bias training toward the baseline basin.

Two-stage (adapter). Freeze the baseline entirely via base_margin and train only on the signature features. In other words, the baseline model becomes a fixed “backbone” and the signatures can only contribute additively; they cannot distort the baseline weights.

Warm-start and residual model experiments

import xgboost as xgb
from scipy.special import expit as _expit
from scipy.optimize import least_squares as _lsq

# ═══════════════════════════════════════════════════════════════════════
# SETUP: data already in kernel from cell 63
# ═══════════════════════════════════════════════════════════════════════
X_aug_np = X_tr_xgb_sig.to_numpy()
X_base_np_diag = X_tr_base.to_numpy()
r2_tr = (r_tr_scaled.to_numpy() ** 2).reshape(-1)
y_tr_np_sc = y_tr_scaled.to_numpy().reshape(-1)
init_var_val = float(np.mean(r2_tr[:INIT_WINDOW]))
n_train = len(y_tr_np_sc)

# OOS data
X_aug_te_np = X_te_xgb_sig.to_numpy()
X_base_te_np = X_te_base.to_numpy()
r_te_scaled = r_te * SCALE_FACTOR
r2_te = (r_te_scaled.to_numpy() ** 2).reshape(-1)
y_te_np_sc = y_te.to_numpy().reshape(-1) * (SCALE_FACTOR ** 2)

# Baseline z-scores on the augmented feature set (embedded: w_sig=0)
# NOTE: must include base_score=0.5 since get_dump() doesn't include it
w_embedded_arr = np.array([w_base.get(c, 0.0) for c in cols_xgb_sig])
z_baseline_tr = 0.5 + bias_base + X_aug_np @ w_embedded_arr

# ═══════════════════════════════════════════════════════════════════════
# E2E OBJECTIVE (standalone, matching library's _make_e2e_objective)
# ═══════════════════════════════════════════════════════════════════════
def make_e2e_obj(returns2, y, init_var):
    """Create (obj_fn, eval_fn) for xgb.train."""
    r2 = np.asarray(returns2, dtype=float)
    y_ = np.asarray(y, dtype=float)
    n = len(y_)
    v0 = float(init_var)

    def obj_fn(preds, dtrain):
        z = preds.astype(float)
        a = _expit(z)
        v = np.empty(n + 1, dtype=float)
        v[0] = v0
        for t in range(n):
            v[t + 1] = v[t] + a[t] * (r2[t] - v[t])
        yhat = v[1:]
        e = 2.0 * (yhat - y_) / n   # MSE derivative
        w = np.full(n, 2.0 / n)     # MSE weight

        g = np.zeros(n + 1); S = np.zeros(n + 1)
        grad = np.zeros(n); hess = np.zeros(n)
        for t in range(n - 1, -1, -1):
            g[t + 1] += e[t]; S[t + 1] += w[t]
            aa = float(a[t]); ap = aa * (1.0 - aa)
            dv = (float(r2[t]) - v[t]) * ap
            grad[t] = g[t + 1] * dv
            h = S[t + 1] * dv * dv
            hess[t] = h if np.isfinite(h) and h > 1e-12 else 1e-12
            g[t] += (1.0 - aa) * g[t + 1]
            S[t] += (1.0 - aa) ** 2 * S[t + 1]
        return grad, hess

    def eval_fn(preds, dtrain):
        z = preds.astype(float); a = _expit(z)
        v = v0
        total = 0.0
        for t in range(n):
            v = v + float(a[t]) * (float(r2[t]) - v)
            total += (float(y_[t]) - v) ** 2
        return "mse_e2e", total / n

    return obj_fn, eval_fn

# ═══════════════════════════════════════════════════════════════════════
# OOS EVALUATION HELPER
# ═══════════════════════════════════════════════════════════════════════
def _oos_rmse(bias, weights, X_tr, X_te, y_te, r2_tr, r2_te, init_var, base_score=0.5):
    """Warm-start OOS RMSE: run ES filter through train, evaluate on test."""
    z_tr = base_score + bias + X_tr @ np.array(weights)
    a_tr = _expit(z_tr)
    v = float(init_var)
    for t in range(len(z_tr)):
        v = v + float(a_tr[t]) * (float(r2_tr[t]) - v)
    # Now v is warm-started. Run on test.
    z_te = base_score + bias + X_te @ np.array(weights)
    a_te = _expit(z_te)
    total = 0.0
    for t in range(len(z_te)):
        v = v + float(a_te[t]) * (float(r2_te[t]) - v)
        total += (float(y_te[t]) - v) ** 2
    return np.sqrt(total / len(y_te))

# ═══════════════════════════════════════════════════════════════════════
# TRAIN/VALID SPLIT + CV ON AUGMENTED FEATURES
# ═══════════════════════════════════════════════════════════════════════
valid_n = max(1, int(np.floor(0.10 * n_train)))
split = max(1, n_train - valid_n)

X_e2e_tr, X_e2e_va = X_aug_np[:split], X_aug_np[split:]
y_e2e_tr, y_e2e_va = y_tr_np_sc[:split], y_tr_np_sc[split:]
r2_e2e_tr, r2_e2e_va = r2_tr[:split], r2_tr[split:]
z_bl_tr, z_bl_va = z_baseline_tr[:split], z_baseline_tr[split:]

print("\n--- CV: augmented feature set ---")
over_cv = _xgb_variant_overrides("XGBSTES_E2E", cols_xgb_sig)
grid_cv = [{"reg_lambda": v} for v in [0.1, 1.0, 10.0, 100.0]]
cv_splits = 5
m_cv = _make_xgb_stes_model(seed=None, xgb_params=dict(XGB_PARAMS), **over_cv)
m_cv.fit(
    X_tr_xgb_sig, y_tr_scaled, returns=r_tr_scaled,
    perform_cv=True, cv_grid=grid_cv, cv_splits=cv_splits,
    start_index=0, end_index=len(X_tr_xgb_sig),
)
params_cv = dict(m_cv.fit_result_.params_used)
print(f"  Selected params: reg_lambda={params_cv.get('reg_lambda', 'n/a')}")

# Warm-start for validation init_var
lookback = 20
v0_val = float(np.mean(r2_e2e_tr[-lookback:])) if len(r2_e2e_tr) >= lookback else init_var_val

obj_tr, _ = make_e2e_obj(r2_e2e_tr, y_e2e_tr, init_var_val)
_, feval_va = make_e2e_obj(r2_e2e_va, y_e2e_va, v0_val)

# ═══════════════════════════════════════════════════════════════════════
# METHOD 0: COLD-START (CV-TUNED PARAMS)
# ═══════════════════════════════════════════════════════════════════════
print("=" * 80)
print("XGBoost-STES EXPERIMENTS")
print("=" * 80)

print("\n--- Method 0: Cold-start (CV-tuned params) ---")
params_cold = dict(params_cv)
params_cold.setdefault("base_score", 0.5)

dtrain_cold = xgb.DMatrix(X_e2e_tr, label=y_e2e_tr, feature_names=list(cols_xgb_sig))
dvalid_cold = xgb.DMatrix(X_e2e_va, label=y_e2e_va, feature_names=list(cols_xgb_sig))

booster_cold = xgb.train(
    params=params_cold,
    dtrain=dtrain_cold,
    num_boost_round=XGB_ROUNDS,
    obj=obj_tr,
    custom_metric=feval_va,
    evals=[(dtrain_cold, "train"), (dvalid_cold, "valid")],
    early_stopping_rounds=10,
    verbose_eval=False,
)

b_cold, w_cold = _parse_gblinear_dump(booster_cold.get_dump()[0])
bias_sig_cv = b_cold
weights_sig_cv = w_cold

loss_cold = _stes_e2e_loss(bias_sig_cv, weights_sig_cv, X_aug_np, y_tr_np_sc, r2_tr, init_var_val)
rmse_cold = _oos_rmse(bias_sig_cv, weights_sig_cv, X_aug_np, X_aug_te_np, y_te_np_sc, r2_tr, r2_te, init_var_val)
# Baseline reference
loss_base = _stes_e2e_loss(bias_base, weights_base, X_base_np_diag, y_tr_np_sc, r2_tr, init_var_val)
rmse_base = _oos_rmse(bias_base, weights_base, X_base_np_diag, X_base_te_np, y_te_np_sc, r2_tr, r2_te, init_var_val)

# ═══════════════════════════════════════════════════════════════════════
# METHOD 1: WARM-START VIA base_margin
# Initialize augmented model from baseline z-scores.
# XGBoost sees preds = base_margin + booster_output.
# Booster starts at 0 → preds start at z_baseline.
# ═══════════════════════════════════════════════════════════════════════
print("\n--- Method 1: Warm-start via base_margin ---")

params_ws = dict(params_cv)
params_ws["base_score"] = 0.0  # disable default base_score since we use base_margin

dtrain_ws = xgb.DMatrix(X_e2e_tr, label=y_e2e_tr, feature_names=list(cols_xgb_sig))
dtrain_ws.set_base_margin(z_bl_tr)
dvalid_ws = xgb.DMatrix(X_e2e_va, label=y_e2e_va, feature_names=list(cols_xgb_sig))
dvalid_ws.set_base_margin(z_bl_va)

booster_ws = xgb.train(
    params=params_ws,
    dtrain=dtrain_ws,
    num_boost_round=XGB_ROUNDS,
    obj=obj_tr,
    custom_metric=feval_va,
    evals=[(dtrain_ws, "train"), (dvalid_ws, "valid")],
    early_stopping_rounds=10,
    verbose_eval=False,
)

# Extract warm-start weights: total prediction = base_margin + booster_output
# So total bias = z_baseline_bias + booster_bias, total w = w_embedded + booster_w
b_ws_corr, w_ws_corr = _parse_gblinear_dump(booster_ws.get_dump()[0])
bias_ws = bias_base + b_ws_corr  # original baseline bias + learned correction
weights_ws = [w_embedded_arr[j] + w_ws_corr[j] for j in range(len(cols_xgb_sig))]

loss_ws = _stes_e2e_loss(bias_ws, weights_ws, X_aug_np, y_tr_np_sc, r2_tr, init_var_val)
rmse_ws = _oos_rmse(bias_ws, weights_ws, X_aug_np, X_aug_te_np, y_te_np_sc, r2_tr, r2_te, init_var_val)

print(f"  Correction bias:  {b_ws_corr:+.6f}")
print(f"  Total bias:       {bias_ws:.4f}  (baseline={bias_base:.4f})")
print(f"  Δbias from base:  {bias_ws - bias_base:+.4f}")
w_sig_ws = {c: w for c, w in zip(cols_xgb_sig, weights_ws) if _is_signature_col(c)}
print(f"  Σ|w_sig|:         {sum(abs(v) for v in w_sig_ws.values()):.4f}")

# ═══════════════════════════════════════════════════════════════════════
# METHOD 2: RESIDUAL / TWO-STAGE (adapter/LoRA approach)
# Freeze baseline model. Only train on sig features.
# z_total = z_baseline (frozen) + booster(x_sig)
# ═══════════════════════════════════════════════════════════════════════
print("\n--- Method 2: Two-stage residual (baseline frozen, train sig-only) ---")

# Extract only signature columns
sig_col_indices = [i for i, c in enumerate(cols_xgb_sig) if _is_signature_col(c)]
sig_col_names = [cols_xgb_sig[i] for i in sig_col_indices]
X_sig_only_tr = X_e2e_tr[:, sig_col_indices]
X_sig_only_va = X_e2e_va[:, sig_col_indices]

dtrain_res = xgb.DMatrix(X_sig_only_tr, label=y_e2e_tr, feature_names=sig_col_names)
dtrain_res.set_base_margin(z_bl_tr)
dvalid_res = xgb.DMatrix(X_sig_only_va, label=y_e2e_va, feature_names=sig_col_names)
dvalid_res.set_base_margin(z_bl_va)

booster_res = xgb.train(
    params=params_ws,
    dtrain=dtrain_res,
    num_boost_round=XGB_ROUNDS,
    obj=obj_tr,
    custom_metric=feval_va,
    evals=[(dtrain_res, "train"), (dvalid_res, "valid")],
    early_stopping_rounds=10,
    verbose_eval=False,
)

b_res_corr, w_res_corr = _parse_gblinear_dump(booster_res.get_dump()[0])
bias_res = bias_base + b_res_corr
# Total weights: baseline original + 0 correction for base features, correction for sig
weights_res = list(w_embedded_arr)  # start from embedded
for j_local, j_global in enumerate(sig_col_indices):
    weights_res[j_global] = w_res_corr[j_local]

loss_res = _stes_e2e_loss(bias_res, weights_res, X_aug_np, y_tr_np_sc, r2_tr, init_var_val)
rmse_res = _oos_rmse(bias_res, weights_res, X_aug_np, X_aug_te_np, y_te_np_sc, r2_tr, r2_te, init_var_val)

print(f"  Correction bias:  {b_res_corr:+.6f}")
print(f"  Total bias:       {bias_res:.4f}  (baseline={bias_base:.4f})")
print(f"  Δbias from base:  {bias_res - bias_base:+.4f}")
w_sig_res = {c: weights_res[i] for i, c in enumerate(cols_xgb_sig) if _is_signature_col(c)}
print(f"  Σ|w_sig|:         {sum(abs(v) for v in w_sig_res.values()):.4f}")

# ═══════════════════════════════════════════════════════════════════════
# COMPARISON TABLE
# ═══════════════════════════════════════════════════════════════════════
print("\n" + "=" * 80)
print("COMPARISON: XGBoost-STES")
print("=" * 80)
print(f"\n  {'Method':>30s}  {'bias':>8s}  {'σ(b)':>7s}  {'Δb':>8s}  {'IS loss':>9s}  {'OOS RMSE':>9s}  {'ΔRMSE%':>8s}")
print("  " + "-" * 87)

methods = [
    ("Baseline (reference)", bias_base, weights_base, loss_base, rmse_base, X_base_np_diag),
    ("Cold-start (current)", bias_sig_cv, weights_sig_cv, loss_cold, rmse_cold, X_aug_np),
    ("Warm-start (transfer)", bias_ws, weights_ws, loss_ws, rmse_ws, X_aug_np),
    ("Two-stage (adapter)", bias_res, weights_res, loss_res, rmse_res, X_aug_np),
]

for name, b, w, l, r, _ in methods:
    delta_b = b - bias_base
    delta_r = (r / rmse_base - 1) * 100
    print(f"  {name:>30s}  {b:>8.4f}  {_expit(b):>.4f}  {delta_b:>+8.4f}  {l:>9.4f}  {r:>9.4f}  {delta_r:>+7.2f}%")

# ═══════════════════════════════════════════════════════════════════════
# STES WARM-START EXPERIMENT
# ═══════════════════════════════════════════════════════════════════════
print("\n" + "=" * 80)
print("STES EXPERIMENTS")
print("=" * 80)

# Setup STES data (from cell 57)
X_tr_stes_np = X_tr_stes_sig.values.astype(float)
X_te_stes_np = X_te_stes_sig.values.astype(float)
y_tr_stes_np = y_tr.values.astype(float)
y_te_stes_np = y_te.values.astype(float)
r_tr_stes_np = r_tr.values.astype(float)
r_te_stes_np = r_te.values.astype(float)
p_stes = X_tr_stes_np.shape[1]

sig_mask_stes = np.array([1.0 if _is_signature_col(c) else 0.0 for c in cols_stes_sig])

# Best l2 from model_stes_sig (already fitted in cell 57)
best_l2_stes = model_stes_sig.l2_reg
scale_stes = np.linalg.norm(y_tr_stes_np)
p_vec_stes = np.sqrt(best_l2_stes) * scale_stes * sig_mask_stes if best_l2_stes > 0 else None

# STES Method 0: Cold-start (random init) — current approach
print("\n--- STES Method 0: Cold-start (random init) ---")
rng = np.random.default_rng(0)
x0_cold = rng.normal(0, 1, size=p_stes)
res_cold = _lsq(
    STESModel._objective, x0=x0_cold, jac=STESModel._jacobian,
    args=(r_tr_stes_np, X_tr_stes_np, y_tr_stes_np, 0, len(X_tr_stes_np), p_vec_stes),
)
params_cold = res_cold.x
print(f"  Cost: {res_cold.cost:.6f}  Status: {res_cold.message}")

# STES Method 1: Warm-start from baseline
print("\n--- STES Method 1: Warm-start from baseline ---")
# Map baseline params to augmented feature space
x0_warm = np.zeros(p_stes)
for i, c in enumerate(cols_stes_sig):
    if c in cols_stes_base:
        base_idx = cols_stes_base.index(c)
        x0_warm[i] = model_stes_base.params[base_idx]
    # sig features stay at 0

res_warm = _lsq(
    STESModel._objective, x0=x0_warm, jac=STESModel._jacobian,
    args=(r_tr_stes_np, X_tr_stes_np, y_tr_stes_np, 0, len(X_tr_stes_np), p_vec_stes),
)
params_warm = res_warm.x
print(f"  Cost: {res_warm.cost:.6f}  Status: {res_warm.message}")

# STES Method 2: Multi-start (5 random + 1 warm, pick best)
print("\n--- STES Method 2: Multi-start (5 random + warm) ---")
starts = [(x0_warm, "warm")]
for s in range(5):
    rng_s = np.random.default_rng(s)
    starts.append((rng_s.normal(0, 1, size=p_stes), f"seed={s}"))

best_cost = np.inf
best_params_ms = None
best_tag = ""
for x0_ms, tag in starts:
    res_ms = _lsq(
        STESModel._objective, x0=x0_ms, jac=STESModel._jacobian,
        args=(r_tr_stes_np, X_tr_stes_np, y_tr_stes_np, 0, len(X_tr_stes_np), p_vec_stes),
    )
    if res_ms.cost < best_cost:
        best_cost = res_ms.cost
        best_params_ms = res_ms.x
        best_tag = tag

print(f"  Best start: {best_tag}  Cost: {best_cost:.6f}")

# STES OOS evaluation
def _stes_oos_rmse(params_vec, X_tr, X_te, y_te, r_tr, r_te):
    """Warm-start STES prediction through train, evaluate on test."""
    a_tr = _expit(X_tr @ params_vec)
    r2_tr_raw = r_tr ** 2
    v = float(r2_tr_raw[0])
    for t in range(len(a_tr)):
        v = a_tr[t] * r2_tr_raw[t] + (1 - a_tr[t]) * v
    # Run on test
    a_te = _expit(X_te @ params_vec)
    r2_te_raw = r_te ** 2
    total = 0.0
    for t in range(len(a_te)):
        v_next = a_te[t] * r2_te_raw[t] + (1 - a_te[t]) * v
        total += (y_te[t] - v_next) ** 2
        v = v_next
    return np.sqrt(total / len(y_te))

rmse_stes_base = _stes_oos_rmse(model_stes_base.params, X_tr_stes_base.values.astype(float),
                                 X_te_stes_base.values.astype(float), y_te_stes_np,
                                 r_tr_stes_np, r_te_stes_np)
rmse_stes_cold = _stes_oos_rmse(params_cold, X_tr_stes_np, X_te_stes_np, y_te_stes_np,
                                 r_tr_stes_np, r_te_stes_np)
rmse_stes_warm = _stes_oos_rmse(params_warm, X_tr_stes_np, X_te_stes_np, y_te_stes_np,
                                 r_tr_stes_np, r_te_stes_np)
rmse_stes_ms = _stes_oos_rmse(best_params_ms, X_tr_stes_np, X_te_stes_np, y_te_stes_np,
                               r_tr_stes_np, r_te_stes_np)

# STES bias = intercept = first entry if "const" is first column, or we compute mean(z)
def _stes_mean_alpha(params_vec, X_tr):
    z = X_tr @ params_vec
    return float(_expit(z).mean())

alpha_stes_base = _stes_mean_alpha(model_stes_base.params, X_tr_stes_base.values.astype(float))
alpha_stes_cold = _stes_mean_alpha(params_cold, X_tr_stes_np)
alpha_stes_warm = _stes_mean_alpha(params_warm, X_tr_stes_np)
alpha_stes_ms = _stes_mean_alpha(best_params_ms, X_tr_stes_np)

print(f"\n  {'Method':>30s}  {'mean(α)':>8s}  {'IS cost':>10s}  {'OOS RMSE':>9s}  {'ΔRMSE%':>8s}")
print("  " + "-" * 71)
for name, alpha_m, cost_m, rmse_m in [
    ("Baseline (reference)", alpha_stes_base, None, rmse_stes_base),
    ("Cold-start (current)", alpha_stes_cold, res_cold.cost, rmse_stes_cold),
    ("Warm-start (transfer)", alpha_stes_warm, res_warm.cost, rmse_stes_warm),
    (f"Multi-start (best={best_tag})", alpha_stes_ms, best_cost, rmse_stes_ms),
]:
    delta_r = (rmse_m / rmse_stes_base - 1) * 100
    cost_str = f"{cost_m:>10.4f}" if cost_m is not None else f"{'—':>10s}"

    print(f"  {name:>30s}  {alpha_m:>8.4f}  {cost_str}  {rmse_m:>9.6f}  {delta_r:>+7.2f}%")


--- CV: augmented feature set ---
  Selected params: reg_lambda=10.0
================================================================================
XGBoost-STES EXPERIMENTS
================================================================================

--- Method 0: Cold-start (CV-tuned params) ---

--- Method 1: Warm-start via base_margin ---
  Correction bias:  -0.029155
  Total bias:       -1.9806  (baseline=-1.9515)
  Δbias from base:  -0.0292
  Σ|w_sig|:         0.0000

--- Method 2: Two-stage residual (baseline frozen, train sig-only) ---
  Correction bias:  -0.029154
  Total bias:       -1.9806  (baseline=-1.9515)
  Δbias from base:  -0.0292
  Σ|w_sig|:         0.0000

================================================================================
COMPARISON: XGBoost-STES
================================================================================

                          Method      bias     σ(b)        Δb    IS loss   OOS RMSE    ΔRMSE%
  ---------------------------------------------------------------------------------------
            Baseline (reference)   -1.9515  0.1244   +0.0000    25.1641     4.4260    +0.00%
            Cold-start (current)   -1.6578  0.1601   +0.2937    26.6150     4.4364    +0.23%
           Warm-start (transfer)   -1.9806  0.1213   -0.0292    25.1507     4.4318    +0.13%
             Two-stage (adapter)   -1.9806  0.1213   -0.0292    25.1507     4.4318    +0.13%

================================================================================
STES EXPERIMENTS
================================================================================

--- STES Method 0: Cold-start (random init) ---
  Cost: 0.000472  Status: `ftol` termination condition is satisfied.

--- STES Method 1: Warm-start from baseline ---
  Cost: 0.000472  Status: `ftol` termination condition is satisfied.

--- STES Method 2: Multi-start (5 random + warm) ---
  Best start: seed=1  Cost: 0.000472

                          Method   mean(α)     IS cost   OOS RMSE    ΔRMSE%
  -----------------------------------------------------------------------
            Baseline (reference)    0.1167           —   0.000448    +0.00%
            Cold-start (current)    0.1166      0.0005   0.000449    +0.14%
           Warm-start (transfer)    0.1168      0.0005   0.000449    +0.13%
       Multi-start (best=seed=1)    0.1167      0.0005   0.000449    +0.13%

6.7 Nonlinearity and Truncation Level

It’s tempting to argue that signatures are only “failing” here because we are truncating too aggressively. The universality theorem is an existence result: if we take signatures to a high enough level, then linear functionals of those signatures can approximate any continuous path functional. However, in our experiment above we truncate at level 2, which only captures up to pairwise interactions of increments.

There are two obvious ways to try to recover some of what truncation throws away. One is to keep level 2 but add a nonlinear model on top (a tree ensemble can form products and threshold-type interactions of low-level terms, which can mimic some higher-order effects). The other is to stay linear but raise the truncation to level 3, which supplies cubic interaction terms directly.

That motivates a simple 2\times 3 grid: two boosters (gblinear, gbtree) crossed with three feature sets (baseline only, +L2 signatures, +L3 signatures). The custom end-to-end objective is the same in all cases — it differentiates L with respect to z_t through the recursion v_{t+1} = v_t + \sigma(z_t)(r_t^2 - v_t) — so swapping the booster changes only how z_t is parameterized, not how the loss is defined.

The table below reports OOS RMSE for the six variants (plus ES and STES benchmarks), averaged across 10 random seeds.

================================================================================
NONLINEARITY × TRUNCATION LEVEL RESULTS
================================================================================

	rmse	mae	medae
variant
XGBSTES_E2E	0.000443	0.000132	0.000045
XGBSTES_E2E_TREE	0.000444	0.000137	0.000042
STES_EAESE	0.000448	0.000138	0.000051
XGBSTES_E2E_SIG_L20_L2_WS	0.000450	0.000136	0.000045
XGBSTES_E2E_TREE_SIG_L20_L3	0.000459	0.000138	0.000046
ES	0.000464	0.000140	0.000050
XGBSTES_E2E_SIG_L20_L3_WS	0.000469	0.000140	0.000047
XGBSTES_E2E_TREE_SIG_L20_L2	0.000483	0.000139	0.000043


  Relative to gblinear baseline (RMSE=0.0004):
    A-WS  gblinear + L2 sig WS:       +1.57%
    B     gbtree baseline:             +0.34%
    C     gbtree + L2 sig:             +9.02%
    D     gblinear + L3 sig WS:        +5.95%
    E     gbtree + L3 sig:             +3.79%

  Key comparisons:
    C vs A-WS (tree on L2 vs linear on L2):   +7.45pp
    C vs B    (do L2 sigs help trees?):        +8.67pp
    D vs C    (higher sigs vs trees?):          -3.06pp
    E vs D    (trees + L3 vs linear + L3):     -2.16pp

6.8 Interpretation

The pattern is consistent: every variant that includes signatures — linear or tree-based, level 2 or level 3 — worsens OOS RMSE relative to the gblinear baseline. Nonlinearity by itself is not the issue: the gbtree baseline is roughly on par with gblinear (+0.34%), so simply moving from linear to tree does not change much.

What does change things is adding signatures. Trees plus L2 signatures are substantially worse (+9.02%), which fits what we have seen throughout: the extra degrees of freedom reduce training error but do not translate into better forecasts at this frequency.

Raising the truncation level does not rescue it. Level-3 signatures with gblinear (+5.95%) perform worse than level-2 (+1.57%), even though level 3 supplies the cubic interaction terms you might hope were missing. That makes it hard to blame “not enough expressiveness”: the limiting factor looks like signal rather than function class.

A genuinely favorable nonlinearity test would be one where signatures receive meaningful weight and improve OOS performance, or where we change the path construction (higher-frequency sampling, richer channels) so that the signatures have something new to say beyond the baseline volatility features. In this daily-SPY setup, neither seems to be the case.

7 Conclusion

We tested whether rolling path signatures — single-channel and multi-channel — improve variance forecasts when fed into STES and XGBoost-STES. Across all the diagnostics in this notebook, the answer is consistent: in this daily-SPY setup, signature features do not improve out-of-sample accuracy, and they often make it slightly worse.

For STES, the main issue is regularization geometry. When we add signature features and cross-validation chooses a large \lambda to suppress their coefficients, that same uniform \ell_2 penalty also shrinks the baseline feature weights far away from their unregularized optima. Once we separate the penalties (regularize signatures without simultaneously crushing the baseline), the augmented model can match the baseline RMSE — but it still does not beat it. On the optimization side, the SciPy Gauss–Newton fit with an analytical Jacobian is stable: warm-start, cold-start, and multi-start all converge to the same solution in the small-p regime considered here.

For XGBoost-STES, the picture is different because the end-to-end objective is non-convex. Fitting a linear score through a sigmoid gate and a recursive filter gives a loss surface with multiple basins, and gblinear coordinate descent can land in different local optima when the feature set changes. Cold-starting with signatures can produce a lower in-sample loss and a worse out-of-sample RMSE at the same time. Warm-start and two-stage fits keep training closer to the baseline behavior, but even after tuning on the augmented feature set they still do not beat the baseline — and in the warm-started solutions the signature weights are effectively negligible.

The nonlinearity experiment reinforces the same point. Switching the booster to gbtree does not unlock value: the tree baseline matches gblinear, and adding signatures to trees is worse. Raising the truncation level to 3 does not help either.

Our takeaways are as follows. First, generic signatures on daily return paths do not automatically add incremental signal beyond standard volatility features. Universality guarantees that a rich enough signature expansion can represent many functionals, but it does not guarantee that a low-level truncation will be a statistically efficient representation in finite samples. Second, once we augment a strong baseline, regularization design and optimization path-dependence can matter as much as the feature engineering. Third, when we do have a non-convex end-to-end loss, warm-starting from a good solution is a sensible default — even if, as in this case, it mostly confirms that the new features are not predictive.

8 Appendix: Analytical Jacobian for STES

One previous implementation of the STES model uses finite-difference Jacobians built into the scipy solver. When we have 20+ signature features, fitting can become uncomfortably slow. By default, scipy.optimize.least_squares approximates the Jacobian numerically, which means: for p parameters, it has to run the full sequential T-step recursion about p+1 times per iteration.

For this model, we can write down the Jacobian analytically and compute it alongside the forward recursion. If we define J_{t,j} = \frac{\partial v_{t+1}}{\partial \beta_j}, then the recursion takes the form J_{t,j} = \alpha_t(1-\alpha_t)\, X_{t,j}\,(r_t^2 - v_t) \;+\; (1-\alpha_t)\, J_{t-1,j}. Computationally this is still just one pass through time per iteration — the same loop as the forward model, plus bookkeeping — instead of p+1 full passes. The implementation is now in STESModel._jacobian.