Volatility Forecasts (Part 8 - What VolGRU Taught Us)
2026-03-17
1 Scope
This notebook is a synthesis and close-out, not a new heavy experiment. It uses the canonical Part 7 log and the cached Part 7 artifacts already on disk. The dataset, target definition, train and out-of-sample protocol, GARCH(1,1) benchmark, and DM test on QLIKE all remain fixed. No full VolGRU cross-validation rerun is launched here.
That choice is deliberate. The purpose of Part 8 is to say what the VolGRU branch taught the series, including the negative result, and to record what remains unrun without implying that more compute would automatically have changed the conclusion. The owning implementation still lives in volatility_forecast/model/volgru_model.py and volatility_forecast/model/volgru_torch.py. The evidence summarized below is the Part 7 canonical record, not a fresh search.
2 Why This Branch Was Worth Testing
VolGRU was a reasonable branch to test. STES already has the form of a constrained GRU-like update: a running variance state is blended with new information through a time-varying gate. Once that connection is explicit, it is natural to ask whether a learned candidate state, an active reset mechanism, or a multi-dimensional latent state can use richer information more effectively than the tightly structured STES recursion.
The expanded-feature design made that possibility look even more plausible. If the data contained multiple volatility channels, or if the useful signal could not be compressed into one disciplined recursion, then a richer recurrent state model was the obvious place to look. Part 8 is not an apology for having tried that idea. It is the record of why the idea was worth testing and why the evidence still moved the series elsewhere.
The canonical record for this tier is narrower than the old multi-post plan implied. On the fixed expanded-feature split, the VolGRU branch first attempted a multidimensional search under the same broad QLIKE protocol used elsewhere in Part 7. Two state_dim = 4 candidates completed in the logged partial cross-validation pass: a reset-gate specification with a positive linear candidate and a reset-gate specification with a positive MLP candidate.
The wider search was not then carried through as a full expensive campaign. Instead, a coarse quick-search pruning pass was used to test whether the broader multidimensional direction was promising enough to justify that cost. That pass covered state_dim = 4 and state_dim = 8, each with reset-gate linear-candidate and reset-gate MLP-candidate variants, under a fixed regularization setup. This is the factual core of what was tried on the VolGRU side of the Part 7 search.
4 What Happened
The conservative result is simple. The best logged partial candidate, VolGRU-MD4-MLP, still sat above GARCH(1,1) on mean QLIKE, with OS QLIKE 1.6144 against the benchmark at 1.5985. The paired VolGRU-MD4-Reset result was materially worse, and the quick-search pruning run made the broader direction look clearly adverse: every MD4 or MD8 quick candidate landed well above the benchmark, and most were significantly worse under the DM test.
That combination is enough for the series decision. VolGRU did not justify itself on this tier under the current protocol. The branch produced a real negative result rather than a near miss that obviously deserved another multi-post campaign.
5 Why This Negative Result Is Useful
The negative result still matters because it narrows the search space in a disciplined way. A more flexible recurrent state is not automatically better. On this dataset and evaluation design, the evidence points in the opposite direction: the useful gains came from controlling the feature set and keeping the recursion disciplined, not from opening a much richer latent-state architecture.
That is informative for mechanism as well as for workflow. If recurrence still matters, it probably needs a much more targeted inductive bias than the broad multidimensional expansions tested here. A residual-state argument, a very specific low-dimensional decomposition, or a tightly constrained nonlinear increment could still be interesting later. What the current evidence does not support is spending another long series of posts on the hope that generic recurrent flexibility will sort itself out with enough search.
6 Screened VolGRU Status
Screened VolGRU was not run in this pass. The canonical log records a leakage-safe feature screen for XGBPGARCHModel, a screened top-K stress test for the nonlinear PGARCH branch, and screened PGARCH-L diagnostics. The stored Part 7 stress-test driver, posts/volatility-forecasts-7/part7_stress_tests.py, also builds only the screened PGARCH diagnostic. The cache contains part7_screened_pgarch_diagnostic.*, but there is no screened VolGRU companion artifact.
That absence should be stated plainly rather than papered over. Screened VolGRU remained a plausible follow-up idea, but it was intentionally deferred in this wrap-up phase because VolGRU is expensive, the existing VolGRU evidence was already weak, and the best-supported result had already shifted toward screened structured recursion elsewhere. It remains an unrun deferred follow-up, not an implied hidden result.
7 Success and Failure Together
Part 7 and Part 8 are stronger together than they would be separately. The success side of the record is that benchmark-clearing specifications exist once the feature set is compressed and the recursion is kept disciplined. The failure side of the record is that broad recurrent-state expansion did not earn its complexity on the same tier.
Those two results point the series in the same direction. The next productive branch should stay close to screened structured recursions and add any extra nonlinear capacity only in small, well-motivated increments. XGBPGARCHModel remains useful as the discovery step that showed the family could clear the benchmark. VolGRU remains useful as the negative result that showed where extra flexibility stopped paying for itself.
8 Handoff
This notebook ends the VolGRU detour as one consolidated lessons post. There is no need to open separate Parts 9 and 10 to revisit the same line under the current protocol.
What comes next should stay near the best-supported branch: screened PGARCH-L around the K = 10 to 15 band, targeted regularization work inside that branch, and only then carefully chosen nonlinear increments if they can beat the screened linear baseline cleanly. That is a sharper direction precisely because the VolGRU branch was tested honestly and did not work well enough here.