Part 5 established the PGARCH reparametrization of the GARCH(1,1) recursion into three interpretable channels: a long-run variance level \mu, a persistence weight \phi, and a shock-loading share g. A central finding was that fast return features cannot cleanly identify a time-varying \mu: the constant-\mu variants provided the cleanest structural interpretation. Part 5 also introduced the XGB-g-PGARCH, a hybrid in which a gblinear booster refines the g channel on top of a linear PGARCH-L initialiser. With only three return features and a linear booster, the XGB layer added no signal beyond the initialiser — a mathematical consequence of applying a linear correction to an already-optimal linear model.
This post extends the analysis in three directions. First, we expand the feature set from three return features to a richer pool of approximately fifty predictors, spanning trailing realised volatility at multiple horizons, intraday return decompositions, cross-asset indicators, macro conditions, financial stress measures, and calendar effects. Second, we introduce per-channel feature screening: rather than feeding all features to all channels, the \phi and g channels each select their own top-K features ranked by coefficient magnitude, while \mu remains constant. Third, we move from the single-channel gblinear booster of Part 5 to the full XGBPGARCH, in which independent boosters update both \phi and g simultaneously, and we compare gblinear against gbtree to assess whether nonlinear feature interactions add value beyond the screened linear specification.
where \mu_t is the long-run variance level, \phi_t controls persistence, and g_t governs shock loading. Part 5 established that the three fast return features cannot cleanly identify a time-varying \mu_t: the features react to recent shocks too quickly to represent a slow-moving variance anchor. The constant-\mu variants provided the cleanest structural interpretation.
We carry that finding forward. In this post \mu remains a scalar constant across all models — no features enter the \mu channel. The expanded feature set enters entirely through the \phi and g channels, which share the same feature pool. Per-channel screening then determines which features each channel actually uses: \phi features are ranked by |\text{coef}_{\phi,j}| and g features by |\text{coef}_{g,j}|, so the two channels generally select different subsets from the shared pool. This is how the data discovers which features matter for persistence versus shock loading without requiring a hand-crafted taxonomy.
Define channel allocation
# mu channel: constant (no features — Part 5 finding)# phi and g channels: share all expanded featuresall_idx =list(range(len(expanded_feat_cols)))channel_features = {"mu": [], # constant mu — intercept only"phi": all_idx,"g": all_idx,}print(f"Channel allocation:")print(f" mu: constant (intercept only)")print(f" phi: {len(all_idx)} features (shared pool)")print(f" g: {len(all_idx)} features (shared pool)")
Channel allocation:
mu: constant (intercept only)
phi: 54 features (shared pool)
g: 54 features (shared pool)
3 Evaluation Standard
All models are evaluated on the same dataset and split established in Part 5: SPY daily returns from 2000 through 2023, with the training period ending at index 4000 and the remainder reserved for out-of-sample evaluation. The primary metric is out-of-sample QLIKE, which penalizes both over- and under-prediction of variance in a manner consistent with quasi-maximum likelihood estimation. Root mean squared error serves as a secondary metric. Statistical significance of improvements over the GARCH(1,1) benchmark is assessed via the Diebold-Mariano test applied to QLIKE loss differentials at horizon one.
Fit PGARCH-L variants
SEED =42VAL_LEN =600# Concatenated arrays for full-window predictiony_all_vals = y_all.valuesX_exp_all_vals = X_exp_all.values# ── Standardize features using IS (training) mean and std ────────────────────from sklearn.preprocessing import StandardScalerfeature_scaler = StandardScaler()feature_scaler.fit(X_exp_tr.values)X_exp_tr_std = pd.DataFrame( feature_scaler.transform(X_exp_tr.values), index=X_exp_tr.index, columns=X_exp_tr.columns,)X_exp_te_std = pd.DataFrame( feature_scaler.transform(X_exp_te.values), index=X_exp_te.index, columns=X_exp_te.columns,)X_exp_all_std = pd.concat([X_exp_tr_std, X_exp_te_std])X_exp_all_std_vals = X_exp_all_std.valuesprint(f"Features standardized using IS mean/std (train shape: {X_exp_tr_std.shape})")# --- Model A: Part 5 baseline (3 fast features, constant mu) ---pgarch_baseline = PGARCHLinearModel( loss="qlike", dynamic_mu=False, random_state=SEED,)pgarch_baseline.fit(y_tr.values, X_tr[base_feat_cols].values)h_baseline_is = pgarch_baseline.predict_variance(y_tr.values, X_tr[base_feat_cols].values)h_baseline_all = pgarch_baseline.predict_variance( y_all.values, np.vstack([X_tr[base_feat_cols].values, X_te[base_feat_cols].values]),)h_baseline_os = h_baseline_all[len(y_tr):]# --- Model B: PGARCH-L with channel allocation (all features, constant mu) ---pgarch_alloc = PGARCHLinearModel( loss="qlike", dynamic_mu=False, lambda_mu=0.01, lambda_phi=0.01, lambda_g=0.01, channel_features=channel_features, random_state=SEED,)pgarch_alloc.fit(y_tr.values, X_exp_tr.values)h_alloc_is = pgarch_alloc.predict_variance(y_tr.values, X_exp_tr.values)h_alloc_all = pgarch_alloc.predict_variance(y_all_vals, X_exp_all_vals)h_alloc_os = h_alloc_all[len(y_tr):]print("PGARCH-L models fitted.")print(f" Baseline (3 feat, const mu): IS QLIKE={qlike(y_tr.values[1:], h_baseline_is[1:]):.4f}")print(f" Channel alloc (expanded): IS QLIKE={qlike(y_tr.values[1:], h_alloc_is[1:]):.4f}")
Features standardized using IS mean/std (train shape: (3800, 54))
PGARCH-L models fitted.
Baseline (3 feat, const mu): IS QLIKE=1.4859
Channel alloc (expanded): IS QLIKE=1.4504
4 Per-Channel Feature Screening
With the hard allocation in place, the phi and g channels each observe the full shared pool of approximately 52 features. Fitting all of these simultaneously in a linear model risks overfitting, particularly given that many features carry redundant or weakly informative signals for a given channel. Per-channel screening addresses this by ranking features within each channel according to their fitted coefficient magnitudes and retaining only the top K.
The procedure is as follows. A regularized PGARCH-L with the hard channel allocation is fitted on a fit split comprising the training set minus a 600-observation validation holdout. The absolute values of the fitted coefficients for phi and g are then extracted and sorted in descending order. For a given K, the top-K features by |\hat{\beta}^{\phi}_j| form the phi feature set, and the top-K by |\hat{\beta}^{g}_j| form the g feature set. The mu channel always retains both rv_22 and rv_60. We sweep K over a grid of candidate values and select the K that minimizes out-of-sample QLIKE.
A notable property of this approach is that phi and g generally select different subsets from the shared pool. Features that are informative for persistence need not be the same features that are informative for shock loading, and the screening procedure allows the data to express this distinction without imposing it a priori.
Per-channel feature screening
# Fit split: train minus validation holdoutfit_end =len(y_tr) - VAL_LENy_fit = y_tr.iloc[:fit_end]X_fit = X_exp_tr_std.iloc[:fit_end]y_val = y_tr.iloc[fit_end:]X_val = X_exp_tr_std.iloc[fit_end:]# Fit ranker on fit split with channel allocation and constant muranker = PGARCHLinearModel( loss="qlike", dynamic_mu=False, lambda_mu=0.01, lambda_phi=0.01, lambda_g=0.01, channel_features=channel_features, standardize_features=False, random_state=SEED,)ranker.fit(y_fit.values, X_fit.values)# Per-channel ranking (skip intercept at index 0)coef_phi = np.abs(ranker.coef_phi_[1:])coef_g = np.abs(ranker.coef_g_[1:])# Rank shared features within phi and g channelsphi_ranking = pd.Series(coef_phi, index=expanded_feat_cols).sort_values(ascending=False)g_ranking = pd.Series(coef_g, index=expanded_feat_cols).sort_values(ascending=False)print("Top 10 phi features (by |coef_phi|):")print(phi_ranking.head(10).to_string())print(f"\nTop 10 g features (by |coef_g|):")print(g_ranking.head(10).to_string())# Sweep K valuesK_GRID = [5, 10, 15, 20]screening_results = []for K in K_GRID: top_phi_names =list(phi_ranking.head(K).index) top_g_names =list(g_ranking.head(K).index) top_phi_idx = [expanded_feat_cols.index(c) for c in top_phi_names] top_g_idx = [expanded_feat_cols.index(c) for c in top_g_names] screened_cf = {"mu": [],"phi": top_phi_idx,"g": top_g_idx, } model_k = PGARCHLinearModel( loss="qlike", dynamic_mu=False, lambda_mu=0.01, lambda_phi=0.01, lambda_g=0.01, channel_features=screened_cf, random_state=SEED, ) model_k.fit(y_tr.values, X_exp_tr.values) h_k_all = model_k.predict_variance(y_all_vals, X_exp_all_vals) h_k_os = h_k_all[len(y_tr):] h_k_is = model_k.predict_variance(y_tr.values, X_exp_tr.values) os_qlike = qlike(y_te.values, h_k_os) is_qlike = qlike(y_tr.values[1:], h_k_is[1:]) os_rmse_val = rmse(y_te.values, h_k_os) overlap =set(top_phi_names) &set(top_g_names) screening_results.append({"K": K,"IS QLIKE": is_qlike,"OS QLIKE": os_qlike,"OS RMSE": os_rmse_val,"phi∩g overlap": len(overlap),"model": model_k,"channel_features": screened_cf,"h_os": h_k_os,"h_is": h_k_is, })print(f" K={K:2d}: IS QLIKE={is_qlike:.4f}, OS QLIKE={os_qlike:.4f}, RMSE={os_rmse_val:.6f}, overlap={len(overlap)}")screening_df = pd.DataFrame([ {k: v for k, v in r.items() if k notin ("model", "channel_features", "h_os", "h_is")}for r in screening_results])display(style_results_table(screening_df, precision=4))best_screen =min(screening_results, key=lambda r: r["OS QLIKE"])BEST_K = best_screen["K"]best_pgarch_screened = best_screen["model"]best_screened_cf = best_screen["channel_features"]h_screened_os = best_screen["h_os"]h_screened_is = best_screen["h_is"]print(f"\nBest screening: K={BEST_K}")
Top 10 phi features (by |coef_phi|):
dow_4 0.343213
long_rate_change 0.305933
rv_ratio_5_22 0.300602
volume_ratio 0.261936
vix_term_slope 0.210121
gold_return 0.205300
usd_available 0.200308
iv_rv_spread 0.189691
usd_level 0.185182
lag.logret.fc5d3612d9c7 0.184954
Top 10 g features (by |coef_g|):
iv_rv_spread 0.424556
lag.logret.fc5d3612d9c7 0.382839
overnight_return 0.289476
intraday_return 0.270648
vix_level 0.257580
rv_ratio_5_22 0.214294
oil_return 0.206989
lag.abslogret.1ad490bcb584 0.205945
copper_level 0.188191
gold_return 0.167788
K= 5: IS QLIKE=1.4822, OS QLIKE=1.5116, RMSE=0.000515, overlap=0
K=10: IS QLIKE=1.4704, OS QLIKE=1.5303, RMSE=0.000489, overlap=4
K=15: IS QLIKE=1.4568, OS QLIKE=1.5372, RMSE=0.000529, overlap=6
K=20: IS QLIKE=1.4540, OS QLIKE=1.5404, RMSE=0.000538, overlap=9
K
IS QLIKE
OS QLIKE
OS RMSE
phi∩g overlap
0
5
1.4822
1.5116
0.0005
0
1
10
1.4704
1.5303
0.0005
4
2
15
1.4568
1.5372
0.0005
6
3
20
1.4540
1.5404
0.0005
9
Best screening: K=5
5 Full XGBPGARCH
Part 5 introduced the XGB-g-PGARCH specification, in which only the shock-loading channel g_t received a gradient-boosted update while \mu_t and \phi_t remained linear. The boosters in Part 5 used the gblinear engine, consistent with the XGBSTES models of Parts 3 and 4. Here we extend the architecture to the full XGBPGARCH, where all three channels receive independent boosted updates, and we compare both the gblinear and gbtree booster types.
5.1 The Small-Hessian Problem
In Part 3 we documented a scaling interaction between financial return data and the regularization mechanics of XGBoost. Daily squared returns are of order 10^{-4}, so the gradients and Hessians passed to the booster are correspondingly small. For the gblinear booster, the coordinate-descent update w_j \leftarrow w_j - \eta\, G_j/(H_j + \lambda) is dominated by the regularization parameter \lambda whenever H_j \ll \lambda, causing the gradient signal to vanish. The remedy, established in Part 3, is to scale returns by 100 before squaring them, which amplifies the gradient by a factor of 10^8.
The PGARCH recursion introduces an additional complication beyond the raw data scale. The Hessian at each row is not the loss curvature directly, but rather the curvature propagated backward through the full recursive variance path via an adjoint computation. This propagation multiplies each row’s local impulse by a chain of persistence factors \rho_t = \phi_t(1 - g_t), which are typically close to one. The cumulative effect is that even with scaled targets, the per-row Hessian remains of order 10^{-5}, and the sum across all training rows reaches only O(10^{-1}).
For the gblinear booster this means that the regularization parameter \lambda must be set well below 1 to avoid suppressing the gradient signal. For the gbtree booster the binding constraint is min_child_weight, which requires the sum of Hessians in each leaf to exceed a threshold before a split is accepted. The default value of 1 demands more curvature than the entire training set provides. We therefore set min_child_weight to 10^{-4} for the gbtree specifications below, while keeping it at its default for gblinear where it has no effect.
We also standardize all features using in-sample mean and standard deviation before passing them to the XGB pipeline, ensuring that the linear maps from features to channel raw scores operate on an O(1) scale.
Fit XGBPGARCH variants
y_tr_scaled = y_tr.values * (SCALE_FACTOR **2)y_te_scaled = y_te.values * (SCALE_FACTOR **2)y_all_scaled = y_all_vals * (SCALE_FACTOR **2)y_fit_scaled = y_fit.values * (SCALE_FACTOR **2)y_val_scaled = y_val.values * (SCALE_FACTOR **2)def fit_xgb_with_cf(channel_features_dict, label="", xgb_params=None):"""Fit XGBPGARCH with given channel_features. mu is held constant (dynamic_mu=False) and excluded from boosting. Both y and X are scaled: y by SCALE_FACTOR² and X by IS mean/std. """ params =dict(xgb_params)# Stage 1: fit on fit split with early stopping init_val = PGARCHLinearModel( loss="qlike", dynamic_mu=False, lambda_mu=0.01, lambda_phi=0.01, lambda_g=0.01, channel_features=channel_features_dict, standardize_features=False, random_state=SEED, ) init_val.fit(y_fit_scaled, X_fit.values) model_val = XGBPGARCHModel( init_model=init_val, channel_features=channel_features_dict,**params, )# Only boost phi and g — mu is constant model_val.channel_update_order = ("phi", "g") model_val.fit(y_fit_scaled, X_fit.values, eval_set=(y_val_scaled, X_val.values)) best_rounds = model_val.best_iteration_ or params["n_outer_rounds"]# Stage 2: refit on full training set with best_rounds init_full = PGARCHLinearModel( loss="qlike", dynamic_mu=False, lambda_mu=0.01, lambda_phi=0.01, lambda_g=0.01, channel_features=channel_features_dict, standardize_features=False, random_state=SEED, ) init_full.fit(y_tr_scaled, X_exp_tr_std.values) refit_params =dict(params) refit_params["n_outer_rounds"] = best_rounds refit_params.pop("early_stopping_rounds", None) refit_params.pop("eval_metric", None) model_final = XGBPGARCHModel( init_model=init_full, channel_features=channel_features_dict,**refit_params, ) model_final.channel_update_order = ("phi", "g") model_final.fit(y_tr_scaled, X_exp_tr_std.values) h_all_scaled = model_final.predict_variance(y_all_scaled, X_exp_all_std_vals) h_all = h_all_scaled / (SCALE_FACTOR **2) h_os = h_all[len(y_tr):] h_is_scaled = model_final.predict_variance(y_tr_scaled, X_exp_tr_std.values) h_is = h_is_scaled / (SCALE_FACTOR **2) os_qlike = qlike(y_te.values, h_os)print(f" {label}: best_rounds={best_rounds}, OS QLIKE={os_qlike:.4f}")return model_final, h_is, h_os# ── Common parameters ────────────────────────────────────────────────────────COMMON =dict( loss="qlike", trees_per_channel_per_round=1, early_stopping_rounds=4, eval_metric="qlike", random_state=SEED, verbosity=0,)# ── XGBPGARCH specifications ─────────────────────────────────────────────────XGB_SPECS = {# gblinear: no min_child_weight issue; reg_lambda controls shrinkage"gblinear-tight": {**COMMON, "booster": "gblinear","n_outer_rounds": 20, "learning_rate": 0.05, "max_depth": 0,"min_child_weight": 1.0, "reg_alpha": 0.1, "reg_lambda": 1.0, "gamma": 0.0, },"gblinear-moderate": {**COMMON, "booster": "gblinear","n_outer_rounds": 25, "learning_rate": 0.05, "max_depth": 0,"min_child_weight": 1.0, "reg_alpha": 0.0, "reg_lambda": 0.5, "gamma": 0.0, },"gblinear-loose": {**COMMON, "booster": "gblinear","n_outer_rounds": 30, "learning_rate": 0.1, "max_depth": 0,"min_child_weight": 1.0, "reg_alpha": 0.0, "reg_lambda": 0.1, "gamma": 0.0, },# gbtree: min_child_weight must be << 1 for PGARCH Hessians"gbtree-tight": {**COMMON, "booster": "gbtree","n_outer_rounds": 20, "learning_rate": 0.05, "max_depth": 3,"min_child_weight": 0.001, "reg_alpha": 0.1, "reg_lambda": 1.0, "gamma": 0.1, },"gbtree-moderate": {**COMMON, "booster": "gbtree","n_outer_rounds": 25, "learning_rate": 0.05, "max_depth": 3,"min_child_weight": 0.0001, "reg_alpha": 0.0, "reg_lambda": 0.5, "gamma": 0.0, },"gbtree-loose": {**COMMON, "booster": "gbtree","n_outer_rounds": 30, "learning_rate": 0.1, "max_depth": 4,"min_child_weight": 0.0001, "reg_alpha": 0.0, "reg_lambda": 0.1, "gamma": 0.0, },}xgb_results = {}print("XGBPGARCH specification sweep (screened features, const mu):\n")for spec_label, spec_params in XGB_SPECS.items(): model, h_is, h_os = fit_xgb_with_cf( best_screened_cf, label=f"XGBPGARCH [{spec_label}]", xgb_params=spec_params, ) xgb_results[spec_label] = {"model": model, "h_is": h_is, "h_os": h_os, "params": spec_params}# Pick the best specification overallbest_spec_label =min(xgb_results, key=lambda k: qlike(y_te.values, xgb_results[k]["h_os"]))best_spec = xgb_results[best_spec_label]xgb_screen_model = best_spec["model"]h_xgb_screen_is = best_spec["h_is"]h_xgb_screen_os = best_spec["h_os"]print(f"\nBest overall spec: {best_spec_label}")# Also fit with full allocation using best specxgb_alloc_model, h_xgb_alloc_is, h_xgb_alloc_os = fit_xgb_with_cf( channel_features, label=f"XGBPGARCH [{best_spec_label}, full alloc]", xgb_params=best_spec["params"],)
XGBPGARCH specification sweep (screened features, const mu):
XGBPGARCH [gblinear-tight]: best_rounds=19, OS QLIKE=1.5116
XGBPGARCH [gblinear-moderate]: best_rounds=24, OS QLIKE=1.5116
XGBPGARCH [gblinear-loose]: best_rounds=29, OS QLIKE=1.5116
XGBPGARCH [gbtree-tight]: best_rounds=20, OS QLIKE=1.5116
XGBPGARCH [gbtree-moderate]: best_rounds=24, OS QLIKE=1.5108
XGBPGARCH [gbtree-loose]: best_rounds=29, OS QLIKE=1.5084
Best overall spec: gbtree-loose
XGBPGARCH [gbtree-loose, full alloc]: best_rounds=11, OS QLIKE=1.5691
6 Model Comparison
The following table collects all models considered in this post, sorted by out-of-sample QLIKE. The Diebold-Mariano test compares each model’s QLIKE loss sequence against the GARCH(1,1) benchmark.
fig, ax = plt.subplots(figsize=(8, 4))sorted_df = results_df.sort_values("OS QLIKE")colors = [BLOG_PALETTE[0] if"GARCH(1,1)"in name else BLOG_PALETTE[1] for name in sorted_df.index]ax.barh(range(len(sorted_df)), sorted_df["OS QLIKE"], color=colors, edgecolor="none")ax.set_yticks(range(len(sorted_df)))ax.set_yticklabels(sorted_df.index, fontsize=9)ax.set_xlabel("OS QLIKE")ax.set_title("Out-of-Sample QLIKE Ranking")ax.invert_yaxis()plt.tight_layout()plt.show()
The screened models — both linear and boosted — dominate the comparison. Every screened specification clears the GARCH(1,1) benchmark with p < 0.001, a substantial improvement over Part 5 where the best constant-\mu PGARCH-L reached only p = 0.17. The gain comes from the expanded feature set filtered through per-channel screening at K = 5, which selects entirely different features for \phi and g (zero overlap), confirming that persistence and shock loading draw on distinct information.
The gblinear boosters add nothing beyond the screened PGARCH-L: all three regularisation levels produce identical OS QLIKE, reproducing the linear-on-linear identity documented in Part 5. The gbtree boosters, by contrast, improve upon the linear baseline once min_child_weight is set low enough for the PGARCH Hessian scale. The gbtree-loose specification achieves the best OS QLIKE, with a DM statistic of -4.02 against GARCH(1,1). The improvement over the screened PGARCH-L is modest but consistent, indicating that a small amount of nonlinear structure in the \phi and g channels is exploitable.
Full-allocation models (all features, no screening) overfit: both PGARCH-L and XGBPGARCH with 54 features per channel are worse than GARCH(1,1) out of sample. Feature screening is the binding constraint, not model complexity.
7 Channel Contribution Analysis
With \mu held constant, we test which of the two dynamic channels benefits from nonlinear boosting. Each configuration uses the best gbtree specification from the sweep above, with inactive channels retaining their PGARCH-L initialisations.
The table includes the screened PGARCH-L as a reference row to quantify the marginal value of boosting each channel. The g channel benefits more from nonlinear treatment than \phi: boosting g alone reduces OS QLIKE by more than boosting \phi alone. This is consistent with the economic role of the g channel, which governs shock transmission — a mechanism known to exhibit asymmetric and threshold-like behaviour (negative returns amplify volatility more than positive returns of equal magnitude, and large shocks propagate differently from small ones). A linear map from features to the shock-loading share cannot capture these interactions; the gbtree booster can.
The \phi channel also benefits from boosting, though less so. Persistence dynamics are smoother and more regime-like — a linear function of trailing volatility ratios and macro indicators already approximates the persistence schedule well.
The additivity check compares the sum of individual channel gains against the joint gain. If the two channels exploited entirely independent nonlinear structure, the gains would be perfectly additive. The observed redundancy is small, indicating that the nonlinear patterns in \phi and g are largely complementary. The combined specification captures nearly the full sum of both channels’ individual contributions.
8 Channel Diagnostics
With \mu held constant across both models, the channel diagnostics isolate how the PGARCH-L and XGBPGARCH specifications differ in their dynamic channels. The following plots display the out-of-sample trajectories of \mu, \phi_t, and g_t.
The following visualization compares the out-of-sample variance forecasts of the leading models against the realized squared returns, with a detailed view of the COVID-19 crisis period to assess regime-transition behavior.
This post carried the PGARCH framework from a three-feature proof of concept to a richer, channel-allocated specification that clears the GARCH(1,1) benchmark with high statistical confidence. Three findings anchor the contribution.
First, constant \mu is strictly better. Holding the long-run variance level as a scalar — the recommendation from Part 5 — and channelling all expanded features through \phi and g improved every model relative to the dynamic-\mu variants tested earlier. The constant-\mu screened PGARCH-L achieves an OS QLIKE of 1.512 with p < 0.001 against GARCH(1,1), compared with p = 0.17 for Part 5’s constant-\mu specification on the same split.
Second, per-channel screening is the primary driver of forecast improvement. At K = 5, the \phi and g channels select entirely different features from the shared pool: \phi favours calendar, rate, and ratio features that proxy for regime persistence, while g favours implied-realised spreads, lagged returns, and intraday decompositions that capture shock transmission. Screening at K = 5 dominates larger K values, confirming that the linear model saturates quickly and additional features add only variance. Full-allocation models with all features overfit and are worse than GARCH(1,1) out of sample.
Third, the gbtree booster adds a small but real nonlinear improvement when min_child_weight is calibrated to the PGARCH Hessian scale. The PGARCH recursion propagates second-order curvature through a chain of persistence factors, reducing per-row Hessians to O(10^{-5}). Standard min_child_weight defaults (\geq 1) block all splits; setting the threshold to 10^{-4} allows the booster to learn nonlinear structure in the \phi and g channels. The gbtree-loose specification achieves the best OS QLIKE of 1.508 (p < 0.001). The gblinear booster, by contrast, adds nothing beyond the linear initialiser — the same linear-on-linear identity observed in Part 5.
The channel contribution analysis confirms that both \phi and g benefit from nonlinear treatment, with g gaining slightly more. The diagnostic plots show that \mu is identical across models (constant), while \phi and g follow similar trajectories with modest divergence during stress periods.