Phase-2 Update: The Important Part Is Not “Gotcha,” It’s Regime Control
mongoobi, Feb 2026
This is a direct update to Part I, where I reported a mid-layer low-$$k$$ proxy mismatch: $$\Delta R^2 > 0$$ while $$\Delta CE_{rec} < 0$$ when scaling from Pythia-70M to Pythia-410M.
Short version:
- The mismatch is real at low token budgets.
- In the phase-2 runs, it shrinks and reverses by higher token budgets.
- So in the tested regime, this now looks primarily optimization-mediated, not purely intrinsic geometry.
That does not rescue $$R^2$$ as a behavioral metric. It means the failure mode is regime-dependent, not immutable.
Epistemic Status
- Strong on sign-level behavior in this narrow setting (Pythia 70M vs 410M, mid-layer, low-$$k$$, current training setup).
- Moderate on mechanism attribution (optimization clearly matters; full geometry contribution still open).
- Weak on universality (not yet across model families, broader hooks, or deep seed grids).
If you only keep one sentence from this post, keep this:
Fixed-budget proxy gaps are not enough to claim intrinsic failure; run the token-budget sweep first.
What Changed
Part I was mostly 10M-token/SAE evidence. This update adds phase-2 low-$$k$$ mid-layer runs at larger token budgets and anchor repeats.
Setup (same scope as before):
- Models:
pythia-70mvspythia-410m - Hookpoint: mid-layer
hook_resid_post(L3 vs L12) - SAE class: TopK, expansion 32x
- Focus: $$k \in {8,16}$$
- Token budgets: 10M, 50M, 100M
Primary statistic remains:
$$ \Delta CE_{rec}(k,T) := CE_{rec}^{410M}(k,T) - CE_{rec}^{70M}(k,T). $$
Mismatch indicator:
$$ I_{mismatch} = \mathbb{1}\left[\Delta R^2 > 0 \land \Delta CE_{rec} < 0\right]. $$
Core Result
From the phase-2 aggregate:
| $$k$$ | tokens | n | $$\Delta CE_{rec}$$ mean (95% CI) | $$\Delta R^2$$ mean (95% CI) | mismatch rate |
|---|---|---|---|---|---|
| 8 | 10M | 4 | -0.099 [-0.117, -0.082] | +0.150 [0.147, 0.153] | 1.00 |
| 8 | 50M | 1 | +0.018 [0.018, 0.018] | +0.108 [0.108, 0.108] | 0.00 |
| 8 | 100M | 3 | +0.009 [0.004, 0.013] | +0.096 [0.096, 0.097] | 0.00 |
| 16 | 10M | 4 | -0.028 [-0.064, 0.009] | +0.111 [0.109, 0.112] | 0.50 |
| 16 | 50M | 1 | +0.018 [0.018, 0.018] | +0.072 [0.072, 0.072] | 0.00 |
| 16 | 100M | 3 | +0.020 [0.016, 0.024] | +0.067 [0.067, 0.067] | 0.00 |
What matters:
- At 10M, low-$$k$$ mismatch is present (especially $$k=8$$).
- By 100M, $$\Delta CE_{rec}$$ is positive for both $$k=8,16$$ while $$\Delta R^2$$ remains positive.
- Mismatch rate collapses from high to zero.
Interpretation in plain English:
- The original “bigger model looks solved in $$R^2$$ but behaves worse” effect is strongest in the low-budget regime.
- In this tested low-$$k$$ mid-layer setup, training budget can remove the negative CE gap.
What I’m Updating (Belief-Level)
The strongest version of my Part I story was:
- low-$$k$$ proxy mismatch is real,
- and plausibly intrinsic.
After phase-2, I would update that to:
- low-$$k$$ proxy mismatch is real,
- but in this tested regime it is mostly a training-state effect unless shown otherwise.
That is a real belief update, not just rewording.
The Update to the Claim (and Why It’s Better)
Part I headline (fair for the data then): reconstruction proxies can disagree with behavior.
Part III headline (better):
Proxy validity is regime-dependent. At low token budgets, $$R^2$$ can point the wrong way for behavior. As SAE training improves, that disagreement can shrink or reverse.
This is a stronger scientific position, not a weaker one:
- It keeps the empirical warning (do not accept SAEs on $$R^2$$ alone).
- It avoids overclaiming intrinsic impossibility.
- It gives a concrete operational rule: always run token-budget sweeps before mechanism claims.
What This Does Not Mean
It does not mean:
- “Geometry is irrelevant.”
- “$$R^2$$ is fine after all.”
- “The original mismatch was fake.”
Why not:
- $$\Delta R^2$$ stays positive throughout, even when $$\Delta CE_{rec}$$ changes sign.
- So reconstruction-space and behavior-space are still not equivalent objectives.
- What changed is where the training dynamics place remaining error mass.
I still expect sensitivity-aware metrics (SWD/pullback approximations) to matter. I’m just no longer treating them as the only explanation for the observed low-$$k$$ gap in this regime.
A More Honest Decomposition
Observed proxy disagreement can be decomposed as:
$$ \text{proxy gap} = \text{metric mismatch} + \text{optimization state} + \text{estimation noise}. $$
Current evidence weights those terms roughly as:
- Optimization state: large.
- Metric mismatch: real, but not enough by itself to explain token-budget dependence.
- Noise: nontrivial (especially where $$n=1$$ at 50M).
That is where we are.
Methodology Change Going Forward
For this project, metric reporting is now explicit:
- Primary endpoint: $$\Delta CE_{rec}$$ (behavior-preserving patch metric).
- Secondary diagnostics: cosine similarity, relative error norm.
- Tertiary diagnostic: $$\Delta R^2$$ for comparability only.
If these disagree, behavioral metric wins.
Decision Status vs Plan
Relative to the pre-registered gates:
D1 (optimization vs intrinsic)
In the tested low-$$k$$ mid-layer regime, current evidence supports optimization-dominant behavior.
- Negative CE gap at 10M
- Near-zero/positive CE gap by 50M/100M
- Mismatch rate drops to zero
D2 (geometry-aware proxy lift)
Still pending. Need SWD/pullback-style proxy leaderboard with held-out prediction.
D3 (dimensionality thread)
Still optional/secondary. Not needed to explain the current phase-2 reversal.
Caveats (Same Standards as Part I)
- 50M points currently have $$n=1$$ in this table. Treat as directional.
- Reported CIs reflect the aggregate used in this runbook; avoid duplicate-source pseudo-repeats in final stats (dedupe by
(k, tokens, seed)). - Scope is still narrow: two model sizes, one family, one hook class.
- This update is about low-$$k$$ mid-layer behavior; it does not settle all depth/model-class regimes.
Why I Think This Is Still Useful Work
Because the output is decision-useful either way:
- If mismatch were irreducible, we’d need geometry-aware objectives/metrics by default.
- If mismatch is largely optimization-mediated (what we now see here), we still need better evaluation protocol and budget controls before claiming cross-scale failures.
Both outcomes are useful. The less-wrong description, for this regime, is the second one.
Next High-Leverage Steps
- Add 50M repeats for tighter uncertainty.
- Add one $$k=32$$ boundary check at 100M (tests low-$$k$$ specificity).
- Run SWD-vs-$$R^2$$ predictive comparison (D2) with held-out cells.
- Freeze claim table: supported / pending / rejected.
Repro Paths
Representative outputs:
workspace/results/proxy_gap_lowk_mid_50M/workspace/results/proxy_gap_lowk_mid_100M/workspace/results/proxy_gap_anchor_repeats/info-geo/outputs/phase2_repeat_analysis.mdinfo-geo/outputs/phase2_repeat_analysis.csv
Bottom Line
Part I showed a worrying mismatch.
Part III says: yes, that mismatch exists, but in this regime it is heavily budget-dependent.
So the methodological correction is:
Evaluate SAEs behavior-first, and never interpret fixed-budget cross-scale proxy gaps as intrinsic until you run the token-budget sweep.