Phase-2 Update: The Important Part Is Not “Gotcha,” It’s Regime Control

mongoobi, Feb 2026


This is a direct update to Part I, where I reported a mid-layer low-$$k$$ proxy mismatch: $$\Delta R^2 > 0$$ while $$\Delta CE_{rec} < 0$$ when scaling from Pythia-70M to Pythia-410M.

Short version:

  1. The mismatch is real at low token budgets.
  2. In the phase-2 runs, it shrinks and reverses by higher token budgets.
  3. So in the tested regime, this now looks primarily optimization-mediated, not purely intrinsic geometry.

That does not rescue $$R^2$$ as a behavioral metric. It means the failure mode is regime-dependent, not immutable.


Epistemic Status

  1. Strong on sign-level behavior in this narrow setting (Pythia 70M vs 410M, mid-layer, low-$$k$$, current training setup).
  2. Moderate on mechanism attribution (optimization clearly matters; full geometry contribution still open).
  3. Weak on universality (not yet across model families, broader hooks, or deep seed grids).

If you only keep one sentence from this post, keep this:

Fixed-budget proxy gaps are not enough to claim intrinsic failure; run the token-budget sweep first.


What Changed

Part I was mostly 10M-token/SAE evidence. This update adds phase-2 low-$$k$$ mid-layer runs at larger token budgets and anchor repeats.

Setup (same scope as before):

  • Models: pythia-70m vs pythia-410m
  • Hookpoint: mid-layer hook_resid_post (L3 vs L12)
  • SAE class: TopK, expansion 32x
  • Focus: $$k \in {8,16}$$
  • Token budgets: 10M, 50M, 100M

Primary statistic remains:

$$ \Delta CE_{rec}(k,T) := CE_{rec}^{410M}(k,T) - CE_{rec}^{70M}(k,T). $$

Mismatch indicator:

$$ I_{mismatch} = \mathbb{1}\left[\Delta R^2 > 0 \land \Delta CE_{rec} < 0\right]. $$


Core Result

From the phase-2 aggregate:

$$k$$ tokens n $$\Delta CE_{rec}$$ mean (95% CI) $$\Delta R^2$$ mean (95% CI) mismatch rate
8 10M 4 -0.099 [-0.117, -0.082] +0.150 [0.147, 0.153] 1.00
8 50M 1 +0.018 [0.018, 0.018] +0.108 [0.108, 0.108] 0.00
8 100M 3 +0.009 [0.004, 0.013] +0.096 [0.096, 0.097] 0.00
16 10M 4 -0.028 [-0.064, 0.009] +0.111 [0.109, 0.112] 0.50
16 50M 1 +0.018 [0.018, 0.018] +0.072 [0.072, 0.072] 0.00
16 100M 3 +0.020 [0.016, 0.024] +0.067 [0.067, 0.067] 0.00

What matters:

  1. At 10M, low-$$k$$ mismatch is present (especially $$k=8$$).
  2. By 100M, $$\Delta CE_{rec}$$ is positive for both $$k=8,16$$ while $$\Delta R^2$$ remains positive.
  3. Mismatch rate collapses from high to zero.

Interpretation in plain English:

  • The original “bigger model looks solved in $$R^2$$ but behaves worse” effect is strongest in the low-budget regime.
  • In this tested low-$$k$$ mid-layer setup, training budget can remove the negative CE gap.

What I’m Updating (Belief-Level)

The strongest version of my Part I story was:

  1. low-$$k$$ proxy mismatch is real,
  2. and plausibly intrinsic.

After phase-2, I would update that to:

  1. low-$$k$$ proxy mismatch is real,
  2. but in this tested regime it is mostly a training-state effect unless shown otherwise.

That is a real belief update, not just rewording.


The Update to the Claim (and Why It’s Better)

Part I headline (fair for the data then): reconstruction proxies can disagree with behavior.

Part III headline (better):

Proxy validity is regime-dependent. At low token budgets, $$R^2$$ can point the wrong way for behavior. As SAE training improves, that disagreement can shrink or reverse.

This is a stronger scientific position, not a weaker one:

  1. It keeps the empirical warning (do not accept SAEs on $$R^2$$ alone).
  2. It avoids overclaiming intrinsic impossibility.
  3. It gives a concrete operational rule: always run token-budget sweeps before mechanism claims.

What This Does Not Mean

It does not mean:

  1. “Geometry is irrelevant.”
  2. “$$R^2$$ is fine after all.”
  3. “The original mismatch was fake.”

Why not:

  • $$\Delta R^2$$ stays positive throughout, even when $$\Delta CE_{rec}$$ changes sign.
  • So reconstruction-space and behavior-space are still not equivalent objectives.
  • What changed is where the training dynamics place remaining error mass.

I still expect sensitivity-aware metrics (SWD/pullback approximations) to matter. I’m just no longer treating them as the only explanation for the observed low-$$k$$ gap in this regime.


A More Honest Decomposition

Observed proxy disagreement can be decomposed as:

$$ \text{proxy gap} = \text{metric mismatch} + \text{optimization state} + \text{estimation noise}. $$

Current evidence weights those terms roughly as:

  1. Optimization state: large.
  2. Metric mismatch: real, but not enough by itself to explain token-budget dependence.
  3. Noise: nontrivial (especially where $$n=1$$ at 50M).

That is where we are.


Methodology Change Going Forward

For this project, metric reporting is now explicit:

  1. Primary endpoint: $$\Delta CE_{rec}$$ (behavior-preserving patch metric).
  2. Secondary diagnostics: cosine similarity, relative error norm.
  3. Tertiary diagnostic: $$\Delta R^2$$ for comparability only.

If these disagree, behavioral metric wins.


Decision Status vs Plan

Relative to the pre-registered gates:

D1 (optimization vs intrinsic)

In the tested low-$$k$$ mid-layer regime, current evidence supports optimization-dominant behavior.

  • Negative CE gap at 10M
  • Near-zero/positive CE gap by 50M/100M
  • Mismatch rate drops to zero

D2 (geometry-aware proxy lift)

Still pending. Need SWD/pullback-style proxy leaderboard with held-out prediction.

D3 (dimensionality thread)

Still optional/secondary. Not needed to explain the current phase-2 reversal.


Caveats (Same Standards as Part I)

  1. 50M points currently have $$n=1$$ in this table. Treat as directional.
  2. Reported CIs reflect the aggregate used in this runbook; avoid duplicate-source pseudo-repeats in final stats (dedupe by (k, tokens, seed)).
  3. Scope is still narrow: two model sizes, one family, one hook class.
  4. This update is about low-$$k$$ mid-layer behavior; it does not settle all depth/model-class regimes.

Why I Think This Is Still Useful Work

Because the output is decision-useful either way:

  1. If mismatch were irreducible, we’d need geometry-aware objectives/metrics by default.
  2. If mismatch is largely optimization-mediated (what we now see here), we still need better evaluation protocol and budget controls before claiming cross-scale failures.

Both outcomes are useful. The less-wrong description, for this regime, is the second one.


Next High-Leverage Steps

  1. Add 50M repeats for tighter uncertainty.
  2. Add one $$k=32$$ boundary check at 100M (tests low-$$k$$ specificity).
  3. Run SWD-vs-$$R^2$$ predictive comparison (D2) with held-out cells.
  4. Freeze claim table: supported / pending / rejected.

Repro Paths

Representative outputs:

  • workspace/results/proxy_gap_lowk_mid_50M/
  • workspace/results/proxy_gap_lowk_mid_100M/
  • workspace/results/proxy_gap_anchor_repeats/
  • info-geo/outputs/phase2_repeat_analysis.md
  • info-geo/outputs/phase2_repeat_analysis.csv

Bottom Line

Part I showed a worrying mismatch.

Part III says: yes, that mismatch exists, but in this regime it is heavily budget-dependent.

So the methodological correction is:

Evaluate SAEs behavior-first, and never interpret fixed-budget cross-scale proxy gaps as intrinsic until you run the token-budget sweep.