Your SAE Looks Solved. Your Model Disagrees. Part III: The Gap Mostly Closes When You Train It (And Why That Matters)

Phase-2 Update: The Important Part Is Not “Gotcha,” It’s Regime Control

mongoobi, Feb 2026

This is a direct update to Part I, where I reported a mid-layer low-$$k$$ proxy mismatch: $$\Delta R^2 > 0$$ while $$\Delta CE_{rec} < 0$$ when scaling from Pythia-70M to Pythia-410M.

Short version:

The mismatch is real at low token budgets.
In the phase-2 runs, it shrinks and reverses by higher token budgets.
So in the tested regime, this now looks primarily optimization-mediated, not purely intrinsic geometry.

That does not rescue $$R^2$$ as a behavioral metric. It means the failure mode is regime-dependent, not immutable.

Epistemic Status

Strong on sign-level behavior in this narrow setting (Pythia 70M vs 410M, mid-layer, low-$$k$$, current training setup).
Moderate on mechanism attribution (optimization clearly matters; full geometry contribution still open).
Weak on universality (not yet across model families, broader hooks, or deep seed grids).

If you only keep one sentence from this post, keep this:

Fixed-budget proxy gaps are not enough to claim intrinsic failure; run the token-budget sweep first.

What Changed

Part I was mostly 10M-token/SAE evidence. This update adds phase-2 low-$$k$$ mid-layer runs at larger token budgets and anchor repeats.

Setup (same scope as before):

Models: pythia-70m vs pythia-410m
Hookpoint: mid-layer hook_resid_post (L3 vs L12)
SAE class: TopK, expansion 32x
Focus: $$k \in {8,16}$$
Token budgets: 10M, 50M, 100M

Primary statistic remains:

$$ \Delta CE_{rec}(k,T) := CE_{rec}^{410M}(k,T) - CE_{rec}^{70M}(k,T). $$

Mismatch indicator:

$$ I_{mismatch} = \mathbb{1}\left[\Delta R^2 > 0 \land \Delta CE_{rec} < 0\right]. $$

Core Result

From the phase-2 aggregate:

$$k$$	tokens	n	$$\Delta CE_{rec}$$ mean (95% CI)	$$\Delta R^2$$ mean (95% CI)	mismatch rate
8	10M	4	-0.099 [-0.117, -0.082]	+0.150 [0.147, 0.153]	1.00
8	50M	1	+0.018 [0.018, 0.018]	+0.108 [0.108, 0.108]	0.00
8	100M	3	+0.009 [0.004, 0.013]	+0.096 [0.096, 0.097]	0.00
16	10M	4	-0.028 [-0.064, 0.009]	+0.111 [0.109, 0.112]	0.50
16	50M	1	+0.018 [0.018, 0.018]	+0.072 [0.072, 0.072]	0.00
16	100M	3	+0.020 [0.016, 0.024]	+0.067 [0.067, 0.067]	0.00

What matters:

At 10M, low-$$k$$ mismatch is present (especially $$k=8$$).
By 100M, $$\Delta CE_{rec}$$ is positive for both $$k=8,16$$ while $$\Delta R^2$$ remains positive.
Mismatch rate collapses from high to zero.

Interpretation in plain English:

The original “bigger model looks solved in $$R^2$$ but behaves worse” effect is strongest in the low-budget regime.
In this tested low-$$k$$ mid-layer setup, training budget can remove the negative CE gap.

What I’m Updating (Belief-Level)

The strongest version of my Part I story was:

low-$$k$$ proxy mismatch is real,
and plausibly intrinsic.

After phase-2, I would update that to:

low-$$k$$ proxy mismatch is real,
but in this tested regime it is mostly a training-state effect unless shown otherwise.

That is a real belief update, not just rewording.

The Update to the Claim (and Why It’s Better)

Part I headline (fair for the data then): reconstruction proxies can disagree with behavior.

Part III headline (better):

Proxy validity is regime-dependent. At low token budgets, $$R^2$$ can point the wrong way for behavior. As SAE training improves, that disagreement can shrink or reverse.

This is a stronger scientific position, not a weaker one:

It keeps the empirical warning (do not accept SAEs on $$R^2$$ alone).
It avoids overclaiming intrinsic impossibility.
It gives a concrete operational rule: always run token-budget sweeps before mechanism claims.

What This Does Not Mean

It does not mean:

“Geometry is irrelevant.”
“$$R^2$$ is fine after all.”
“The original mismatch was fake.”

Why not:

$$\Delta R^2$$ stays positive throughout, even when $$\Delta CE_{rec}$$ changes sign.
So reconstruction-space and behavior-space are still not equivalent objectives.
What changed is where the training dynamics place remaining error mass.

I still expect sensitivity-aware metrics (SWD/pullback approximations) to matter. I’m just no longer treating them as the only explanation for the observed low-$$k$$ gap in this regime.

A More Honest Decomposition

Observed proxy disagreement can be decomposed as:

$$ \text{proxy gap} = \text{metric mismatch} + \text{optimization state} + \text{estimation noise}. $$

Current evidence weights those terms roughly as:

Optimization state: large.
Metric mismatch: real, but not enough by itself to explain token-budget dependence.
Noise: nontrivial (especially where $$n=1$$ at 50M).

That is where we are.

Methodology Change Going Forward

For this project, metric reporting is now explicit:

Primary endpoint: $$\Delta CE_{rec}$$ (behavior-preserving patch metric).
Secondary diagnostics: cosine similarity, relative error norm.
Tertiary diagnostic: $$\Delta R^2$$ for comparability only.

If these disagree, behavioral metric wins.

Decision Status vs Plan

Relative to the pre-registered gates:

D1 (optimization vs intrinsic)

In the tested low-$$k$$ mid-layer regime, current evidence supports optimization-dominant behavior.

Negative CE gap at 10M
Near-zero/positive CE gap by 50M/100M
Mismatch rate drops to zero

D2 (geometry-aware proxy lift)

Still pending. Need SWD/pullback-style proxy leaderboard with held-out prediction.

D3 (dimensionality thread)

Still optional/secondary. Not needed to explain the current phase-2 reversal.

Caveats (Same Standards as Part I)

50M points currently have $$n=1$$ in this table. Treat as directional.
Reported CIs reflect the aggregate used in this runbook; avoid duplicate-source pseudo-repeats in final stats (dedupe by (k, tokens, seed)).
Scope is still narrow: two model sizes, one family, one hook class.
This update is about low-$$k$$ mid-layer behavior; it does not settle all depth/model-class regimes.

Why I Think This Is Still Useful Work

Because the output is decision-useful either way:

If mismatch were irreducible, we’d need geometry-aware objectives/metrics by default.
If mismatch is largely optimization-mediated (what we now see here), we still need better evaluation protocol and budget controls before claiming cross-scale failures.

Both outcomes are useful. The less-wrong description, for this regime, is the second one.

Next High-Leverage Steps

Add 50M repeats for tighter uncertainty.
Add one $$k=32$$ boundary check at 100M (tests low-$$k$$ specificity).
Run SWD-vs-$$R^2$$ predictive comparison (D2) with held-out cells.
Freeze claim table: supported / pending / rejected.

Repro Paths

Representative outputs:

workspace/results/proxy_gap_lowk_mid_50M/
workspace/results/proxy_gap_lowk_mid_100M/
workspace/results/proxy_gap_anchor_repeats/
info-geo/outputs/phase2_repeat_analysis.md
info-geo/outputs/phase2_repeat_analysis.csv

Bottom Line

Part I showed a worrying mismatch.

Part III says: yes, that mismatch exists, but in this regime it is heavily budget-dependent.

So the methodological correction is:

Evaluate SAEs behavior-first, and never interpret fixed-budget cross-scale proxy gaps as intrinsic until you run the token-budget sweep.