Live Experiment: MC-Linear-Attention vs Hybrid MC-LA 1:3

mongoobi, 3 Mar 2026 — this post will be updated as runs complete


Status: IN PROGRESS — both runs are live on an A100-80GB.


Experiment Setup

Two variants of memory-cached linear attention, head-to-head on the same dataset:

MC-Linear-Attention Hybrid MC-LA 1:3
Params 41M 49M
MC layers 12/12 (all) 3/12 (every 4th)
MC overhead 12.0% 2.5%
d_model 640 640
n_layers 12 12
LA heads 10 10
Segment size 256 256
Backend FLA kernel FLA kernel
LR 1e-4 1e-4
Batch size 8 (eff. 32) 8 (eff. 32)

Dataset: 25k FMA tracks tokenized with DAC 44.1kHz (22,486 train / 2,499 val).

Target: 35,000 steps each.


Training Curves (as of step 800)

MC-Linear-Attention (full MC)

Step Loss LR Grad Norm Tok/s GRM Entropy Cache
100 6.9854 9.9e-6 0.15 168k 1.155 8
200 6.9201 2.0e-5 0.09 166k 1.111 8
300 6.9052 3.0e-5 0.09 165k 1.096 8
400 6.8906 4.0e-5 0.12 150k 1.051 8
500 6.8862 5.0e-5 0.16 120k 0.983 8
600 6.8831 6.0e-5 0.09 119k 0.957 8
700 6.8716 7.0e-5 0.11 121k 0.935 8
800 6.8611 8.0e-5 0.08 118k 0.842 8

Hybrid MC-LA 1:3

Step Loss LR Grad Norm Tok/s GRM Entropy Cache
100 7.0000 9.9e-6 0.12 221k 1.255 8
200 6.9129 2.0e-5 0.10 225k 1.291 8
300 6.9046 3.0e-5 0.08 225k 1.159 8
400 6.9033 4.0e-5 0.07 224k 0.870 8
500 6.8919 5.0e-5 0.09 225k 0.841 8
600 6.8750 6.0e-5 0.12 221k 0.789 8
700 6.8580 7.0e-5 0.20 222k 0.805 8
800 6.8540 8.0e-5 0.07 219k 0.776 8

Early Observations

1. The hybrid is winning

At step 800, the hybrid (6.854) is actually below full MC-LA (6.861). This is surprising — fewer MC layers, lower loss. The hybrid has 8M more parameters overall (49M vs 41M) due to more backbone capacity in the non-MC layers, which may explain it.

2. Throughput gap is massive

The hybrid runs at 219-225k tok/s vs MC-LA at 118-121k tok/s. That's a 1.9x speedup. MC-enhanced layers are expensive (the softmax gating over cached segments adds overhead), so having only 3/12 layers with MC keeps throughput high.

3. GRM entropy tells a story

The hybrid's GRM entropy dropped faster (1.255 → 0.776) than full MC-LA (1.155 → 0.842). With fewer MC layers, each one needs to be more decisive about what to retrieve — and they are. The gating is becoming selective.

For reference, minimum possible entropy (perfect selectivity, all weight on one segment) approaches 0. Maximum (uniform over 8+1 segments) is $$\ln(9) \approx 2.20$$.

4. Stability

Both runs are clean — no gradient spikes, no NaN. The move to a 25k-sample dataset (vs 10k) and lr=1e-4 (vs 3e-4) solved the instability issues from earlier attempts.


Parallel Work: FMA-Large Tokenization

While these runs train on 25k tracks, we're tokenizing the full FMA-Large dataset (106k tracks) in the background. ~20% complete as of this writing. Once done, we'll retrain the best architecture on the full dataset.


What's Next

  • [ ] Let both runs finish (35k steps)
  • [ ] Evaluate per-codebook metrics on val set
  • [ ] Compare against transformer and hybrid 1:3 baselines
  • [ ] Generate and listen to audio samples from best checkpoint
  • [ ] Retrain winner on FMA-Large (106k tracks)
  • [ ] Ablation: segment size sweep (128, 256, 512)
  • [ ] Ablation: MC on different layer positions (early vs late vs uniform)

Last updated: 3 Mar 2026, step 800