Live Experiment: MC-Linear-Attention vs Hybrid MC-LA 1:3
mongoobi, 3 Mar 2026 — this post will be updated as runs complete
Status: IN PROGRESS — both runs are live on an A100-80GB.
Experiment Setup
Two variants of memory-cached linear attention, head-to-head on the same dataset:
| MC-Linear-Attention | Hybrid MC-LA 1:3 | |
|---|---|---|
| Params | 41M | 49M |
| MC layers | 12/12 (all) | 3/12 (every 4th) |
| MC overhead | 12.0% | 2.5% |
| d_model | 640 | 640 |
| n_layers | 12 | 12 |
| LA heads | 10 | 10 |
| Segment size | 256 | 256 |
| Backend | FLA kernel | FLA kernel |
| LR | 1e-4 | 1e-4 |
| Batch size | 8 (eff. 32) | 8 (eff. 32) |
Dataset: 25k FMA tracks tokenized with DAC 44.1kHz (22,486 train / 2,499 val).
Target: 35,000 steps each.
Training Curves (as of step 800)
MC-Linear-Attention (full MC)
| Step | Loss | LR | Grad Norm | Tok/s | GRM Entropy | Cache |
|---|---|---|---|---|---|---|
| 100 | 6.9854 | 9.9e-6 | 0.15 | 168k | 1.155 | 8 |
| 200 | 6.9201 | 2.0e-5 | 0.09 | 166k | 1.111 | 8 |
| 300 | 6.9052 | 3.0e-5 | 0.09 | 165k | 1.096 | 8 |
| 400 | 6.8906 | 4.0e-5 | 0.12 | 150k | 1.051 | 8 |
| 500 | 6.8862 | 5.0e-5 | 0.16 | 120k | 0.983 | 8 |
| 600 | 6.8831 | 6.0e-5 | 0.09 | 119k | 0.957 | 8 |
| 700 | 6.8716 | 7.0e-5 | 0.11 | 121k | 0.935 | 8 |
| 800 | 6.8611 | 8.0e-5 | 0.08 | 118k | 0.842 | 8 |
Hybrid MC-LA 1:3
| Step | Loss | LR | Grad Norm | Tok/s | GRM Entropy | Cache |
|---|---|---|---|---|---|---|
| 100 | 7.0000 | 9.9e-6 | 0.12 | 221k | 1.255 | 8 |
| 200 | 6.9129 | 2.0e-5 | 0.10 | 225k | 1.291 | 8 |
| 300 | 6.9046 | 3.0e-5 | 0.08 | 225k | 1.159 | 8 |
| 400 | 6.9033 | 4.0e-5 | 0.07 | 224k | 0.870 | 8 |
| 500 | 6.8919 | 5.0e-5 | 0.09 | 225k | 0.841 | 8 |
| 600 | 6.8750 | 6.0e-5 | 0.12 | 221k | 0.789 | 8 |
| 700 | 6.8580 | 7.0e-5 | 0.20 | 222k | 0.805 | 8 |
| 800 | 6.8540 | 8.0e-5 | 0.07 | 219k | 0.776 | 8 |
Early Observations
1. The hybrid is winning
At step 800, the hybrid (6.854) is actually below full MC-LA (6.861). This is surprising — fewer MC layers, lower loss. The hybrid has 8M more parameters overall (49M vs 41M) due to more backbone capacity in the non-MC layers, which may explain it.
2. Throughput gap is massive
The hybrid runs at 219-225k tok/s vs MC-LA at 118-121k tok/s. That's a 1.9x speedup. MC-enhanced layers are expensive (the softmax gating over cached segments adds overhead), so having only 3/12 layers with MC keeps throughput high.
3. GRM entropy tells a story
The hybrid's GRM entropy dropped faster (1.255 → 0.776) than full MC-LA (1.155 → 0.842). With fewer MC layers, each one needs to be more decisive about what to retrieve — and they are. The gating is becoming selective.
For reference, minimum possible entropy (perfect selectivity, all weight on one segment) approaches 0. Maximum (uniform over 8+1 segments) is $$\ln(9) \approx 2.20$$.
4. Stability
Both runs are clean — no gradient spikes, no NaN. The move to a 25k-sample dataset (vs 10k) and lr=1e-4 (vs 3e-4) solved the instability issues from earlier attempts.
Parallel Work: FMA-Large Tokenization
While these runs train on 25k tracks, we're tokenizing the full FMA-Large dataset (106k tracks) in the background. ~20% complete as of this writing. Once done, we'll retrain the best architecture on the full dataset.
What's Next
- [ ] Let both runs finish (35k steps)
- [ ] Evaluate per-codebook metrics on val set
- [ ] Compare against transformer and hybrid 1:3 baselines
- [ ] Generate and listen to audio samples from best checkpoint
- [ ] Retrain winner on FMA-Large (106k tracks)
- [ ] Ablation: segment size sweep (128, 256, 512)
- [ ] Ablation: MC on different layer positions (early vs late vs uniform)
Last updated: 3 Mar 2026, step 800