SSMs for Music Generation: Do They Work?
mongoobi, Feb 2026
The question is straightforward: can selective state space models (Mamba) match transformers for autoregressive music generation? And if so, does a hybrid Mamba-Attention architecture — which Jamba showed works well for language — translate to the audio domain?
This post documents the baseline experiments that set the stage for everything that follows.
Setup
Codec: DAC 44.1kHz — 9 codebooks at ~86Hz frame rate, 1024 codes per codebook. Every audio frame becomes 9 discrete tokens via a delay pattern (MusicGen-style).
Dataset: ~10k tracks from FMA, tokenized into DAC tokens. Split 90/10 train/val (~8,915 / 986 samples).
Architectures compared: 1. Transformer — standard causal self-attention, 162M params (d_model=768, 20 layers, 12 heads) 2. Hybrid 1:3 — 1 attention layer per 3 Mamba-1 layers (Jamba-style), 116M params
Both use the same RVQ embedding (sum across codebooks) and output head (per-codebook linear projections). Trained with AdamW, cosine schedule, lr=3e-4, bf16, effective batch size 64. 31k steps.
Results
Training Curves
Transformer loss trajectory (31k steps): - Step 1000: 6.76 → Step 10k: ~5.8 → Step 31k: ~5.30 - Val loss at step 30k: 5.5505 (ppl 257.4)
Hybrid 1:3 trained similarly — converged to the same regime.
Evaluation: Per-Codebook Metrics
| Codebook | Transformer Loss | Hybrid 1:3 Loss | Δ |
|---|---|---|---|
| 0 (coarsest) | 3.993 | 3.980 | -0.013 |
| 1 | 4.928 | 4.922 | -0.006 |
| 2 | 5.296 | 5.291 | -0.005 |
| 3 | 5.484 | 5.477 | -0.007 |
| 4 | 5.515 | 5.510 | -0.005 |
| 5 | 5.541 | 5.534 | -0.007 |
| 6 | 5.559 | 5.548 | -0.011 |
| 7 | 5.614 | 5.601 | -0.013 |
| 8 (finest) | 5.717 | 5.691 | -0.026 |
| Mean | 5.294 | 5.284 | -0.010 |
The hybrid wins on every single codebook. The margin is small but consistent — and the hybrid does it with 29% fewer parameters (116M vs 162M).
Top-k Accuracy
| Codebook | Transformer Top-1 | Hybrid Top-1 | Transformer Top-5 | Hybrid Top-5 |
|---|---|---|---|---|
| 0 | 16.62% | 16.67% | 40.73% | 40.90% |
| 8 | 3.72% | 3.83% | 12.07% | 12.35% |
Same story: hybrid matches or beats transformer across the board.
What This Tells Us
- Mamba works for music. The hybrid architecture doesn't degrade on any metric versus pure attention — it's strictly better here despite fewer params.
- The Jamba finding transfers to audio. The 1:3 attention-to-Mamba ratio is a good default. The attention layers likely handle the "structural recall" that music needs (returning to motifs, choruses) while Mamba handles local sequential dynamics efficiently.
- Fine-grained codebooks benefit most from the hybrid. The biggest gains are on codebooks 6-8 (the finest acoustic detail), suggesting Mamba's sequential inductive bias helps model low-level audio structure.
This sets the baseline. The next question: can we improve long-range coherence by adding explicit memory?