SSMs for Music Generation: Do They Work?

mongoobi, Feb 2026


The question is straightforward: can selective state space models (Mamba) match transformers for autoregressive music generation? And if so, does a hybrid Mamba-Attention architecture — which Jamba showed works well for language — translate to the audio domain?

This post documents the baseline experiments that set the stage for everything that follows.


Setup

Codec: DAC 44.1kHz — 9 codebooks at ~86Hz frame rate, 1024 codes per codebook. Every audio frame becomes 9 discrete tokens via a delay pattern (MusicGen-style).

Dataset: ~10k tracks from FMA, tokenized into DAC tokens. Split 90/10 train/val (~8,915 / 986 samples).

Architectures compared: 1. Transformer — standard causal self-attention, 162M params (d_model=768, 20 layers, 12 heads) 2. Hybrid 1:3 — 1 attention layer per 3 Mamba-1 layers (Jamba-style), 116M params

Both use the same RVQ embedding (sum across codebooks) and output head (per-codebook linear projections). Trained with AdamW, cosine schedule, lr=3e-4, bf16, effective batch size 64. 31k steps.


Results

Training Curves

Transformer loss trajectory (31k steps): - Step 1000: 6.76 → Step 10k: ~5.8 → Step 31k: ~5.30 - Val loss at step 30k: 5.5505 (ppl 257.4)

Hybrid 1:3 trained similarly — converged to the same regime.

Evaluation: Per-Codebook Metrics

Codebook Transformer Loss Hybrid 1:3 Loss Δ
0 (coarsest) 3.993 3.980 -0.013
1 4.928 4.922 -0.006
2 5.296 5.291 -0.005
3 5.484 5.477 -0.007
4 5.515 5.510 -0.005
5 5.541 5.534 -0.007
6 5.559 5.548 -0.011
7 5.614 5.601 -0.013
8 (finest) 5.717 5.691 -0.026
Mean 5.294 5.284 -0.010

The hybrid wins on every single codebook. The margin is small but consistent — and the hybrid does it with 29% fewer parameters (116M vs 162M).

Top-k Accuracy

Codebook Transformer Top-1 Hybrid Top-1 Transformer Top-5 Hybrid Top-5
0 16.62% 16.67% 40.73% 40.90%
8 3.72% 3.83% 12.07% 12.35%

Same story: hybrid matches or beats transformer across the board.


What This Tells Us

  1. Mamba works for music. The hybrid architecture doesn't degrade on any metric versus pure attention — it's strictly better here despite fewer params.
  2. The Jamba finding transfers to audio. The 1:3 attention-to-Mamba ratio is a good default. The attention layers likely handle the "structural recall" that music needs (returning to motifs, choruses) while Mamba handles local sequential dynamics efficiently.
  3. Fine-grained codebooks benefit most from the hybrid. The biggest gains are on codebooks 6-8 (the finest acoustic detail), suggesting Mamba's sequential inductive bias helps model low-level audio structure.

This sets the baseline. The next question: can we improve long-range coherence by adding explicit memory?


Next Steps

Memory Caching for SSMs: From Paper to Implementation