SSMs for Music Generation: Baseline Experiments

SSMs for Music Generation: Do They Work?

mongoobi, Feb 2026

The question is straightforward: can selective state space models (Mamba) match transformers for autoregressive music generation? And if so, does a hybrid Mamba-Attention architecture — which Jamba showed works well for language — translate to the audio domain?

This post documents the baseline experiments that set the stage for everything that follows.

Setup

Codec: DAC 44.1kHz — 9 codebooks at ~86Hz frame rate, 1024 codes per codebook. Every audio frame becomes 9 discrete tokens via a delay pattern (MusicGen-style).

Dataset: ~10k tracks from FMA, tokenized into DAC tokens. Split 90/10 train/val (~8,915 / 986 samples).

Architectures compared: 1. Transformer — standard causal self-attention, 162M params (d_model=768, 20 layers, 12 heads) 2. Hybrid 1:3 — 1 attention layer per 3 Mamba-1 layers (Jamba-style), 116M params

Both use the same RVQ embedding (sum across codebooks) and output head (per-codebook linear projections). Trained with AdamW, cosine schedule, lr=3e-4, bf16, effective batch size 64. 31k steps.

Results

Training Curves

Transformer loss trajectory (31k steps): - Step 1000: 6.76 → Step 10k: ~5.8 → Step 31k: ~5.30 - Val loss at step 30k: 5.5505 (ppl 257.4)

Hybrid 1:3 trained similarly — converged to the same regime.

Evaluation: Per-Codebook Metrics

Codebook	Transformer Loss	Hybrid 1:3 Loss	Δ
0 (coarsest)	3.993	3.980	-0.013
1	4.928	4.922	-0.006
2	5.296	5.291	-0.005
3	5.484	5.477	-0.007
4	5.515	5.510	-0.005
5	5.541	5.534	-0.007
6	5.559	5.548	-0.011
7	5.614	5.601	-0.013
8 (finest)	5.717	5.691	-0.026
Mean	5.294	5.284	-0.010

The hybrid wins on every single codebook. The margin is small but consistent — and the hybrid does it with 29% fewer parameters (116M vs 162M).

Top-k Accuracy

Codebook	Transformer Top-1	Hybrid Top-1	Transformer Top-5	Hybrid Top-5
0	16.62%	16.67%	40.73%	40.90%
8	3.72%	3.83%	12.07%	12.35%

Same story: hybrid matches or beats transformer across the board.

What This Tells Us

Mamba works for music. The hybrid architecture doesn't degrade on any metric versus pure attention — it's strictly better here despite fewer params.
The Jamba finding transfers to audio. The 1:3 attention-to-Mamba ratio is a good default. The attention layers likely handle the "structural recall" that music needs (returning to motifs, choruses) while Mamba handles local sequential dynamics efficiently.
Fine-grained codebooks benefit most from the hybrid. The biggest gains are on codebooks 6-8 (the finest acoustic detail), suggesting Mamba's sequential inductive bias helps model low-level audio structure.

This sets the baseline. The next question: can we improve long-range coherence by adding explicit memory?

Next Steps

→ Memory Caching for SSMs: From Paper to Implementation