Completed 12k Run: Early Speech Emergence in a 21.9M OLMo-Hybrid Speech LM

The Full Run Finished Cleanly

This page supersedes the earlier 1800-step pilot. The same 21.9M OLMo-hybrid speech codec LM was resumed on a stable A100 run and trained all the way to 12000 steps.

Final result:

best checkpoint: step 12000
EMA validation loss: 3.8207
perplexity: 45.63
dataset: LJ Speech
tokenizer: EnCodec 24 kHz, 8 codebooks
hardware: A100-SXM4-80GB

The qualitative result is stronger than the pilot: the model still does not produce semantic speech, but it now stays in a much cleaner voice-like regime and the later samples are noticeably less rough than the early ones.

Useful links:

What I Actually Trained

This is still an unconditional speech codec language model, not a text-conditioned TTS system yet.

Item	Value
Model	OLMo-hybrid speech codec LM
Parameters	`21,904,648`
Backbone	`8` layers = `6` Gated DeltaNet blocks + `2` attention blocks
Width	`d_model=384`, `d_ff=1024`
Attention	`6` query heads, `2` KV heads
Hybrid schedule	attention every `4th` block, final block forced to attention
Tokenizer	`EnCodec 24 kHz`, `8` codebooks, vocab `1027`
Dataset	`LJ Speech`
Chunking	`8.0s` chunks
Split	`12,624` train / `666` val
Context	`1024` delayed steps
Hardware	`A100-SXM4-80GB`
Precision	`bf16`
Optimizer	fused AdamW
Attention backend	CUDA SDPA flash path enabled
Recurrent kernel	fused FLA GDN path disabled in the stable run

The architecture question here was specific: can an OLMo-Hybrid / Gated-DeltaNet-style recurrent-attention decoder model speech codec tokens well enough to stay in a speech-like basin?

The answer is clearly yes.

Run Chronology

The full story ended up being two phases:

an initial A100 pilot that reached step 1800 and already produced clear voice-like babble
a resumed stable A100 continuation from 1800 -> 12000 that improved monotonically all the way to the end

The stable completed run was:

ljspeech_olmo_hybrid_a100_no_fla_12k_b24a1

I had to search for a stable A100 operating point first because the fused recurrent kernel path was not usable on the available pod stack.

Configuration	Effective batch	Outcome
`8 x 4`	`32`	stable but underutilized
`16 x 2`	`32`	stable, better utilization
`24 x 1`	`24`	final successful `12k` run
`32 x 1`	`32`	OOM / rejected

Validation Progression

The important thing about the completed run is that it never rolled over. Validation kept improving through the full budget.

Step	Train Loss	EMA Val Loss	PPL
200	5.0049	6.7878	886.99
1000	4.1367	4.8510	127.87
1800	4.2224	4.2847	72.58
3200	4.2821	4.0500	57.40
5200	3.7050	3.9194	50.37
7400	3.8102	3.8551	47.23
10000	3.8445	3.8267	45.91
12000	3.7043	3.8207	45.63

So the earlier 1800 result was real, but it was not the end of the useful training regime. The model kept getting better, just more slowly.

Listen To The Progression

These are fixed-protocol local samples decoded with the improved delayed-space sampler. The point is not that they are already "good TTS." The point is that the model clearly stays on the speech side of the line and gets cleaner with training.

Earlier pilot checkpoint: step 1800

Mid-run checkpoint: step 7400

Final-best checkpoint: step 12000

Sample 1

Sample 2

Sample 3

Qualitatively, the late samples still babble, but they are much less in the "barely-holding-together" regime than the early pilot. That is exactly the kind of progression I hoped this architecture would show.

What This Result Actually Says

The claim I am comfortable making is narrow:

A small OLMo-hybrid / Gated-DeltaNet-style speech codec language model can learn enough local speech structure on a clean single-speaker corpus to produce stable, clearly voice-like audio, and it keeps improving well past the earliest emergence point.

What I am not claiming:

that this is a finished TTS system
that this beats a matched transformer baseline
that the no-FLA A100 path is the architecture's ideal implementation
that these samples are semantically meaningful speech

This is still a codec-LM architecture result, not a production speech system result.

Systems Reality

The training result is clean. The systems story was messier.

Main issues:

the intended fused FLA Gated DeltaNet path was not usable on the available Triton / pod stack
the successful run used the plain PyTorch recurrent fallback for the hybrid blocks
cheap pod infrastructure was volatile and SSH endpoints changed repeatedly
the only reason the final run survived cleanly is that I started pulling artifacts off-pod aggressively

So this validates the modeling story much more than it validates the final optimized training stack.

Why This Matters For The Next Step

This is enough to justify moving on to text conditioning.

The question is no longer:

can this hybrid backbone model speech codec tokens at all?

The question is now:

can I inject text cleanly enough to make the decoder read controllable content instead of only producing speech-like babble?

That means the next version should keep the audio decoder and add:

a small text encoder
cross-attention from selected decoder blocks into text states
sentence-level training on LJ Speech transcripts

That is the project from here.