The Full Run Finished Cleanly

This page supersedes the earlier 1800-step pilot. The same 21.9M OLMo-hybrid speech codec LM was resumed on a stable A100 run and trained all the way to 12000 steps.

Final result:

  • best checkpoint: step 12000
  • EMA validation loss: 3.8207
  • perplexity: 45.63
  • dataset: LJ Speech
  • tokenizer: EnCodec 24 kHz, 8 codebooks
  • hardware: A100-SXM4-80GB

The qualitative result is stronger than the pilot: the model still does not produce semantic speech, but it now stays in a much cleaner voice-like regime and the later samples are noticeably less rough than the early ones.

Useful links:


What I Actually Trained

This is still an unconditional speech codec language model, not a text-conditioned TTS system yet.

Item Value
Model OLMo-hybrid speech codec LM
Parameters 21,904,648
Backbone 8 layers = 6 Gated DeltaNet blocks + 2 attention blocks
Width d_model=384, d_ff=1024
Attention 6 query heads, 2 KV heads
Hybrid schedule attention every 4th block, final block forced to attention
Tokenizer EnCodec 24 kHz, 8 codebooks, vocab 1027
Dataset LJ Speech
Chunking 8.0s chunks
Split 12,624 train / 666 val
Context 1024 delayed steps
Hardware A100-SXM4-80GB
Precision bf16
Optimizer fused AdamW
Attention backend CUDA SDPA flash path enabled
Recurrent kernel fused FLA GDN path disabled in the stable run

The architecture question here was specific: can an OLMo-Hybrid / Gated-DeltaNet-style recurrent-attention decoder model speech codec tokens well enough to stay in a speech-like basin?

The answer is clearly yes.


Run Chronology

The full story ended up being two phases:

  1. an initial A100 pilot that reached step 1800 and already produced clear voice-like babble
  2. a resumed stable A100 continuation from 1800 -> 12000 that improved monotonically all the way to the end

The stable completed run was:

ljspeech_olmo_hybrid_a100_no_fla_12k_b24a1

I had to search for a stable A100 operating point first because the fused recurrent kernel path was not usable on the available pod stack.

Configuration Effective batch Outcome
8 x 4 32 stable but underutilized
16 x 2 32 stable, better utilization
24 x 1 24 final successful 12k run
32 x 1 32 OOM / rejected

Validation Progression

The important thing about the completed run is that it never rolled over. Validation kept improving through the full budget.

Step Train Loss EMA Val Loss PPL
200 5.0049 6.7878 886.99
1000 4.1367 4.8510 127.87
1800 4.2224 4.2847 72.58
3200 4.2821 4.0500 57.40
5200 3.7050 3.9194 50.37
7400 3.8102 3.8551 47.23
10000 3.8445 3.8267 45.91
12000 3.7043 3.8207 45.63

So the earlier 1800 result was real, but it was not the end of the useful training regime. The model kept getting better, just more slowly.


Listen To The Progression

These are fixed-protocol local samples decoded with the improved delayed-space sampler. The point is not that they are already "good TTS." The point is that the model clearly stays on the speech side of the line and gets cleaner with training.

Earlier pilot checkpoint: step 1800

Mid-run checkpoint: step 7400

Final-best checkpoint: step 12000

Sample 1

Sample 2

Sample 3

Qualitatively, the late samples still babble, but they are much less in the "barely-holding-together" regime than the early pilot. That is exactly the kind of progression I hoped this architecture would show.


What This Result Actually Says

The claim I am comfortable making is narrow:

A small OLMo-hybrid / Gated-DeltaNet-style speech codec language model can learn enough local speech structure on a clean single-speaker corpus to produce stable, clearly voice-like audio, and it keeps improving well past the earliest emergence point.

What I am not claiming:

  • that this is a finished TTS system
  • that this beats a matched transformer baseline
  • that the no-FLA A100 path is the architecture's ideal implementation
  • that these samples are semantically meaningful speech

This is still a codec-LM architecture result, not a production speech system result.


Systems Reality

The training result is clean. The systems story was messier.

Main issues:

  • the intended fused FLA Gated DeltaNet path was not usable on the available Triton / pod stack
  • the successful run used the plain PyTorch recurrent fallback for the hybrid blocks
  • cheap pod infrastructure was volatile and SSH endpoints changed repeatedly
  • the only reason the final run survived cleanly is that I started pulling artifacts off-pod aggressively

So this validates the modeling story much more than it validates the final optimized training stack.


Why This Matters For The Next Step

This is enough to justify moving on to text conditioning.

The question is no longer:

  • can this hybrid backbone model speech codec tokens at all?

The question is now:

  • can I inject text cleanly enough to make the decoder read controllable content instead of only producing speech-like babble?

That means the next version should keep the audio decoder and add:

  • a small text encoder
  • cross-attention from selected decoder blocks into text states
  • sentence-level training on LJ Speech transcripts

That is the project from here.