The Full Run Finished Cleanly
This page supersedes the earlier 1800-step pilot. The same 21.9M OLMo-hybrid speech codec LM was resumed on a stable A100 run and trained all the way to 12000 steps.
Final result:
- best checkpoint:
step 12000 - EMA validation loss:
3.8207 - perplexity:
45.63 - dataset:
LJ Speech - tokenizer:
EnCodec 24 kHz,8codebooks - hardware:
A100-SXM4-80GB
The qualitative result is stronger than the pilot: the model still does not produce semantic speech, but it now stays in a much cleaner voice-like regime and the later samples are noticeably less rough than the early ones.
Useful links:
What I Actually Trained
This is still an unconditional speech codec language model, not a text-conditioned TTS system yet.
| Item | Value |
|---|---|
| Model | OLMo-hybrid speech codec LM |
| Parameters | 21,904,648 |
| Backbone | 8 layers = 6 Gated DeltaNet blocks + 2 attention blocks |
| Width | d_model=384, d_ff=1024 |
| Attention | 6 query heads, 2 KV heads |
| Hybrid schedule | attention every 4th block, final block forced to attention |
| Tokenizer | EnCodec 24 kHz, 8 codebooks, vocab 1027 |
| Dataset | LJ Speech |
| Chunking | 8.0s chunks |
| Split | 12,624 train / 666 val |
| Context | 1024 delayed steps |
| Hardware | A100-SXM4-80GB |
| Precision | bf16 |
| Optimizer | fused AdamW |
| Attention backend | CUDA SDPA flash path enabled |
| Recurrent kernel | fused FLA GDN path disabled in the stable run |
The architecture question here was specific: can an OLMo-Hybrid / Gated-DeltaNet-style recurrent-attention decoder model speech codec tokens well enough to stay in a speech-like basin?
The answer is clearly yes.
Run Chronology
The full story ended up being two phases:
- an initial A100 pilot that reached
step 1800and already produced clear voice-like babble - a resumed stable A100 continuation from
1800 -> 12000that improved monotonically all the way to the end
The stable completed run was:
ljspeech_olmo_hybrid_a100_no_fla_12k_b24a1
I had to search for a stable A100 operating point first because the fused recurrent kernel path was not usable on the available pod stack.
| Configuration | Effective batch | Outcome |
|---|---|---|
8 x 4 |
32 |
stable but underutilized |
16 x 2 |
32 |
stable, better utilization |
24 x 1 |
24 |
final successful 12k run |
32 x 1 |
32 |
OOM / rejected |
Validation Progression
The important thing about the completed run is that it never rolled over. Validation kept improving through the full budget.
| Step | Train Loss | EMA Val Loss | PPL |
|---|---|---|---|
| 200 | 5.0049 | 6.7878 | 886.99 |
| 1000 | 4.1367 | 4.8510 | 127.87 |
| 1800 | 4.2224 | 4.2847 | 72.58 |
| 3200 | 4.2821 | 4.0500 | 57.40 |
| 5200 | 3.7050 | 3.9194 | 50.37 |
| 7400 | 3.8102 | 3.8551 | 47.23 |
| 10000 | 3.8445 | 3.8267 | 45.91 |
| 12000 | 3.7043 | 3.8207 | 45.63 |
So the earlier 1800 result was real, but it was not the end of the useful training regime. The model kept getting better, just more slowly.
Listen To The Progression
These are fixed-protocol local samples decoded with the improved delayed-space sampler. The point is not that they are already "good TTS." The point is that the model clearly stays on the speech side of the line and gets cleaner with training.
Earlier pilot checkpoint: step 1800
Mid-run checkpoint: step 7400
Final-best checkpoint: step 12000
Sample 1
Sample 2
Sample 3
Qualitatively, the late samples still babble, but they are much less in the "barely-holding-together" regime than the early pilot. That is exactly the kind of progression I hoped this architecture would show.
What This Result Actually Says
The claim I am comfortable making is narrow:
A small OLMo-hybrid / Gated-DeltaNet-style speech codec language model can learn enough local speech structure on a clean single-speaker corpus to produce stable, clearly voice-like audio, and it keeps improving well past the earliest emergence point.
What I am not claiming:
- that this is a finished TTS system
- that this beats a matched transformer baseline
- that the no-FLA A100 path is the architecture's ideal implementation
- that these samples are semantically meaningful speech
This is still a codec-LM architecture result, not a production speech system result.
Systems Reality
The training result is clean. The systems story was messier.
Main issues:
- the intended fused FLA Gated DeltaNet path was not usable on the available Triton / pod stack
- the successful run used the plain PyTorch recurrent fallback for the hybrid blocks
- cheap pod infrastructure was volatile and SSH endpoints changed repeatedly
- the only reason the final run survived cleanly is that I started pulling artifacts off-pod aggressively
So this validates the modeling story much more than it validates the final optimized training stack.
Why This Matters For The Next Step
This is enough to justify moving on to text conditioning.
The question is no longer:
- can this hybrid backbone model speech codec tokens at all?
The question is now:
- can I inject text cleanly enough to make the decoder read controllable content instead of only producing speech-like babble?
That means the next version should keep the audio decoder and add:
- a small text encoder
- cross-attention from selected decoder blocks into text states
- sentence-level training on LJ Speech transcripts
That is the project from here.