Udani is the bioacoustic representation-learning line.
The research question is:
can hybrid recurrent sequence models learn useful bioacoustic representations with less data, compute, and parameter count than standard transformer-style encoders?
The current model family uses HuBERT-style masked acoustic cluster prediction over animal vocalization corpora, but swaps the usual transformer-only backbone for efficient recurrent and hybrid sequence blocks.
Architecture
The main branch is a depth-recurrent hybrid:
- shared core:
[GDN, GDN, GDN, local attention] - recurrence passes:
3 - logical depth:
12 - trainable parameters: about
41.8M - frontend: convolutional waveform encoder
- objective: masked prediction of
500acoustic cluster IDs
The recurrence embedding lets each pass specialize even though the backbone core is shared.
Training snapshot
The April run udani_depth_recurrent_3x_curriculum_live_fast_stable completed
25,000 pretraining steps. The final checkpoint retained clean masked-prediction
behavior on the calibration path:
- latest train loss:
0.6102 - latest masked accuracy:
0.8155 - best observed train loss:
0.2247 - best observed masked accuracy:
0.9282 - calibration sentinel at step
25,000: loss1.4160, accuracy0.6044
A source-weighted variant also completed 22,000 steps, giving a second stable
training trajectory for the same family.
What is still open
The current evidence is strongest on engineering viability: the data pipeline, cluster objective, fused GDN backend, local-attention hybrid, depth recurrence, checkpointing, and sentinel evaluation all work.
The next research-facing step is transfer evaluation: frozen probes and retrieval-style tests against public bioacoustic encoders such as AVES/Perch under a compute and parameter budget. That is where the representation-quality claim should actually live.
Links
- Udani repository
- source code path:
/Users/oboh/ssm-bio-pilot - pilot spec path:
/Users/oboh/Downloads/SSM_bioacoustic_pilot_spec.md