Udani is the bioacoustic representation-learning line.

The research question is:

can hybrid recurrent sequence models learn useful bioacoustic representations with less data, compute, and parameter count than standard transformer-style encoders?

The current model family uses HuBERT-style masked acoustic cluster prediction over animal vocalization corpora, but swaps the usual transformer-only backbone for efficient recurrent and hybrid sequence blocks.

Architecture

The main branch is a depth-recurrent hybrid:

  • shared core: [GDN, GDN, GDN, local attention]
  • recurrence passes: 3
  • logical depth: 12
  • trainable parameters: about 41.8M
  • frontend: convolutional waveform encoder
  • objective: masked prediction of 500 acoustic cluster IDs

The recurrence embedding lets each pass specialize even though the backbone core is shared.

Training snapshot

The April run udani_depth_recurrent_3x_curriculum_live_fast_stable completed 25,000 pretraining steps. The final checkpoint retained clean masked-prediction behavior on the calibration path:

  • latest train loss: 0.6102
  • latest masked accuracy: 0.8155
  • best observed train loss: 0.2247
  • best observed masked accuracy: 0.9282
  • calibration sentinel at step 25,000: loss 1.4160, accuracy 0.6044

A source-weighted variant also completed 22,000 steps, giving a second stable training trajectory for the same family.

What is still open

The current evidence is strongest on engineering viability: the data pipeline, cluster objective, fused GDN backend, local-attention hybrid, depth recurrence, checkpointing, and sentinel evaluation all work.

The next research-facing step is transfer evaluation: frozen probes and retrieval-style tests against public bioacoustic encoders such as AVES/Perch under a compute and parameter budget. That is where the representation-quality claim should actually live.

  • Udani repository
  • source code path: /Users/oboh/ssm-bio-pilot
  • pilot spec path: /Users/oboh/Downloads/SSM_bioacoustic_pilot_spec.md