Udani: Depth-Recurrent Hybrids for Bioacoustic Representation Learning

Udani is the bioacoustic representation-learning line.

The research question is:

can hybrid recurrent sequence models learn useful bioacoustic representations with less data, compute, and parameter count than standard transformer-style encoders?

The current model family uses HuBERT-style masked acoustic cluster prediction over animal vocalization corpora, but swaps the usual transformer-only backbone for efficient recurrent and hybrid sequence blocks.

Architecture

The main branch is a depth-recurrent hybrid:

shared core: [GDN, GDN, GDN, local attention]
recurrence passes: 3
logical depth: 12
trainable parameters: about 41.8M
frontend: convolutional waveform encoder
objective: masked prediction of 500 acoustic cluster IDs

The recurrence embedding lets each pass specialize even though the backbone core is shared.

Training snapshot

The April run udani_depth_recurrent_3x_curriculum_live_fast_stable completed 25,000 pretraining steps. The final checkpoint retained clean masked-prediction behavior on the calibration path:

latest train loss: 0.6102
latest masked accuracy: 0.8155
best observed train loss: 0.2247
best observed masked accuracy: 0.9282
calibration sentinel at step 25,000: loss 1.4160, accuracy 0.6044

A source-weighted variant also completed 22,000 steps, giving a second stable training trajectory for the same family.

What is still open

The current evidence is strongest on engineering viability: the data pipeline, cluster objective, fused GDN backend, local-attention hybrid, depth recurrence, checkpointing, and sentinel evaluation all work.

The next research-facing step is transfer evaluation: frozen probes and retrieval-style tests against public bioacoustic encoders such as AVES/Perch under a compute and parameter budget. That is where the representation-quality claim should actually live.

Architecture

Training snapshot

What is still open

Links