Research
Melody Diffuser
A 32M-parameter Discrete Diffusion Transformer for symbolic music generation, conditioned on real-time hand gesture input via cross-attention.
Overview
The problem
Text-to-music models generate convincing style and timbre, but offer little control over melodic and rhythmic structure. Melody Diffuser treats gesture as a first-class conditioning signal instead.
Dataset
No gesture-labeled music dataset existed, so a self-supervised pipeline was built. Pitch intervals across 10M+ symbolic melodies were converted into 8 discrete gesture tokens — small/medium/large up and down, hold, and repeat.
Inference
At runtime, MediaPipe tracks index finger coordinates from a webcam feed and tokenizes them through the same rule-based system. These tokens condition the diffusion process on a cloud GPU (NVIDIA T4).
Architecture
Model
32M-parameter Discrete Diffusion Transformer. Tokens are corrupted over a 64-step categorical schedule and iteratively reconstructed. Each transformer block attends to gesture embeddings via cross-attention, with RMSNorm and SwiGLU activations for training stability.
Conditioning
cond = self.gesture_embed(gesture) attn, _ = self.cross_attn(x, cond, cond) x = x + attn
Next steps
Expanding to polyphonic generation by training a variational autoencoder on Bach chorales, enabling latent diffusion over 4-part harmony.
Independent research conducted by Duncan Larzelere (East Lansing High School). Trained on Google Colab (NVIDIA A100). No institutional supervision.
Questions or collaboration? duncan.larzelere@gmail.com