Research

Melody Diffuser

A 32M-parameter Discrete Diffusion Transformer for symbolic music generation, conditioned on real-time hand gesture input via cross-attention.

Read the paperModel weightsTry the demo

Overview

The problem

Text-to-music models generate convincing style and timbre, but offer little control over melodic and rhythmic structure. Melody Diffuser treats gesture as a first-class conditioning signal instead.

Dataset

No gesture-labeled music dataset existed, so a self-supervised pipeline was built. Pitch intervals across 10M+ symbolic melodies were converted into 8 discrete gesture tokens — small/medium/large up and down, hold, and repeat.

Inference

At runtime, MediaPipe tracks index finger coordinates from a webcam feed and tokenizes them through the same rule-based system. These tokens condition the diffusion process on a cloud GPU (NVIDIA T4).

Architecture

Model

32M-parameter Discrete Diffusion Transformer. Tokens are corrupted over a 64-step categorical schedule and iteratively reconstructed. Each transformer block attends to gesture embeddings via cross-attention, with RMSNorm and SwiGLU activations for training stability.

Conditioning

cond = self.gesture_embed(gesture)
attn, _ = self.cross_attn(x, cond, cond)
x = x + attn

Next steps

Expanding to polyphonic generation by training a variational autoencoder on Bach chorales, enabling latent diffusion over 4-part harmony.

Independent research conducted by Duncan Larzelere (East Lansing High School). Trained on Google Colab (NVIDIA A100). No institutional supervision.

Questions or collaboration? duncan.larzelere@gmail.com