Music begins with gesture

Your hands already know how to conduct. We built AI that listens.

How it works

Discrete diffusion meets gesture control

Unlike image diffusion, we operate directly on symbolic music tokens. Your gestures condition the denoising process in real-time.

👆

Gesture Input

8 gesture types captured from hand movement—up, down, hold, accent, and more.

Cross-Attention

Gestures are embedded and attended to at each diffusion step, shaping the melody.

🎵

Melody Output

Discrete tokens decoded to MIDI—pitch, duration, velocity all controlled by you.

Architecture

class MelodyDiffusor(nn.Module):
    def __init__(self, ...):
        # Transformer backbone
        self.blocks = nn.ModuleList([
            TransformerBlock(dim, heads)
            for _ in range(depth)
        ])
        
        # Gesture conditioning
        self.cond_embed = nn.Embedding(
            num_gestures, dim
        )
        self.cross_attn = nn.MultiheadAttention(
            dim, heads
        )

    def forward(self, x, t, gesture):
        # Embed gesture sequence
        c = self.cond_embed(gesture)
        
        # Diffusion with conditioning
        for block in self.blocks:
            x = block(x, t)
            x = x + self.cross_attn(x, c, c)[0]
        
        return x

The key insight

Gesture as musical intent

Traditional music AI generates from prompts or examples. We take a different approach: your physical movement becomes the conditioning signal.

Each gesture is embedded into a learned vector space. During denoising, the model attends to these embeddings, allowing your movement to directly influence pitch direction, rhythmic density, and melodic contour.

Trained on 1 million+ synthetic melodies with paired gesture annotations.

Available nowMelody Diffuser
Try it →