Music begins with gesture
Your hands already know how to conduct. We built AI that listens.
Discrete diffusion meets gesture control
Unlike image diffusion, we operate directly on symbolic music tokens. Your gestures condition the denoising process in real-time.
Gesture Input
8 gesture types captured from hand movement—up, down, hold, accent, and more.
Cross-Attention
Gestures are embedded and attended to at each diffusion step, shaping the melody.
Melody Output
Discrete tokens decoded to MIDI—pitch, duration, velocity all controlled by you.
Architecture
class MelodyDiffusor(nn.Module):
def __init__(self, ...):
# Transformer backbone
self.blocks = nn.ModuleList([
TransformerBlock(dim, heads)
for _ in range(depth)
])
# Gesture conditioning
self.cond_embed = nn.Embedding(
num_gestures, dim
)
self.cross_attn = nn.MultiheadAttention(
dim, heads
)
def forward(self, x, t, gesture):
# Embed gesture sequence
c = self.cond_embed(gesture)
# Diffusion with conditioning
for block in self.blocks:
x = block(x, t)
x = x + self.cross_attn(x, c, c)[0]
return xThe key insight
Gesture as musical intent
Traditional music AI generates from prompts or examples. We take a different approach: your physical movement becomes the conditioning signal.
Each gesture is embedded into a learned vector space. During denoising, the model attends to these embeddings, allowing your movement to directly influence pitch direction, rhythmic density, and melodic contour.
Trained on 1 million+ synthetic melodies with paired gesture annotations.