Seeing Voices: Mirage

The power of video depends on the harmonious integration of what we hear with what we see. Creators and consumers of content connect with the precise details of audiovisual performances, especially when it comes to A-roll: the primary, narrative-advancing portions of real-world video that foreground people on screen and soundtrack.

We introduce Mirage, an audio-to-video foundation model designed for A-roll. Mirage generates expressive video of people from images, text, and audio, producing compelling performances with the quality and realism needed for content with real-world engagement. Try it now at mirage.app and read the full paper here.

A-Roll Video Generation

Mirage excels at generating A-roll, matching emotional nuance and affect, precisely executing lip and gesture synchronization, and producing people who look like they sound delivering natural behaviors. Mirage coaxes precisely coordinated performances out of audio, images, and text, making it easier for you to tell your story with realistic A-roll video.

Capabilities

Plosive and Viseme Dynamics

While other systems struggle to time visual dynamics with the fidelity needed for realistic human vocal delivery, Mirage can handle precise lip movements like plosives with precision across variations in language, tone, volume, and visual appearance.

/t/ sound, ‘take’

/b/ sound, ‘back’

/p/ sound, ‘peter’

Eye Blinking and Gaze Behavior

Mirage performances look around like humans, connecting blinks, saccades, and adjustments to gaze patterns to create lifelike visual performances.

Mirage outputs feature natural blinking and eye movement

Emotional Nuance and Affect Matching

Mirage produces clarity and continuity in mouth motion, accurately timing lip closures and releases in sync with audio. Performances move fluently from slurred, slow speech to rapid, rhythmically demanding rapping.

Mirage behavior when expressing fear

Mirage behavior when expressing shock

Mirage behavior when expressing sadness

Coarticulation and Speech Blending

Mirage samples exhibit strong co-articulation across diverse speech patterns, such as rapping and tongue twisters.

Paralinguistic Generation

Mirage captures the moments that don’t happen in words, from laughter and tears to sneezing, coughing, and grunting.

Mirage behavior when audio features coughing

Mirage behavior when audio features sneezing

Mirage behavior when audio features laughing

Audio-Only Generation

Mirage works with text, images, and audio, alone or together. When generating A-roll from audio alone, Mirage matches who’s performing in the output to what can be heard.

Mirage output conditioned on male presenting voice without text prompt

The sound world in audio input tells Mirage about the world it should generate, from intimate indoor settings, to the great outdoors.

Mirage output conditioned on voice recorded from intimate indoor scenes to the great outdoors

Mirage matches sight and sound, avoiding the uncanny valley effects found in many models that pair generated people with voices that don’t relate.

Mirage output conditioned on female presenting voice without text prompt

Mismatched Text and Audio Results

But when text and audio are deliberately pushed in different directions, Mirage can develop compelling blends of the two.

Mirage output when conditioned using a female-presenting voice, and male-presenting text prompt

Mirage output when conditioned using a male-presenting voice, and female-presenting text prompt

Gesture-semantic Alignment

Training for video generation lets Mirage connect what it knows about audio to what happens in visual behavior. Mirage makes images that move along with what’s being said.

Head shaking indicating agreement. No mention of body motion in prompts.

Head nodding in response to agreement. Prompts contained no explicit motion references.

Head shaking indicating disagreement. No mention of body motion in prompts.

Purposeful hand movements in response to explanations. Again, no mention of body motion in prompts.

Subject Appearance Accuracy

Mirage gives text control over all the details of how someone appears from facial structure to makeup to clothing and lighting, while still ensuring they look how they sound when they perform.

A young woman with a long, dark, messy braid that gives her a carefree vibe, complemented by large hoop earrings and striking long nails that reflect her personal style. She interacts with the camera, her lips slightly parted in a thoughtful expression, suggesting she is about to share something meaningful while exuding confidence. The background is softly blurred, showcasing a minimalistic indoor space with soft lighting, a few plants, and simple decor that creates a calm and inviting atmosphere. The shot is a medium, stationary frame at eye level, illuminated by bright, even lighting that enhances the clarity of her features and the overall composition.

A young woman with a long, dark, messy braid adorned with a colorful silk scarf, her lips slightly parted, exuding a relaxed and minimal aesthetic. She engages softly with the camera, her expression calm and inviting, as she occasionally tilts her head slightly, enhancing her approachable demeanor. The background is softly blurred, featuring a minimalist indoor space with neutral-colored walls, a few carefully arranged decorative items, and natural light filtering through a nearby window, creating a serene atmosphere. The shot is a medium, stationary frame at eye level, under bright, even lighting that creates a clear and engaging visual of the subject while maintaining a polished and inviting composition.

A young woman with a long, dark, messy braid cascading over her shoulder, her face adorned with subtle glitter that catches the light, and her lips slightly parted, exuding a relaxed yet captivating vibe. She engages with the camera in a serene manner, her expression inviting and thoughtful, as she occasionally tilts her head slightly, enhancing her minimal aesthetic. The background is softly blurred, featuring a minimalist indoor space with soft neutral tones, a few carefully placed decorative items, and gentle lighting that enhances the tranquil atmosphere. The shot is a medium, stationary frame at eye level, under bright, even lighting that creates a clear and engaging visual of the subject, highlighting her features and the simplicity of the setting.

Background and Prop Fidelity

Mirage generates environments with specific contexts (e.g., “sunlit café,” “futuristic lab”) and props (e.g., “holding a vintage microphone,” “surrounded by holograms”), to closely match input prompts.

A young man with short, dark brown hair and round green-framed glasses, dressed in a stylish black shirt that highlights his engaged demeanor. He maintains eye contact with the camera, his lips slightly pursed as he speaks, prominently displaying a vibrant polka-dot cycling jersey in front of him. The background is softly blurred, featuring a cozy indoor scene with a lamp and a plant, adding to the inviting ambiance. The shot is a close-up at eye level with stationary framing, illuminated by warm, diffused lighting that enhances his features and creates a polished, intimate atmosphere.

A woman with long, dark hair styled straight wears a brown ribbed sweater, complemented by gold hoop earrings and a nose ring, exuding a modern and fashionable vibe. She presents a misty crystal ball to the camera, her serious expression and direct engagement with a professional Shure microphone highlighting her focus and passion. The background is softly blurred, featuring a stylish, dimly lit setting with wooden slat paneling, warm ambient lighting, and a piano partially visible on the left, contributing to the sophisticated atmosphere. The shot is a medium close-up at eye level, illuminated by soft, directional lighting that creates subtle shadows, adding depth and a professional yet inviting tone.

A young man with fair skin, shoulder-length blonde hair, and a light beard wears a fitted dark gray t-shirt that accentuates his athletic build. He looks directly into the camera, speaking animatedly with an engaged expression, his mouth slightly open mid-sentence while both hands are raised to emphasize his points. The background is softly blurred, showcasing a bustling gym filled with exercise equipment and several gymgoers, including multiple individuals clearly visible squatting at the squat racks. The shot is a medium close-up at eye level, captured in a stationary frame under harsh fluorescent lighting that creates a bright, clear visual of the subject.

Mirage generates photorealistic, engaging performances with precisely synchronized audio and video — and we’re constantly adding new capabilities. Try it now at mirage.app and read the full paper here.