Seeing Voices: Mirage

The power of video depends on the harmonious integration of what we hear with what we see. Creators and consumers of content connect with the precise details of audiovisual performances, especially when it comes to A-roll: the primary, narrative-advancing portions of real-world video that foreground people on screen and soundtrack.

We introduce Mirage, an audio-to-video foundation model designed for A-roll. Mirage generates expressive video of people from images, text, and audio, producing compelling performances with the quality and realism needed for content with real-world engagement. Try it now at mirage.app and read the full paper here.

A-Roll Video Generation

Mirage excels at generating A-roll, matching emotional nuance and affect, precisely executing lip and gesture synchronization, and producing people who look like they sound delivering natural behaviors. Mirage coaxes precisely coordinated performances out of audio, images, and text, making it easier for you to tell your story with realistic A-roll video.

Capabilities

Plosive and Viseme Dynamics

While other systems struggle to time visual dynamics with the fidelity needed for realistic human vocal delivery, Mirage can handle precise lip movements like plosives with precision across variations in language, tone, volume, and visual appearance.

/t/ sound, ‘take’

/b/ sound, ‘back’

/p/ sound, ‘peter’

Eye Blinking and Gaze Behavior

Mirage performances look around like humans, connecting blinks, saccades, and adjustments to gaze patterns to create lifelike visual performances.

Mirage outputs feature natural blinking and eye movement

Emotional Nuance and Affect Matching

Mirage produces clarity and continuity in mouth motion, accurately timing lip closures and releases in sync with audio. Performances move fluently from slurred, slow speech to rapid, rhythmically demanding rapping.

Mirage behavior when expressing fear

Mirage behavior when expressing shock

Mirage behavior when expressing sadness

Coarticulation and Speech Blending

Mirage samples exhibit strong co-articulation across diverse speech patterns, such as rapping and tongue twisters.

Paralinguistic Generation

Mirage captures the moments that don’t happen in words, from laughter and tears to sneezing, coughing, and grunting.

Mirage behavior when audio features coughing

Mirage behavior when audio features sneezing

Mirage behavior when audio features laughing

Audio-Only Generation

Mirage works with text, images, and audio, alone or together. When generating A-roll from audio alone, Mirage matches who’s performing in the output to what can be heard.

Mirage output conditioned on male presenting voice without text prompt

The sound world in audio input tells Mirage about the world it should generate, from intimate indoor settings, to the great outdoors.

Mirage output conditioned on voice recorded from intimate indoor scenes to the great outdoors

Mirage matches sight and sound, avoiding the uncanny valley effects found in many models that pair generated people with voices that don’t relate.

Mirage output conditioned on female presenting voice without text prompt

Mismatched Text and Audio Results

But when text and audio are deliberately pushed in different directions, Mirage can develop compelling blends of the two.

Mirage output when conditioned using a male-presenting voice, and female-presenting text prompt

Mirage output when conditioned using a female-presenting voice, and male-presenting text prompt

Mirage output when conditioned using a male-presenting voice, and female-presenting text prompt

Gesture-semantic Alignment

Training for video generation lets Mirage connect what it knows about audio to what happens in visual behavior. Mirage makes images that move along with what’s being said.

Head shaking indicating agreement. No mention of body motion in prompts.

Head nodding in response to agreement. Prompts contained no explicit motion references.

Head shaking indicating disagreement. No mention of body motion in prompts.

Purposeful hand movements in response to explanations. Again, no mention of body motion in prompts.

Subject Appearance Accuracy

Mirage gives text control over all the details of how someone appears from facial structure to makeup to clothing and lighting, while still ensuring they look how they sound when they perform.

A middle-aged woman with a long, dark, messy braid and gentle features wears a plain white blouse, embodying a minimalistic style. She looks directly at the camera with her lips slightly parted, suggesting a moment of reflection and engagement with the viewer. The background is softly blurred, showcasing a simple indoor setting with light-colored walls, a few art pieces, and a cozy armchair, contributing to a calm and minimalist vibe. The shot is a medium, stationary angle at eye level, illuminated by bright, even lighting that creates a clear and engaging visual of her presence.

A young woman with a long, dark, messy braid and striking holographic makeup, her lips slightly parted, embodying a minimal aesthetic that captivates the viewer. She interacts with the camera with a gentle smile, her eyes sparkling with enthusiasm as she gestures subtly, inviting the audience into her world. The background is softly blurred, showcasing a chic indoor setting with soft pastel tones, a few art pieces on the walls, and a cozy ambiance that complements her style. The shot is a medium, stationary angle at eye level, under bright, even lighting that highlights her features and creates an engaging visual experience.

A young woman with a long, dark, messy braid and braces has her lips slightly parted, dressed in a casual, minimalistic top that complements her laid-back style. She interacts with the camera with a gentle smile, her expression relaxed and approachable, occasionally using subtle hand gestures to emphasize her thoughts. The background is softly blurred, showcasing a minimalistic indoor setting with light-colored walls, a few decorative items on a shelf, and a soft rug, contributing to a cozy yet modern vibe. The shot is a medium, stationary frame at eye level, illuminated by bright, even lighting that enhances the clarity of her features while maintaining a polished and inviting composition.

Background and Prop Fidelity

Mirage generates environments with specific contexts (e.g., “sunlit café,” “futuristic lab”) and props (e.g., “holding a vintage microphone,” “surrounded by holograms”), to closely match input prompts.

A young woman with fair skin and long, flowing brown hair, adorned with soft makeup that emphasizes her eyes and lips, wears a bright blue fuzzy hat with pointed ears, complemented by a black-and-gray striped sweater and glossy pink nail polish. She smiles brightly while speaking into a small gray microphone held close to her mouth, conveying a lively and animated presence that captivates the audience. The background is softly blurred, featuring a warm and inviting indoor space, prominently displaying a giant, gnarled saguaro cactus that adds an intriguing element to the cozy atmosphere. The shot is a medium close-up at eye level, illuminated by natural daylight that highlights her features and enhances the cheerful, playful mood of the scene.

An African-American man with a clean-shaven head, smooth dark skin, and a strong jawline is dressed in a charcoal gray button-up shirt with the top button unfastened, conveying a professional yet approachable look. He engages with the camera, maintaining steady eye contact while speaking in a calm and authoritative manner, his serious expression reflecting confidence and focus. The background is softly blurred, showcasing a series of computer monitors filled with cascading green digits and data, creating a visually captivating and high-tech environment. The shot is a medium close-up at eye level, stationary, under bright fluorescent lighting that clearly illuminates his face and enhances the vivid colors of the digital rain behind him.

Mirage generates photorealistic, engaging performances with precisely synchronized audio and video — and we’re constantly adding new capabilities. Try it now at mirage.app and read the full paper here.