The power of video depends on the harmonious integration of what we hear with what we see. Creators and consumers of content connect with the precise details of audiovisual performances, especially when it comes to A-roll: the primary, narrative-advancing portions of real-world video that foreground people on screen and soundtrack.
We introduce Mirage, an audio-to-video foundation model designed for A-roll. Mirage generates expressive video of people from images, text, and audio, producing compelling performances with the quality and realism needed for content with real-world engagement. Try it now at mirage.app and read the full paper here.
A-Roll Video Generation
Mirage excels at generating A-roll, matching emotional nuance and affect, precisely executing lip and gesture synchronization, and producing people who look like they sound delivering natural behaviors. Mirage coaxes precisely coordinated performances out of audio, images, and text, making it easier for you to tell your story with realistic A-roll video.
Capabilities
Plosive and Viseme Dynamics
While other systems struggle to time visual dynamics with the fidelity needed for realistic human vocal delivery, Mirage can handle precise lip movements like plosives with precision across variations in language, tone, volume, and visual appearance.

/t/ sound, ‘take’

/b/ sound, ‘back’

/p/ sound, ‘peter’

/d/ sound, ‘don’t’

/g/ sound, ‘go’
Description
Eye Blinking and Gaze Behavior
Mirage performances look around like humans, connecting blinks, saccades, and adjustments to gaze patterns to create lifelike visual performances.

Mirage outputs feature natural blinking and eye movement

Mirage outputs feature natural blinking and eye movement

Mirage outputs feature natural blinking and eye movement

Mirage outputs feature natural blinking and eye movement

Mirage outputs feature natural blinking and eye movement
Description
Emotional Nuance and Affect Matching
Mirage produces clarity and continuity in mouth motion, accurately timing lip closures and releases in sync with audio. Performances move fluently from slurred, slow speech to rapid, rhythmically demanding rapping.

Mirage behavior when expressing fear

Mirage behavior when expressing shock

Mirage behavior when expressing sadness

Mirage behavior when expressing neutral/flat emotion

Mirage behavior when expressing happiness
Description
Coarticulation and Speech Blending
Mirage produces clarity and continuity in mouth motion, accurately timing lip closures and releases in sync with audio. Performances move fluently from slurred, slow speech to rapid, rhythmically demanding rapping.

Mirage samples exhibit strong co-articulation across diverse speech patterns, such as rapping and tongue twisters.

Mirage samples exhibit strong co-articulation across diverse speech patterns, such as rapping and tongue twisters.

Mirage samples exhibit strong co-articulation across diverse speech patterns, such as rapping and tongue twisters.

Mirage samples exhibit strong co-articulation across diverse speech patterns, such as rapping and tongue twisters.

Mirage samples exhibit strong co-articulation across diverse speech patterns, such as rapping and tongue twisters.
Description
Paralinguistic Generation
Mirage captures the moments that don’t happen in words, from laughter and tears to sneezing, coughing, and grunting.

Mirage behavior when audio features coughing

Mirage behavior when audio features sneezing

Mirage behavior when audio features laughing

Mirage behavior when audio features sneezing

Mirage behavior when audio features coughing
Description
Audio-Only Generation
Mirage works with text, images, and audio, alone or together. When generating A-roll from audio alone, Mirage matches who’s performing in the output to what can be heard.

Mirage output conditioned on male presenting voice without text prompt

Mirage output conditioned on male presenting voice without text prompt

Mirage output conditioned on male presenting voice without text prompt

Mirage output conditioned on male presenting voice without text prompt

Mirage output conditioned on male presenting voice without text prompt
Description
The sound world in audio input tells Mirage about the world it should generate, from intimate indoor settings, to the great outdoors.

Mirage output conditioned on voice recorded from intimate indoor scenes to the great outdoors

Mirage output conditioned on voice recorded from intimate indoor scenes to the great outdoors

Mirage output conditioned on voice recorded from intimate indoor scenes to the great outdoors

Mirage output conditioned on voice recorded from intimate indoor scenes to the great outdoors

Mirage output conditioned on voice recorded from intimate indoor scenes to the great outdoors
Description
Mirage matches sight and sound, avoiding the uncanny valley effects found in many models that pair generated people with voices that don’t relate.

Mirage output conditioned on female presenting voice without text prompt

Mirage output conditioned on female presenting voice without text prompt

Mirage output conditioned on female presenting voice without text prompt

Mirage output conditioned on female presenting voice without text prompt

Mirage output conditioned on female presenting voice without text prompt
Description
Mismatched Text and Audio Results
But when text and audio are deliberately pushed in different directions, Mirage can develop compelling blends of the two.

Mirage output when conditioned using a male-presenting voice, and female-presenting text prompt

Mirage output when conditioned using a female-presenting voice, and male-presenting text prompt

Mirage output when conditioned using a male-presenting voice, and female-presenting text prompt

Mirage output when conditioned using a female-presenting voice, and male-presenting text prompt

Mirage output when conditioned using a female-presenting voice, and male-presenting text prompt
Description
Gesture-semantic Alignment
Training for video generation lets Mirage connect what it knows about audio to what happens in visual behavior. Mirage makes images that move along with what’s being said.

Head shaking indicating agreement. No mention of body motion in prompts.

Head nodding in response to agreement. Prompts contained no explicit motion references.

Head nodding in response to agreement. Prompts contained no explicit motion references.

Head nodding in response to agreement. Prompts contained no explicit motion references.

Head nodding in response to agreement. Prompts contained no explicit motion references.
Description

Head shaking indicating disagreement. No mention of body motion in prompts.

Head shaking indicating disagreement. No mention of body motion in prompts.

Head shaking indicating disagreement. No mention of body motion in prompts.

Head shaking indicating disagreement. No mention of body motion in prompts.

Head shaking indicating disagreement. No mention of body motion in prompts.
Description

Purposeful hand movements in response to explanations. Again, no mention of body motion in prompts.

Purposeful hand movements in response to explanations. Again, no mention of body motion in prompts.

Purposeful hand movements in response to explanations. Again, no mention of body motion in prompts.

Purposeful hand movements in response to explanations. Again, no mention of body motion in prompts.

Purposeful hand movements in response to explanations. Again, no mention of body motion in prompts.
Description
Subject Appearance Accuracy
Mirage gives text control over all the details of how someone appears from facial structure to makeup to clothing and lighting, while still ensuring they look how they sound when they perform.

A middle-aged woman with a long, dark, messy braid and gentle features wears a plain white blouse, embodying a minimalistic style. She looks directly at the camera with her lips slightly parted, suggesting a moment of reflection and engagement with the viewer. The background is softly blurred, showcasing a simple indoor setting with light-colored walls, a few art pieces, and a cozy armchair, contributing to a calm and minimalist vibe. The shot is a medium, stationary angle at eye level, illuminated by bright, even lighting that creates a clear and engaging visual of her presence.

A young woman with a long, dark, messy braid and striking holographic makeup, her lips slightly parted, embodying a minimal aesthetic that captivates the viewer. She interacts with the camera with a gentle smile, her eyes sparkling with enthusiasm as she gestures subtly, inviting the audience into her world. The background is softly blurred, showcasing a chic indoor setting with soft pastel tones, a few art pieces on the walls, and a cozy ambiance that complements her style. The shot is a medium, stationary angle at eye level, under bright, even lighting that highlights her features and creates an engaging visual experience.

A young woman with a long, dark, messy braid and braces has her lips slightly parted, dressed in a casual, minimalistic top that complements her laid-back style. She interacts with the camera with a gentle smile, her expression relaxed and approachable, occasionally using subtle hand gestures to emphasize her thoughts. The background is softly blurred, showcasing a minimalistic indoor setting with light-colored walls, a few decorative items on a shelf, and a soft rug, contributing to a cozy yet modern vibe. The shot is a medium, stationary frame at eye level, illuminated by bright, even lighting that enhances the clarity of her features while maintaining a polished and inviting composition.

A young woman with a long, dark, messy braid and prominent high cheekbones wears a casual greyish-blue hoodie, her lips slightly parted, suggesting a thoughtful demeanor. She engages with the camera in a serene manner, her expression soft and contemplative, occasionally glancing away as if lost in thought. The background is softly blurred, showcasing a minimalistic indoor setting with light-colored walls, a few decorative items, and a warm, inviting ambiance. The shot is a medium, stationary frame at eye level, illuminated by bright, even lighting that enhances the clarity of her features while maintaining a clean, aesthetic composition.

A young adult woman with a long, dark, messy braid and wearing a greyish-blue hoodie, complemented by stylish glasses that frame her face. She engages thoughtfully with the camera, her lips slightly parted as if mid-sentence, conveying a sense of contemplation and openness. The background is softly blurred, featuring minimalistic indoor decor with soft neutral tones, a few potted plants, and a simple wall art piece that enhances the serene atmosphere. The shot is a medium, stationary frame at eye level, under bright, even lighting that creates a clear and engaging visual of the subject while maintaining a polished aesthetic.
Description
Background and Prop Fidelity
Mirage generates environments with specific contexts (e.g., “sunlit café,” “futuristic lab”) and props (e.g., “holding a vintage microphone,” “surrounded by holograms”), to closely match input prompts.

A young woman with fair skin and long, flowing brown hair, adorned with soft makeup that emphasizes her eyes and lips, wears a bright blue fuzzy hat with pointed ears, complemented by a black-and-gray striped sweater and glossy pink nail polish. She smiles brightly while speaking into a small gray microphone held close to her mouth, conveying a lively and animated presence that captivates the audience. The background is softly blurred, featuring a warm and inviting indoor space, prominently displaying a giant, gnarled saguaro cactus that adds an intriguing element to the cozy atmosphere. The shot is a medium close-up at eye level, illuminated by natural daylight that highlights her features and enhances the cheerful, playful mood of the scene.

An African-American man with a clean-shaven head, smooth dark skin, and a strong jawline is dressed in a charcoal gray button-up shirt with the top button unfastened, conveying a professional yet approachable look. He engages with the camera, maintaining steady eye contact while speaking in a calm and authoritative manner, his serious expression reflecting confidence and focus. The background is softly blurred, showcasing a series of computer monitors filled with cascading green digits and data, creating a visually captivating and high-tech environment. The shot is a medium close-up at eye level, stationary, under bright fluorescent lighting that clearly illuminates his face and enhances the vivid colors of the digital rain behind him.

A young woman with fair skin and long, flowing brown hair, adorned with soft makeup that emphasizes her eyes and lips, wears a bright blue fuzzy hat with pointed ears, complemented by a black-and-gray striped sweater and glossy pink nail polish. She smiles brightly while speaking into a small gray microphone held close to her mouth, conveying a lively and animated presence that captivates the audience. The background is softly blurred, featuring a warm and inviting indoor space, prominently displaying a giant, gnarled saguaro cactus that adds an intriguing element to the cozy atmosphere. The shot is a medium close-up at eye level, illuminated by natural daylight that highlights her features and enhances the cheerful, playful mood of the scene.

A young woman with fair skin and long, straight brown hair, featuring soft makeup that accentuates her eyes and lips, wears a bright blue fuzzy hat with pointed ears, along with a black-and-gray striped sweater and glossy pink nail polish. She engages with the camera, smiling mid-sentence while holding a small gray microphone close to her mouth, radiating a vibrant and expressive energy. The background is softly blurred, revealing a cozy indoor setting with gentle lighting, where a large, fuzzy spider web creates a unique and playful focal point. The shot is a medium close-up at eye level, held stationary, with bright, even lighting that highlights her features and contributes to the cheerful, casual ambiance.

A young man with short, dark brown hair and round green-framed glasses, dressed in a stylish black shirt that highlights his engaged demeanor. He maintains eye contact with the camera, his lips slightly pursed as he speaks, prominently displaying a vibrant polka-dot cycling jersey in front of him. The background is softly blurred, featuring a cozy indoor scene with a lamp and a plant, adding to the inviting ambiance. The shot is a close-up at eye level with stationary framing, illuminated by warm, diffused lighting that enhances his features and creates a polished, intimate atmosphere.
Description
Mirage generates photorealistic, engaging performances with precisely synchronized audio and video — and we’re constantly adding new capabilities. Try it now at mirage.app and read the full paper here.