Seeing voices: Foundation model research report

Generating A-roll Video from Audio with Mirage

The power of video depends on the harmonious integration of what we hear with what we see. Creators and consumers of content connect with the precise details of audiovisual performances, especially when it comes to A-roll: the primary, narrative-advancing portions of real-world video that foreground people on screen and soundtrack.

We introduce Mirage, an audio-to-video foundation model designed for A-roll. Mirage generates expressive video of people from images, text, and audio, producing compelling performances with the quality and realism needed for content with real-world engagement.

A-Roll Video Generation

Mirage excels at generating A-roll, matching emotional nuance and affect, precisely executing lip and gesture synchronization, and producing people who look like they sound delivering natural behaviors. Mirage coaxes precisely coordinated performances out of audio, images, and text, making it easier for you to tell your story with realistic A-roll video.

Capabilities

Plosive and Viseme Dynamics

While other systems struggle to time visual dynamics with the fidelity needed for realistic human vocal delivery, Mirage can handle precise lip movements like plosives with precision across variations in language, tone, volume, and visual appearance.

Eye Blinking and Gaze Behavior

Mirage performances look around like humans, connecting blinks, saccades, and adjustments to gaze patterns to create lifelike visual performances.

Emotional Nuance and Affect Matching

Mirage produces clarity and continuity in mouth motion, accurately timing lip closures and releases in sync with audio. Performances move fluently from slurred, slow speech to rapid, rhythmically demanding rapping.

Coarticulation and Speech Blending

Paralinguistic Generation

Mirage captures the moments that don't happen in words, from laughter and tears to sneezing, coughing, and grunting.

Audio-Only Generation

Mirage works with text, images, and audio, alone or together. When generating A-roll from audio alone, Mirage matches who's performing in the output to what can be heard.

The sound world in audio input tells Mirage about the world it should generate, from intimate indoor settings, to the great outdoors.

Mirage matches sight and sound, avoiding the uncanny valley effects found in many models that pair generated people with voices that don't relate.

Mismatched Text and Audio Results

But when text and audio are deliberately pushed in different directions, Mirage can develop compelling blends of the two.

Gesture-semantic Alignment

Training for video generation lets Mirage connect what it knows about audio to what happens in visual behavior. Mirage makes images that move along with what's being said.

Subject Appearance Accuracy

Mirage gives text control over all the details of how someone appears from facial structure to makeup to clothing and lighting, while still ensuring they look how they sound when they perform.

Background and Prop Fidelity

Mirage generates environments with specific contexts (e.g., “sunlit café,” “futuristic lab”) and props (e.g., “holding a vintage microphone,” “surrounded by holograms”), to closely match input prompts.

Try Captions now

Get started Get started