The power of video depends on the harmonious integration of what we hear with what we see. Creators and consumers of content connect with the precise details of audiovisual performances, especially when it comes to A-roll: the primary, narrative-advancing portions of real-world video that foreground people on screen and soundtrack.
We introduce Mirage, an audio-to-video foundation model designed for A-roll. Mirage generates expressive video of people from images, text, and audio, producing compelling performances with the quality and realism needed for content with real-world engagement. Try it now at mirage.app and read the full paper here.
Mirage excels at generating A-roll, matching emotional nuance and affect, precisely executing lip and gesture synchronization, and producing people who look like they sound delivering natural behaviors. Mirage coaxes precisely coordinated performances out of audio, images, and text, making it easier for you to tell your story with realistic A-roll video.
While other systems struggle to time visual dynamics with the fidelity needed for realistic human vocal delivery, Mirage can handle precise lip movements like plosives with precision across variations in language, tone, volume, and visual appearance.
Mirage performances look around like humans, connecting blinks, saccades, and adjustments to gaze patterns to create lifelike visual performances.
Mirage produces clarity and continuity in mouth motion, accurately timing lip closures and releases in sync with audio. Performances move fluently from slurred, slow speech to rapid, rhythmically demanding rapping.
Mirage produces clarity and continuity in mouth motion, accurately timing lip closures and releases in sync with audio. Performances move fluently from slurred, slow speech to rapid, rhythmically demanding rapping.
Mirage captures the moments that don’t happen in words, from laughter and tears to sneezing, coughing, and grunting.
Mirage works with text, images, and audio, alone or together. When generating A-roll from audio alone, Mirage matches who’s performing in the output to what can be heard.
The sound world in audio input tells Mirage about the world it should generate, from intimate indoor settings, to the great outdoors.
Mirage matches sight and sound, avoiding the uncanny valley effects found in many models that pair generated people with voices that don’t relate.
But when text and audio are deliberately pushed in different directions, Mirage can develop compelling blends of the two.
Training for video generation lets Mirage connect what it knows about audio to what happens in visual behavior. Mirage makes images that move along with what’s being said.
Mirage gives text control over all the details of how someone appears from facial structure to makeup to clothing and lighting, while still ensuring they look how they sound when they perform.
Mirage generates environments with specific contexts (e.g., “sunlit café,” “futuristic lab”) and props (e.g., “holding a vintage microphone,” “surrounded by holograms”), to closely match input prompts.
Mirage generates photorealistic, engaging performances with precisely synchronized audio and video — and we’re constantly adding new capabilities. Try it now at mirage.app and read the full paper here.