We Build Synthetic Humans. Here’s What’s Keeping Us Up at Night
The state of deepfakes from an insider POV.

I. Deepfakes are everywhere
We build AI models that generate lifelike people on screen.
Our north star is ultimately realism—and our models have achieved a level of proficiency that makes it possible for subscribers to instantly create videos that look polished and human.
We’ve seen our products help people tell stories, turn ideas into businesses, and communicate better (even fall in love), people who without this technology, wouldn’t have had this chance.
That said, we’re acutely aware that we’re building in a sector where the same category of technology that’s so dramatically accelerating human capabilities is also being used to manipulate truth around current events, all the while eroding trust in the very idea of video as “evidence” or “proof.”
This isn’t a product update or a research paper. It’s a sort of reflection, from a team inside the AI video industry, on how deepfakes are evolving and how we’re thinking about developing our models in this climate.
II. Deepfakes have dramatically evolved — but the most advanced models aren’t the only ones doing damage
Internally, we classify AI video models (used specifically to generate people) into four types, each of which generate an increasingly significant part of the video, each with different implications and giveaways in terms of determining the technology powering them.
While reviewing the list, one of the more striking patterns we noticed was that even the earliest deepfakes were already quite hard to detect—and they continue to be among the most damaging forms of video content today.
1st Generation: Face Overlay on Real Footage
At this stage, one would need to record base footage—typically using a body double to mimic the movements and voice of the person being impersonated. One would then overlay a generated face, so that expression and lip movement is mirrored between the original impersonator and the generated face.
With this model, you can’t change what’s said in the video but it appears shockingly realistic because most of the video actually is real.
This kind of deepfake is particularly difficult to detect. The only tell is whether the impersonated person’s body – such as posture, gestures, or proportions – is inconsistent with their real-life appearance. The Tom Cruise impersonator, for example, nailed it. But the Leonardo Dicaprio impersonator’s proportions are off.


2nd Generation: Lip Sync on Real Footage
This was the first time you could make someone that exists say something that was never said.
Using GAN-based lipsync and facial reenactment techniques, one could map any audio track onto the face of a real or generated person. This means an existing video of a real person (or an AI-generated video of one) can be changed to speak convincingly in sync with entirely new dialogue. The process is faster and cheaper than any previous method, which has driven its widespread adoption. As of now, an estimated 65–75 providers offer lip-sync tools, many with minimal or no content moderation.
Below, you can see the dangers of “2nd-generation” models. In one case, it was used to spread misinformation during the most recent U.S. election; in another, the same technology was applied to a generated base video to spread misinformation in Iran.


The primary “tell” that these videos are synthetic is that the body language and expression doesn’t align with what’s being said—since only the lips are being generated. Because these models can’t invent new body language, pay attention to unnatural hand or head movements—all physical gestures are assembled from pre-existing footage. The caveat is that sometimes, by luck, the body language matches up with the audio, making it harder to identify the video as a deepfake. The best way to prove these deepfakes were generated is to find the original footage, which is often available somewhere on the internet.
For instance, there have been allegations that Captions’ and other companies’ earlier AI models have been used to promote products via misinformation. In response, we stepped up our trust and safety efforts and are always seeking to improve our work in this space. We know that these models can be used by bad actors online, and we’re always considering safety when designing and updating our products.

3rd Generation: Full-Frame Generation, Short Form
Diffusion-based models marked the first time a model could generate people from nothing—face, body, voice, background, even camera motion—without relying on existing footage. In other words, you could generate people and environments from scratch. It’s worth noting you could also feature someone’s real likeness in these generated videos as well.
You can see this model-type creating a fake interview related to current events below.

A subtle sign that these are “3rd-generation” deepfakes? The video cuts off right before the 8 second mark, the limit for the model provider—Veo 3. With these models, you can’t generate clips longer than a few seconds (usually 4, 6, 8, 12), so clips are usually around this length or stitched together to simulate longer moments.
It’s worth noting that these models today can’t maintain perfect consistency across clips. Partial consistency can be achieved with jump cuts (evident in this video we created to demonstrate an earlier version of Mirage) however.
Since for the first time, 3rd-generation models were creating people and environments entirely from scratch, they’re still developing in terms of realism. Pay attention to the subject’s skin texture, often 3rd-generation models take on an AI-sheen or plastic-like appearance. Body movements can also be a bit jittery.
The most dangerous aspect of this model-type is that there’s no original footage to validate whether a video has been manipulated.
4th Generation: Long-Form, Multi-Person, Transformative
No model is fully “4th-generation” yet, but most models in the “3rd-generation” are coming close and contain characteristics of this category. Very soon, most models will allow you to generate a person (real or synthetic) in any situation—without a duration constraint (longer than 12 seconds) and featuring multiple people in one shot. TLDR; these models will be able to produce extended, complex videos with more realism than any synthetic media that’s come before.
One of the most distinctive parts of this model-type is the ability to take one character and put them in an imagined situation (clothing, environment, objects, etc). We’ve generated Jon Stewart reviewing Dune 2 – in military uniform – to demonstrate what this entails.

You can see viral “fake” street interviews featuring two people in one-shot created with Veo-3. To confirm, multiple people in one shot is unique to this model-type.

Lastly, we’ve taken the audio from the deepfake shared earlier — reported by the New York Times, and re-created it to demonstrate how rapidly realism is improving.

Identifying 4th-generation deepfakes is only going to get harder. That said, while AI models can now place anyone in entirely imagined scenarios, they still struggle with audio realism. The voice often doesn’t completely align with the visual context—feeling inconsistent, flat, or just slightly off—especially when generating imagined characters in specific environments.
Beyond audio, subtle visual “tells” remain: watch for background inconsistencies, like mismatched lighting or inconsistent faces in a crowd, or small anatomical errors—hands, for instance, are often still imperfect.
III. So where does this leave us?
While we’ve thoughtfully implemented a moderation system that limits how our technology can be used—restricting impersonation, requiring likeness consent, and actively moderating abuse, design isn’t a catch-all.
We’ve explored technical solutions like provenance metadata and watermarking integration—and while promising, most of these systems are still limited in scope, fragile under compression, and easily stripped. Initiatives like SynthID are not really solving the problem—but rather creating a false sense of security. The public knowledge of these systems just isn’t where it needs to be for them to actually be useful.
People just aren’t looking for fake videos. In fact, most aren’t aware that fake videos even exist. As AI video grows, the biggest challenge is cultural.
We need a new kind of media literacy—one where we consider video as we do headlines. It’ll be a transition, but we do strongly believe this technology will open doors that will outweigh these potential pitfalls.
Today more than ever your source matters. We’re not sure how to promote awareness other than talking about it as much as we can.
IV. Eyes open
We believe in the future of AI video. We see the good it can do—making storytelling more accessible, understanding more vivid, and production more equitable. But we also know that "anyone can say anything” is a real risk—not just a philosophical one.
We don’t take lightly the fact that AI video can be used to shape belief. But with the right structure, the right culture, and the right boundaries, we believe synthetic video can serve something deeper: a broader, more inclusive kind of communication—one that amplifies voices, not noise.