What Is Gemini Omni? Google's New AI Video Model, Explained

About two hours ago at Google I/O 2026, Demis Hassabis walked on stage and announced Gemini Omni, a single model that takes any combination of text, images, audio, and video as input, generates high-fidelity video, and lets you keep editing it just by talking to it. I’ve spent the last few hours pushing it through real work inside the Gemini app and Google Flow. First impressions further down, but first, the context.

Three weeks ago, OpenAI shut down the Sora consumer app. Two days ago, the AI video category had no clear leader. As of today, May 19, 2026, it does.

This isn’t Veo 4. It’s a different kind of model. Here’s what it actually does, what Google is officially promising on the DeepMind product page, what I’m finding after hands-on testing, and what every creator using AI video needs to understand before the noise gets louder.

What Gemini Omni Actually Is

Gemini Omni is what DeepMind calls a “world model.” That’s not marketing. It’s a technical claim about how the model works under the hood.

Traditional text-to-video models (Veo 3.1, the discontinued Sora, Runway Gen-4) generate a clip by predicting what pixels should appear in the next frame. Pattern-matching at massive scale. The output can look incredible, but the model doesn’t really understand gravity, momentum, or what a character was doing two seconds ago. That’s why characters morph between cuts and physics break on close inspection.

Gemini Omni works differently. It fuses the reasoning capabilities of the main Gemini model with the generative capabilities of Veo, Nano Banana (image editing), and Project Genie (interactive world simulation) into one architecture. Google’s framing of the result: a model that can create anything from anything, starting with video.

The first shipping variant is Gemini Omni Flash. It launched today across the Gemini app, Google Flow, Google Flow Music, and (the part most people are sleeping on) free inside YouTube Shorts and the YouTube Create app. API access for developers rolls out “in the coming weeks.”

The Three Promises Google Makes on the DeepMind Page

The official page organises Gemini Omni around three core capabilities. Each one matters for a different reason.

Promise 1: Edit Any Video Through Natural, Step-by-Step Conversation

This is the headline feature. DeepMind’s positioning is direct: think of Gemini Omni as Nano Banana, but for video.

Every instruction builds on the last. The scene remembers what came before. Characters stay consistent. The physics holds up across edits. You’re not regenerating from scratch each time. You’re refining one coherent scene through multi-turn dialogue.

Here’s their demo of a violinist whose entire environment and camera angle get rewritten across multiple turns, while the performer stays consistent:

Prompt: "Change the camera angle to be over the violinist's shoulder." Source: Google DeepMind

And the “transformation through a trigger” demo, where touching a mirror morphs the subject. Same shot, same lighting, completely different physical form:

Prompt: When the person touches the mirror, they transform into a felted stuffed puppet version with large googly eyes and glasses. Source: Google DeepMind

If you’ve fought with timeline editors, masking tools, or single-shot prompt regeneration, this is the biggest workflow change in AI video since the category began.

Promise 2: Apply Real-World Knowledge (Physics + Science + Culture)

This is where the “world model” framing earns its keep. Google’s page promises that Omni combines an intuitive grasp of physics with Gemini’s knowledge of history, science, and cultural context, bridging the gap from photorealism to meaningful storytelling.

Translation: the model knows how gravity, kinetic energy, and fluid dynamics work well enough to simulate them. And it knows enough about the world to explain real things visually.

Here’s their physics demo: a marble running a continuous chain-reaction track in a single smooth shot. Watch how the marble’s velocity and the track’s transfer of momentum stay physically consistent:

Prompt: A marble rolling fast on a chain-reaction style track, continuous smooth shot. Source: Google DeepMind

And the one that’s been circulating from the keynote: a claymation explainer of protein folding, generated from a single prompt. The clay aesthetic is stylised, but the molecular biology being depicted is anatomically correct:

Prompt: Claymation explainer of protein folding, everything made out of clay, no hands, stop motion, accurate. Source: Google DeepMind

Previous text-to-video models can fake the look of clay. None of them, until now, could fake the biology accurately at the same time. That’s the world-model claim made concrete.

Promise 3: Reference Anything to Combine Inputs Into a Single Output

This is the “any-to-any” promise. You can feed Gemini Omni a mix of inputs and ask it to weave them into one cohesive clip:

An image of a character + a video of a motion you want them to perform + an audio track for pacing → a new video where the character performs that motion in sync with the music.
A rough sketch + a reference photo for style → a polished cinematic clip that uses the sketch only as a movement guide.
Your own voice (via the new Avatars feature) + a script → a video of a digital version of yourself delivering it.

Here’s their demo combining a real video of fern leaves with harp audio and a stylistic prompt. The result is bioluminescent plant life that reacts to touch, with harp sounds synchronised to the motion:

Prompt: Add harp sounds synchronised to when I touch each fern leaf. Change the leaves to semi-translucent bioluminescent plant life with fireflies reacting in sync to the sound. Source: Google DeepMind

For creators, this collapses what used to be a multi-app, multi-day pipeline into a single conversation.

My First Impressions After a Few Hours of Testing

I’ve had Omni Flash open in the Gemini app and inside Google Flow since the keynote ended. Quick read on what’s standing out so far:

The speed is the headline. Generations come back in a fraction of the time I’m used to with Veo 3.1 or Runway. For me that’s the biggest practical unlock. Fast iteration is how you actually find the shot you wanted.

Multi-shot scenes hold up. Characters stay consistent across cuts. Backgrounds don’t drift. The world-model framing isn’t just marketing copy. It shows up in the output.

Native audio is clean and properly synced. Dialogue, ambient sound, and music get generated and locked to the visual in a single pass. No after-the-fact alignment work.

4K upscaling adds real detail. It’s closer to a redraw than a stretch. Not the smeared mess you get from older upscalers.

The Flow Agent is the workflow change. You sit in one conversation, tweak shots, rewrite prompts, batch variations, and refine the whole piece without leaving the chat. After a few hours with it, going back to manual timeline editing feels prehistoric.

I’ll publish a longer hands-on once I’ve stress-tested the rough edges, but the early signal is clear: this is the first model where conversational editing actually keeps up with how fast you think.

What’s New Inside Google Flow (The Details That Matter)

When you open Flow and try to access Omni, a “What’s New” panel walks you through what just shipped. Here are the practical points every creator should know before they start:

Gemini Omni Flash in Flow is paid-only. You need a Google AI Plus, Pro, or Ultra subscription to use it inside Flow. Features vary by tier and region.

Clip length is capped at 10 seconds for the Flash variant. That’s the current ceiling. Long enough for a strong single shot, short enough that iteration cycles stay tight. Expect this to grow with future Omni models.

Characters: build a recurring cast with @character_name. This is the feature that’s going to change long-form workflows. You design a character once (from text prompts or reference images, with a custom voice attached), then summon them into any future scene by typing @character_name in your prompt. Visual and vocal consistency is preserved across every shot. If you’ve been hacking together character consistency with seeds, LoRAs, or reference image stacks, this collapses all of that into a native primitive.

Avatars: drop your own face and voice into scenes with @me. Set this up in Account Settings → Create Avatar. You record a short selfie video plus a few spoken words, and after that, typing @me in any prompt casts you into the scene. One important caveat: Avatars are not available in EEA, UK, or Switzerland. If you’re in Europe, this feature is off the table for now. Likely a GDPR / AI Act compliance call.

Flow Agent is web-only at launch. The conversational creative partner inside the prompt box doesn’t ship to the mobile apps yet. If you want the full agent experience (brainstorming, batch edits, asset organisation, style enforcement), stay on desktop.

Flow Tools lets you build your own workflows in natural language. Custom image editors, video resizers, post-processing effects: describe what you want, Flow builds it. You can share tools you’ve made, and remix ones other creators have shared. Paid plans only, web only.

The pattern across all of these: Google is treating Flow as a creative OS, not just a video generator. The @mentions for characters and self, the agent inside the prompt box, the user-buildable tools. It’s the same composability idea Cursor and Claude Code brought to coding, applied to filmmaking.

Where You Can Actually Use Gemini Omni Today

Google distributed Gemini Omni Flash across three surfaces simultaneously, which tells you how much weight they’re putting behind it:

Gemini app (web, Android, iOS): included with Google AI Plus, Pro, and Ultra subscriptions.
Google Flow and Google Flow Music: Google’s professional AI creative studios, also gated to AI subscribers.
YouTube Shorts and YouTube Create app: rolling out free this week to eligible users. This is the first time a frontier-grade generative video model has been distributed inside a major social platform’s native creator tools at no cost.

The subscription tiers were also restructured at I/O. The standard plan is now $20/month (Google AI Plus/Pro), AI Ultra is $100/month (developer tier, 5× higher compute limits), and AI Ultra Enterprise is $200/month (20× higher limits, plus Project Genie access). Google also replaced daily prompt caps with a “compute-used” model. Video generations consume far more of your quota than text prompts, which is something every creator planning a serious workflow needs to understand before they hit a ceiling mid-project.

What Google Is Deliberately Holding Back

One detail that’s getting under-reported: Google is not yet shipping the ability to edit existing audio or speech in a video. You can generate avatars that speak in your own voice, but you cannot upload a video and have Omni rewrite what someone is saying.

DeepMind’s official line is that they’re still working to test this and figure out how to bring the capability to users responsibly. Translation: the deepfake-audio risk is too high to ship without further guardrails. This matters because it’s the single biggest gap between Omni’s marketing pitch and a fully end-to-end conversational video editor. For now, audio is a one-way street. Generate from scratch, don’t rewrite.

For trust, every Omni-generated video carries an invisible SynthID watermark plus C2PA Content Credentials, verifiable through the Gemini app today, with Chrome and Google Search detection coming. YouTube is also expanding its likeness detection tool to all creators 18+, so you can scan the platform for AI-generated copies of your face or voice and request takedowns.

What This Means For You As an AI Video Creator

If you’ve built your workflow around Veo 3.1, Runway, or the now-defunct Sora, the playing field shifted today. Three things to internalise right now:

Conversational editing is the new default. Single-shot prompting will feel as dated as command-line image editors did once Photoshop arrived. The skill that matters now is prompt sequencing: planning a video as a series of refinements rather than as one big generation.
Reference-driven generation beats pure text prompting. Omni’s strongest outputs combine multiple input types. Creators who learn to assemble reference packs (a character image + a motion clip + an audio cue) will produce dramatically better results than those typing paragraphs of description.
Distribution is part of the product. With Omni inside YouTube Shorts at no cost, the bar for “should I post this?” just dropped to zero for billions of users. Expect a wave of AI-edited Shorts within weeks. Originality and craft are about to matter more, not less.

Coming Up in This Series

This is the first post in a deep-dive series. Over the coming days we’re publishing:

A complete Gemini Omni prompt playbook with 15 working examples you can steal.
Gemini Omni vs Seedance vs Kling: a side-by-side comparison for serious creators.
A breakdown of Google’s new AI subscription tiers and the compute-used model.
Why OpenAI killed Sora, and what its absence means for the next 12 months of AI video.

If you want each one delivered the moment it goes live, join the AI Video Bootcamp community. Over 20,000 creators are already inside.

Official sources: