Google unveiled Gemini Omni, a multimodal AI model that understands images, audio, video and text and creates video.
TechCrunch reported on May 19 that Google introduced "Gemini Omni Flash" at its annual Google I/O developer conference and said it first applied it to the Gemini app, YouTube Shorts and the AI creation tool Flow.
Gemini Omni does not simply combine multiple inputs. It reasons across images, audio, video and text to produce consistent outputs. Google said this can generate high-quality video that reflects understanding of physics, culture, history and science.
Google plans over the long term to expand Gemini Omni so it can create images from audio or generate audio from video.
This release of Gemini Omni is initially focused on video generation. Users can edit photos using natural-language commands without complex editing software. It also supports video creation using a user’s digital avatar.
To prevent deepfakes, creating an avatar requires a separate registration process. If a user films themselves and reads numbers, the avatar is saved for reuse later. All videos made with Gemini Omni will include Google's digital watermark, SynthID.
Nicole Brihiotova (니콜 브리히토바), head of product management at Google DeepMind, said Gemini Omni is not a simple update of the existing video generation model "Veo". She described it as next-generation technology that combines Gemini intelligence with media-model rendering capabilities. Koray Kavukcuoglu (코라이 카부크추올루), DeepMind's chief technology officer, introduced that a simple prompt, "clay animation explaining protein folding," quickly produced a stop-motion style video with voice narration.
The first model, Gemini Omni Flash, generates a 10-second video. Google explained this was not due to model limits but to let more users try it first. It plans to add longer video generation soon.
Google appears to be positioning Gemini Omni Flash as a consumer tool first. It presented examples such as making an awards scene or a video of going to the moon, or removing passersby captured in the background of travel videos. It added that if editing commands are not specific, excessive edits could occur that change unwanted elements as well.
Google plans to offer Gemini Omni as an API within days. It is also preparing a higher-end model, "Gemini Omni Pro," aimed at professional uses such as advertising and video production. It did not disclose a release date.