Google DeepMind on Wednesday introduced Gemini Omni, a new artificial intelligence model the company describes as its most capable yet at working across text, images, audio, and live video at the same time. The launch deepens an intensifying contest among the world's largest technology companies to build systems that can see, listen, and respond in something close to real time.
Where earlier models handled one kind of input at a time, Gemini Omni is designed to take in several streams together, the company says, holding a spoken conversation while watching a video feed or reading a document. Google framed the model as a step toward assistants that feel less like a search box and more like an attentive collaborator.
A race measured in milliseconds
Rivals have spent the past year racing to shave the delay between a question and a reply, betting that responsiveness, not raw size, will decide which assistants people actually adopt. Gemini Omni's pitch leans on that latency, with demonstrations showing the model reacting to a scene as it changes rather than after the fact.
Independent researchers cautioned that polished demonstrations rarely capture how a model behaves in messier, real-world conditions, and questions remain about cost, privacy, and the safeguards around always-on cameras and microphones. Google said Gemini Omni would roll out gradually, beginning with developers and select products before reaching a wider audience.
For now, the clearest signal is competitive. Each new release resets expectations for what an everyday assistant should be able to do, and Gemini Omni is Google's argument that the next one should be able to keep up with the world as it happens.
Google DeepMind Unveils Gemini Omni, Its Real-Time AI
The company's new omni-modal model processes text, images, audio, and live video at once, part of an intensifying race to build AI that perceives the world the way people do.
Google DeepMind on Wednesday introduced Gemini Omni, a new artificial intelligence model the company describes as its most capable yet at working across text, images, audio, and live video at the same time. The launch deepens an intensifying contest among the world's largest technology companies to build systems that can see, listen, and respond in something close to real time.
Where earlier models handled one kind of input at a time, Gemini Omni is designed to take in several streams together, the company says, holding a spoken conversation while watching a video feed or reading a document. Google framed the model as a step toward assistants that feel less like a search box and more like an attentive collaborator.
A race measured in milliseconds
Rivals have spent the past year racing to shave the delay between a question and a reply, betting that responsiveness, not raw size, will decide which assistants people actually adopt. Gemini Omni's pitch leans on that latency, with demonstrations showing the model reacting to a scene as it changes rather than after the fact.
Independent researchers cautioned that polished demonstrations rarely capture how a model behaves in messier, real-world conditions, and questions remain about cost, privacy, and the safeguards around always-on cameras and microphones. Google said Gemini Omni would roll out gradually, beginning with developers and select products before reaching a wider audience.
For now, the clearest signal is competitive. Each new release resets expectations for what an everyday assistant should be able to do, and Gemini Omni is Google's argument that the next one should be able to keep up with the world as it happens.
Written by