Google DeepMind Reveals Gemini Robotics AI Models Capable of Real-World Robot Control

Two new artificial intelligence (AI) models that can manipulate robots to carry out a variety of tasks in real-world settings were introduced by Google DeepMind on Thursday. These sophisticated vision language models, known as Gemini Robotics and Gemini Robotics-ER (embodied reasoning), are able to act and exhibit spatial intelligence. The IT behemoth from Mountain View also disclosed that it is collaborating with Apptronik to develop humanoid robots that are powered by Gemini 2.0. In order to further assess these models and determine how to improve them, the organization is also testing them.
Google DeepMind Reveals AI Models for Gemini Robotics
DeepMind described the latest AI models for robots in a blog post. According to Carolina Parada, Senior Director and Head of Robotics at Google DeepMind, artificial intelligence (AI) must exhibit “embodied” thinking, or the capacity to interact with, comprehend, and act in the physical environment in order to accomplish tasks.
The Gemini 2.0 model was used to create Gemini Robotics, the first of the two AI models. It is a sophisticated vision-language-action (VLA) model. The model can now directly operate robots thanks to a new output modality called “physical actions.”
DeepMind pointed out that three essential qualities are needed for AI models for robots to be practical in the real world: generality, interactivity, and dexterity. The term “generality” describes a model’s capacity to adjust to many circumstances. According to the business, Gemini Robotics is “adept at dealing with new objects, diverse instructions, and new environments.” The researchers discovered that the AI model more than doubles the performance on a thorough generalization benchmark based on internal testing.
Gemini 2.0 serves as the foundation for the AI model’s interactivity, which includes the ability to comprehend and react to orders expressed in both ordinary, conversational language and other languages. According to Google, the model also continuously scans its environment, recognizes modifications to the instructions or environment, and modifies its behavior in response to input.
Lastly, DeepMind asserted that Gemini Robotics is capable of carrying out multi-step, incredibly difficult tasks that call for exact physical environment modification. According to the researchers, the AI model can direct robots to pack a snack into a bag or fold a piece of paper.
Although it concentrates on spatial reasoning, the second AI model, Gemini Robotics-ER, is likewise a vision language model. The AI model is claimed to demonstrate the capacity to comprehend the proper motions to control an object in the real world, drawing on Gemini 2.0’s coding and 3D detection. When the model was shown a coffee mug, for instance, it was able to produce a command for a two-finger grasp to take it up by the handle along a safe trajectory, according to Parada.
Numerous tasks required to operate a robot in the real world are carried out by the AI model, such as perception, state estimation, spatial comprehension, planning, and code production. Interestingly, neither of the two AI models is publicly accessible at this time. Before publishing the technology, DeepMind will probably combine the AI model into a humanoid robot and assess its capabilities.