In a cluttered open-plan office in Mountain View, California, a elevated, slender robot on wheels was busy acting as a tour guide and informal office assistant — thanks to a major update to Google DeepMind’s language model revealed todayThe robot uses the latest version of Google’s vast Gemini language model for both analyzing commands and finding its way.
For example, when a human says, “Find me a place to write,” the robot obediently walks away, leading the human to a pristine blackboard somewhere in the building.
Gemini’s ability to handle video and text—in addition to its ability to absorb vast amounts of information in the form of pre-recorded video tours of the office—allows the “Google Helper” robot to understand its surroundings and navigate properly when given commands that require some common-sense reasoning. The robot pairs Gemini with an algorithm that generates specific actions for the robot to perform, such as turning, in response to commands and what it sees in front of it.
When Gemini was introduced in December, Google DeepMind CEO Demis Hassabis told WIRED that its multimodal capabilities would likely unlock novel robotic abilities, adding that the company’s researchers were difficult at work testing the robot’s potential.
IN new paper Describing the project, the researchers behind the work say their robot proved to be 90 percent reliable in navigating, even when given hard commands such as “Where did I leave my roller coaster?” DeepMind’s system “significantly improved the naturalness of human-robot interactions and greatly increased the usability of the robot,” the team writes.
Courtesy of Google DeepMind
Photo: Muinat Abdul; Google DeepMind
The demo neatly illustrates the potential of vast language models to reach into the physical world and do useful work. Gemini and other chatbots operate primarily within a web browser or app, although they are increasingly capable of handling visual and auditory data, as Google and OpenAI have recently demonstrated. In May, Hassabis unveiled an improved version of Gemini that can understand the layout of an office as seen through a smartphone camera.
Academic and industrial research labs are racing to see how language models can be used to enhance robotic capabilities. May program At the International Conference on Robotics and Automation, a popular event for robotics researchers, nearly two dozen papers that used vision language models were cited.
Investors are pouring money to startups that aim to apply AI advances to robotics. Several researchers involved in the Google project left the company to found a startup called Physical intelligencewhich has received an initial $70 million in funding, is working to combine large-scale language models with real-world training to equip robots with general problem-solving skills. Separate AIfounded by roboticists at Carnegie Mellon University, has a similar goal. This month, it announced $300 million in funding.
Just a few years ago, a robot needed a map of its surroundings and carefully crafted commands to navigate. Gigantic language models contain useful information about the physical world, and newer versions trained on images and video as well as text, known as visual language models, can answer questions that require perception. Gemini lets Google’s robot analyze both visual and spoken instructions, following a sketch on a whiteboard that shows a route to a novel destination.
In their paper, the researchers say they plan to test the system on a variety of robots. They add that Gemini should be able to understand more elaborate questions, such as “Do they have my favorite drink today?” from a user who has a lot of empty Coca-Cola cans on her desk.

