Waymo has long touted its ties to Google DeepMind and decades of artificial intelligence research as a strategic advantage over rivals in the autonomous driving space. Now the Alphabet-owned company is taking it a step further by developing a modern training model for its robotics based on Google’s Gemini multimodal enormous language model (MLLM).
Waymo today published a modern research paper introducing its “End-to-End Multimodal Model for Autonomous Driving,” also known as EMMA. This modern end-to-end training model processes sensor data to generate “future autonomous vehicle trajectories,” helping Waymo’s autonomous vehicles make decisions about where to go and how to avoid obstacles.
But more importantly, this is one of the first signs that an autonomous driving leader has projects to exploit MLLM in its business. This is a sign that these LLMs can break free from their current exploit as chatbots, email organizers and image generators and find exploit in a completely modern environment down the road. In its research paper, Waymo proposes “developing an autonomous driving system in which MLLM will be a first-class citizen.”
A comprehensive multimodal autonomous driving model, also known as EMMA
The article describes how, historically, autonomous driving systems have developed specific “modules” for various functions, including perception, mapping, prediction and planning. This approach proved useful for many years, but presented scaling problems “due to accumulated errors between modules and limited communication between modules.” Moreover, these modules may have difficulty responding to “new environments” because they are “predefined” by nature, which can make adaptation challenging.
Waymo developed EMMA as a tool to lend a hand robots navigate sophisticated environments. The company identified several situations where the model helped autonomous cars find the right route, including: in case of encountering various animals or road works.
Other companies, such as Tesla, have spoken extensively about developing comprehensive models of their autonomous cars. says Elon Musk that the latest version of the Full Self-Driving system (12.5.5) uses an “end-to-end neural network” artificial intelligence system that translates camera images into driving decisions.
This is a clear indication that Waymo, which has an advantage over Tesla in deploying autonomous vehicles on the road, is also interested in developing a comprehensive system. The company said its EMMA model excels at predicting trajectories, detecting objects and understanding road charts.
“This suggests a promising direction for future research that could combine even more basic autonomous driving tasks in a similar scaled-up configuration,” the company said in a blog post today.
However, EMMA also has its limitations, and Waymo acknowledges that further research will be necessary before the model can be implemented. For example, EMMA couldn’t incorporate 3D sensor input from lidar or radar, which Waymo said was “computationally expensive.” It could only process a petite number of image frames at a time.
There are also risks associated with using MLLM for robotics training that were not mentioned in the research article. Chatbots like Gemini often hallucinate or fail to perform plain tasks such as reading clocks or counting objects. Waymo has very little margin for error when its autonomous vehicles are traveling at 40 miles per hour on a busy road. More research will be needed before these models can be deployed on a enormous scale – Waymo has no doubt about that.
“We hope that our results will inspire further research to alleviate these issues,” writes the company’s research team, “and further develop state-of-the-art architecture for autonomous driving models.”
