Robotics AI model race pits world models against VLA

With industry moves accelerating to develop AI models optimized for robots, two broad camps appear to be vying for technological leadership.

According to a recent report by The Information, one camp is vision-language-action models, or VLA, derived from large language models. The other is world models, trained mainly on video to predict what will happen in real environments when robots take actions.

In the case of VLA, Nvidia's Groot and Physical Intelligence's Pi model are drawing attention.

Microsoft's physical AI robotics model Rho-alpha, unveiled in January, is also based on VLA. Microsoft aims to use Rho-alpha to help physical systems adapt more flexibly.

According to the company, robots over the past decades have performed in structured environments such as assembly lines where tasks are predictable and strictly defined. With the emergence of vision-language-action, or VLA, models for physical systems, it has become possible to support robots so they can autonomously perceive, reason and act alongside humans even in complex, undefined and less structured environments.

But The Information report suggests interest in world models has been rising in Silicon Valley recently.

AI video startup Luma opened a physical AI lab in June focused on world models for robotics, and humanoid startup 1X announced it would set up its own world model research institute.

Supporters expect world models to understand physics very deeply, predict real situations such as how objects fall and break, generate simulations robots can learn from, and serve as the AI brain in robots.

Marshall Eber, dean of the School of Computer Science at Carnegie Mellon University, pointed to the fact that chatbots cannot pick up a coffee cup to highlight the limits of existing language models. "How to move a hand and physical contact with a cup are far more complex than predicting the next word," he said.

Critics also argue that world models still make frequent errors and therefore cannot accurately simulate the real world. Even so, as investors lower their expectations for VLA, world models are increasingly in the spotlight, The Information reported.

According to the report, VLA models such as Nvidia's Groot and Physical Intelligence's Pi have had some success by borrowing the intelligence and natural language understanding of underlying language models. But even 2 years after their release, VLA overall still lacks the reliability to run robots on real production floors.

Rajat Bhageria, CEO of Chef Robotics, which supplies meal-assembly robots for industrial kitchens, said, "We have experimented with using VLA to run robots, but in the long term world models are more promising." He added, "Current VLA is too slow and unreliable. It is not yet ready for full-scale use."

Bhageria also mentioned a world model developed by autonomous driving services company Waymo using Google DeepMind's Genie 3, saying the model can simulate highly exceptional road situations such as tornadoes or elephants on the road.

Some also say it is unrealistic to view the competitive landscape for AI models for robots solely as a confrontation between world models and VLA.

Nvidia's world model Cosmos 3 combines elements of world models and VLA to analyze text and images and also generate physically realistic video, The Information reported.

A robotics researcher group recently said, "Robots need more than VLA and world models." It added, "The debate over which type of model is better is missing the point," and stressed that the bigger task is finding a way to convert physical data such as internet videos into a form robots can easily learn.

Chi-gyu Hwang (황치규) delight@d-today.co.kr

Keyword