Xpeng unveils predictive autonomous driving AI framework X-Mind

X-Mind stands out as an attempt to make the decision-making process in autonomous driving lighter and more explicit. [Photo: Xpeng]

[DigitalToday reporter Jinju Hong (홍진주)] Xpeng unveiled a new vehicle artificial intelligence (AI) framework, X-Mind, that predicts and reasons about future traffic conditions before an autonomous vehicle acts. It goes beyond existing autonomous driving methods that simply react to what is directly ahead. It simulates what may happen next like a human and then makes driving decisions.

On July 1 local time, electric vehicle outlet CleanTechnica reported that Xpeng first announced X-Mind along with its World Model technology roadmap at a CVPR 2026 workshop in Denver, the United States.

At the core of the technology is a shift from the existing “Perception-Action” structure of autonomous driving systems to a prediction-and-reasoning centered structure. Xpeng explained that advanced autonomous driving requires active reasoning, controllable generation and long-term prediction capability. This means the vehicle should not only perceive the current scene but first calculate how the physical environment will change and then drive.

X-Mind is a physical AI-based model for autonomous driving announced after X-World, X-Foresight and X-Cache, which Xpeng recently unveiled. It is designed so an in-vehicle AI agent can implement a “Visual Chain of Thought” that simulates future situations in advance.

Xpeng also pointed to limitations of existing approaches. It said text-based reasoning struggles to accurately represent complex road environments and spatial information. It also said generating future video as-is includes excessive unnecessary information such as texture, diluting the semantic information needed for autonomous driving.

To address this, X-Mind first creates a “Thought Sketch” containing only key elements, including lanes, obstacles, traffic lights, driving paths and speed information, instead of generating full real video. Based on this information, it runs high-speed simulations inside the model to calculate the optimal driving route.

Xpeng said it applied a deep compression autoencoder technology that compresses 12 frames of future scenes into just 96 tokens. It said this sharply reduces the computational burden that arises when processing long context.

It also improved real-time processing performance. Existing diffusion-based generative models require repeated calculations multiple times, but X-Mind is designed to generate future situations with a single forward pass by applying a recurrent block diffusion mechanism. Xpeng said the FID metric, which indicates image generation quality, improved significantly compared with existing methods, while inference latency increased little, if at all.

Another feature is that the decision-making process can be checked visually. X-Mind can visualise how the vehicle internally predicted future obstacle positions and lane connectivity before deciding an action. Xpeng said it expects these functions to help not only with algorithm verification but also with securing user trust and software debugging.

In performance evaluation, Xpeng said it trained the system on hundreds of millions of real driving data items and then predicted obstacles and risk factors faster than existing vision-language-action (VLA) models in scenarios such as sudden braking, highway merging and complex intersections. It also said it reduced driving trajectory errors and improved safety and traffic law compliance performance in complex unexpected situations.

Xpeng also emphasised that it increased inference speed so X-Mind can run sufficiently even in vehicle semiconductor environments. It said the computational burden is lower than existing approaches that use raw images or 3D Gaussian splatting, making mass-production application more likely.

Xpeng said it plans to develop X-World, X-Foresight and X-Mind into a single integrated foundation model system to strengthen active reasoning and long-term prediction functions. It also said it is improving VLA 2.0 performance by continuously increasing model size, data volume and training objectives. It plans to expand a system covering environment understanding and reasoning, decision-making and action execution into a broader embodied intelligence domain, and to accelerate development of related technology and its application in mass production.

Jinju Hong hongjj@d-today.co.kr

Keyword