Alibaba unveils VLA model to bridge fragmented physical tasks

  • Qwen-VLA combines vision, language and motion generation in a single system for robots
  • The move marks Alibaba’s entry into embodied AI as firms race to build foundation models for the physical world

Alibaba Group’s AI model team Qwen has released its first vision-language-action, or VLA, model for embodied AI, expanding the tech giant’s AI ambitions beyond digital applications and into robotics.

The model, called Qwen-VLA, is designed to unify a range of robotic capabilities that are typically handled by separate systems, including object manipulation, dual-arm coordination, indoor navigation and visual understanding.

The launch positions Alibaba among a growing number of companies seeking to develop general-purpose AI models capable of operating in physical environments.

Built on multimodal model

Built on the Qwen multimodal foundation model, Qwen-VLA adds a diffusion-based action decoder that converts visual and language inputs into continuous motion commands.

Users can issue instructions in natural language, such as directing a dual-arm robot to place a red cup into a box, and the model determines both the robot configuration and the required action format.

Alibaba said the model achieved a 97.9% success rate on the LIBERO desktop manipulation benchmark, close to the 98.6% recorded by the specialized ABot-M0 model.

On the RoboTwin dual-arm benchmark, Qwen-VLA scored 86.1% on simple tasks and 87.2% on difficult tasks, outperforming previous task-specific systems.

In indoor navigation tests using the R2R benchmark, it achieved a 69.0% Oracle success rate, which Alibaba described as the highest among open-source models.

Object manipulation tests

In experiments using ALOHA dual-arm robots, the pre-trained model achieved an average success rate of 76.9% when tested in unfamiliar environments with different backgrounds, object colors and object placements. A baseline model trained from scratch recorded 36.2%.

Alibaba said the model also demonstrated the ability to manipulate previously unseen objects, including toy ducks, sunglasses and vegetables, after receiving natural-language instructions.

The “brain” is powered by the Qwen3.5 multimodal model and handles perception and reasoning, while the “cerebellum” consists of a 1.15-billion-parameter diffusion transformer-based action decoder responsible for generating smooth motion trajectories.

Training was conducted in four stages, beginning with text-to-action pre-training to establish motion priors, followed by multimodal pre-training for visual alignment, supervised fine-tuning using human demonstration data and reinforcement learning in simulation environments.

The architecture of Qwen-VLA model

The launch reflects a broader effort to address fragmentation in the robotics industry, where systems designed for one task often cannot be easily adapted to others, resulting in high development costs and limited scalability.

Across hardware platforms

Rather than building robots itself, Alibaba is positioning Qwen-VLA as a foundation model that can be deployed across different hardware platforms.

The strategy mirrors the company’s approach in large language models, where open-source software and ecosystem partnerships have been central to adoption.

“Moving from understanding language and visual cues to carrying out physical work is a fundamental leap that will not happen overnight,” an industry analyst said on condition of anonymity. “But Qwen-VLA shows that using a unified foundation model to consolidate fragmented physical control tasks is a viable path forward.”

Qwen-VLA combines vision, language and motion generation in a single system for robots
The move marks Alibaba’s entry into embodied AI as firms race to build foundation models for the physical world.

mental leap that will not happen overnight,” an industry analyst said on condition of anonymity. “But Qwen-VLA shows that using a unified foundation model to consolidate fragmented physical control tasks is a viable path forward.”