The Qwen3.5-Omni model unveiled by Alibaba’s research team [Photo: Qwen blog]

[Digital Today reporter Yoonseo Lee (이윤서)] Alibaba’s AI research team Qwen·Tongyi Lab has unveiled the omni-modal model Qwen3.5-Omni, which covers text, image, audio and video understanding as well as speech generation.

On March 31 local time, online outlet Gigazine reported that Tongyi Lab said the model’s speech and video understanding performance surpasses Google’s Gemini 3.1 Pro.

Tongyi Lab put “real-time response” and “long-input processing” at the forefront. Qwen3.5-Omni has a maximum sequence length of 256,000, allowing inputs of up to 10 hours of audio or 400 seconds of audiovisual data at 1 frame per second. Speech recognition supports 74 languages, including 39 Chinese dialects as well as Japanese and English. Speech synthesis supports 29 languages, including seven Chinese dialects as well as Japanese and English.

Training data and the internal structure were also disclosed. Tongyi Lab explained that Qwen3.5-Omni was trained on more than 100 million hours of visual and audio data. It said the model combines two mixture-of-experts structures, applying a method in which text generated by one structure is passed to the other to output speech that matches the context. It also presented “scaling” and “native omni-modal AGI” as the model’s direction, and set out a goal of full-scale native omni-modal implementation.

The product was unveiled as a three-model lineup rather than a single model. It comprises Qwen3.5-Omni Plus, Qwen3.5-Omni Flash and Qwen3.5-Omni Light, and can be used through offline and real-time APIs. Tongyi Lab said the Plus model showed better performance than Gemini 3.1 Pro on multiple benchmarks.

In a demo, it presented video understanding alongside use as a development assistant. It released an audiovisual recognition demo that describes events in a video in text, and also showed a workflow that outputs code by inputting a video in which a hand-drawn blueprint is shown and desired functions are explained verbally. Tongyi Lab named this “Audio-Visual Vibe Coding.” For speech synthesis, it introduced the ability to adjust voice tone to produce high-quality speech.

The release is interpreted as a move by Alibaba to build its presence in competition over omni-modal AI that integrates processing of text, images, audio and video. Actual market competitiveness is likely to depend not on benchmark results but on how stably it can implement long-input processing, real-time responses and speech generation quality in real service environments.

1/10 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: Audio-Visual Vibe… pic.twitter.com/fWWyTl9cPY

Keyword

#Alibaba #Qwen3.5-Omni #Qwen·Tongyi Lab #Gemini 3.1 Pro #Audio-Visual Vibe Coding
Copyright © DigitalToday. All rights reserved. Unauthorized reproduction and redistribution are prohibited.