Alibaba Group Holding on Tuesday unveiled a suite of new artificial intelligence models, including a multimodal system rivalling OpenAI’s GPT-4o and Google’s popular “Nano Banana” image editor, intensifying both domestic and international competition in the field.
Advertisement
Chief among the new releases was Qwen3-Omni, a flagship multimodal model akin to OpenAI’s GPT-4o launched in May 2024. The Alibaba model is designed to process a combination of text, audio, image and video inputs and respond with text and audio.
Qwen3-Omni was the first native end-to-end multimodal system that “unifies text, images, audio and video in one model”, the development team said on social media. Alibaba owns the Post.
The model competes with similar offerings already available outside China, including OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash, also known as “Nano Banana” – an image editing and generating tool that has been making waves recently.
Citing benchmark tests on audio recognition and comprehension, as well as image and video understanding, developers said two variants of Qwen3-Omni outperformed their predecessor, Qwen2.5-Omni-7B, as well as GPT-4o and Gemini-2.5-Flash.
Advertisement
Lin Junyang, a researcher on the Qwen team under Alibaba’s cloud unit, attributed the improvements to various foundational projects related to audio and images.