Recent advances in video generation and video understanding have largely been driven by separate model families — diffusion-based generators and autoregressive language models, respectively. In this work, we present Uni-ViGU, a unified framework that bridges video generation and understanding within a single diffusion-based video generator. Our key insight is that a diffusion model, traditionally used only for pixel-level generation, can be extended to jointly produce both visual and textual outputs from a shared prompt, enabling a seamless integration of generation and understanding capabilities.
Through a carefully designed architecture and training strategy, Uni-ViGU generates high-quality videos while simultaneously producing coherent textual descriptions — all within one forward diffusion/denoising process. The model leverages flow matching to iteratively refine both modalities, and we demonstrate that video and text co-evolve meaningfully across denoising steps. Extensive experiments show that Uni-ViGU achieves competitive results on both video generation and understanding benchmarks, offering a new paradigm for multimodal unification.
Uni-ViGU unifies video generation and understanding through a diffusion-based architecture that jointly denoises visual and textual representations.
A single diffusion backbone processes both video and text tokens, eliminating the need for separate generation and understanding models.
The model employs flow matching for iterative denoising, enabling smooth and interpretable generation trajectories for both modalities.
Video frames and text tokens are co-denoised in a shared process, allowing them to inform and refine each other at every step.
Uni-ViGU jointly generates video and text from a single prompt through a shared denoising process. Use the step slider to inspect the denoising trajectory and observe how both modalities co-evolve.
Temporal frames sampled from the intermediate video at this step
Final generated video
Place .mp4 in static/videos/demos/@article{qin2025univigu,
title = {Uni-ViGU: Towards Unified Video Generation and Understanding
via A Diffusion-Based Video Generator},
author = {Qin, Luozheng and Gong, Jia and Qiao, Qian
and Li, Tianjiao and Xu, Li and Pan, Haoyu and Qu, Chao
and Tan Zhiyu and Li, Hao},
journal = {arXiv preprint arXiv:2604.08121},
year = {2026}
}
We appreciate the Diffusers and the Flow Matching — the two codebases this project is built upon.