Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Luozheng Qin1,* Jia Gong1,*† Qian Qiao3,* Tianjiao Li4 Li Xu4 Haoyu Pan1 Chao Qu2,1 Zhiyu Tan2,1 Hao Li2,1†
1 Shanghai Academy of AI for Science 2 Fudan University 3 Independent Researcher 4 Singapore University of Technology and Design
* Equal contribution † Corresponding author
Single Prompt
Uni-ViGU
Video
+
Text

One unified diffusion model jointly generates video and text from a single prompt.

Abstract

Recent advances in video generation and video understanding have largely been driven by separate model families — diffusion-based generators and autoregressive language models, respectively. In this work, we present Uni-ViGU, a unified framework that bridges video generation and understanding within a single diffusion-based video generator. Our key insight is that a diffusion model, traditionally used only for pixel-level generation, can be extended to jointly produce both visual and textual outputs from a shared prompt, enabling a seamless integration of generation and understanding capabilities.

Through a carefully designed architecture and training strategy, Uni-ViGU generates high-quality videos while simultaneously producing coherent textual descriptions — all within one forward diffusion/denoising process. The model leverages flow matching to iteratively refine both modalities, and we demonstrate that video and text co-evolve meaningfully across denoising steps. Extensive experiments show that Uni-ViGU achieves competitive results on both video generation and understanding benchmarks, offering a new paradigm for multimodal unification.

Method

Uni-ViGU unifies video generation and understanding through a diffusion-based architecture that jointly denoises visual and textual representations.

Unified Architecture

A single diffusion backbone processes both video and text tokens, eliminating the need for separate generation and understanding models.

Flow Matching

The model employs flow matching for iterative denoising, enabling smooth and interpretable generation trajectories for both modalities.

Joint Denoising

Video frames and text tokens are co-denoised in a shared process, allowing them to inform and refine each other at every step.

Unified Generation Demo

Uni-ViGU jointly generates video and text from a single prompt through a shared denoising process. Use the step slider to inspect the denoising trajectory and observe how both modalities co-evolve.

Loading demos...
Prompt

A golden retriever running through a sunlit meadow, wildflowers swaying in the breeze.

Denoising Trajectory

Step 0 / 49
Pure Noise (t=1) Converged (t=0)
Visual State t = 1.000
Early
Middle
Late

Temporal frames sampled from the intermediate video at this step


Unified
Text State t = 1.000

Final Generated Video

832 × 480

Final generated video

Place .mp4 in static/videos/demos/
How to read this demo: The slider controls the denoising step in the flow-matching process. At step 0, both visual and text states begin as noise. As you advance the slider, both modalities are jointly refined through the unified diffusion process. The visual state panel shows sampled temporal frames from the intermediate video, while the text state panel shows the corresponding intermediate textual output. The final video below is the fully-denoised result, playable independently.

Citation

@article{qin2025univigu,
  title   = {Uni-ViGU: Towards Unified Video Generation and Understanding
             via A Diffusion-Based Video Generator},
  author  = {Qin, Luozheng and Gong, Jia and Qiao, Qian
             and Li, Tianjiao and Xu, Li and Pan, Haoyu and Qu, Chao
             and Tan Zhiyu and Li, Hao},
  journal = {arXiv preprint arXiv:2604.08121},
  year    = {2026}
}

Acknowledgements

We appreciate the Diffusers and the Flow Matching — the two codebases this project is built upon.