Uni-ViGU: Unified Video Generation and Understanding

Abstract

Recent advances in video generation and video understanding have largely been driven by separate model families — diffusion-based generators and autoregressive language models, respectively. In this work, we present Uni-ViGU, a unified framework that bridges video generation and understanding within a single diffusion-based video generator. Our key insight is that a diffusion model, traditionally used only for pixel-level generation, can be extended to jointly produce both visual and textual outputs from a shared prompt, enabling a seamless integration of generation and understanding capabilities.

Through a carefully designed architecture and training strategy, Uni-ViGU generates high-quality videos while simultaneously producing coherent textual descriptions — all within one forward diffusion/denoising process. The model leverages flow matching to iteratively refine both modalities, and we demonstrate that video and text co-evolve meaningfully across denoising steps. Extensive experiments show that Uni-ViGU achieves competitive results on both video generation and understanding benchmarks, offering a new paradigm for multimodal unification.

Method

Uni-ViGU unifies video generation and understanding through a diffusion-based architecture that jointly denoises visual and textual representations.

View full pipeline figure (PDF)

Unified Architecture

A single diffusion backbone processes both video and text tokens, eliminating the need for separate generation and understanding models.

Flow Matching

The model employs flow matching for iterative denoising, enabling smooth and interpretable generation trajectories for both modalities.

Joint Denoising

Video frames and text tokens are co-denoised in a shared process, allowing them to inform and refine each other at every step.

Unified Generation Demo

Uni-ViGU jointly generates video and text from a single prompt through a shared denoising process. Use the step slider to inspect the denoising trajectory and observe how both modalities co-evolve.

Loading demos...

Prompt

A golden retriever running through a sunlit meadow, wildflowers swaying in the breeze.

Denoising Trajectory

Step 0 / 49

Pure Noise (t=1) Converged (t=0)

Visual State t = 1.000

Early

Middle

Late

Temporal frames sampled from the intermediate video at this step

Unified

Text State t = 1.000

Final Generated Video

832 × 480

Final generated video

Place .mp4 in static/videos/demos/

How to read this demo: The slider controls the denoising step in the flow-matching process. At step 0, both visual and text states begin as noise. As you advance the slider, both modalities are jointly refined through the unified diffusion process. The visual state panel shows sampled temporal frames from the intermediate video, while the text state panel shows the corresponding intermediate textual output. The final video below is the fully-denoised result, playable independently.

Citation

@article{qin2025univigu,
  title   = {Uni-ViGU: Towards Unified Video Generation and Understanding
             via A Diffusion-Based Video Generator},
  author  = {Qin, Luozheng and Gong, Jia and Qiao, Qian
             and Li, Tianjiao and Xu, Li and Pan, Haoyu and Qu, Chao
             and Tan Zhiyu and Li, Hao},
  journal = {arXiv preprint arXiv:2604.08121},
  year    = {2026}
}

Acknowledgements

We appreciate the Diffusers and the Flow Matching — the two codebases this project is built upon.