DreamForge: Motion-Aware Autoregressive Video Generation for
Multiview Driving Scenes



1Zhejiang University2Shanghai Artificial Intelligence Laboratory3University of Science and Technology of China
4Technical University of Munich
Corresponding authors

For optimal experience, we recommend using Chrome on PC. Large videos may load slowly.

Abstract

Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and design object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm, we can autoregressively generate long videos (over 200 frames) using a model trained in short sequences, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulator DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents.

Architecture

(a) Overall framework. During the denoising process, DreamForge leverages various conditions to enhance the modeling of driving scenes. Additionally, we introduce perspective guidance and incorporate object-wise position encoding (OPE) to improve street and foreground generation. We also implement motion-aware attention (MTA) to enhance temporal coherence, supporting long-term video generation through autoregression. "P" denotes the perspective projection. (b) The overall procedure of OPE. We only encode frustum sampling points in the 3D bounding boxes into the object position embedding. (c) The detailed architecture of MTA, which learns motion cues from motion frames, ego poses, and bidirectional feature differences.

Long Multiview Video Generation

Smooth transition between weather conditions

"Autoregressive generation with smooth weather transitions: 0s Rainy 8s Sunny 14s Night"

Various driving scenes

1. A driving scenario where the ego makes a right turn (12 Hz, 20s).
2. A driving scenario where the ego goes straight (12 Hz, 20s).

Various weather conditions

1. Sunny : "A driving scene video at boston-seaport. daytime, downtown, straight road, red building, white buses, green trees."
2. Rainy : "A driving scene video at boston-seaport. rainy, downtown, straight road, red building, white buses, green trees."
3. Night : "A driving scene video at singapre-hollandvillage. night, congestion. difficult lighting. very dark."

Intergration with Traffic Manager

1. "A driving scene image at singapore. daytime, sunny, downtown."
2. "A driving scene image at singapore. daytime, sunny, downtown."