DreamForge: Motion-Aware Autoregressive Video Generation for
Multiview Driving Scenes



1Zhejiang University2Shanghai Artificial Intelligence Laboratory3University of Science and Technology of China
4Technical University of Munich
Corresponding authors

Abstract

Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm, we can autoregressively generate long videos (over 200 frames) using a 7-frame model, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulation platform DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents.

Architecture

(a) Overall framework. During the denoising process, DreamForge leverages various conditions to enhance the modeling of driving scenes. Additionally, we introduce perspective guidance and incorporate object-wise position encoding (OPE) to improve street and foreground generation. We also implement motion-aware attention (MTA) to enhance temporal coherence, supporting long-term video generation through autoregression. "P" denotes the perspective projection. (b) The overall procedure of OPE. We only encode frustum sampling points in the 3D bounding boxes into the object position embedding. (c) The detailed architecture of MTA, which learns motion cues from motion frames, ego poses, and bidirectional feature differences.

Long Multiview Video Generation

Various driving scenes

1. A case at the country (12 Hz, 19s).
2. A case at the intersection (12 Hz, 19s).

Various weather conditions

1. Sunny—"A driving scene image at boston-seaport. sunny, daytime, suburban, straight road."
2. Rainy"—A driving scene image at boston-seaport. rainy, cloudy, suburban, wet road."
3. Nightime"—A driving scene image at singapre-hollandvillage. night, clear, suburban, streetlights."

Intergration with Traffic Manager

1. "A driving scene image at singapore. daytime, sunny, downtown."
2. "A driving scene image at singapore. daytime, sunny, downtown."