DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

Abstract

Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and design object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm, we can autoregressively generate long videos (over 200 frames) using a model trained in short sequences, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulator DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents.

Architecture

(a) Overall framework. During the denoising process, DreamForge leverages various conditions to enhance the modeling of driving scenes. Additionally, we introduce perspective guidance and incorporate object-wise position encoding (OPE) to improve street and foreground generation. We also implement motion-aware attention (MTA) to enhance temporal coherence, supporting long-term video generation through autoregression. "P" denotes the perspective projection. (b) The overall procedure of OPE. We only encode frustum sampling points in the 3D bounding boxes into the object position embedding. (c) The detailed architecture of MTA, which learns motion cues from motion frames, ego poses, and bidirectional feature differences.

DreamForge: Motion-Aware Autoregressive Video Generation for
Multiview Driving Scenes

For optimal experience, we recommend using Chrome on PC. Large videos may load slowly.

Use UniAD to perform planning on the keyframes (2 Hz) for the generated video.

Use UniAD to perform planning on the keyframes (2 Hz) for the generated video.

Use UniAD to perform planning on the keyframes (2 Hz) for the generated video.

Abstract

Architecture

Long Multiview Video Generation

Smooth transition between weather conditions

Various driving scenes

Various weather conditions

Intergration with Traffic Manager

Visualizations of the simulation within DriveArena.

Visual comparison. Our DreamForge outperforms the baseline in foreground object generation.

Our DreamForge can adapt to the road layouts and 3D bounding boxes generated by DriveArena.

Our DreamForge can adapt to the road layouts and 3D bounding boxes generated by DriveArena.

DreamForge: Motion-Aware Autoregressive Video Generation for Multiview Driving Scenes

For optimal experience, we recommend using Chrome on PC. Large videos may load slowly.

Use UniAD to perform planning on the keyframes (2 Hz) for the generated video.

Use UniAD to perform planning on the keyframes (2 Hz) for the generated video.

Use UniAD to perform planning on the keyframes (2 Hz) for the generated video.

Abstract

Architecture

Long Multiview Video Generation

Smooth transition between weather conditions

Various driving scenes

Various weather conditions

Intergration with Traffic Manager

Visualizations of the simulation within DriveArena.

Visual comparison. Our DreamForge outperforms the baseline in foreground object generation.

Our DreamForge can adapt to the road layouts and 3D bounding boxes generated by DriveArena.

Our DreamForge can adapt to the road layouts and 3D bounding boxes generated by DriveArena.

DreamForge: Motion-Aware Autoregressive Video Generation for
Multiview Driving Scenes