Leap Icon

LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking LeapVAD



1Shanghai Artificial Intelligence Laboratory2Zhejiang University3Technical University of Munich
4Tongji University5University of Science and Technology of China Corresponding author
Code arXiv

LeapVAD in DriveArena

LeapVAD in Carla

Abstract

Autonomous driving technology has made significant progress; however, data-driven methods continue to face challenges in complex scenarios due to their lack of reasoning ability. Knowledge-driven autopilot systems have evolved significantly, thanks to the recent popularization of visual language models. Therefore, we propose a novel data-driven framework named LeapVAD. Our method is designed to emulate the human attentional mechanism, selectively focusing on key traffic objects that influence driving decisions. LeapVAD simplifies environmental representation and reduces the complexity of decision-making by describing the attributes of these objects, such as appearance, motion, and risks. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module that mimics the human learning process of driving. This model comprises a Analytic Process (System-II) that accumulates driving experience through logical reasoning without human intervention, and a Heuristic Process (System-I) that develops from this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. We further trained a Scene Encoder network to generate scene tokens, providing a compact representation for the efficient retrieval of target driving experiences. We conducted experiments in two renowned self-driving simulators, Carla and DriveArena. Our method, trained with less data, outperforms other approaches that rely solely on camera input. Furthermore, extensive ablation studies highlight its continuous learning and migration capabilities.

LeapVAD Architecture

The architecture of LeapVAD consists of two primary modules: scene understanding and dual-process decision-making. The scene understanding module analyzes multi-view or multi-frame images, identifying critical objects and generating scene a token. This token serves as a characteristic representation of the current scene. The dual-process decision-making module then uses this scene description and the guidance of traffic rules to make reasoning and decisions. These decisions are converted into control signals to navigate the ego car in the simulator. Specifically, Analytic Process accumulates an initial memory bank used to train Heuristic Process and updates it especially when Heuristic Process encounters accidents. Heuristic Process leverages scene tokens to efficiently retrieve the most relevant historical scenarios from this memory bank, enabling rapid and informed driving decisions.

VLM instruction datasets

We create a dataset for instruction learning in VLM derived from DriveLM, Rank2Tell, and Carla. This dataset can be categorized into two types: multi-view and multi-frame. The multi-view annotations include a summary and elaboration, while the multi-frame annotations solely consist of a summary. Compared to multi-view annotations, the multi-frame annotations provide additional information such as exact velocity and motion trends.

Scene Encoder

Case Studies

BibTeX


      @misc{ma2025leapvadleapautonomousdriving,
          title={LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking}, 
          author={Yukai Ma and Tiantian Wei and Naiting Zhong and Jianbiao Mei and Tao Hu and Licheng Wen and Xuemeng Yang and Botian Shi and Yong Liu},
          year={2025},
          eprint={2501.08168},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2501.08168}, 
      }