DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models

Abstract

Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems.

Overview

We identify and demonstrate the limitations of state-of-the-art rule-based metrics like EPDMS, showing their lack of context awareness and alignment with expert human judgment in nuanced driving scenarios. Despite widespread adoption, these metrics operate on predefined fixed rules and thresholds that struggle to capture human-like judgment in complex traffic situations.

We introduce the DriveCritic dataset, a curated dataset sampled from NAVSIM comprising 5,730 trajectory pairs for assessing driving evaluation methods. The dataset features challenging scenarios annotated with pairwise expert human preferences, organized into two diagnostic case studies: (1) Lane-Progress Trade-off scenarios where human drivers sacrifice lane keeping to maintain progress, and (2) Progress-only Contrast scenarios focusing on context-aware evaluation of driving progress.

We propose the DriveCritic model, a VLM-based evaluator that leverages powerful contextual reasoning and common-sense knowledge. The model is built on Qwen2.5-VL-7B and conditions on four inputs: stitched three-camera views, a BEV map with scene context and overlaid candidate trajectories, ego-vehicle status, and EPDMS sub-scores. It is fine-tuned via a two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning (DAPO), achieving 76.0% accuracy in aligning with human expert preferences — significantly outperforming all baselines.

Method

DriveCritic Dataset

The DriveCritic dataset is sampled and constructed from NAVSIM, comprising 5,730 trajectory pairs curated as a pilot benchmark to highlight the need for context-aware evaluation. We mine ambiguous scenarios where EPDMS misjudges human trajectories through the lane keeping (LK) and ego progress (EP) scores, and formulate the task as a pairwise adjudication problem.

We construct two diagnostic case studies:

Case 1 (Lane-Progress Trade-off): Pairs where the human has LK=0 and high EP versus a vocabulary alternative with LK=1 and lower EP, along with mirror pairs to prevent degenerate rules.
Case 2 (Progress-only Contrast): Human trajectories with lower EP paired with vocabulary trajectories receiving notably higher EP while other sub-scores are perfect.

DriveCritic Model

The DriveCritic model leverages Qwen2.5-VL-7B as its backbone and is prompted as an expert driving evaluator. The model conditions on four inputs: (i) a stitched three-camera view (left-front, front, right-front), (ii) a BEV map with scene context where two candidate trajectories are overlaid separately, (iii) the ego-vehicle status (acceleration, velocity, driving command), and (iv) EPDMS sub-scores (Ego Progress and Lane Keeping).

We adopt a two-stage training pipeline:

Stage 1 — Supervised Fine-Tuning (SFT): The base VLM is fine-tuned on 1,100 pairs with chain-of-thought reasoning traces generated by GPT-5 as a teacher model. This warm-up stage establishes the response format and grounds judgments in step-by-step reasoning.
Stage 2 — Reinforcement Learning (DAPO): After SFT, we further fine-tune using the RLVR paradigm with format and accuracy rewards. This stage enables the model to learn human-aligned preferences that go beyond the supervised demonstrations.

Results

DriveCritic achieves 76.0% accuracy on the DriveCritic test set, significantly outperforming all baselines including rule-based metrics (EPDMS), general-purpose VLMs (GPT-5, OpenAI-o3, Qwen2.5-VL-7B), and a supervised pairwise classifier.

Method	Type	Accuracy (%)
EPDMS	Rule-based	46.6
Qwen2.5-VL-7B	General VLM (open)	53.2
GPT-5	General VLM (closed)	60.9
OpenAI-o3	General VLM (closed)	60.1
Supervised Classifier	Supervised	63.1
DriveCritic (Ours)	VLM + SFT + DAPO	76.0

Our ablation studies show that the two-stage training pipeline is critical: applying RL alone can reduce accuracy, while SFT provides necessary warm-up. The full recipe (SFT + DAPO with format and accuracy rewards) achieves the best performance. The model also demonstrates high robustness to input permutations, confirming the effectiveness of the training strategy.

BibTeX

@inproceedings{song2025drivecritic,
      title     = {DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models},
      author    = {Song, Jingyu and Li, Zhenxin and Lan, Shiyi and Sun, Xinglong and Chang, Nadine and Shen, Maying and Chen, Joshua and Skinner, Katherine A. and Alvarez, Jose M.},
      journal   = {2026 IEEE International Conference on Robotics and Automation (ICRA)},
      year      = {2026}
    }