| Method | Type | Accuracy (%) |
|---|---|---|
| EPDMS | Rule-based | 46.6 |
| Qwen2.5-VL-7B | General VLM (open) | 53.2 |
| GPT-5 | General VLM (closed) | 60.9 |
| OpenAI-o3 | General VLM (closed) | 60.1 |
| Supervised Classifier | Supervised | 63.1 |
| DriveCritic (Ours) | VLM + SFT + DAPO | 76.0 |
We identify and demonstrate the limitations of state-of-the-art rule-based metrics like EPDMS, showing their lack of context awareness and alignment with expert human judgment in nuanced driving scenarios. Despite widespread adoption, these metrics operate on predefined fixed rules and thresholds that struggle to capture human-like judgment in complex traffic situations.
We introduce the DriveCritic dataset, a curated dataset sampled from NAVSIM comprising 5,730 trajectory pairs for assessing driving evaluation methods. The dataset features challenging scenarios annotated with pairwise expert human preferences, organized into two diagnostic case studies: (1) Lane-Progress Trade-off scenarios where human drivers sacrifice lane keeping to maintain progress, and (2) Progress-only Contrast scenarios focusing on context-aware evaluation of driving progress.
We propose the DriveCritic model, a VLM-based evaluator that leverages powerful contextual reasoning and common-sense knowledge. The model is built on Qwen2.5-VL-7B and conditions on four inputs: stitched three-camera views, a BEV map with scene context and overlaid candidate trajectories, ego-vehicle status, and EPDMS sub-scores. It is fine-tuned via a two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning (DAPO), achieving 76.0% accuracy in aligning with human expert preferences — significantly outperforming all baselines.
The DriveCritic dataset is sampled and constructed from NAVSIM, comprising 5,730 trajectory pairs curated as a pilot benchmark to highlight the need for context-aware evaluation. We mine ambiguous scenarios where EPDMS misjudges human trajectories through the lane keeping (LK) and ego progress (EP) scores, and formulate the task as a pairwise adjudication problem.
We construct two diagnostic case studies:
The DriveCritic model leverages Qwen2.5-VL-7B as its backbone and is prompted as an expert driving evaluator. The model conditions on four inputs: (i) a stitched three-camera view (left-front, front, right-front), (ii) a BEV map with scene context where two candidate trajectories are overlaid separately, (iii) the ego-vehicle status (acceleration, velocity, driving command), and (iv) EPDMS sub-scores (Ego Progress and Lane Keeping).
We adopt a two-stage training pipeline:
DriveCritic achieves 76.0% accuracy on the DriveCritic test set, significantly outperforming all baselines including rule-based metrics (EPDMS), general-purpose VLMs (GPT-5, OpenAI-o3, Qwen2.5-VL-7B), and a supervised pairwise classifier.
| Method | Type | Accuracy (%) |
|---|---|---|
| EPDMS | Rule-based | 46.6 |
| Qwen2.5-VL-7B | General VLM (open) | 53.2 |
| GPT-5 | General VLM (closed) | 60.9 |
| OpenAI-o3 | General VLM (closed) | 60.1 |
| Supervised Classifier | Supervised | 63.1 |
| DriveCritic (Ours) | VLM + SFT + DAPO | 76.0 |
Our ablation studies show that the two-stage training pipeline is critical: applying RL alone can reduce accuracy, while SFT provides necessary warm-up. The full recipe (SFT + DAPO with format and accuracy rewards) achieves the best performance. The model also demonstrates high robustness to input permutations, confirming the effectiveness of the training strategy.
@article{song2025drivecritic,
title = {DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models},
author = {Song, Jingyu and Li, Zhenxin and Lan, Shiyi and Sun, Xinglong and Chang, Nadine and Shen, Maying and Chen, Joshua and Skinner, Katherine A. and Alvarez, Jose M.},
journal = {arXiv preprint arXiv:2510.13108},
year = {2025}
}