Main Track
The headline competition. Any model is permitted — proprietary, open-weight, or hybrid — and the only objective is raw numerical accuracy on the original QuantiPhy test set.
The QuantiPhy Challenge — a NeurIPS 2026 Competition on Quantitative Physical Reasoning in Vision–Language Models.
QuantiPhy asks competitors to build VLMs that estimate sizes, velocities, and accelerations directly from video — across microscopic, macroscopic, and astronomical scales — and quantifies how faithfully their predictions are grounded in what the model actually sees.
Plausible-sounding answers are not the same as numerically correct ones. QuantiPhy probes whether a VLM is measuring or merely guessing.
Vision–language models are increasingly asked to reason about the physical world — estimating the speed of a falling object, the diameter of a cell, or the acceleration of a galaxy cluster. Yet recent results show that contemporary VLMs rely heavily on memorized world knowledge rather than on the actual video and text presented to them.
The QuantiPhy benchmark, introduced at CVPR 2026 and re-released here as a NeurIPS competition, exposes this gap. Across 3,355 video–question pairs drawn from simulation, laboratory, and in-the-wild sources, the top frontier model trails the human baseline by more than two points — and collapses entirely under counterfactual prompts.
The QuantiPhy Challenge invites the community to close that gap, not only on raw accuracy but on the harder question that this benchmark was built to ask: are model predictions faithful to what the model actually sees?
Push Mean Relative Accuracy on size, velocity, and acceleration estimation toward the human ceiling.
A dedicated Open-Weight Track ensures the strongest methods remain reproducible and accessible to the broader community.
Solutions must hold from microscopic to astronomical regimes — the same physical relationships, observed at vastly different magnitudes.
Competitors may enter either or both tracks. Each track has its own leaderboard, evaluation protocol, and awards.
The headline competition. Any model is permitted — proprietary, open-weight, or hybrid — and the only objective is raw numerical accuracy on the original QuantiPhy test set.
Same scoring rule as the Main Track, but submissions must be based on publicly available model weights and tools — supporting reproducible, community-accessible research on quantitative physical reasoning.
QuantiPhy is grounded in a multi-source video corpus designed to isolate what a model knows from what a model sees.
Fully controlled scenes with exact ground-truth annotations for position, velocity, and acceleration — the cleanest signal for diagnostic evaluation.
Multi-camera, calibrated recordings of real-world physical setups — controlled enough for ground truth, real enough to expose perception failure modes.
Curated real-world videos with expert annotations, including footage from natural phenomena, sports, microscopy, and astronomical observation.
Four diagnostic categories (2S, 2D, 3S, 3D). Across all of them, the target quantities are size, velocity, and acceleration. Final scores are macro-averaged over the four categories.
★ Dataset, validation split, and a CC-licensed starter sample available on Hugging Face.
Both tracks share the original QuantiPhy metric — a threshold-averaged measure that rewards approximate correctness rather than exact matches.
A prediction ŷ for ground truth y is judged correct at confidence threshold θ when:
MRA averages the indicator over a set of D instances and a set of thresholds C = {0.50, 0.55, …, 0.95}:
The final track score is the macro-average of MRA across the four QuantiPhy categories — c ∈ {2S, 2D, 3S, 3D} — giving equal weight to each prior×dimensionality setting:
All deadlines are 23:59 anywhere-on-Earth (AOE). Subject to ratification by the NeurIPS 2026 competition chairs.
Launch the official challenge website, publish rules and documentation, release the validation set with ground-truth answers, test the CSV evaluator, and conduct an internal dry run of the real-time leaderboard.
Open registration and public submissions. Release the starting kit, baseline scripts, example CSV files, evaluation code for the validation set, and documentation.
Participants submit CSV predictions to the online leaderboard and receive real-time scores. Organizers maintain the website, answer questions, update FAQs, and monitor submission integrity.
Freeze rules and leaderboard settings. Participants submit final CSV predictions. Top teams are asked to provide code or Docker containers for reproducibility verification.
Organizers verify top submissions, run reproducibility checks when applicable, compute final scores and bootstrap confidence intervals, and finalize rankings.
Prepare the competition report, invite top teams to submit short technical summaries, finalize workshop talks/posters, and publish final leaderboard analysis.
Present challenge design, final results, analysis of model performance, and selected participant methods at the in-person NeurIPS Competition Track workshop.
.csv file matching the reference submission template.
Puyin Li
Stanford University
Tiange Xiang
Stanford University
Ella Mao
Stanford University
Shirley Wei
Stanford University
Xinye Chen
Stanford University
Shikang Liu
Stanford University
Adnan Masood
UST
Fei-Fei Li
Stanford University
Ehsan Adeli
Stanford University
Compute, prize, and infrastructure support from academic and industry partners.
Don't see your question? Email quantiphybench@gmail.com.
Yes. The two tracks share a registration but maintain independent leaderboards. A single submission file template covers both; you opt in to each track per upload.
No. Every model invoked at inference time must be based on publicly available weights and tools at the time of submission.
No. The numerical ground truth for the test set remains withheld so QuantiPhy can serve as a reliable long-term benchmark beyond NeurIPS. A small validation set with ground-truth labels is provided for prompt development, local evaluation, and sanity checks.
A CSV file containing the numerical prediction for each test instance, in the required output unit. The evaluation server scores submissions against the protected ground truth automatically.
You may, provided the Open-Weight pipeline at inference time uses only publicly available weights and tools. Closed-API outputs used during method development must be disclosed in the technical report.
@article{li2025quantiphy,
title = {QuantiPhy: A Quantitative Benchmark Evaluating Physical
Reasoning Abilities of Vision-Language Models},
author = {Li, Puyin and Xiang, Tiange and Mao, Ella and Wei, Shirley
and Chen, Xinye and Masood, Adnan and Li, Fei-Fei and Adeli, Ehsan},
journal = {arXiv preprint arXiv:2512.19526},
year = 2025
}