Registration open NeurIPS 2026 · Competition Track CVPR 2026 Project↗

How well can vision–language models do physics, with numerical accuracy?

The QuantiPhy Challenge — a NeurIPS 2026 Competition on Quantitative Physical Reasoning in Vision–Language Models.

QuantiPhy asks competitors to build VLMs that estimate sizes, velocities, and accelerations directly from video — across microscopic, macroscopic, and astronomical scales — and quantifies how faithfully their predictions are grounded in what the model actually sees.

Submission closes
2026 · Nov 05 · 23:59 AOE

–Days
–Hrs
–Min
–Sec

§ 01 Overview

A benchmark for physics, not plausibility.

Plausible-sounding answers are not the same as numerically correct ones. QuantiPhy probes whether a VLM is measuring or merely guessing.

Vision–language models are increasingly asked to reason about the physical world — estimating the speed of a falling object, the diameter of a cell, or the acceleration of a galaxy cluster. Yet recent results show that contemporary VLMs rely heavily on memorized world knowledge rather than on the actual video and text presented to them.

The QuantiPhy benchmark, introduced at CVPR 2026 and re-released here as a NeurIPS competition, exposes this gap. Across 3,289 video–question pairs drawn from simulation, laboratory, and in-the-wild sources, the top frontier model trails the human baseline by more than two points — and collapses entirely under counterfactual prompts.

The QuantiPhy Challenge invites the community to close that gap, not only on raw accuracy but on the harder question that this benchmark was built to ask: are model predictions faithful to what the model actually sees?

O.01

Improve numerical accuracy on quantitative physics tasks

Push Mean Relative Accuracy on size, velocity, and acceleration estimation toward the human ceiling.
O.02

Reward open, reproducible solutions

A dedicated Open-Weight Track ensures the strongest methods remain reproducible and accessible to the broader community.
O.03

Generalize across scale

Solutions must hold from microscopic to astronomical regimes — the same physical relationships, observed at vastly different magnitudes.

§ 02 Tracks

Two tracks, one benchmark.

Competitors may enter either or both tracks. Each track has its own leaderboard, evaluation protocol, and awards.

Track A MRA · Open

Main Track

The headline competition. Any model is permitted — proprietary, open-weight, or hybrid — and the only objective is raw numerical accuracy on the original QuantiPhy test set.

MetricMean Relative Accuracy
Models allowedAny
Code disclosureTop-3 only

Track B MRA · Open-weight

Open-Weight Track

Same scoring rule as the Main Track, but submissions must be based on publicly available model weights and tools — supporting reproducible, community-accessible research on quantitative physical reasoning.

MetricMean Relative Accuracy
Models allowedPublicly available weights
Code disclosureAll entries

§ 03 Dataset

3,289 questions. Three observation regimes. Three scales.

QuantiPhy is grounded in a multi-source video corpus designed to isolate what a model knows from what a model sees.

3,289^★ Video–question pairs
568 Unique source videos
3 Observation regimes
10^±25 Decades of physical scale

Simulation (Blender)

Fully controlled scenes with exact ground-truth annotations for position, velocity, and acceleration — the cleanest signal for diagnostic evaluation.

Laboratory

Multi-camera, calibrated recordings of real-world physical setups — controlled enough for ground truth, real enough to expose perception failure modes.

Internet (In-the-Wild)

Curated real-world videos with expert annotations, including footage from natural phenomena, sports, microscopy, and astronomical observation.

2D · planar

3D · depth

Static
static prior

2S static · 2D

3S static · 3D

Dynamic
dynamic prior

2D dynamic · 2D

3D dynamic · 3D

Four diagnostic categories (2S, 2D, 3S, 3D). Across all of them, the target quantities are size, velocity, and acceleration. Final scores are macro-averaged over the four categories.

★ Dataset, validation split, and a CC-licensed starter sample available on Hugging Face.

§ 04 Evaluation

Mean Relative Accuracy.

Both tracks share the original QuantiPhy metric — a threshold-averaged measure that rewards approximate correctness rather than exact matches.

A prediction ŷ for ground truth y is judged correct at confidence threshold θ when:

∣ŷ − y∣ ∣y∣ < 1 − θ (1)

MRA averages the indicator over a set of D instances and a set of thresholds C = {0.10, 0.20, …, 0.90, 0.95}:

MRA(D) = 1 ∣D∣·∣C∣ ∑ i∈D ∑ θ∈C 𝟙( ∣ŷ_i − y_i∣ ∣y_i∣ < 1 − θ) (2)

The final track score is the macro-average of MRA across the four QuantiPhy categories — c ∈ {2S, 2D, 3S, 3D} — giving equal weight to each prior×dimensionality setting:

Score = 1 4 ∑ c MRA(D_c) (3)

§ 05 Timeline

Important dates.

All deadlines are 23:59 anywhere-on-Earth (AOE). Subject to ratification by the NeurIPS 2026 competition chairs.

June 15–30, 2026
Website, platform, and dry run
Launch the official challenge website, publish rules and documentation, release the validation set with ground-truth answers, test the CSV evaluator, and conduct an internal dry run of the real-time leaderboard.
Late June 2026
Official launch
Open registration and public submissions. Release the starting kit, baseline scripts, example CSV files, evaluation code for the validation set, and documentation.
July–September 2026
Development phase
Participants submit CSV predictions to the online leaderboard and receive real-time scores. Organizers maintain the website, answer questions, update FAQs, and monitor submission integrity.
Early October 2026
Final submission phase
Freeze rules and leaderboard settings. Participants submit final CSV predictions. Top teams are asked to provide code or Docker containers for reproducibility verification.
Mid–Late October 2026
Verification and final ranking
Organizers verify top submissions, run reproducibility checks when applicable, compute final scores and bootstrap confidence intervals, and finalize rankings.
November 2026
Result analysis and workshop preparation
Prepare the competition report, invite top teams to submit short technical summaries, finalize workshop talks/posters, and publish final leaderboard analysis.
December 11–12, 2026
NeurIPS Competition Track workshop
Present challenge design, final results, analysis of model performance, and selected participant methods at the in-person NeurIPS Competition Track workshop.

§ 06 Participation

How to compete.

Steps

Register your team (up to five members).
Review the QuantiPhy dataset and the starter kit.
Prepare a single .csv file matching the reference submission template.
Sign in and upload your .csv in the submission portal — scored against the hidden test set on upload (up to 3 scored submissions per day).

Rules

Teams of up to five. One team per person, per track.
Submissions are made through the official evaluation server. The test split is hidden.
Track B (Open-Weight): submissions must be based on publicly available model weights and tools.
Top-3 entries publish a reproducibility package within 14 days of notification.
Organizers and direct collaborators are not eligible for prizes.

§ 07 Awards

Prizes

1st · per track ^$1K

Track winner · oral presentation slot at the NeurIPS competition workshop.

2nd · per track ^$500

Runner-up · spotlight slot at the workshop.

3rd · per track ^$250

Bronze · poster at the workshop.

Stay tuned for exciting prizes!

§ 08 Organizers

The QuantiPhy Team

Core organizers

Puyin Li Stanford University

Tiange Xiang Stanford University

Ella Mao Stanford University

Shirley Wei Stanford University

Xinye Chen Stanford University

Shikang Liu Stanford University

Faculty advisors

Fei-Fei Li Stanford University

Ehsan Adeli Stanford University

§ 09 Sponsors

Frequently asked.

Don't see your question? Email quantiphybench@gmail.com.

Can I enter both tracks?

Yes. The two tracks share a registration but maintain independent leaderboards. A single submission file template covers both; you opt in to each track per upload.

Does the Open-Weight Track allow proprietary APIs as part of a pipeline?

No. Every model invoked at inference time must be based on publicly available weights and tools at the time of submission.

Will the test ground truth be released after the competition?

No. The numerical ground truth for the test set remains withheld so QuantiPhy can serve as a reliable long-term benchmark beyond NeurIPS. A small validation set with ground-truth labels is provided for prompt development, local evaluation, and sanity checks.

What format are submissions in?

A CSV file containing the numerical prediction for each test instance, in the required output unit. The evaluation server scores submissions against the protected ground truth automatically.

Can I use Main Track submissions to bootstrap an Open-Weight Track submission?

You may, provided the Open-Weight pipeline at inference time uses only publicly available weights and tools. Closed-API outputs used during method development must be disclosed in the technical report.

§ 11 Citation

If you use QuantiPhy, please cite.

@article{li2025quantiphy,
  title   = {QuantiPhy: A Quantitative Benchmark Evaluating Physical
             Reasoning Abilities of Vision-Language Models},
  author  = {Li, Puyin and Xiang, Tiange and Mao, Ella and Wei, Shirley
             and Chen, Xinye and Masood, Adnan and Li, Fei-Fei and Adeli, Ehsan},
  journal = {arXiv preprint arXiv:2512.19526},
  year    = 2025
}

How well can vision–language models do physics, with numerical accuracy?

Improve numerical accuracy on quantitative physics tasks

Reward open, reproducible solutions

Generalize across scale

Main Track

Open-Weight Track

Simulation (Blender)

Laboratory

Internet (In-the-Wild)

Steps

Rules

Core organizers

Faculty advisors