NeurIPS 2026 · Competition Track CVPR 2026 Project

How well can vision–language models do physics, with numerical accuracy?

The QuantiPhy Challenge — a NeurIPS 2026 Competition on Quantitative Physical Reasoning in Vision–Language Models.

QuantiPhy asks competitors to build VLMs that estimate sizes, velocities, and accelerations directly from video — across microscopic, macroscopic, and astronomical scales — and quantifies how faithfully their predictions are grounded in what the model actually sees.

Submission closes
2026 · Nov 05 · 23:59 AOE
  1. Days
  2. Hrs
  3. Min
  4. Sec
§ 01  Overview

A benchmark for physics, not plausibility.

Plausible-sounding answers are not the same as numerically correct ones. QuantiPhy probes whether a VLM is measuring or merely guessing.

Vision–language models are increasingly asked to reason about the physical world — estimating the speed of a falling object, the diameter of a cell, or the acceleration of a galaxy cluster. Yet recent results show that contemporary VLMs rely heavily on memorized world knowledge rather than on the actual video and text presented to them.

The QuantiPhy benchmark, introduced at CVPR 2026 and re-released here as a NeurIPS competition, exposes this gap. Across 3,355 video–question pairs drawn from simulation, laboratory, and in-the-wild sources, the top frontier model trails the human baseline by more than two points — and collapses entirely under counterfactual prompts.

The QuantiPhy Challenge invites the community to close that gap, not only on raw accuracy but on the harder question that this benchmark was built to ask: are model predictions faithful to what the model actually sees?

  1. O.01

    Improve numerical accuracy on quantitative physics tasks

    Push Mean Relative Accuracy on size, velocity, and acceleration estimation toward the human ceiling.

  2. O.02

    Reward open, reproducible solutions

    A dedicated Open-Weight Track ensures the strongest methods remain reproducible and accessible to the broader community.

  3. O.03

    Generalize across scale

    Solutions must hold from microscopic to astronomical regimes — the same physical relationships, observed at vastly different magnitudes.

§ 02  Tracks

Two tracks, one benchmark.

Competitors may enter either or both tracks. Each track has its own leaderboard, evaluation protocol, and awards.

Track A MRA · Open

Main Track

The headline competition. Any model is permitted — proprietary, open-weight, or hybrid — and the only objective is raw numerical accuracy on the original QuantiPhy test set.

  • MetricMean Relative Accuracy
  • Models allowedAny
  • Code disclosureTop-3 only
Track B MRA · Open-weight

Open-Weight Track

Same scoring rule as the Main Track, but submissions must be based on publicly available model weights and tools — supporting reproducible, community-accessible research on quantitative physical reasoning.

  • MetricMean Relative Accuracy
  • Models allowedPublicly available weights
  • Code disclosureAll entries
§ 03  Dataset

3,355 questions. Three observation regimes. Three scales.

QuantiPhy is grounded in a multi-source video corpus designed to isolate what a model knows from what a model sees.

  • 3,355 Video–question pairs
  • 569 Unique source videos
  • 3 Observation regimes
  • 10±25 Decades of physical scale

Simulation (Blender)

Fully controlled scenes with exact ground-truth annotations for position, velocity, and acceleration — the cleanest signal for diagnostic evaluation.

Laboratory

Multi-camera, calibrated recordings of real-world physical setups — controlled enough for ground truth, real enough to expose perception failure modes.

Internet (In-the-Wild)

Curated real-world videos with expert annotations, including footage from natural phenomena, sports, microscopy, and astronomical observation.

2D · planar
3D · depth
Static
static prior
2S static · 2D
3S static · 3D
Dynamic
dynamic prior
2D dynamic · 2D
3D dynamic · 3D

Four diagnostic categories (2S, 2D, 3S, 3D). Across all of them, the target quantities are size, velocity, and acceleration. Final scores are macro-averaged over the four categories.

★ Dataset, validation split, and a CC-licensed starter sample available on Hugging Face.

§ 04  Evaluation

Mean Relative Accuracy.

Both tracks share the original QuantiPhy metric — a threshold-averaged measure that rewards approximate correctness rather than exact matches.

A prediction ŷ for ground truth y is judged correct at confidence threshold θ when:

ŷy y < 1θ

MRA averages the indicator over a set of D instances and a set of thresholds C = {0.50, 0.55, …, 0.95}:

MRA(D) = 1 D·C iD θC 𝟙( ŷiyi yi < 1θ)

The final track score is the macro-average of MRA across the four QuantiPhy categories — c ∈ {2S, 2D, 3S, 3D} — giving equal weight to each prior×dimensionality setting:

Score = 1 4 c MRA(Dc)
§ 05  Timeline

Important dates.

All deadlines are 23:59 anywhere-on-Earth (AOE). Subject to ratification by the NeurIPS 2026 competition chairs.

  1. June 15–30, 2026
    Website, platform, and dry run

    Launch the official challenge website, publish rules and documentation, release the validation set with ground-truth answers, test the CSV evaluator, and conduct an internal dry run of the real-time leaderboard.

  2. Late June 2026
    Official launch

    Open registration and public submissions. Release the starting kit, baseline scripts, example CSV files, evaluation code for the validation set, and documentation.

  3. July–September 2026
    Development phase

    Participants submit CSV predictions to the online leaderboard and receive real-time scores. Organizers maintain the website, answer questions, update FAQs, and monitor submission integrity.

  4. Early October 2026
    Final submission phase

    Freeze rules and leaderboard settings. Participants submit final CSV predictions. Top teams are asked to provide code or Docker containers for reproducibility verification.

  5. Mid–Late October 2026
    Verification and final ranking

    Organizers verify top submissions, run reproducibility checks when applicable, compute final scores and bootstrap confidence intervals, and finalize rankings.

  6. November 2026
    Result analysis and workshop preparation

    Prepare the competition report, invite top teams to submit short technical summaries, finalize workshop talks/posters, and publish final leaderboard analysis.

  7. December 11–12, 2026
    NeurIPS Competition Track workshop

    Present challenge design, final results, analysis of model performance, and selected participant methods at the in-person NeurIPS Competition Track workshop.

§ 06  Participation

How to compete.

Steps

  1. Register via the registration form[To be announced soon].
  2. Review the QuantiPhy dataset and the starter kit.
  3. Prepare a single .csv file matching the reference submission template.
  4. Submit - [To be announced soon].

Rules

  • Teams of up to five. One team per person, per track.
  • Submissions are made through the official evaluation server. The test split is hidden.
  • Track B (Open-Weight): submissions must be based on publicly available model weights and tools.
  • Top-3 entries publish a reproducibility package within 14 days of notification.
  • Organizers and direct collaborators are not eligible for prizes.
§ 07  Awards

Prizes

Stay tuned for exciting prizes!
§ 08  Organizers

The QuantiPhy Team

Core organizers

Faculty advisors

§ 09  Sponsors

Sponsors

Compute, prize, and infrastructure support from academic and industry partners.

§ 10  FAQ

Frequently asked.

Don't see your question? Email quantiphybench@gmail.com.

Can I enter both tracks?

Yes. The two tracks share a registration but maintain independent leaderboards. A single submission file template covers both; you opt in to each track per upload.

Does the Open-Weight Track allow proprietary APIs as part of a pipeline?

No. Every model invoked at inference time must be based on publicly available weights and tools at the time of submission.

Will the test ground truth be released after the competition?

No. The numerical ground truth for the test set remains withheld so QuantiPhy can serve as a reliable long-term benchmark beyond NeurIPS. A small validation set with ground-truth labels is provided for prompt development, local evaluation, and sanity checks.

What format are submissions in?

A CSV file containing the numerical prediction for each test instance, in the required output unit. The evaluation server scores submissions against the protected ground truth automatically.

Can I use Main Track submissions to bootstrap an Open-Weight Track submission?

You may, provided the Open-Weight pipeline at inference time uses only publicly available weights and tools. Closed-API outputs used during method development must be disclosed in the technical report.

§ 11  Citation

If you use QuantiPhy, please cite.

@article{li2025quantiphy,
  title   = {QuantiPhy: A Quantitative Benchmark Evaluating Physical
             Reasoning Abilities of Vision-Language Models},
  author  = {Li, Puyin and Xiang, Tiange and Mao, Ella and Wei, Shirley
             and Chen, Xinye and Masood, Adnan and Li, Fei-Fei and Adeli, Ehsan},
  journal = {arXiv preprint arXiv:2512.19526},
  year    = 2025
}