QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Li Puyin🌲*, Tiange Xiang🌲*, Ella Mao🌲*,
Shirley Wei🌲, Xinye Chen🌲, Adnan Masood🌍,
Li Fei-fei🌲†, Ehsan Adeli🌲†
🌲Stanford University, 🌍UST
* Equal first authorship † Equal last authorship
TL;DR

QuantiPhy is the first benchmark that asks vision–language models to do physics with numerical accuracy.

Across 3,300+ video–text instances, we show that today’s VLMs often sound plausible but fail quantitatively on physical reasoning tasks—they rely more on memorized world knowledge from pretraining than on the actual video and text inputs.

QuantiPhy benchmarks the critical gap between qualitative understanding and quantitative reasoning, providing a rigorous testbed for building input-faithful, physically grounded AI.

Are you better than ChatGPT-5.1 at physical reasoning?

video-thumbnail video-thumbnail video-thumbnail video-thumbnail video-thumbnail video-thumbnail video-thumbnail
Object Count

Prior Knowledge: The diameter of the billiard balls is 57.4 mm.

Question: What is the velocity of the orange ball at 1.00s in cm/s?

Click to view Ground Truth and ChatGPT-5.1's answer!

Introduction

Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations.

To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video–text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness.

We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.

QuantiPhy Teaser

Figure 1. On a crowded city street, a bird's nest falls from a branch, a car rushes by, an eagle flits over a building, and a person walks in a crosswalk — the real world is full of complex physical motion. To enable AI to understand and navigate this environment, it is essential for generalist embodied systems to reason about physical properties quantitatively. Because objects obey common laws of physics, their kinematic properties (such as size, velocity, and acceleration) are interrelated. This interdependence makes it possible for visual AI to systematically reason about these properties with respect to available priors. In this work, we present Quantiphy, the first benchmark to evaluate the reasoning ability of AI models on quantitative kinematic inference tasks.

QuantiPhy LogoQuantiPhy Dataset

QuantiPhy Dataset Examples

Figure 2. Sample examples from QuantiPhy, illustrating the four core task combinations.

QuantiPhy Data Overview

Figure 3. Dataset statistics overview.

Dataset Overview

QuantiPhy introduces a rigorous benchmark for evaluating quantitative physical reasoning in Vision-Language Models. Unlike traditional VQA tasks that focus on qualitative descriptions, QuantiPhy challenges models to perform precise numerical inference grounded in physical laws.

  • 🆕 Novel Task: Kinematic Inference

    We formally define a task where object size, velocity, and acceleration are treated as mutually constraining quantities.

    • Input: A video clip + A single physical prior provided as text (e.g., "The car is 4 meters long" or "Gravity is 9.8 m/s²").
    • Reasoning: The model must use the provided prior to recover the world-to-pixel scale and leverage kinematic equations to deduce other unknown properties of the target object.
    • Output: A precise numerical value (with units) for a target property (e.g., "The velocity at t=2s is 12.5 m/s").
  • 🗂️ Structured Taxonomy

    To provide fine-grained analysis of model capabilities, the benchmark is organized along two primary axes:

    • Dimensionality: 2D (Planar motion) vs. 3D (Depth-varying motion).
    • Physical Prior: Static (Size-based) vs. Dynamic (Motion-based, e.g., Velocity/Acceleration).
  • 📊 Scale & Diversity

    The dataset contains 3,355 video-question pairs derived from 569 unique videos. The data spans diverse sources (Simulation, Lab, Internet) to ensure coverage across microscopic, macroscopic, and astronomical scales.

Dataset Construction Pipeline

Figure 4. QuantiPhy dataset construction pipeline.

Dataset Construction

QuantiPhy is built from four complementary data sources, each designed to probe quantitative physical reasoning under different visual conditions. For every video, we segment object trajectories and annotate size, velocity, and acceleration using source-appropriate, physically grounded procedures.

  • đź”§ Simulation (Fully Controlled)
    • Collection: Physically grounded Blender simulations with explicitly controlled object properties and motions
    • Segmentation: Automatic object-level segmentation from the simulation engine
    • Annotation: Exact ground-truth physical quantities directly obtained from simulation parameters
    • Key advantage: Enables precise causal and counterfactual interventions.
  • đź§Ş Laboratory (Measured Reality)
    • Collection: Multi-camera recordings of real objects in motion under calibrated setups
    • Segmentation: Frame-level object segmentation via semi-automatic pipelines with manual verification
    • Annotation: Physical quantities computed from measured trajectories and known object dimensions
    • Key advantage: Provides real-world grounding with reliable physical measurements.
  • 🌍 Internet (In-the-Wild)
    • Collection: Curated real-world videos with visible, trackable object motion
    • Segmentation: Manual and model-assisted object segmentation under diverse visual conditions
    • Annotation: Expert annotations constrained by visual evidence and physical plausibility
    • Key advantage: Tests robustness under natural noise, clutter, and ambiguity.

QuantiPhy LogoEvaluation on QuantiPhy

Evaluation Setup

QuantiPhy evaluates whether vision–language models can infer quantitative physical properties from visual evidence, rather than producing plausible numerical guesses. Each evaluation instance consists of a short video paired with a natural-language question, and models are required to output a single continuous numerical value, such as object size, velocity, or acceleration. All tasks are open-ended and avoid multiple-choice formats, preventing shortcut strategies based on memorization or priors.

We evaluate three core tasks—size, velocity, and acceleration estimation—using a unified protocol across simulated, laboratory, and in-the-wild videos. Human performance is measured under the same setup, providing a reference point for visually grounded quantitative reasoning.

đź§  Explain Box: Kinematic Inference Task

Given a video, the model is provided with a single physical prior in world units for a source object (e.g., \(S^{\text{world}}\), \(V_t^{\text{world}}\), \(A_t^{\text{world}}\)).

From the corresponding pixel-space measurement, the model must infer a pixel-to-world scale \(\gamma\), such that

\[S^{\text{world}} = \gamma S^{\text{pixel}}, \quad V_t^{\text{world}} = \gamma V_t^{\text{pixel}}, \quad A_t^{\text{world}} = \gamma A_t^{\text{pixel}}.\]

Using this scale, the model then quantitatively infers a target kinematic property in world coordinates.

To measure performance, we use Mean Relative Accuracy (MRA), a metric designed for quantitative physical inference. For a prediction \(\hat{y}\) and ground-truth value \(y\), we compute the relative error \(|\hat{y} - y| / |y|\). A prediction is considered correct under a confidence threshold \(\theta\) if

\[\frac{|\hat{y} - y|}{|y|} < 1 - \theta,\]

where \(\theta \in \{0.5, 0.55, \ldots, 0.95\}\). The final MRA score is the average accuracy across all thresholds:

\[\text{MRA} = \frac{1}{|C|} \sum_{\theta \in C} \mathbb{1}\left(\frac{|\hat{y} - y|}{|y|} < 1 - \theta\right),\]

with \(C\) denoting the set of confidence thresholds.

By rewarding approximate correctness rather than exact numerical matches, MRA provides a smoother and more informative measure of quantitative physical reasoning under real-world visual uncertainty.

đź§  Explain Box: Why Mean Relative Accuracy (MRA)?

Exact numerical matching is unrealistic for physical reasoning. Small perceptual differences in video can lead to meaningful but imperfect numerical estimates, even for humans.

Mean Relative Accuracy (MRA) addresses this by:

  • Measuring relative error rather than absolute error
  • Assigning partial credit for approximately correct predictions
  • Avoiding brittle pass/fail judgments on noisy visual data

As a result, MRA better reflects how physical quantities are inferred in practice and provides a smoother, more informative signal for comparing models.

QuantiPhy Leaderboard

Models Size Release Kinematic Categories Avg.
2S 2D 3S 3D
Proprietary models
ChatGPT-5.1 – 2025/11 46.3 56.2 51.5 58.3 53.1
Gemini-2.5 Pro – 2025/03 44.8 57.5 42.4 53.7 49.6
Gemini-2.5 Flash – 2025/03 40.3 53.2 43.6 57.4 48.6
Grok 4.1 (Fast Reasoning) – 2025/07 39.4 49.5 42.4 48.6 45.0
ChatGPT-5 – 2025/04 36.6 35.0 25.9 33.1 32.6
Claude Sonnet 4.5 – 2025/04 19.6 23.0 19.6 29.1 22.8
Open-weight models
Qwen3-VL-Instruct-32B 32B 2025/05 35.8 51.6 43.2 53.4 46.0
InternVL-3.5-30B 30B 2025/05 36.7 45.4 38.6 42.0 40.7
Qwen3-VL-Instruct-8B 8B 2025/04 26.0 47.8 35.1 46.3 38.8
InternVL-3.5-8B 8B 2025/05 27.3 41.8 34.4 38.3 35.4
Molmo-7B 7B 2024/09 30.0 43.1 24.4 36.6 33.5
Phi-4-Multimodal-Instruct 5.6B 2025/02 33.4 42.3 25.4 28.4 32.4
Qwen3-VL-Instruct-2B 2B 2025/04 27.1 39.0 17.6 32.1 29.0
SmolVLM-Instruct 0.26B 2024/11 31.6 34.4 20.0 27.8 28.5
InternVL-3.5-2B 2B 2025/05 25.0 31.1 16.6 27.4 25.0
VILA-7B 7B 2024/01 23.0 29.8 14.4 23.0 22.6
CogVLM2 Video 12B 2024/05 19.4 28.7 12.7 27.9 22.2
Phi-3-Mini-128K-Instruct-3.8B 3.8B 2024/04 17.3 14.7 19.5 18.6 17.5
LLaVA-13B 13B 2023/10 14.4 22.1 8.0 16.5 15.2
MiniCPM-V 4.5 8B 2025/05 27.6 26.3 0.4 0.0 13.6
Fuyu-8B 8B 2023/10 9.5 14.7 9.5 16.2 12.5
Human Baseline – – 50.0 59.1 55.2 57.9 55.6

Note: Best proprietary model scores are highlighted in orange, best open-weight model scores in teal.

QuantiPhy LogoDissecting Quantitative Reasoning in Vision–Language Models

How Does GPT-5.1 Approach Kinematic Inference Tasks?

Case 1
Case 1 — When Visual Measurement Works

GPT-5.1 can infer kinematic quantities when visual cues are clean and well-scaled.

Given a realistic object size and clear motion across frames, the model explicitly estimates pixel displacement, maps it to real-world units, and computes speed or dimensions proportionally.

In these cases, predictions closely match ground truth, suggesting genuine use of visual evidence rather than memorized priors.

This represents GPT-5.1's best-case behavior: explicit visual measurement followed by numerical reasoning.

Case 2
Case 2 — Sensitivity to Unrealistic Scale

Unrealistic physical scales can derail otherwise correct visual reasoning.

When object dimensions are inflated to implausible values (e.g., a car length of 5670 m), GPT-5.1 often performs arithmetic consistently but fails to detect scale violations.

The model propagates the incorrect assumption through its calculations, leading to extreme numerical errors in downstream kinematic estimates.

Visual reasoning remains internally consistent, but lacks physical plausibility checks.

Case 3
Case 3 — Guessing Without Visual Evidence

When visual input is removed, GPT-5.1 falls back to prior-driven guesses.

With the video ablated, the model no longer has access to motion cues and instead produces values based on typical real-world expectations (e.g., "reasonable" speeds or object widths).

These guesses often deviate substantially from ground truth and show little task-specific adaptation.

Without visual grounding, numerical outputs reflect priors rather than inference.

Case 4
Case 4 — Prior Dominance

For acceleration-related tasks, prior knowledge can override visual evidence.

In scenarios involving falling objects, GPT-5.1 frequently outputs values close to canonical constants (e.g., gravitational acceleration), even when the video implies different kinematic behavior.

This pattern suggests reliance on memorized physics facts instead of extracting acceleration from observed motion.

Correct-looking numbers do not necessarily imply visually grounded reasoning.

Scene Context and Relational Cues

We first examine how scene context affects quantitative reasoning by varying background complexity and the number of moving objects. Surprisingly, background complexity alone has only a mild effect on performance. While removing clutter via segmentation slightly stabilizes predictions, models often perform equally well—or even better—in visually rich scenes, likely due to the presence of implicit reference cues such as tiles, road markings, or architectural structures.

In contrast, the number of objects in a scene has a consistent and substantial impact. Across models, multi-object scenes outperform single-object setups, suggesting that relational structure provides valuable comparative anchors for inferring both size and motion.

MRA Line Plot - Background Complexity and Object Count

Figure 5. Performance across background complexity and object count conditions.

đź’ˇ Takeaway: VLMs benefit less from visual simplicity than from relational and contextual cues embedded in the scene.

Input Faithfulness: Video, Priors, and Prompting

"Do VLMs actually use what we give them?"

We next investigate whether VLMs faithfully condition on the reference video and the explicit physical prior provided in the prompt. We evaluate models under three controlled settings: the default video + prior condition, a prior-only condition with the video removed, and a counterfactual prior condition where the numerical prior is systematically rescaled.

If models relied on visual measurement and algebraic reasoning, removing the video or altering the prior should lead to substantial performance degradation. Instead, we observe a striking pattern: many models achieve comparable accuracy even without the video, indicating that plausible guesses based on memorized world knowledge already suffice. When the prior is counterfactually rescaled, performance collapses across nearly all models, with predictions remaining anchored to typical real-world magnitudes rather than the provided numerical values.

Finally, we test whether structured chain-of-thought (CoT) prompting can mitigate these failures. Contrary to expectation, step-by-step decomposition does not systematically improve performance. For most models, explicitly reasoning through pixel measurement and scale estimation amplifies early numerical errors rather than correcting them.

Models Size Release Video + Prior Prior only Counterfactual CoT
Proprietary models
ChatGPT-5.1 – 2025/11 56.1 39.0 15.4 27.7
Gemini-2.5 Pro – 2025/03 60.9 46.1 29.9 49.8
Gemini-2.5 Flash – 2025/03 49.8 36.1 14.4 22.4
Grok 4.1 (Fast Reasoning) – 2025/07 47.5 44.3 31.6 39.5
ChatGPT-5 – 2025/04 34.2 50.8 29.6 53.7
Claude Sonnet 4.5 – 2025/04 25.4 16.6 11.6 25.9
Open-weight models
Qwen3-VL-Instruct-32B 32B 2025/05 50.1 37.2 34.0 23.1
InternVL-3.5-30B 30B 2025/05 45.4 – 12.1 17.6
Qwen3-VL-Instruct-8B 8B 2025/04 40.5 24.9 12.0 21.0
Qwen2.5-VL-72B-Instruct 72B 2025/01 37.0 19.3 29.7 18.2
Molmo-7B 7B 2024/09 39.8 – 14.7 15.9
Phi-4-Multimodal-Instruct 5.6B 2025/02 40.0 20.1 9.2 23.5
LLaVA-OneVision-72B 72B 2024/08 34.9 28.2 25.2 25.6
SmolVLM-Instruct 0.26B 2024/11 38.9 – 14.3 17.8
InternLM-XC2.5-8B 8B 2024/05 32.7 25.1 22.5 21.5
VILA-7B 7B 2024/01 31.8 – 14.1 10.0
CogVLM2 Video 12B 2024/05 28.5 – 9.5 26.4
Phi-3-Mini-128K-Instruct 3.8B 2024/04 11.1 10.3 8.4 7.2
LLaVA-13B 13B 2023/10 20.2 – 13.9 14.4
MiniCPM-V 4.5-8B 8B 2025/05 29.7 – 19.9 24.1
Fuyu-8B 8B 2023/10 14.3 – 9.0 21.1

Table 2. Input faithfulness analysis across different experimental conditions.

đź’ˇ Takeaway: Current VLMs are not input-faithful quantitative reasoners: visual evidence and explicit priors act as soft hints, while memorized world knowledge dominates final predictions.

BibTeX

@article{li2025quantiphy,
      title   = {QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models},
      author  = {Li, Puyin and Xiang, Tiange and Mao, Ella and Wei, Shirley and Chen, Xinye and Masood, Adnan and Li, Fei-Fei and Adeli, Ehsan},
      journal = {arXiv preprint arXiv:2512.19526},
      year    = {2025}
    }