Benchmark

InquiTree: Evaluating AI Agents in the Scientific Inquiry Loop with Paper-Derived Research Trees

InquiTree turns scientific papers into interactive Research Trees: logical DAGs over subtopic proposal, study design, result interpretation, and belief updating. Agents are evaluated through repeated propose, observe, revise, and conclude cycles, testing whether they can choose the next scientific move, absorb feedback, detect anomalous results, and decide when to draw conclusions. The benchmark derives inquiry environments from neuroscience papers and reports diagnostic stress tests around long-horizon interaction, Fake Result detection, and temporal generalization across newer papers. Its public IT-18 subset releases open-access paper-derived configurations and logs for evaluating AI agents in scientific inquiry loops.

Jun 8, 2026

VidNum1.4K - A Comprehensive Benchmark for Video-based Numerical Reasoning

This research introduces VNum, a comprehensive VideoQA benchmark containing 1,379 human-annotated video-question pairs designed to test multi-step numerical reasoning in Vision-Language Models (VLMs). Moving beyond simple counting, VNum spans diverse real-world environments to quantify objects, actions, and events through a unique three-level hierarchy.

Apr 3, 2026