Swap the Action Interface, Gain 11 Points on Spatial Reasoning

Today's Overview

  • The interface design sets the ceiling on spatial reasoning. SpatialClaw changes nothing about the model and skips fine-tuning. It rewrites the agent's action interface into a stateful code kernel and hits 59.9% average accuracy across 20 benchmarks — 11.2 points above recent spatial agents, with consistent gains on six VLM backbones.
  • Science automation's last mile. LabVLA plugs a vision-language-action model into a real lab bench, using a simulation data engine to fill the gap left by scarce lab data, and tops every baseline on LabUtopia both in and out of distribution.
  • MaxProof trains math proving as three separate skills — generation, verification, repair — then organizes them with population-level test-time scaling. It scores 35/42 on IMO 2025 and 36/42 on USAMO 2026, both past the human gold-medal line.
  • Image models finally get interleaved generation. InterleaveThinker bolts a multi-agent pipeline plus single-step RL onto any generator, letting it think and emit text and images in turns, matching Nano Banana and GPT-5.
  • 2D supervision is enough to learn 3D motion. VideoMDM uses no 3D ground truth at all. Precise 2D poses from monocular video train a coherent 3D human motion prior that nearly matches MDM trained on full 3D data.

Featured

01 Same Perception Tools, New Interface, 11 More Points

Most spatial reasoning agents blame their ceiling on a weak VLM or a weak perception module. SpatialClaw points at a third link nobody looks at: the action interface the agent uses to call tools. The usual options both constrain it. Single-shot code execution forces the agent to write its whole strategy up front and submit without seeing intermediate results. Structured tool-call interfaces stay flexible but hard to compose freely.

SpatialClaw uses code itself as the action interface. It keeps a stateful Python kernel preloaded with the input frames and a set of perception and geometry primitives. The VLM writes one executable unit per step and decides the next step from every text and image output so far. No training, no model changes. Across 20 spatial reasoning benchmarks it averages 59.9% accuracy, 11.2 points above recent spatial agents, and improves on six VLM backbones across two model families with no per-benchmark or per-model tuning.

The consistent cross-backbone gain is the convincing part. It says the improvement comes from interface design, not a lucky fit with one backbone. The full paper should clarify the cost: feeding all prior outputs every step raises token spend and latency, and multi-step composition needs a recovery path when it fails.

Key takeaways: - An agent's ceiling depends not just on the model and tools but on the interface between them. The optimization target can shift from "find a stronger model" to "redesign the action interface." - Code as the action interface wins because it's stateful and adjustable step by step from intermediate results, instead of one committed strategy. - Consistent gains on six backbones mark this as a general interface-layer improvement worth borrowing if you build tool-using agents.


02 AI Can Write the Protocol, So Why Are Humans Still at the Bench?

Science automation keeps stalling at the last mile. AI can read the literature, propose hypotheses, and write a full experimental protocol. But uncapping bottles, pipetting, and running instruments still needs a human at the bench. LabVLA aims to close that gap by plugging a vision-language-action model into a real scientific bench. The obstacle is concrete: existing VLAs train almost entirely on home and tabletop scenes, never see lab instruments, transparent liquids, or fixed-sequence protocols, and fall apart in the lab.

The team works two angles. RoboGenesis is a data engine that uses simulation to assemble atomic skills into full experimental workflows, auto-validates and filters them, then exports demonstrations for several robot embodiments — its answer to scarce lab data. Training runs in two stages: FAST action tokens first teach a Qwen3-VL-4B backbone to "understand actions," then a flow-matching action expert learns continuous control. On the LabUtopia benchmark, LabVLA posts the top average success rate among all baselines, in distribution and out.

Key takeaways: - The bottleneck in science automation is moving from "can think" to "can act," with VLA the interface between paper protocols and bench execution. - The real constraint for lab VLAs is data and embodiment diversity. A simulation data engine solves more than stacking on model capacity. - Teams in lab automation or embodied research should track this "from paper science to bench science" path.


03 Train Math Proofs as Generate, Verify, Repair

Competition-level proofs aren't hard because a plausible-looking proof is hard to produce. They're hard because judging correctness is hard — one slippery step and the whole proof is dead. MaxProof splits the model into three skills trained separately: proof generation, proof verification, and critique-based repair that first names the error then fixes it. The verifier deliberately drives down false positives, where a wrong proof gets marked correct, adding defense in depth to the pipeline.

At test time the same model plays generator, verifier, repairer, and ranker at once, produces a batch of candidate proofs, and picks one to submit through a tournament of pairwise comparisons. It scores 35/42 on IMO 2025 and 36/42 on USAMO 2026, both past the human gold-medal line. The structure matters more to practitioners than the scores.

Key takeaways: - Generate-verify-repair works because it trains "judging correctness" as its own skill, instead of hoping the generator polices itself. - Population search plus tournament selection at test time trades compute for correctness — a fit for any task with an automatic verification signal. - The approach ports to code, formal proofs, anything machine-checkable. Math just has the cleanest verification signal.


04 Image Models Draw One Picture. Who Teaches Them to Interleave?

Image generators are already strong at single-image generation and editing. One ability stays locked by architecture: interleaved generation, sequences where text and images alternate. That format is exactly what visual storytelling, step-by-step instruction, and embodied manipulation need. InterleaveThinker doesn't retrain the model. It wraps any existing generator in a multi-agent pipeline. A planner orders the text-and-image inputs into an execution sequence and issues step-by-step instructions. A critic checks each output for drift and rewrites the instruction to regenerate when it strays.

The hard part: one interleaved trajectory can call the generator 25-plus times, so RL over the whole trajectory is impractical. The team uses single-step RL instead, with accuracy and step-wise rewards, optimizing only the single step via GRPO to steer the full trajectory. It reaches parity with Nano Banana and GPT-5 on interleaved generation benchmarks, and unexpectedly lifts the backbone on reasoning benchmarks like WISE and RISE. The idea is solid; whether it bolts on cleanly depends on the paper's consistency results across different generators.

Key takeaways: - Interleaved generation (alternating text and images) is the key gap as image models move toward agent form, worth watching for visual-storytelling and tutorial teams. - This is a bolt-on pipeline, claimed to attach to any existing generator with no backbone retraining. - Steering a multi-step trajectory with single-step RL sidesteps the compute cost of long-trajectory optimization — a reusable trick for agentic generation.


05 Learn a 3D Motion Prior Without a Single 3D Frame

The usual recipe for 3D human motion generation needs 3D motion-capture data first — the most expensive, least scalable part. VideoMDM shows you can skip it. Precise 2D poses pulled from monocular video are enough to train a coherent 3D motion prior. The mechanism is a little backwards. An off-the-shelf 2D-to-3D lifter produces rough 3D pose sequences as a "noisy teacher." The model denoises in 3D space, projects predictions back to 2D, and gets supervised by the accurate 2D keypoints.

The authors back it with theory worth taking seriously: under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision. On HumanML3D, VideoMDM nearly matches MDM trained on full 3D ground truth (FID 0.88 versus 0.54, a small gap). On real video datasets, human raters prefer its generated motion.

Key takeaways: - 2D supervision can stand in for 3D and unlock 3D generation, sidestepping the costly motion-capture bottleneck. - Unlike lifting 2D to 3D only at inference time, this learns a coherent 3D motion manifold during training. - Teams in character animation or motion-data synthesis should watch this — data acquisition cost could drop sharply.

Swap the Action Interface, Gain 11 Points on Spatial Reasoning

Also Worth Noting

06
The Bottleneck in Autonomous Research Is Environment Design, Not Agent Workflow AgentEurekAgent treats optimizable metrics and the execution environment as the main battleground, echoing recent environment-engineering work. link
07
The Lever for World-Action Models Is a Semantic Vision-Action Tokenizer, Not Reconstruction Fidelity RoboticsRepWAM wants that representation to connect future prediction with robot control. link
08
Turn the Agent Harness From Hand Engineering Into a Trainable Plugin AgentHarnessBridge lets the layer that connects to the environment optimize alongside the task. link
09
Build a Walkable Surround World in Real Time From One Narrow Image Video GenMoVerse separates "world building" from "observation rendering." link
10
Make Hidden-State Recurrent Latent Reasoning Switchable and Trainable With On-Policy RL Reasoningeasing the old problem that latent CoT is hard to optimize and interpret. link
11
Post-Training Quantization Compresses LLMs to Ternary Weights and Low-Bit Activations EfficiencyTWLA pushes deployment-grade compression to the limit. link
12
Step-Level Caching for Diffusion Models Drops the Threshold Heuristic Efficiencyit makes budget-constrained caching decisions against final output quality directly. link
13
Fix the "Bag of Words" Flaw in CLIP-Style Models Multimodalcross-modal masked composition recovers object relations and attribute binding. link
14
DoorDash in Production: Multi-Agent RL Learns Three-Sided Dispatch Weights From Delayed Market Feedback Agenta rare production-system case study. link
15
Detect Hallucinations Under Zero-Information-Source Constraints Safetyno model internals, no external reference, just human-like-criterion probes. link

Today's Observation

Three unrelated papers today aim at the same spot — not the model, but the layer that connects reasoning to action. SpatialClaw says what limits spatial reasoning is the agent's action interface for calling tools. RepWAM says the real lever for world-action models is the vision-action tokenizer representation. HarnessBridge says the harness between agent and environment should be trainable. Each names a different interface layer, yet they reach the same conclusion: the optimizable lever is sinking down from the policy network into the mediating representation.

Keep this separate from the recent claim that "environments are the new scaling axis." Environments are the external world the agent faces — tasks, rewards, the interactive state space. These three are about the translation layer wedged between policy and environment, or between perception and control. One makes the stage bigger. The other swaps the "transmission rod" between policy and stage from a hardwired part into a learnable, tunable one. The distinction is practical. If the bottleneck is the environment, invest in the environment engine. If it's the mediating interface, then the same backbone and the same environment can yield real gains just by making the interface trainable or more stateful — SpatialClaw's 11-point lift across six backbones is the evidence.

Something to act on: inventory the middle layers in your own agent system that you treat as fixed plumbing — tool-call protocols, observation encoding, action tokenizers, the harness. Pick the one most like a hardwired translation layer. Make it stateful or able to iterate from intermediate results first, no RL required. Run an A/B on your current backbone, see how much the interface alone yields, then decide whether to open a separate training track for it.