Move to See: Top Model Reaches the Target Just 12% of the Time

Today's Overview

Spatial intelligence shifts from passive understanding to active perception. TVR asks an agent to turn and step through a 3D room until its view matches a target photo. The strongest closed model succeeds only 12% of the time, but vision-action SFT pulls a 9B open model from single digits past 50%.
Long-context compression can preserve code reasoning. LongAttnComp fine-tunes one lightweight scoring layer, trains it once, and reuses it across three model families — matching full context on code debugging after compression.
VLMs writing code to build 3D models fail in specific ways. 3DCodeBench drops 12 VLMs into real modeling software. Most failures come from wrong API calls and disconnected geometry; multi-turn iteration with execution feedback brings them back.
The frontier in skill adaptation moves to attribution granularity. SkillAdaptor pushes failure blame down from the whole trajectory to the specific step. Backbones stay frozen, no training needed, and every skill edit is auditable — even if each gain is only +1.5 points.

Featured

01 Hand It a Photo, Make It Walk to That View

Show a foundation model a target photo, then drop the agent into a 3D room and let it turn its head and step around until its view lines up with the picture. Humans do this without thinking. Models almost can't. This paper proposes a new task, TVR (Target Viewpoint Reproduction), that turns spatial intelligence from "passively read a given image" into "move in order to see." It ships with TVRBench, an open indoor simulation benchmark.

The numbers sting. The best open and closed models succeed only 7.8% and 12.0% of the time — nowhere near solved. The paper isolates two consistent failure points. Models handle multi-turn visual history poorly, and once reproducing a viewpoint requires the body to translate rather than just pivot in place, scores fall off a cliff. That gap exposes the missing link: mapping a spatial difference into an embodied action.

The post-training recipe is the interesting part. Vision-action SFT does the heavy lifting, taking a 9B open model from single digits to 50.8%. Multi-turn GRPO adds multi-room refinement up to 51.4%. CoT supervision and single-turn GRPO actually hurt closed-loop performance. On tasks that demand continuous decisions, "think it through first, then move" loses to "learn while moving."

Key takeaways: - TVR reframes spatial intelligence from passive understanding to active perception, giving embodied and navigation teams a new axis for capability evaluation. - Off-the-shelf foundation models are near-zero on this task — the strongest closed model hits only 12%. Don't expect to drop one in and use it. - Vision-action SFT carries the gains while CoT and single-turn GRPO backfire, a sign that training closed-loop embodied tasks differs from static reasoning.

Source: Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

02 Slimming Long Documents Without Losing Code Reasoning

Before pushing a 100K-token input through a model, a common way to cut long-context cost is to compress part of it, then prefill. The catch: existing training-free attention compression methods do fine on simple retrieval but visibly drop on tasks that need real reasoning, like code debugging. LongAttnComp stops being fully training-free. It fine-tunes one lightweight cross-attention scoring layer to decide which tokens stay, paired with token-level chunking, top-p selection under a token budget, and positional reordering.

The cross-family result is the headline. Train the compressor once and it transfers directly to four target models across three different families, rather than binding to one. On InfiniteBench code debugging, post-compression accuracy matches or beats uncompressed full context, and multi-document reasoning mostly recovers the gap from the first stage. These are self-reported benchmark numbers, though. How much survives the move to your own workload needs a real test.

Key takeaways: - To keep hard tasks like code and multi-hop reasoning through long-context compression, pure training-free attention methods fall short — you need a lightweight scoring layer fine-tuned in. - Cross-family means the compressor trains once and reuses across multiple models, so you don't redo it per model at deploy time. - Teams building long-document or long-conversation products should test it on their own tasks. Matching full context on a benchmark doesn't mean it matches in your setting.

Source: LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

03 Where VLMs Break When Coding 3D Models

Neural-network-generated 3D assets carry an old problem: non-deterministic, hard to edit, and still in need of rework once they reach an engine. Procedural modeling through code sidesteps all of that — the output is deterministic, the parameters are adjustable, and the engine uses it directly. The price is that you need to know 3D software APIs, parametric design, and geometric reasoning. That bar is high. 3DCodeBench systematically drops 12 vision-language models into real modeling software, asking them to turn text and images into code that generates 3D content, then judges the results with human pairwise preferences (3DCodeArena).

The capability boundary it measures is concrete. Most failures stall on wrong API calls — invoking interfaces that don't exist or don't match. Even when a render succeeds, geometric parts often sit disconnected or float apart. The good news: test-time scaling works. More thinking budget and multi-turn correction both lift results overall. That points straight at what models lack — high-quality procedural code data, and an execution environment that gives high-fidelity feedback so the model can try and fix in the loop.

Key takeaways: - Procedural code for 3D takes the deterministic-and-editable route, a complementary trade-off against neural 3D generators — worth watching for teams building engine asset pipelines. - Today's VLM bottleneck is API knowledge and geometric connection. The problem isn't writing code, it's writing the right code. - Multi-turn iteration plus execution feedback rescues the task. Whoever builds the high-fidelity feedback environment first takes the lead.

Source: 3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

04 Skill Adaptation Moves From Whether to How Precisely

Training-free skill adaptation has a detail no one handled seriously: when an agent fails, most methods edit skills from the whole trajectory or session-level feedback. They know the task failed but can't say which step or which skill is to blame. The edits come out unstable, sometimes broadening with each pass. SkillAdaptor pushes attribution down to the step level. It locates the first correctable error step, attaches blame to a specific candidate skill, then makes a targeted update under an explicit acceptance check — the backbone frozen throughout.

Across three suites (WebShop, PinchBench, Claw-Eval) and three models, it beats both no-skill and existing skill-adaptation baselines consistently. But the largest single gain is only +1.5 to +1.8 points. The increment is small, so the score isn't the point. Auditability is — every skill edit maps to a specific failure step instead of a black-box rewrite of everything.

Key takeaways: - The contest in skill maintenance shifts from "whether to reuse" to "attribution granularity." Step-level is steadier and more auditable than session-level feedback. - Gains are only +1.5 to +1.8. Don't adopt it for the metric — the mechanism worth studying is pinpointing the specific failed step. - Training-free with a frozen backbone, it plugs into OpenClaw-style harnesses at low cost. Teams turning skills into assets can use it as a reference.

Source: SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

Move to See: Top Model Reaches the Target Just 12% of the Time

Also Worth Noting

Does VLM Document Understanding Transfer Across Languages EvaluationHakushoBench builds a Japanese chart and table VQA benchmark from government white papers, targeting the blind spot of non-English document understanding. link

Legal and Humanities Citations Hide in Footnotes Retrievalexisting extraction tools are built for the structured end-of-paper references of natural science; FOSSIL provides a dataset and pipeline for footnote citations interwoven with commentary. link

Updating Parameters Only at a Few Moments Can Still Be Near-Optimal Trainingan algorithm for linear contextual bandits under a "very few parameter updates" constraint, where observation and action selection stay online but reward feedback merges in only at select moments, close to real engineering limits (ICML accepted). link

Today's Observation

TVR and 3DCodeBench look unrelated side by side — one moves a body through a 3D room to find a viewpoint, the other types code in modeling software to build assets. They land on the same line. VLMs are being pushed from "look at an image, answer a question" toward "take actions in a 3D world." TVR moves in order to see; 3DCodeBench writes code in order to build. Neither settles for passively digesting one given observation. Both make the model actively operate the space.

What makes the shift notable is that it rewrites both evaluation and capability-building at once. Success is no longer judged by whether the answer is right, but by whether the action brought the environment to the target state. TVR checks whether you moved to the spot that reproduces the photo. 3DCodeBench checks whether your code rendered an asset whose geometry connects.

If you work on embodied, 3D, or any direction where the model acts rather than answers, run your own tasks through this question this week: does your evaluation metric measure passive understanding or active operation? If it's still stuck at "read a given observation, return an answer," borrow the TVRBench or 3DCodeBench approach and add a closed-loop test that makes the model take real actions in the environment and scores it on the resulting state.