Knowing When to Stop Doubles an Agent's Recall

Today's Overview

  • A reliable agent has to know when to stop. Across 13 agent systems and 28,000+ tasks, when to abstain proved harder than whether to — and the bigger, more reasoning-heavy models were sometimes more reluctant to quit. CONVOLVE distills stop rules and lifts Llama-3.3-70B's timely-abstention recall from 26.7 to 57.4.
  • Acing the exam doesn't mean it works in the clinic. Stanford ran a blind test on 620 real point-of-care questions, with 149 practicing physicians scoring by specialty. The purpose-built clinical tool OpenEvidence beat three frontier models on all five dimensions, by 25 to 39 points.
  • Long-video relighting breaks at the seams between chunks. NVIDIA's HorizonRelight passes the previous chunk's target-domain latent into the next one and trains with masked self-conditioning, fixing the lighting jumps at chunk boundaries during sliding-window inference.
  • Teaching robots from human video is about understanding interaction, not copying motion. MIT's Human2Any decomposes human demos into composable, object-interaction priors, transferring to a Franka arm and a humanoid with zero target-task robot data.

Featured

01 Agent Benchmarks Forgot To Measure Knowing When To Stop

Everyone building agents competes on capability: search, click, run a terminal, get the task done over many turns. One case keeps getting ignored — the goal was never well-specified, or the environment simply can't deliver it. A reliable agent shouldn't keep grinding here. It should recognize that more interaction won't help, and stop.

This paper names that ability Agentic Abstention and separates it from the classic single-turn "answer or not" decision. An agent can answer, abstain, or keep exploring on every turn, and the signal that it should abstain often only surfaces after several turns of interaction. The authors evaluate across shopping, terminal, and QA environments — 13 agent systems, 28,000+ tasks. The hard part isn't whether an agent can abstain but when: some never stop when they should, others burn many wasted turns before they do. The worst case is the task that looks doable until the environment reveals it isn't, like a query with no matching product. One counterintuitive finding: larger, more reasoning-capable models are sometimes worse at stopping on time.

CONVOLVE addresses this without touching weights. It distills full interaction trajectories into reusable stop rules, and on WebShop it raises Llama-3.3-70B's timely-abstention recall from 26.7 to 57.4.

Key takeaways: - Knowing when not to act drives cost, latency, and user trust. Put it in your eval metrics instead of tracking task success alone. - Don't assume the bigger model is the safer one. On timely stopping it may be worse, so verify this separately during model selection. - CONVOLVE-style context engineering — injecting stop rules without retraining — is a cheap thing to try on production agents.


02 Evaluation The Model That Aces Exams Loses In The Clinic

Physicians put millions of clinical questions to AI tools every week, but the evals almost all use exam items and hypotheticals — not what gets asked in an actual clinic. Stanford collected 620 real point-of-care questions (Real-POCQi) and had 149 practicing physicians from 36 states grade them blind, each question scored by a doctor in the matching specialty.

The comparison was head-to-head: three frontier models (Claude Opus 4.8, Gemini 3.1 Pro, GPT-5.5) against the purpose-built clinical tool OpenEvidence. Across accuracy, clinical usefulness, source quality, and two other dimensions, the specialized tool won every one, by 25 to 39 points. The paper hedges its own conclusion: this doesn't mean general models can't do the job, only that targeted engineering and customization deliver real gains in a vertical.

One detail stands out. Using an LLM as judge diverged systematically from the expert physicians. The two broadly agreed on which tool was best, but if you auto-evaluate domain outputs with a model, that bias is worth watching.

Key takeaways: - In specialized verticals, exam-style benchmark scores and real-world usefulness diverge sharply. Evaluate on a query distribution close to the real one before you ship. - General models aren't unusable here, but they need targeted engineering and customization to close the gap with a dedicated tool. - LLM-as-judge drifts systematically from expert judgment on specialist output. Discount your automated eval results accordingly.


03 Video Gen Long-Video Relighting Breaks At The Seams

Relighting demos look great until you run a clip that lasts a few minutes. Video diffusion models train on short clips, so long video gets split into chunks and processed with a sliding window. Lighting then jumps at the boundary between chunks, breaking continuity.

NVIDIA's HorizonRelight reframes the task as time-conditioned latent domain translation. The previous chunk's target-domain latent gets passed to the next chunk as a starting point, and masked self-conditioning teaches the model to continue from partially occluded context rather than painting each chunk independently. A controllable generation model first produces a single "relit anchor" frame to warm-start the process, which also gives you a prompt-driven interface for controlling the lighting.

The paper reports far fewer boundary artifacts on real long videos and much less cross-chunk appearance drift. Work like this is only as good as the flicker and detail stability in actual footage, so the metrics aren't enough on their own.

Key takeaways: - The real bottleneck in video relighting isn't single-frame quality but long-range consistency, and boundary jumps are what blocks production use. - Passing target-domain latents across chunks plus masked self-conditioning is a reusable fix for sliding-window discontinuity. - Teams building video post-production or generation tools should track this, but wait for stability on real long footage before betting on it.


04 Robotics Copying Human Motion Is The Wrong Goal

The biggest obstacle in teaching robots from human video isn't a data shortage — a human hand and a robot arm are nothing alike. The embodiment differs, the scene differs, and the robot has physical limits a person doesn't, like joints it can't reach.

MIT's Human2Any skips end-to-end policy learning. It decomposes human demos into object-interaction priors, recording only the task-relevant changes in how objects should move relative to each other and abstracting away the hand-specific detail. Those priors then compose with the robot's own feasibility reasoning and motion planning, so one body of human knowledge adapts across different embodiments and scene geometries.

The authors validate on a Franka tabletop arm and an RBY-1 humanoid mobile robot, using no target-task robot training data at all. This decompose-and-check-constraints route is more interesting than another bumped success rate. It reframes imitation as understanding the interaction, then planning your own way to reproduce it.

Key takeaways: - Abstracting human demos into object-interaction priors sidesteps the embodiment-mismatch problem at the core of human-to-robot transfer. - Priors that compose and feasibility that checks mean one body of human knowledge reuses across arms and humanoids. - Transfer works with zero target-task robot data, but read the full paper and watch more scenes before judging the generalization limits.

Knowing When to Stop Doubles an Agent's Recall

Also Worth Noting

05
Harvard's Therapeutic-Reasoning Agent Deliberates Inside A "Biomedical Tool Universe" Agentit weighs contraindications, comorbidities, and medications one by one before choosing a treatment, a sample of agent tool orchestration landing in a specialist domain. link
06
Aristotelian Virtue Ethics As A "Character Profile" For LLMs Safetyfairness, honesty, courage, and restraint treated as describable, measurable dimensions (echoing today's observation). link
07
TrafficAlign Pulls LLM-Generated Traffic Scenes Back To The Real Distribution Trainingscenes from a pretrained model don't match reality, and this automated framework corrects them into usable autonomous-driving data. link
08
BackTranslation2.0 Rebuilds The Metric For Sign-Language Generation Evaluationold metrics are crude and disconnected from human judgment, so this redesigns them from linguistic motivation. link
09
FreqOrtho-SR Uses Frequency-Guided Orthogonal Experts To Resolve A Super-Resolution Tradeoff Image Genreal-image super-resolution struggles to balance pixel fidelity and semantic quality, reconciled by splitting work across frequencies. link
10
LogiCo Unifies Structural And Logical Anomalies In One Framework Architecturemost anomaly detection only watches structure and misses violations of logical constraints. link
11
A Spectrum-Aware Feature-Decoupling Network Tackles Background Clutter In Small-Object Detection Architecturebackground noise across different spectra drags performance down, suppressed by separating features per spectrum. link
12
Inverse Optimization Gives Hierarchical Decision Sub-Policies A Principled Design Reasoningit sidesteps the old instability of training hierarchical policies with pure RL. link

Today's Observation

Two papers with unrelated starting points both treat not acting as a real capability worth studying. Agentic Abstention asks the engineering question: when should an agent stop calling tools and abstain? It turns knowing-when-not-to-act into a measurable reliability metric. Aristotelian Virtue Profiling comes from ethics, placing restraint alongside fairness, honesty, and courage as a dimension you can measure when profiling a model's character.

One measures capability, the other character, but they land in the same spot. We've long evaluated agents by counting what they can do, and almost no one systematically asks whether they know when not to. The gap is underrated because restraint leaves no trace in a success-rate table — it never adds points, only loses them when something goes wrong.

What to do: carve out a separate class of "should have abstained" tasks in your agent eval — unclear goals, unsolvable environments — and measure how often the agent stops when it should, instead of lumping these in as ordinary failures. You'll likely find your strongest model isn't the best one here.