A 20B Search Agent Ties the Frontier by Offloading Its Bookkeeping

Today's Overview

Deleting stale observations to save context follows an inverted-U, not a straight line. Sweep across 4B to 284B models and three retrievers: strong retrievers plus mid-size models win biggest, but a model that's already strong loses accuracy when masking deletes evidence it still needs.
Move the bookkeeping out of the policy and into the environment, and a 20B searcher hits 0.730 average recall. That's 11.4 points over the next-best open searcher, with the largest gains on held-out transfer benchmarks.
Stuffing charts into a report is easy. Getting them factually right is the part nobody checks. TVIR uses 100 expert-curated multimodal research tasks and scores visual reliability and text alignment as their own dimension.
Teach a model to infer intent with zero labels. MindZero turns a planner's behavior explainability into a self-supervised reward, trains with heavy reasoning, and distills to a single forward pass at deployment, beating slow model-based methods in gridworld and home scenarios.

Featured

01 When Saving Context Starts Costing Accuracy

A long-horizon search agent jams retrieval results into context on every tool call, and the window fills up fast. The cheapest fix is to wipe stale observations — old retrieved content the model no longer needs — and free up the budget. This paper's real contribution isn't the saving. It's the regime map: across models from 4B to 284B and three retrievers, masking's payoff isn't monotonic. It's an asymmetric inverted-U.

Every cell on that map behaves differently, so a blanket rule fails. Weak retrievers return noisy, low-hit evidence, so deleting it barely matters — there wasn't much worth keeping. Strong retrievers paired with mid-capability models gain the most: the retriever floods the window with good evidence, pressure peaks, and the model can't implicitly filter noise on its own, so clearing consumed observations hands the freed budget to later turns. Push the model's own capability high enough that it can tell signal from noise inside a long context, and masking backfires — it deletes the exact evidence the model meant to revisit, and accuracy drops.

The mechanism is a trade: tokens for turns. Masking removes content the model has mostly stopped attending to and rarely revisits, buying more executable tool-call rounds with the saved budget. Whether that trade pays depends on whether those extra rounds rescue tasks that would otherwise fail — not on how many tokens you saved. The two don't move together. In weak-retriever cells the saved tokens buy no useful action; in strong-model cells deleting the wrong evidence costs accuracy outright. The authors released the full scaffold and experiment trajectories, so any team building a search or research agent can place its own model size and retriever config on the map, confirm whether it sits in the gain zone or the loss zone, and decide before burning a round to find out the hard way.

Key takeaways: - Context masking's payoff is a conditional inverted-U, not a free optimization. Copy it blindly and you may lose accuracy. - To decide whether to turn it on, locate your "model capability × retriever strength" on the map: strong retriever plus mid-size model wins, and leave it off once the model is already strong. - The authors published scaffold and trajectories, so search and research teams can position their own setup directly.

Source: Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism

02 What a Search Agent Should Remember, and What It Shouldn't

Training a search agent carries an overlooked burden: the model works on an ever-growing transcript, deciding what to search next while also tracking which documents it has seen, which evidence helped, and which constraints stay open. Harness-1's bet is that most of this bookkeeping is something the environment can maintain reliably, so it shouldn't sit inside the policy for RL to optimize.

The candidate pool, evidence links, verification records, and deduplicated observations all move into a stateful harness. The policy keeps only the hard semantic decisions: what to search, what to keep, what to verify, when to stop. The result is a 20B retrieval subagent scoring 0.730 average curated recall across eight benchmarks spanning web, finance, patents, and multi-hop QA — 11.4 points over the next-best open search subagent, and competitive with far larger frontier models.

The biggest gains land on held-out transfer benchmarks, which suggests that RL over explicit search state learns retrieval behavior that generalizes across domains rather than overfitting the training distribution. Read it alongside today's featured paper and you get two cuts at one problem: one externalizes state to the environment, the other masks state out of the context window.

Key takeaways: - Strip recoverable state management out of the policy and let the environment hold it, so RL concentrates on the genuinely hard search decisions. - A 20B model matches and even beats larger frontier searchers this way — a useful architecture choice for compute-limited teams. - The strong transfer-benchmark generalization is the signal most worth checking, but read the full paper to confirm it holds on harder real retrieval tasks.

Source: Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

03 Putting a Chart in a Report Is Easy. Getting It Right Isn't.

Anyone building research agents knows the easy part: have a model retrieve over many steps and write a multi-thousand-word report. The charts inside that report are another story — whether the chosen figure fits, whether a self-drawn chart's data is accurate, whether it lines up with the surrounding analysis. There's almost no systematic way to evaluate any of it.

TVIR fills that gap with 100 expert-curated multimodal research tasks, where each visual element has to serve a specific analytical goal instead of padding the page. The companion TVIR-Agent is a hierarchical multi-agent framework that separates outlining, image retrieval, traceable chart generation, and context-ordered writing into distinct stages, posting solid overall results across nine deep research systems. Its real contribution isn't that baseline. It's pulling "factual reliability of visual information and alignment with the text" out as its own evaluation dimension — exactly the part past long-report evals skipped.

Key takeaways: - Research and report agent teams can add "is the figure factually reliable, does it align with the analysis" to their acceptance checklist instead of judging text alone. - The traceable-sources design is worth borrowing, or an agent's charts look right while the data is fake. - This is an evaluation-level signal. Judge TVIR-Agent itself only after reading the comparison details in full.

Source: TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

04 Teaching a Model to Guess What You're Thinking, With No Labels

For an AI assistant to be genuinely useful, it has to infer your intent from your actions. The real obstacle isn't the algorithm — it's that real-world scenarios have no ground-truth "mental state" to learn from. MindZero's answer is unexpected: skip the labels, let the model generate a batch of hypotheses about what you're thinking, then use a planner to work backward and reward whichever hypothesis best explains the behavior you already took.

That turns intent inference into a self-supervised signal, no human annotation required. The training uses slow, heavy model-based reasoning, but the finished model internalizes the ability into a single forward pass — keeping the accuracy while running fast enough to assist in real time. In gridworld and home scenarios, it beats the original slow, expensive model-based methods on both accuracy and efficiency.

Key takeaways: - Using a planner's behavior explainability as the reward sidesteps the dead end of missing mental-state labels in real settings. - Train with heavy reasoning, deploy as a single distilled forward pass — a general recipe for shipping "can reason but too slow." - Teams building real-time assistant agents should watch this unsupervised route to learning intent.

Source: MindZero: Learning Online Mental Reasoning With Zero Annotations

A 20B Search Agent Ties the Frontier by Offloading Its Bookkeeping

Also Worth Noting

Scaling Test-Time Compute for Agentic Search Hits a Calibration Trap When Correct Answers Are Sparse. AgentFineVerify breaks a question into verifiable sub-questions and checks each candidate piece by piece, structuring "judging correctness" out of the policy too — a third cut at today's masking/externalize theme. link

Today's Observation

Stack three of today's four search-agent papers and a pattern shows: they aren't chasing "search more accurately." They're attacking the chores piled on the policy. On an ever-lengthening transcript, the model makes semantic search decisions, keeps the books, remembers what it has seen, and judges whether its own answer is right. Each paper peels off one burden. Harness-1 externalizes recoverable state to the environment, leaving the policy only the hard search decisions. The masking paper wipes stale observations from context, but only inside a specific regime — change cells and it flips. FineVerify splits "is the answer right" into sub-questions checked one by one, also lifting it out of the policy.

Mechanically it's one thing happening on three faces: state and bookkeeping are moving out of the policy and into the harness, the context window, and the verification step. The leverage point for search agents is shifting — not entirely in the policy itself, but in how you manage what sits in the context window.

In practice: next time you build a research or search agent, don't rush to tune the policy or add RL. Audit the state first. For recoverable things like the candidate pool, evidence links, and verification records, decide which the environment should maintain reliably and which should be cleared from context once stale, rather than dumping all of it into the prompt for the model to carry. Get that right and the policy has far less, and far clearer, to learn.