256 Tokens Match Full Attention, Agents That Build Agents

Today's Overview

  • 256 Tokens Is All You Need for Near-Full Attention — FASA discovers frequency-level sparsity hidden in RoPE positional encoding and uses it for token selection, compressing KV cache to 18.9% while maintaining near-100% accuracy.
  • Multi-agent systems no longer need pre-defined sub-agents: let the system spawn specialized executors on demand. AOrchestra achieves 16.28% relative improvement across GAIA/SWE-Bench/Terminal-Bench.
  • Training SWE Agents No Longer Requires Docker — SWE-World replaces containerized execution feedback with a learned environment simulator; paired with SWE-Master's post-training framework, an open-source 32B model hits 70.8% on SWE-bench Verified.
  • Predicting agent failure doesn't necessarily prevent it. An LLM critic tanks one model's performance by 26 percentage points after deployment.

Featured

01 Efficiency RoPE Hides a Free Lunch for KV Cache

The biggest cost of long-context inference isn't compute — it's KV cache memory. Tens of gigabytes of cache put consumer hardware out of reach. Existing token pruning approaches either use static rules that risk losing critical information, or heuristic strategies with unstable results.

FASA finds a more elegant angle: within RoPE's frequency chunks, only a small subset of "dominant" chunks closely tracks full attention head behavior — and these can identify important tokens at essentially zero extra cost. On LongBench, retaining just 256 tokens reaches nearly 100% of full-KV performance; on AIME24, 18.9% of the cache yields a 2.56x speedup.

A concurrent paper, HySparse, attacks the same problem from the architecture side — using a few full-attention layers as "oracles" to guide token selection in sparse layers. In an 80B MoE model, just 5 full-attention layers enable nearly 10x KV cache compression.

Key takeaways: - RoPE itself contains exploitable sparsity signals, lighter than training a separate importance predictor - The 256-token result shows most context is highly redundant - Teams doing inference deployment should look at FASA and HySparse as complementary approaches


02 Agent No Pre-Defined Workflows — Build Agents on Demand

Multi-agent systems keep getting more complex, but there's an awkward limitation: sub-agents are typically pre-defined, and they choke on novel subtask types.

AOrchestra proposes a natural abstraction — any agent is a tuple of ⟨instruction, context, tools, model⟩. With this "recipe," the system ditches a fixed agent library. Instead, a central orchestrator constructs specialized agents at each step: which tools, which model, what context — all decided dynamically.

The payoff goes beyond flexibility. It enables Pareto-optimal performance-cost tradeoffs — lightweight models handle simple subtasks, heavy models only where needed. Across GAIA, SWE-Bench, and Terminal-Bench, AOrchestra paired with Gemini-3-Flash achieves 16.28% relative improvement over the strongest baseline.

Key takeaways: - Agent-as-tuple turns "building agents" from an engineering problem into a composition problem - Dynamic model selection inherently supports cost control - The framework is framework-agnostic, theoretically layerable on top of any agent system


03 Code Intelligence Training SWE Agents Without Docker

The single biggest infrastructure burden of training SWE agents? Docker environments. Every code fix task needs a dependency-complete, test-executable container — eye-wateringly expensive to build and maintain.

SWE-World and SWE-Master are a one-two punch from the same team. SWE-World trains an LLM-based "environment simulator" that takes an agent's action trajectory and directly predicts execution outcomes and test feedback — completely bypassing physical Docker environments. Beyond saving resources, the simulator enables test-time scaling: virtually evaluating multiple candidate solutions at inference without real execution.

SWE-Master provides the companion post-training framework covering teacher trajectory synthesis, long-horizon SFT, and RL with execution feedback. Together they push Qwen2.5-Coder-32B from 6.2% to 52% (Docker-free SFT), 55% with Docker-free RL, and 68.2% with test-time scaling. SWE-Master alone reaches 61.4% via RL, 70.8% at TTS@8.

Key takeaways: - The environment simulator dramatically lowers the infrastructure barrier, opening the door wider for the open-source community - Simulator plus TTS is a win-win: save Docker at training time, use the simulator to pick the best solution at inference time - Both projects are open-source


04 Agent Your Agent Supervisor Might Be Helping Backward

The intuition sounds reasonable: have an LLM critic monitor agent execution, predict failure, and intervene proactively to boost success rates. This paper delivers a counterintuitive answer — not necessarily, and possibly much worse.

The researchers trained a binary LLM critic with strong offline accuracy (AUROC 0.94), but in deployment it caused a 26 percentage point performance collapse on one model while barely affecting another. The reason is a disruption-recovery tradeoff: intervention can save doomed trajectories, but it also derails trajectories that look shaky mid-run but would have succeeded on their own.

The team's practical recommendation: run a 50-task pilot test before deployment to determine whether intervention actually helps or hurts.

Key takeaways: - High offline prediction accuracy does not guarantee positive online intervention — a critical blind spot in agent reliability research - The disruption-recovery tradeoff means conservative strategies (less intervention) may be safer than aggressive ones - A 50-task pilot is cheap; any team shipping agent systems should add it to their deployment checklist

256 Tokens Match Full Attention, Agents That Build Agents

Also Worth Noting

05
5 Full-Attention Layers Support a 49-Layer Model EfficiencyHySparse lets sparse layers reuse full-attention layer KV caches; an 80B MoE model achieves nearly 10x KV cache compression, outperforming both full-attention and sliding-window baselines. link
06
RL Budgets Shouldn't Be Spent Uniformly TrainingCoBA-RL dynamically allocates per-sample rollout budgets based on model capability: less compute on easy problems, more on hard ones, improving both training efficiency and generalization. link
07
Autoregressive Video KV Cache Gets a 2-Bit Fix Video GenQuant VideoGen exploits spatiotemporal video redundancy for progressive residual quantization, compressing cache 7x with under 4% latency overhead. link
08
Distillation Without Mode Collapse Image GenDP-DMD dedicates the first step to preserving diversity and subsequent steps to quality refinement; no discriminator, no perceptual loss, matching SOTA few-step text-to-image quality. link
09
VLM Spatial Reasoning: Humans 88%, Best Model 55% EvaluationSpatiaLab tests 6 categories and 30 types of spatial reasoning in real-world scenes; on open-ended answers the gap widens further, with the best model at just 41%. link
10
Google Uses Gemini Deep Think to Solve Hard Math AI for ScienceCase studies showing how researchers use iterative dialogue, problem decomposition, and cross-disciplinary knowledge transfer to tackle open problems in theoretical CS and physics with LLMs. link
11
Pretraining Without a Preset Step Count TrainingAnytime Pretraining shows that weight averaging with polynomial-decay learning rates matches cosine schedules, eliminating the need to know total training duration in advance. link
12
SWE-Master: Open Post-Training for SWE Agents Code IntelligenceCovers teacher trajectory synthesis, long-horizon SFT, RL with execution feedback, and inference framework design; Qwen2.5-Coder-32B reaches 61.4%, TTS@8 hits 70.8%. link
13
Pinterest Predicts What Users Would Search For AgentA reverse approach: instead of describing what images are, predict search queries. Deployed at scale, it drives 20% organic traffic growth. link

Today's Observation

A clear technical trend emerges today: KV cache compression is moving from point optimizations to systematic solutions. FASA attacks it from RoPE frequency structure, HySparse from architecture-level full-sparse hybridization, Quant VideoGen from quantization for video generation — three papers solving the same problem from different angles, and their methods are theoretically stackable. The other notable thread is infrastructure cost reduction for SWE agent training: SWE-World demonstrates that replacing Docker with a learned simulator is viable, which could significantly accelerate open-source SWE agent iteration. Teams working on inference deployment should watch FASA and HySparse for combination potential; teams building coding agents should look into SWE-World's simulator approach.