Agent Trajectories Let a 30B Match a 235B

Today's Overview

  • ACC Repackages Agent Tool-Use Trajectories as Long-Context QA Pairs. Qwen3-30B trained on them lifts MRCR from 50.2 to 68.3, matching Qwen3-235B-A22B at roughly 7x the parameters.
  • WorldKV Moves Long-Term Memory Out of the Attention Bill. Retrieval plus per-block compression keeps the turn-around consistent, doubles throughput, no fine-tuning required.
  • High-Res DiT Inference Is Shifting to Content-Aware Scaling. SEGA weights RoPE frequency components by spectral energy, dodging the structure-vs-detail tradeoff that uniform scaling forces.
  • 80,870 Terminal Recordings Reverse-Engineered Into 1,530 Eval Tasks. TerminalWorld correlates with Terminal-Bench at Pearson 0.20, so scores from expert-curated sets may not map to real developer work.

Featured

01 Your Agent Logs Are Already Long-Context Training Data

Long-context training has been stuck between two expensive options. Collect scarce human-written long documents, or synthesize them with rules. Neither path is cheap.

ACC's angle is unexpected. The trajectories agents produce while solving tasks are themselves natural long documents. Multi-step tool calls and per-step environment observations scatter evidence across distant context spans. That's exactly the distribution long-context training needs to handle. Standard agent SFT trains only turn-level tool selection and masks tool responses, throwing away the supervision signal spread across long context. ACC reformulates trajectories from search, SWE, and database agents into long-context QA pairs. The original question plus all tool responses and environment observations become the long context. The model trains to answer directly, not to run tools.

Qwen3-30B-A3B after training lifts MRCR from 50.2 to 68.3, GraphWalks from 69.9 to 77.5, matching Qwen3-235B-A22B at roughly 7x the parameters. General capability (GPQA, MMLU-Pro, AIME, IFEval) doesn't regress. The caveat is trajectory quality varies. Not every agent log makes a high-value long sample, and the abstract doesn't spell out filtering criteria. Wait for the method section or open data before drawing firmer conclusions. For teams already running agents, the message is direct. Those debug-log traces are the same material someone else is paying to synthesize.

Key takeaways: - Teams already running agents should inventory trajectory logs as training assets, not just debug material. - Standard agent SFT masks tool responses and discards distant supervision. ACC reclaims that signal for training. - Trajectory filtering is the key reproducibility question. The abstract is silent. Hold judgment until method or data release.


02 When You Turn Around, Does the World Model Remember?

The hardest part of interactive video world models isn't single-frame quality. It's the turn-around. A player walks back, looks at where the building was, and it should still be there. Full KV-cache holds that consistency, but memory and attention costs grow linearly with rollout length, so real-time throughput collapses fast. Sliding window keeps moving, but anything outside the window may as well never have existed.

WorldKV splits the forced choice into two independent problems. World Retrieval evicts KV blocks to GPU/CPU memory and brings them back into the native attention window when camera or action matches. Inside each block, World Compression prunes via key-key similarity, letting the same budget hold roughly twice the history. On Matrix-Game-2.0 and related benchmarks, fidelity stays close to full KV while throughput doubles. The whole framework needs no fine-tuning.

For teams building interactive world models, the engineering pattern of moving long-term memory out of the attention bill is what's worth lifting. It transfers more cleanly than any single component.

Key takeaways: - Long-term consistency and real-time throughput can be decoupled by moving history out of the attention window. No forced choice. - Training-free. Drops onto existing video diffusion models with near-zero migration cost. - The retrieval-plus-compression combination is the engineering value. Either piece alone misses the point.


03 Uniform Scaling Wastes Frequency Information at High Resolution

DiT loses quality when generating above its training resolution. The current training-free fix combines RoPE extrapolation with attention scaling. That scaling treats every RoPE frequency component the same. Structure and detail usually trade against each other as a result.

SEGA observes that different RoPE frequency components already correspond to different image scales. The fix is to use the latent's spectral energy to guide scaling magnitude at each denoising step, making the scaling content-aware. The idea is plain but the target is clear. Replace the compromise of uniform scaling with a budget that follows content. SEGA beats existing training-free baselines across multiple target resolutions.

For image generation teams running high-res inference, this drops into existing DiT pipelines without retraining.

Key takeaways: - Training-free high-res inference is converging toward content-aware scaling. - SEGA weights RoPE frequency components by spectral energy, sidestepping the structure-vs-detail tradeoff. - Slots into existing DiT inference for direct comparison without retraining.


04 80,000 Terminal Recordings, 1,530 Reverse-Engineered Tasks

TerminalWorld reverse-engineers 1,530 verified tasks from 80,870 in-the-wild terminal recordings. Coverage spans 18 real categories, 1,280 unique commands, and workflow lengths from short daily operations up to 50+ steps. Hand-curated agent benchmarks struggle on both axes. They drift away from the real distribution and they can't scale.

The surprise: this auto-built benchmark correlates with expert-curated sets like Terminal-Bench at Pearson 0.20. Agent scores polished on expert sets may not transfer to actual developer scenarios. The methodology matters more than the specific numbers. Teams with raw operation logs can distill eval sets from them systematically, skipping the expensive hand-annotation path.

Key takeaways: - Benchmarks shifting from hand-curation to data reverse-engineering is the viable path to scale. - Pearson 0.20 between expert sets and the real distribution means internal evals may need a different data source. - Even the strongest agent hits only 62.5% pass on the verified subset. Real terminal workflows are still hard.

Agent Trajectories Let a 30B Match a 235B

Also Worth Noting

05
Flow Matching Belongs in DINOv2 Representation Space, Not Pixels or SD-VAE. ArchitectureRepresentation-space geometry is friendlier for flow matching to learn. link
06
Agentic Reasoning Shouldn't Make CoT Carry Planning Implicitly. AgentThe paper splits decisions into 3 systems so the agent explicitly chooses when to plan and when to act. link
07
SAM 2 Transferred Directly to Visual Object Tracking Isn't Enough. MultimodalAdds motion, geometry, and semantic adapters to handle distractors, occlusion, and nonlinear motion. link
08
A Multi-Agent Pipeline for Short Drama Generation From One Sentence. Video GenTargets pacing, spatial consistency, and quality control as three specific pain points, not one giant prompt. link
09
Taylor Series Identifies "Temporal Surprise Points" in Video for Frame Selection. MultimodalTraining-free, aligned with predictive coding intuition. link
10
Model Search Is Fundamentally Comparative. RetrievalStructured tables from model cards beat pure text similarity at separating candidate alternatives. link
11
A Task-Adaptive Unified Framework for Fashion Image Retrieval. RetrievalCovers multiple query formats and search intents, directly applicable to e-commerce. link

Today's Observation

ACC and TerminalWorld look like different projects. One generates training data, the other generates evaluation tasks. The underlying move is the same. Recover high-value data assets from naturally produced computational trajectories. ACC treats multi-step tool-use trajectories from agent task-solving as long-context training material. TerminalWorld reverse-engineers eval tasks from 80,000 real terminal recordings. Both sidestep the two highest-cost centers of traditional NLP data prep: hand-curated long documents and hand-annotated benchmarks.

If a team already runs agents in production, or if a product captures users' command-line and tool-call traces, those logs are already a candidate training and eval asset. The remaining questions are whether anyone has noticed and how to filter useful samples. One thing to do today: list the agent or tool-call logs your team hasn't archived. Note the volume and retention. Get them into the data-infrastructure view before deciding whether to build a filtering pipeline.