Agents Start Improving Themselves, and Reaching for Fewer Tools

Today's Overview

  • A Chinese MoE puts "self-evolution" on the roadmap. MiniMax-M2 runs 230B params with only 9.8B active, built end-to-end for agent work, and its latest checkpoint can already debug its own training and rewrite its own scaffold.
  • The biggest waste in parallel reasoning is branches thinking in isolation. CPT lets thinking branches share intermediate findings in real time, training-free, and pushes the accuracy-latency curve forward on competition math.
  • RL-trained agents drift into over-calling tools. AKBE teaches a model when to look something up versus trust its own knowledge: 18% fewer tool calls, higher accuracy, 25% better tool efficiency.
  • A skill shouldn't be a throwaway script. MUSE-Autoskill gives agent skills a full lifecycle so they carry experience across tasks and fix themselves through unit tests.

Featured

01 Training A 230B Model Starts Fixing Itself

The last two years have been a race on parameter counts and activation ratios. The hard part is different: turning a model into something that does work, not a student that answers questions. MiniMax's M2 series takes an aggressive position. Total params hit 229.9B, only 9.8B fire per token, and the whole stack — data to training — is built for agent work.

An agent-driven data pipeline produces large volumes of verifiable coding and office traces. Each trace binds to a real, runnable workspace and an "artifact-aligned" reward. The most interesting piece is the latest M2.7 checkpoint, which takes a first step toward self-evolution: it can autonomously debug its own training and modify its own scaffold. The companion system, Forge, is an RL stack for long-horizon agent traces that cleanly decouples training, inference, and the agent, working with both white-box and black-box agents.

For teams tracking open or in-house base models, this is another data point on small activation buying strong capability. The self-evolution part is the one to follow — though the real grade depends on hands-on testing.

Key takeaways: - 230B total params, 9.8B active, with the full pipeline built for agentic deployment rather than answering questions. - M2.7 can already debug its own training and rewrite its own scaffold, making self-evolution more than a slogan. - Forge decouples training, inference, and agent, and is the engineering base that makes the whole thing run.


02 Reasoning Several Branches Think Together, Still Slow

Running several lines of thought in parallel and picking the best is now standard for stronger reasoning. One waste gets overlooked: the branches are sealed off from each other. An intermediate result that branch A worked hard to find, branch B can't see, so it reruns the same path. The same finding gets rediscovered over and over, and the search gets longer for no reason.

CPT — Collaborative Parallel Thinking — takes the direct route: let the branches talk in real time. It pulls condensed intermediate information from each in-progress branch, maintains a deduplicated "information pool," then broadcasts the pool back into each branch's context. Later steps reuse what others found instead of rebuilding the wheel. The whole thing is training-free, a pure inference-time framework.

On competition math benchmarks like HMMT and AIME, CPT pushes the accuracy-latency Pareto front outward, and it holds across rollout budgets and model sizes. For teams running inference services and trying to cut test-time cost, this is an optimization you can try without retraining.

Key takeaways: - The hidden cost of parallel reasoning is branches rediscovering the same information. - CPT uses a shared information pool so branches reuse each other's intermediate results in real time, with no training. - Accuracy-latency improves across the board on competition math, and it drops straight into an inference framework.


03 Agent RL Makes Agents Over-Call Tools

RL training for agents has a counterintuitive side effect: the model starts abusing tools. It knows the answer but fires off a few extra searches anyway, and gradually loses the line between "look this up" and "my own knowledge is enough." The usual fix rewards fewer tool calls, but that coarse signal is easy to game. The model just cuts calls across the board and hacks the reward.

AKBE goes finer-grained. During training it runs two trajectories per question, one with tools and one without, compares right against wrong, and judges per question whether tools are needed and how few calls suffice. That gives a targeted supervision signal instead of a blunt penalty.

Across seven QA benchmarks, accuracy rises 1.85 on average, tool calls drop 18%, and tool productivity climbs 25% — no accuracy traded for efficiency. It plugs into different RL algorithms as-is. For teams building search or tool-use agents, this aims straight at the "should I call a tool" question.

Key takeaways: - Agentic RL blurs the line on when to use a tool, and reward shaping is easy to game. - AKBE compares with-tool and without-tool trajectories per question to draw the boundary, improving accuracy and efficiency together. - Plug-and-play across RL algorithms, with code released.


04 Agent The Skill an Agent Saves Dies After One Use

Agents now lean on reusable skills to solve complex tasks. Most approaches treat a skill as an isolated, static script — built, used once, thrown away. That's neither reliable nor able to improve over time. MUSE-Autoskill treats a skill as an asset with a full lifecycle: create, remember, manage, evaluate, refine, in a closed loop.

The agent builds skills on demand, stores and reuses them across tasks, organizes and selects them efficiently, and keeps correcting them through unit tests and runtime feedback. It also gives each skill its own "skill-level memory" that stores the experience accumulated across tasks, so reuse and adaptation get more accurate over time.

Early SkillsBench experiments show lifecycle-managed skills do raise task success, efficiency, and reuse rate, and even transfer across agents. For anyone building long-running agent systems that need to accumulate capability, the "skills should be testable, carry experience, and evolve" idea is worth noting — though the evidence is still preliminary.

Key takeaways: - Treating skills as throwaway scripts is why agent capability fails to compound. - MUSE-Autoskill uses a create-remember-manage-evaluate-refine loop so skills are reusable, carry experience, and self-correct via unit tests. - SkillsBench shows early gains in cross-task reuse and cross-agent transfer.


Agents Start Improving Themselves, and Reaching for Fewer Tools

Also Worth Noting

05
Benchmarks Stop Asking "Can It Replace Humans" and Start Asking "What Do People Want Agents to Do" EvaluationJobBench covers 130 real office tasks across 35 occupations, and even the strongest, Claude Opus 4.7, hits only 45.9%, deliberately reframing the goal from replacement to augmentation. link
06
Let a VLM Play Werewolf and Half Its Accusations Are Made Up AgentQUACK checks agent statements sentence by sentence against the true trajectory, and the best model still hallucinates 15.1% of spatial descriptions, with half of its accusations unsupported by evidence. link
07
Can an Agent Remember Your Preferences? Long-Term Interaction Exposes the Gap EvaluationVitaBench 2.0 turns tasks into time-ordered user sequences with preferences buried in everyday fragments, requiring the agent to keep extracting and updating, and frontier models still fall well short. link
08
Minute-Long Audio-Video Generation, and Nobody Tested Where It Breaks Over Time MultimodalLongAV-Compass uses 284 cases across text, image, and video conditions, comparing 11 models on 20-plus dimensions from identity consistency to narrative coherence. link
09
Multi-View 3D Reconstruction Falls Apart on Degraded Inputs Image GenGARD runs diffusion denoising directly in the reconstruction model's feature space, restoring geometry and high-resolution RGB images together. link
10
Scientific Simulation Wants Fast and Accurate, and RecFM Claims 20x Speedup With Better Accuracy AI for Sciencerecursive flow matching uses cross-scale self-consistency to approach multi-step solvers in 2-4 steps, cutting error by over 15%. link
11
That Unremarkable Scaling Vector in the Norm Layer — Delete It and the Model Won't Train Architectureits parameter share is negligible, yet it improves optimization through a "self-amplifying preconditioning" effect, and the paper offers three lightweight improvements. link
12
"LLMs Can Introspect" May Be a Premature Conclusion Interpretabilitya reality check argues the so-called self-state recognition looks more like generic anomaly detection and pattern matching, dropping to near-random once you control for confounds. link
13
Unlearning Requests Keep Coming, and Fine-Tuning Each One Costs Too Much SafetyICCU leaves parameters untouched, deriving readable refusal rules from the unlearning data and applying them at inference, where the rules compose without interfering. link

Today's Observation

One clear thread runs through today: agent capability work is shifting from "get a single task right" to "keep getting stronger." MiniMax-M2.7 tries to debug its own training, MUSE-Autoskill turns skills into long-lived assets that carry experience, and AKBE teaches a model to judge its own tool-use boundary. Same direction, all three: agents that keep correcting and accumulating during operation and training.

A batch of new benchmarks — JobBench, VitaBench 2.0, QUACK, MemFail — aims at the same target from the other side: persistence and truthfulness. Can it remember preferences, are its statements backed by evidence, where does the memory system break. Teams building long-running agents should add both "skill and memory lifecycle management" and "verifiable self-evolution" to the tracking list. Capability going up is one thing. Going up in a controlled, auditable way is another.