Today's Overview
- A Chinese MoE puts "self-evolution" on the roadmap. MiniMax-M2 runs 230B params with only 9.8B active, built end-to-end for agent work, and its latest checkpoint can already debug its own training and rewrite its own scaffold.
- The biggest waste in parallel reasoning is branches thinking in isolation. CPT lets thinking branches share intermediate findings in real time, training-free, and pushes the accuracy-latency curve forward on competition math.
- RL-trained agents drift into over-calling tools. AKBE teaches a model when to look something up versus trust its own knowledge: 18% fewer tool calls, higher accuracy, 25% better tool efficiency.
- A skill shouldn't be a throwaway script. MUSE-Autoskill gives agent skills a full lifecycle so they carry experience across tasks and fix themselves through unit tests.
Featured
01 Training A 230B Model Starts Fixing Itself
The last two years have been a race on parameter counts and activation ratios. The hard part is different: turning a model into something that does work, not a student that answers questions. MiniMax's M2 series takes an aggressive position. Total params hit 229.9B, only 9.8B fire per token, and the whole stack — data to training — is built for agent work.
An agent-driven data pipeline produces large volumes of verifiable coding and office traces. Each trace binds to a real, runnable workspace and an "artifact-aligned" reward. The most interesting piece is the latest M2.7 checkpoint, which takes a first step toward self-evolution: it can autonomously debug its own training and modify its own scaffold. The companion system, Forge, is an RL stack for long-horizon agent traces that cleanly decouples training, inference, and the agent, working with both white-box and black-box agents.
For teams tracking open or in-house base models, this is another data point on small activation buying strong capability. The self-evolution part is the one to follow — though the real grade depends on hands-on testing.
Key takeaways: - 230B total params, 9.8B active, with the full pipeline built for agentic deployment rather than answering questions. - M2.7 can already debug its own training and rewrite its own scaffold, making self-evolution more than a slogan. - Forge decouples training, inference, and agent, and is the engineering base that makes the whole thing run.
Source: The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
02 Reasoning Several Branches Think Together, Still Slow
Running several lines of thought in parallel and picking the best is now standard for stronger reasoning. One waste gets overlooked: the branches are sealed off from each other. An intermediate result that branch A worked hard to find, branch B can't see, so it reruns the same path. The same finding gets rediscovered over and over, and the search gets longer for no reason.
CPT — Collaborative Parallel Thinking — takes the direct route: let the branches talk in real time. It pulls condensed intermediate information from each in-progress branch, maintains a deduplicated "information pool," then broadcasts the pool back into each branch's context. Later steps reuse what others found instead of rebuilding the wheel. The whole thing is training-free, a pure inference-time framework.
On competition math benchmarks like HMMT and AIME, CPT pushes the accuracy-latency Pareto front outward, and it holds across rollout budgets and model sizes. For teams running inference services and trying to cut test-time cost, this is an optimization you can try without retraining.
Key takeaways: - The hidden cost of parallel reasoning is branches rediscovering the same information. - CPT uses a shared information pool so branches reuse each other's intermediate results in real time, with no training. - Accuracy-latency improves across the board on competition math, and it drops straight into an inference framework.
Source: Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling
03 Agent RL Makes Agents Over-Call Tools
RL training for agents has a counterintuitive side effect: the model starts abusing tools. It knows the answer but fires off a few extra searches anyway, and gradually loses the line between "look this up" and "my own knowledge is enough." The usual fix rewards fewer tool calls, but that coarse signal is easy to game. The model just cuts calls across the board and hacks the reward.
AKBE goes finer-grained. During training it runs two trajectories per question, one with tools and one without, compares right against wrong, and judges per question whether tools are needed and how few calls suffice. That gives a targeted supervision signal instead of a blunt penalty.
Across seven QA benchmarks, accuracy rises 1.85 on average, tool calls drop 18%, and tool productivity climbs 25% — no accuracy traded for efficiency. It plugs into different RL algorithms as-is. For teams building search or tool-use agents, this aims straight at the "should I call a tool" question.
Key takeaways: - Agentic RL blurs the line on when to use a tool, and reward shaping is easy to game. - AKBE compares with-tool and without-tool trajectories per question to draw the boundary, improving accuracy and efficiency together. - Plug-and-play across RL algorithms, with code released.
Source: Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement
04 Agent The Skill an Agent Saves Dies After One Use
Agents now lean on reusable skills to solve complex tasks. Most approaches treat a skill as an isolated, static script — built, used once, thrown away. That's neither reliable nor able to improve over time. MUSE-Autoskill treats a skill as an asset with a full lifecycle: create, remember, manage, evaluate, refine, in a closed loop.
The agent builds skills on demand, stores and reuses them across tasks, organizes and selects them efficiently, and keeps correcting them through unit tests and runtime feedback. It also gives each skill its own "skill-level memory" that stores the experience accumulated across tasks, so reuse and adaptation get more accurate over time.
Early SkillsBench experiments show lifecycle-managed skills do raise task success, efficiency, and reuse rate, and even transfer across agents. For anyone building long-running agent systems that need to accumulate capability, the "skills should be testable, carry experience, and evolve" idea is worth noting — though the evidence is still preliminary.
Key takeaways: - Treating skills as throwaway scripts is why agent capability fails to compound. - MUSE-Autoskill uses a create-remember-manage-evaluate-refine loop so skills are reusable, carry experience, and self-correct via unit tests. - SkillsBench shows early gains in cross-task reuse and cross-agent transfer.
Source: MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Also Worth Noting
Today's Observation
One clear thread runs through today: agent capability work is shifting from "get a single task right" to "keep getting stronger." MiniMax-M2.7 tries to debug its own training, MUSE-Autoskill turns skills into long-lived assets that carry experience, and AKBE teaches a model to judge its own tool-use boundary. Same direction, all three: agents that keep correcting and accumulating during operation and training.
A batch of new benchmarks — JobBench, VitaBench 2.0, QUACK, MemFail — aims at the same target from the other side: persistence and truthfulness. Can it remember preferences, are its statements backed by evidence, where does the memory system break. Teams building long-running agents should add both "skill and memory lifecycle management" and "verifiable self-evolution" to the tracking list. Capability going up is one thing. Going up in a controlled, auditable way is another.