Olympiad Gold Becomes a Two-Step Recipe

Today's Overview

  • Olympiad Gold Becomes a Portable Two-Step Recipe. SU-01 combines reverse-perplexity curriculum SFT with two-stage RL. A 30B-A3B backbone clears IMO and IPhO gold. Whether the recipe ports to other backbones decides whether this matters.
  • Multi-Turn Agent Rewards Are Too Coarse to Learn From. SDAR demotes self-distillation to a gated auxiliary objective. Gains of 7–10 points over GRPO on ALFWorld, WebShop, and Search-QA.
  • AR Accuracy and Diffusion Speed in the Same Frame. Orthrus uses a dual-architecture that shares one KV cache. The authors claim lossless inference and up to 7.8× speedup.
  • Camera-Controlled Video May Not Need a Dedicated Encoder. Warp-as-History feeds camera-induced warps as pseudo history frames. Frozen models follow trajectories zero-shot.
  • Multi-Hop RAG's Bottleneck Isn't Retrieval — It's Hidden State. PyRAG writes reasoning as executable Python. Errors get caught by the runtime, not by the model's self-check.

Featured

01 Olympiad Gold Becomes a Two-Step Recipe

Olympiad-level reasoning used to look like something only frontier labs could reproduce. SU-01 reframes it as a portable recipe — one you can drop onto any post-trained reasoning backbone. Two moves do the work: a reverse-perplexity curriculum SFT (difficulty ordered by "reverse" perplexity) instills proof-search and self-check behaviors, then a two-stage RL amplifies them. Verifiable-reward RL first, then proof-level RL. Test-time scaling extracts the last bit.

The authors ran it on a 30B-A3B backbone with about 340K sub-8K-token trajectories for SFT plus 200 RL steps. The model handles reasoning chains over 100K tokens and won IMO 2025, USAMO 2026, IPhO 2024, and IPhO 2025 gold.

Discount the "simple and unified" framing. The abstract leaves three questions open: how reverse-perplexity actually orders difficulty, where those 340K high-quality trajectories came from, and whether teams without olympiad-grade annotations can reproduce it. If the recipe really transfers across backbones, training costs for reasoning models get rewritten. If it doesn't and the win is in the data, this becomes another "reproducible only with frontier-lab resources" story.

Key takeaways: - Olympiad gold has been packaged from a one-off capability into a two-stage recipe (curriculum SFT plus two-stage RL). Cross-backbone reproduction decides whether it generalizes. - Reverse-perplexity curriculum and high-quality trajectory data are the recipe's hidden costs. Don't assume cheap reproduction before you see the details. - A 30B-active-3B MoE sustaining 100K-token reasoning chains is a useful engineering reference for teams building long-chain reasoning products.


02 Multi-Turn Agents Need More Than "You Failed"

In ALFWorld and WebShop tasks, RL hands the model one trajectory-end reward. It walked fifteen steps and missed, and nobody — least of all the model — knows which step went wrong. The natural fix is OPSD (On-Policy Self-Distillation): a teacher branch with more context gives dense per-token supervision. This works on single-turn reasoning. On multi-turn agents it breaks. Errors compound, the teacher's own signal drifts, and when the teacher vetoes a token it could mean the model lacks the skill or that the teacher didn't retrieve the right one this time. Treating every veto as a negative is wrong.

SDAR demotes self-distillation from main course to gated auxiliary. RL stays the backbone. The distillation signal passes through a sigmoid gate that strengthens teacher-approved positives and softly damps teacher-vetoed negatives. On Qwen2.5 and Qwen3, gains over GRPO are +9.4% on ALFWorld, +10.2% on WebShop, and +7.0% on Search-QA. The bigger result is that naive GRPO+OPSD's training collapse never appears.

Key takeaways: - The pain point in multi-turn agent RL isn't a weak algorithm. Reward signals are too coarse, and dense supervision is worth the investment. - Single-turn distillation tricks blow up in multi-turn settings. The supervision interface needs redesign, not reuse. - Wins came on interactive multi-turn tasks rather than reasoning benchmarks, which says more about agent training progress. Longer-horizon stability is the next question.


03 AR Accuracy and Diffusion Speed in One Frame

Autoregressive decoding emits one token at a time — slow but accurate. Diffusion language models parallelize but lose quality. Orthrus stops choosing sides. It freezes the original LLM, adds a light trainable module, and lets two views share one KV cache. The AR head handles prefill to preserve representation fidelity. Diffusion handles parallel generation. An exact consensus mechanism keeps the two aligned, with the authors claiming lossless inference, up to 7.8× speedup, and O(1) memory overhead.

The headline numbers are clean, but the full paper has to clear two questions: how heavy is that "light" trainable module in practice, and how does consensus degrade on long sequences.

Key takeaways: - AR fidelity plus diffusion parallelism as a combination — not a choice — is the right direction in this acceleration wave. - Lossless plus O(1) memory is deploy-friendly on paper. Real inference loads need to confirm it. - Teams running high-throughput inference services should test whether Orthrus bolts onto existing Transformers.


04 Video Models Already Follow Camera Trajectories

Camera-controlled video generation has gone two ways: train a dedicated camera encoder on labeled video, or pay an inference-time optimization tax. Warp-as-History changes the framing. Feed the camera-induced image warp into the model's existing visual history channel as pseudo history frames, align the positional encoding, drop invalid tokens. Done.

Under this interface, a frozen video model shows non-trivial zero-shot camera-following. Add LoRA fine-tuning on one annotated video and it generalizes to unseen scenes. One training video sounds aggressive — the full paper needs to show how far that generalization actually reaches when scene content shifts hard.

Key takeaways: - Camera control may not need a dedicated camera encoder. The model's existing history-frame path already carries the signal. - Zero-shot capability is there before any training. One video of LoRA is bonus, not gate. - Teams building controllable video generation can probe their backbone with this zero-training interface first, then decide whether heavier pipelines are worth it.


05 Write Multi-Hop Reasoning as Code

Multi-hop QA failures in RAG usually aren't about retrieval quality. They're about hidden state — the model's mid-chain reasoning lives inside natural language, query drift goes unnoticed, and the only thing that catches errors is the same model that made them. PyRAG rewrites the pipeline as an executable Python program. Each retrieval is a function call. Intermediate answers are explicit variables. The whole reasoning chain becomes a trace you can rerun and step through.

Error detection moves from model self-criticism to execution feedback and compiler errors. Self-repair gets a grounded signal. On HotpotQA, MuSiQue, 2WikiMultihopQA, and two other benchmarks, the more compositional the dataset, the bigger the gain. Training-free and RL settings both win.

For teams building multi-hop QA systems, the value isn't the benchmark number. It's swapping "prompts tuned by superstition" for code execution you can manage with normal engineering tools.

Key takeaways: - The real bottleneck in multi-hop RAG is invisible intermediate state, not retrieval. - Expressing reasoning as code means the runtime catches errors instead of the model itself. - Multi-hop QA teams can treat this as an engineering template that pulls them out of prompt-only work.

Olympiad Gold Becomes a Two-Step Recipe

Also Worth Noting

06
MemEye Takes "Answers Derivable From the Caption" Seriously. MultimodalBuilds an eval that only credits answers requiring fine-grained visual evidence, raising the bar for multimodal agent memory. link
07
Survey on Multi-Agent Failure Attribution. AgentErrors propagate across agents and resist diagnosis. Skim if you're shipping multi-agent products. link
08
Many-Shot ICL Scaling Laws Don't Hold for CoT or Reasoning Tasks. ReasoningCounterintuitive prompt-tuning advice for long-context reasoning. link
09
Orchard: Open-Source Agent-Training Framework, Not Just Orchestration. AgentFills the open-source agent training infra gap. link
10
Reasoning RL Self-Improvement Moves From "Generate Data" to "Generate Environments." TrainingA concrete instance of zero-data self-evolution. link
11
SFT Data Selection Has a Generalization-vs-Extrapolation Tradeoff. TrainingExplains why perplexity, length, and difficulty heuristics keep disagreeing. link
12
RealICU Drops "Doctor's Historical Action" as Ground Truth. EvaluationLong-context ICU clinical agent benchmark and a methodological upgrade for medical AI eval. link
13
VGGT-Edit: Feed-Forward 3D Scene Editing. ArchitectureUses residual field prediction for dynamic response, relevant to 3D content tooling. link
14
Video2GUI Converts Video Into GUI Interaction Trajectories. AgentAimed at GUI agent pretraining and attacking GUI data scarcity directly. link
15
Nexus: Time-Series Forecasting Plus Text Context in an Agentic Framework. AgentOne engineering pattern for stitching TSFMs and LLMs together. link

Today's Observation

Three papers landed on the same concrete point from three angles. SU-01 (2605.13301) uses a reverse-perplexity curriculum for olympiad models. Difficulty gets ordered in the inverse direction of the "low perplexity first" instinct from the SFT era. Many-Shot CoT-ICL (2605.13511) finds that the many-shot scaling pattern, which keeps paying off on ordinary tasks, breaks on CoT and reasoning. More demonstrations can actually hurt. Data Difficulty and the Generalization–Extrapolation Tradeoff (2605.12906) attributes the long-running fight in SFT data-selection literature, where perplexity, length, and difficulty heuristics contradict each other, to a structural tradeoff between generalization and extrapolation. Different difficulty buckets optimize different objectives by design.

Three lines, three actions: curriculum order, in-context demonstration count, data selection. They point at one thing. Reasoning models have grown their own data-side regularities, and the instruction-tuning playbook's heuristics no longer agree here. The disagreement is structural, not noise.

If you're working on reasoning fine-tuning, lifting the SFT-era data-curation playbook straight across is now a real risk. Curriculum order, demo count, and difficulty measurement all need fresh ablations against reasoning-task characteristics. Treat "low perplexity first," "more shots is better," and "pick medium difficulty" as hypotheses to test, not conclusions to encode. Validate before they enter the pipeline.