Two Loops Take SWE-bench From 43 to 64

Today's Overview

Looping for deeper compute saturates at two passes. LoopCoder-v2 turns "how many loops" into an engineering knob, and two loops push SWE-bench Verified from 43.0 to 64.4 — but three or more regress.
Make the agent ship a game you can actually play. GameCraft-Bench sets 140 tasks inside the Godot engine, scoring by running the game and replaying it rather than reading code.
Put the teacher in the prompt, not the gradient. ZPPO uses the teacher as a hint to rescue all-fail problems in small-model post-training, beating distillation and GRPO across four sizes from 0.8B to 9B.
One shared discrete tokenizer can work after all. UniAR bets against UniDDT, using bitwise quantization so understanding and generation share visual tokens, hitting dual SOTA on image generation and editing.

Featured

01 Looping for Deeper Compute Stalls at Two Passes

Looped Transformers deepen latent computation by reapplying one shared set of layers. Serial looping pays for it: every extra pass adds latency and KV-cache memory, and the cost eats the depth gains. LoopCoder-v2 routes around this with parallel looping (PLT), using cross-loop position offsets and shared-KV sliding-window attention to flatten the serial cost. "How many times to loop" becomes a design parameter you can choose deliberately.

The team trained a fresh set of 7B PLT coding models on 18T tokens to test it. The result is counterintuitive. Two loops beat the no-loop baseline across the board: SWE-bench Verified climbs from 43.0 to 64.4, Multi-SWE from 14.0 to 31.0. Three or more loops regress. Diagnostics show the useful refinement concentrates in the second pass; updates after that shrink and oscillate, while the mismatch cost from position offsets stays roughly fixed. Once the gains thin out, cost takes over.

Note the gap between framing and finding. The paper is titled "Only Loop Once," but the abstract's real optimum is two loops. This test-time compute curve saturates early. Whether you can capture both the depth gains and the parallel efficiency depends on the full paper's latency and memory measurements.

Key takeaways: - Test-time compute now has a "stack depth" path, not just "stack sampled tokens" — but depth gains saturate at two passes and regress at three, so the usable range is narrow. - If the 43.0-to-64.4 jump holds up, it's a signal worth tracking for anyone building coding agents. - The title's "loop once" and the abstract's "two is best" disagree; weigh the real gains against the parallel cost savings only after seeing the full benchmarks.

Source: LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

02 Writing Code Is Easy, Shipping a Playable Game Is the Test

In ordinary coding tasks, the code runs and the tests pass, you win. Game generation is different. It lives inside a game engine, where scripts, scenes, assets, rendering, and runtime interaction all have to produce coherent playability. A script that doesn't throw errors is nowhere near enough.

GameCraft-Bench targets that gap. It designs 140 tasks across 15 game categories inside the Godot engine, and it doesn't score by reading code. It actually runs the agent's game, replays a recording of player input, then uses a multimodal judge to rule on whether the game is playable against a rubric. The value isn't the score. It's that "can an agent deliver a complete, running interactive system" — long hard to quantify — becomes observable and reproducible.

The results are honest. Agents often implement visible gameplay mechanics, then fall short on content completeness, on whether visual feedback actually fires, and on whether the whole thing holds together.

Key takeaways: - Judge interactive deliverables by whether they run and play, not whether the code passes tests — a harder bar for anyone shipping agent output. - Game generation is the difficulty ceiling for coding agents; in-engine multimodal coordination exposes where current models really fall down. - Agents can build the skeleton of mechanics but can't fill in content and presentation, so "can write" still sits a gap away from "can deliver a finished product."

Source: GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

03 Where Should the Teacher Step Into Small-Model Training?

Distillation wants the small model to learn from the large one, but it's brittle when the size gap is wide. Forcing a small student to mimic a big teacher's logits crushes it onto the teacher's sharpest few modes, and out-of-distribution generalization collapses. RL takes a different route, training on the student's own rollouts to skip logit mimicry — then hits an old problem. On hard problems where every attempt fails, advantage is zero, the whole batch gets silently dropped, and the student learns nothing there.

ZPPO's angle is specific. It doesn't change how you pick data; it changes where the teacher steps in, moving the teacher into the prompt instead of the policy gradient. On hard problems it builds two rewritten prompts: one mixes the teacher's correct answer with the student's wrong one and asks the student to tell them apart; the other aggregates the student's repeated failed rollouts to expose the shared failure mode. A replay buffer recycles each problem until its average accuracy clears half, then it "graduates."

Across four student sizes from 0.8B to 9B with a 27B teacher, ZPPO beats distillation and GRPO on 31 benchmarks, and smaller models gain the most.

Key takeaways: - Putting the teacher in the prompt rather than the gradient avoids the old trap of injecting teacher answers into all-fail problems and breaking on-policy training. - Anyone doing small-model post-training should look at this "keep hard problems inside the student's reach and drill them" approach. - The conclusions are from the abstract; the exact gains and stability need the full paper to confirm.

Source: Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

04 Decouple or Share for Unified Multimodal — Opposite Answers, Same Week

A few days ago UniDDT argued for decoupling: understanding and generation take separate visual paths, since forcing one shared path pleases neither end. UniAR bets the exact opposite. One discrete visual tokenizer serves both understanding and generation, so the model reads the visual tokens it just generated inside the same context, with no re-encoding pass.

The shared route works, and the trick is lookup-free bitwise quantization. It keeps both high-level semantics and low-level detail, and it shortens the visual sequence to speed up generation. The result is SOTA on image generation and editing, with multimodal understanding holding steady.

Two opposite technical bets in the same week mean the route to unified multimodal hasn't converged. Which wins comes down to which one scales.

Key takeaways: - UniAR's "share one discrete tokenizer" bets against UniDDT's "decouple into dual paths"; the core route to unified multimodal is still forking. - The key trick is bitwise quantization, compressing the visual vocabulary and sequence length so understanding and generation share one set of tokens. - Teams working this direction should read both papers together, hold off on picking a side, and watch which route holds up at larger scale.

Source: Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Also Worth Noting

Self-Evolving Agents Should Learn How to Use Experience, Not Just Store It AgentOPD-Evolver proposes a slow-fast on-policy self-distillation framework that separates "remember trajectories" from "learn to evolve." link

Looped Architectures Move Into World Models for the First Time Video GenLooped World Models reuse depth to ease the tension between long-horizon simulation needing deep compute and deep models being expensive and error-prone. link

Most Transformers Make Every Layer the Same Width — This One Doesn't ArchitectureVariable-Width Transformers use an ×-shaped structure to reshape width and compute along depth. link

Interactive World Models Have Shrunk Their Action Vocabulary to Navigation Video GenActWorld uses action-aware memory so you can actually pick up a plate or open a door, not just move the camera. link

Non-Differentiable Top-K Routing Has Long Plagued MoE Training ArchitectureSoftMoE switches to soft differentiable routing so expert selection can be learned end-to-end (ICML). link

Reasoning Models Keep Grinding After the Answer Is Out EfficiencyDynamically pruning rollouts from a GRPO view cuts overthinking and the dead reasoning after the answer appears (Huawei). link

The "Dense Fourier Spectrum" of Transformers Doing Modular Multiplication Is an Artifact InterpretabilityRe-analyzed, it's a discrete logarithmic clock (Stanford). link

When Image and Text Conflict, MLLMs Always Side With Text MultimodalThe root cause is a late-layer text bias, and it can be corrected directly (IJCAI). link

LLM Code Translation Fixates on Correctness and Ignores Runtime Efficiency Code IntelligenceIn the post-Moore era this starts to matter (ICML). link

Unsupervised Dense Retrieval Catches Semantic Similarity but Misses Temporal Relevance RetrievalTemporal preference optimization fills the gap (ICML). link

Today's Observation

Three unrelated papers today happen to push on the same spot. LoopCoder-v2 folds depth into time with looping, Looped World Models carry the same move into world models, and Variable-Width Transformers allocate width unevenly along depth. On the surface one is about efficiency, one about world models, one about architecture. They answer the same question: compute doesn't have to spread evenly along depth, and you can allocate it by each layer's distinct computational role.

Looping is the "reuse depth" answer; widening is the "shape width" answer. Both are surfacing across different tasks right now. This doesn't mean "looped architectures are the future" or "static stacking is dead," and it's nowhere near an architecture revolution. More precisely, the long-default assumption — uniform-width layers, stacked all the way up — just got loosened from a few angles at once, and how far it gives is still unclear.

If you have your own training stack, pick a depth-sensitive task and run the plainest control: share the weights of a few middle layers and loop them twice, or widen only the middle band. Watch which layer the gain curve saturates at. Don't wait for a paper to settle it — only your own workload knows whether this assumption is worth touching.