A 135M Pixel Model Beats Billion-Parameter Baselines

Today's Overview

Multi-agent errors can finally be computed, not guessed. GBC adds differentiable weights to the connections between agents, so loss flows back along the interaction chain and turns "which agent's prompt to fix" into token-level attribution — as long as the collaboration structure is differentiable, which discrete tool calls still test.
Tokenizer-free pixel AR is closing the quality gap. PRA uses a low-dimensional intermediate and approximate rollout to suppress compounding error, hitting FID 2.58 on ImageNet 256 with 135M parameters — past the 3.60 of earlier billion-scale models, an order of magnitude smaller and better.

Featured

01 Agent Tuning Is Still Guesswork When Something Breaks

Anyone who tunes multi-agent systems knows the feeling. The whole pipeline fails, but all you get is a final output — no idea whether the task decomposition was wrong or some interaction step corrupted good information upstream. So you edit prompts and swap roles on instinct. GBC (Gradient-Based Connections) wants to move this from trial-and-error toward actual localization.

It models the system as a computation graph, adds differentiable weights to the connections between agents, and lets the task loss propagate backward along the interaction chain. Each agent's contribution to downstream output becomes measurable at the token level. Which step is at fault, which prompt to fix — computed, not guessed. The authors ship an implementation called AgentChord that uses prefix gradient computation to cut overhead, and it beats strong single- and multi-agent baselines on MultiWOZ and τ-bench. They also observe that better attribution quality tracks better optimization, which suggests fine-grained credit assignment carries real signal.

The catch is the premise: the whole collaboration structure has to be differentiable. Production agent systems are full of discrete tool calls, external APIs, and conditional branches where gradients don't flow. How much real-world coverage this gets depends on how the full paper handles those non-differentiable steps. For teams building orchestration, the value isn't the score. It's the attempt to give "why doesn't this agent setup work" an analyzable answer instead of another round of educated guessing.

Key takeaways: - The core pain of multi-agent systems is not knowing which agent or which interaction step failed; GBC turns this into a computable attribution problem using differentiable connections. - Attribution quality correlates with optimization gains, evidence that fine-grained credit assignment produces usable signal — worth attention from orchestration teams. - Differentiability is a hard constraint. Discrete tool calls in production may block gradients, so real deployment depends on how the paper handles non-differentiable steps.

Source: GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems

02 Image Gen Dropping the Tokenizer Is Catching Up on Quality

Mainstream image generation trains a discrete tokenizer to compress images into tokens first — a separate component to build and maintain. Pixel-space autoregressive (AR) generation skips it, predicting images directly as a sequence of raw pixel patches: pixel-in, pixel-out, no tokenizer stage. The cost is two coupled problems. Each step has to generate a very high-dimensional output, so per-step error is large, and teacher-forcing opens a train/inference gap that lets error compound along the AR steps.

PRA predicts a low-dimensional intermediate first, then maps it back to pixel tokens with a pixel decoder. During training it constructs an input distribution close to inference time, approximating the feedback path of a real rollout while keeping parallel training efficient. At 135M parameters it reaches FID 2.58 on ImageNet 256×256, past the 3.60 of earlier billion-scale pixel-space AR. Scaled to 511M it drops to 1.94 — an order of magnitude fewer parameters and still better. The quality gap for this route is narrowing fast.

Key takeaways: - The tokenizer-free route removes a component you'd otherwise train and maintain separately, and PRA brings its quality to a comparable level. - Compounding error is the core bottleneck of pixel-space AR; a low-dimensional intermediate plus approximate rollout is a pragmatic way to contain it. - Classification probing accuracy also beats AR and diffusion baselines, hinting the same pixel representation could serve both generation and understanding — worth a look for teams building unified models.

Source: Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation