Gated DeltaNet-2 Splits the Gate, Maestro Outscores GPT-5

Today's Overview

  • Linear Attention's Real Bottleneck Is State-Edit Granularity, Not Speed. Gated DeltaNet-2 splits the scalar gate into channel-wise erase and write gates. It tops Mamba-2, KDA, and Mamba-3 in head-to-head training, with the biggest gains on long-context retrieval.
  • Tabular Agents Enter the RL Training Era. Spreadsheet-RL builds a multi-turn sandbox and lifts Qwen3-4B's SpreadsheetBench Pass@1 from 12% to 23.4%. The doubling is real, but the absolute number still sits short of production.
  • Reasoning Doesn't Have to Be Text. LatentOmni interleaves audio-visual state inside a unified latent space instead of compressing to discrete tokens. It dodges the language-prior pull that bends CoT toward grammatical sentences.
  • A 4B Orchestrator Beats GPT-5 and Gemini-2.5-Pro on Ten Benchmarks. Maestro uses outcome-based RL to schedule frozen experts. Training stability under sparse hierarchical reward, however, is something the abstract skips.

Featured

01 Linear Attention's Real Bottleneck Was Never Speed

Linear attention's path from DeltaNet to KDA to Gated DeltaNet-2 points at one thing. Speed isn't the issue. The edit primitive on the recurrent state has stayed too coarse. Delta-rule uses one scalar gate to control two different operations: erasing old content on the key side, and writing new content on the value side. These aren't the same operation.

NVIDIA Labs splits the scalar into channel-wise erase and write gates, letting each channel decide independently how much to erase and how much to write. The form stays backward compatible. Collapse both gates to a scalar and you recover KDA. Collapse decay again and you get the original Gated DeltaNet. So it's a strict superset.

At 1.3B parameters trained on 100B tokens, it beats Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across the board. The biggest delta shows up on RULER long-context retrieval, exactly the workload that hammers the edit primitive: repeated read-write without scrambling existing associations. What's worth saving here is the trajectory, not the benchmark number. Erase-write decoupling is the first step. Finer-grained gate control should follow.

Key takeaways: - Linear attention's bottleneck is shifting from "how fast" to "how precise the state edit is." This is a long-running thread worth tracking. - For teams working on long-context retrieval or agent memory, edit-primitive design will matter more than total throughput. - Gated DeltaNet-2 is a milestone, not an endpoint. Expect successors with finer-grained control gates.


02 Can a Small Team Actually Train a Spreadsheet Agent?

Prompt-plus-ReAct on a general LLM handles simple cells and lookups in Excel or Sheets. Walk into multi-step workflows and it collapses. Spreadsheet-RL is the first work to put serious RL training inside a real Excel environment. The setup includes Spreadsheet Gym for multi-turn rollouts, a corpus of start-goal table pairs auto-mined from forums, and a Domain-Spreadsheet dataset weighted toward finance and supply chain.

Qwen3-4B's Pass@1 on SpreadsheetBench moves from 12% to 23.4%. The domain split goes from 8.4% to 17.2%. Multipliers are clean. Absolute numbers still sit short of production. The paper itself flags that prompting is sufficient for simple operations. RL training shows its real value on complex tasks.

For application teams, the more useful question is not the SOTA number. It is whether the compute budget for training a domain-specialized agent stays accessible to mid-sized shops. That answer decides whether this pattern becomes a default.

Key takeaways: - Excel and Sheets agents are entering an RL training phase. Multi-step complex tasks finally have a serious training-side approach. - Pass@1 doubled but absolute scores sit at 17-23%. Don't extrapolate the multiplier into "production-ready." - The decisive question is whether mid-sized teams can afford the compute for RL on domain-specialized agents.


03 Why Does Reasoning Default to Looking Like Sentences?

Intermediate reasoning has to be written out as text. CoT quietly carried this assumption in. LatentOmni questions it directly. Audio and video are continuous signals. Compress them to discrete tokens, route everything through textual CoT, and most of the spatial-temporal correspondence is lost. The reasoning path then gets pulled toward grammatically tidy sentences by the language prior.

Their move is to let reasoning interleave audio-visual states inside a unified latent space rather than running text tokens only. Omni-Sync positional encoding aligns the audio-visual timelines. On several audio-visual benchmarks it beats explicit textual CoT, and it's the strongest open-source result on this task.

The takeaway worth carrying out is not "latent CoT outperforms by X%." It's that the medium of reasoning was never required to be text. That holds with extra force when continuous signals dominate.

Key takeaways: - CoT assumes text is the reasoning medium. That assumption is weakest on continuous signals like audio and video. - Discrete tokenization burns a lot of spatial-temporal grounding. Latent-space reasoning keeps the dense signal intact. - For multimodal reasoning, the medium itself may be a more productive variable to vary than prompts or data.


04 A 4B Orchestrator Beats GPT-5, Sparse Reward Skipped

Maestro frames a multimodal task as a sequence decision over "which expert to call and which skill to apply." A 4B lightweight policy uses outcome-based RL to schedule a set of frozen experts. Averaged across ten benchmarks it hits 70.1%, ahead of GPT-5 at 69.3% and Gemini-2.5-Pro at 68.7%. Swap in unseen experts and it still generalizes. The numbers look strong.

What the abstract dodges is the harder question. The orchestration policy gets a task-level reward, very sparse across multi-step hierarchical decisions. Hierarchical RL in this setting has known training-stability issues. How credit assignment and reward shaping were handled is missing.

Set aside the training-side doubts and the architecture itself is clean. Small model as orchestrator, frozen experts behind a registry, no retrain when the registry changes. That property is worth a lot for long-term extensibility. Teams choosing an agent framework should track this. Before reproducing, check the method section for training curves and variance. That tells you whether this is an engineering win or a real RL breakthrough.

Key takeaways: - A 4B orchestrator plus frozen experts beat GPT-5 and Gemini-2.5-Pro on ten benchmarks. "Stuff everything into one giant model" loses another round. - The abstract avoids hierarchical RL's training stability under task-level sparse reward. Read the method section, not just the numbers. - Agent framework teams should track orchestration-style designs. Don't reproduce until you've seen the training dynamics.

Gated DeltaNet-2 Splits the Gate, Maestro Outscores GPT-5

Also Worth Noting

05
Transit Planning via Continual Pretraining on 13M Transfer Records, No Routing Engine. ReasoningTransitLM tests directly whether structured tasks can be served by pretraining alone instead of a specialized system. Not another RAG augmentation. link
06
MLLMs Score Big Five Traits From Person Videos, Grounded in Observed Behaviors. EvaluationSeparates "perception" from "stereotyping" in the evaluation. Methodology generalizes to other subjective-judgment tasks. link
07
CUSP Predicts Post-Cutoff Scientific Progress From Pre-Cutoff Knowledge. AI for ScienceCross-disciplinary event-level evaluation. Closer to the actual definition of forecasting than "can AI write a paper." link
08
Sensor2Sensor Converts Dashcam Video Into the AV Fleet's Sensor Configuration. RoboticsLong-tail coverage becomes a sensor-conversion problem instead of a data-collection problem. link
09
SpaceDG Adds Motion Blur, Low Light, and Compression Artifacts to Spatial Reasoning. EvaluationAlmost all current benchmarks assume clean visual input. Adding degradation will likely cut current SOTA scores meaningfully. link
10
SceneAligner Extends "You Are Here" Localization to Real Raster Floorplans of Public Buildings. RoboticsPast work assumed vector floorplans and small-scale environments. This one runs in real public buildings. link

Today's Observation

Two papers today that look unrelated land on the same design choice from different parts of the model. Gated DeltaNet-2 works inside the layer. Split the single scalar gate on a linear-attention recurrent state into channel-wise erase and write gates, and the granularity of state editing goes up a level. LatentOmni works between layers. Audio-visual reasoning stops collapsing into discrete text tokens and instead runs through continuous latent space, so the granularity of the reasoning trace medium goes up too. One moves inside a single layer's state edit. Another moves between layers across reasoning. Different locations entirely.

But two independent research lines hit the same wall in different places. The bandwidth of the model's internal "intermediate representations" is too narrow. One line found state erase and write tied together by a single scalar. The other found reasoning being pulled around by the language prior. The shared point is that the default granularity of intermediate steps is now the binding constraint on the next round of performance. Convergence like this across subfields is worth writing down. A one-paper benchmark bump can be a trick. Two lines hitting the same wall isn't.

Next time you design or evaluate a model, single out the "intermediate representations" in your system for a separate audit. Recurrent state. Scratchpad. CoT trace. Cross-modal alignment cache. Check whether each is still carrying critical information through a too-coarse primitive. If yes, widen the bandwidth at that layer. Return on investment may beat continuing to improve peripheral structure.