Video Models Stumble on Composite Edits, MoE Fails at the Router

Today's Overview

  • Single edits are good enough; composite instructions fall apart together. CoVEBench breaks multi-point editing into 9,990 fine-grained checklist items, and models that change subject, motion, and camera at once routinely miss edits, wreck backgrounds, or introduce artifacts.
  • Let the model learn what to remember. MemoPilot trains "memory update" as an optimizable policy via multi-turn GRPO, leading on Elo with a frozen LLM and no weight changes — though only on competitive games so far.
  • MoE's expert specialization fails at the routing step. STAR reframes routing as structure-aware subspace learning, aligning inputs to their principal structure and moving the diagnosis from expert capacity to router perception.
  • Put a statistical guarantee on a whole reasoning chain's factuality. A conformal method treats multi-step reasoning as a dependency graph, calibrates overall uncertainty in real time, and turns hallucination control from tuning into inference with coverage guarantees.

Featured

01 Style Transfer Works; "Change Three Things at Once" Doesn't

Text-to-video editing models are already good enough for single edits. Swap a style, add an object, change a color — all fine. Real users rarely ask for one thing. A single prompt often demands changing the subject, the motion, and the camera at once, while keeping the unrelated background and timing intact.

CoVEBench targets exactly this composite workflow: 416 source videos, 626 multi-point editing instructions, broken into 9,990 fine-grained checklist items scored one by one with an MLLM rather than smoothed over by a single global metric. The results aren't encouraging. Handling several operations at once, models frequently drop edits, violate "keep this unchanged" constraints, or introduce visible artifacts.

Look closer and the failures arrive in order. When subject, motion, and camera all have to change together, the first things to break are motion and camera — the edits that need cross-frame consistency. Swapping a static object is relatively easy. Carrying a new motion through an entire sequence while the camera also moves is where models lose track, and the motion holds for only a few frames or the subject's coherence collapses the moment the camera shifts. For anyone building video editing products, the value here isn't the leaderboard. It's that the benchmark separates two failure types: losing track (changed the subject, forgot the motion) versus collateral damage (changed one thing, scrambled the background too). These point to different product strategies. Losing track means the model's instruction-parsing capacity is too small, which you can ease by splitting instructions and editing in steps. Collateral damage means the model has no concept of which regions to lock, a deeper capability gap that no amount of prompt engineering recovers. Measure with this ruler and you'll learn more than from a cherry-picked single-edit demo.

Key takeaways: - Single edits are nearly usable; composite instructions are where real user demand and product gaps live. - CoVEBench's 9,990 checklist items diagnose failure modes that global FID-style metrics can't surface. - Separating "losing track" (fixable with stepwise editing) from "collateral damage" (a capability gap prompts can't fix) drives model selection directly. - Don't trust the single-edit demo. Test where the model starts breaking under coupled multi-point edits.


02 Can the Model Learn What to Write Into Memory?

Long-running agents accumulate experience by updating a memory blob after each interaction. The pattern is common now, but what to write and how is almost always hand-designed prompt rules — people deciding for the agent what to remember. MemoPilot replaces that step with an optimizable policy: multi-turn GRPO (a reinforcement learning method) trains the "memory update" action directly, so a frozen LLM gets sharper with use without touching its weights. Memory updating shifts from hand-written rules to a trainable policy.

Worth a caveat: the validation tasks are rock-paper-scissors and limit hold'em poker. The Elo numbers do lead (1762 on poker, 1590 on RPS, ahead of DeepSeek-V3.2), but games offer clean feedback and clear goals. Real agent work is further off, and whether this transfers to search or coding over long horizons depends on follow-up.

Key takeaways: - The focus shifts from "does memory help" to "who decides what to remember" — memory updating becomes a trainable policy itself. - A frozen LLM gets stronger at test time without weight changes, useful when fine-tuning the base model isn't an option. - Results hold only on competitive games for now; hold judgment before assuming transfer to real agent tasks.


03 MoE's Expert Specialization Fails at the Router

An MoE router is usually a single shallow linear projection. It never really "sees" the input's structure when it decides, so routing is unstable and expert specialization barely exists. STAR changes the angle and reframes routing as a subspace learning problem. Alongside the existing learnable router, it adds a track that uses a generalized Hebbian algorithm (GHA) to continuously follow the input's principal structure through an evolving subspace, aligning routing decisions with the input's dominant directions. Expert specialization now has a stable basis to stand on.

The paper reports gains in routing quality and downstream performance across synthetic data, large-scale language, and vision tasks. An optional test-time subspace update further improves robustness when the input distribution drifts. The contribution is moving the diagnosis from expert capacity to router perception — routing quality sits upstream of expert specialization, worth remembering for anyone training MoE.

Key takeaways: - When MoE specialization fails, the root cause is often a router blind to input structure, not undersized experts. - STAR's evolving subspace aligns routing to the input's principal structure, buying more stable specialization. - Test-time subspace updates handle distribution drift, though the exact gain needs the full paper to confirm.


04 A Statistical Guarantee on a Whole Reasoning Chain's Factuality

Controlling hallucination today mostly means tuning by feel — adjust temperature, add prompts, post-process — without knowing how confident you actually are. This paper takes a different route. It treats multi-step reasoning as an implicit dependency graph, where each intermediate conclusion's correctness depends structurally on the ones before it, so factual uncertainty propagates along the graph rather than summing per-step errors.

The authors use conformal prediction (a statistical method that comes with coverage guarantees) to compute the graph's overall uncertainty live during generation, stopping once it hits a threshold. That puts a user-specifiable, valid guarantee on "this chain is trustworthy." Calibrating the graph as it generates beats post-hoc pruning on downstream reasoning accuracy. For anyone putting reasoning LLMs into healthcare, finance, or legal work, this turns hallucination control from "tuned until it looks okay" into inference with a coverage guarantee — though how much risk budget it actually saves depends on the experiments and calibration cost in the full paper.

Key takeaways: - Treating factual uncertainty as propagating along a dependency graph, rather than accumulating per-step errors, matches real failure modes better. - Conformal methods give a user-specifiable coverage guarantee, turning hallucination control into quantifiable inference. - Worth attention for high-stakes settings, but calibration overhead and the real error-rate reduction need the full paper to confirm.

Video Models Stumble on Composite Edits, MoE Fails at the Router

Also Worth Noting

05
Let the Query Drive State Evolution Itself ArchitectureIn linear attention the query only ever reads out, decoupled from how the state evolves; Q-Delta pulls it into the evolution, loosening the KV-association paradigm. link
06
The Schema-Derived Graph Isn't the Graph a GNN Wants ArchitectureGraphs converted straight from relational databases often don't suit relational reasoning; this asks what makes a good graph, a reminder about the graph-construction step for relational deep learning. link
07
Encoder and Decoder Update Unevenly, So Unified Aggregation Breaks TrainingIn medical segmentation the encoder and decoder update very unequally; this handles federated LoRA aggregation separately by encoder-decoder structure. link
08
Synthetic Data Judged on Exact Conclusion, Not Fidelity AI for ScienceInstead of competing on fidelity to the real distribution, it requires exactly satisfying a declarative analytical conclusion with no source data, a different axis of judgment. link
09
Octree-Cached Glossy Radiance, Heading for Real-Time Rendering Image GenHigh-frequency outgoing radiance from glossy and specular materials has been hard to model; OctaOctree organizes a neural radiosity cache with an octree. link