A 7B Video Agent Beats a 72B Model by Looking Less

Today's Overview

Long video understanding doesn't need to watch every frame. OmniAgent models perception as a decision the model makes for itself; its 7B agent hits 50.5% on LVBench, beating Qwen2.5-VL-72B at 10x the size.
The bottleneck in multimodal-as-policy is memory, not decisions. RNG-Bench isolates "rebuild what's no longer visible and act on it" in two games, and finds frontier models mostly fail by forgetting earlier observations.
Uniform diffusion language models finally have a serious open baseline. Sumi is the first at-scale (7B, 1.5T tokens) fully open uniform diffusion model trained from scratch, with weights, recipe, and data mix all released.
Every reasoning step from an AI scientist now leaves auditable evidence. Xcientist externalizes literature, ideas, plans, and ablations into contract-governed artifacts, and names "claim drift" as a failure you can't catch by looking at outputs alone.
User simulators are shifting the target from "match the line" to "pass as a person." Turing-RL swaps similarity matching for a Turing-test discriminator reward, and beats matching baselines in chat and forum settings.

Featured

01 Let the Model Decide Which Frames to Watch

Long video understanding has been stuck on watch-it-all: every frame gets equal treatment regardless of the question, so compute grows linearly with video length. OmniAgent reframes the problem as an observe-think-act loop under a POMDP. The model takes actions on demand, distilling only the key audio-visual cues into a running text memory. That decouples reasoning complexity from raw video length.

Training runs in two steps. Agentic SFT bootstraps active perception through best-of-N trajectory synthesis. Agentic RL with TAURA then uses per-turn entropy to push credit toward the steps that actually surfaced something useful.

The 7B agent scores 50.5% on LVBench, above Qwen2.5-VL-72B's 47.3% at 10x the parameters, and shows positive test-time scaling — more reasoning turns, better results. One caveat worth keeping: the paper distinguishes its approach from interactive methods that rely on global pre-scanning, whose context cost still grows with length. But whether active perception truly saves compute or just moves it from frame processing to extra reasoning turns is a question the full paper's latency and token numbers will settle.

Key takeaways: - Treating perception as a decision the model makes is a real path to cheaper long-video agents. - A 7B beating a 72B says architecture matters more than parameter count here; video-RAG and long-video QA teams should track this. - Be skeptical of the cost story — multi-turn perception may shift the bill to reasoning turns, so measure latency and tokens before you ship.

Source: Native Active Perception as Reasoning for Omni-Modal Understanding

02 Evaluation: When You Run a Multimodal Model as a Policy, It Forgets

Wire a multimodal model into a closed loop and many actions depend on observations that have already scrolled off screen. Existing benchmarks hide that ability. RNG-Bench isolates it with two games — memory matching and a first-person 3D maze — that force the model to rebuild unseen observations across multiple steps and act on them. A Memory Gap metric then separates "forgot" from "decided badly."

The result is worth attention. Frontier MLLMs are far from saturated on the hardest setting (around 128K context and 350 images per episode), and most residual errors come from forgetting earlier observations rather than weak decisions. The limit is long-horizon memory, not reasoning.

This is trainable, not a hard ceiling. Fine-tuning Qwen3.5-9B on optimal-policy rollouts improves performance and transfers to other benchmarks without hurting general ability.

Key takeaways: - When you run a multimodal model as an agent or policy, the real bottleneck may be holding onto what it saw, not deciding what to do. - Memory Gap splits forgetting from poor decisions, so you can locate where the failure is. - The memory weakness responds to targeted fine-tuning without sacrificing general ability — an actionable signal for closed-loop agent teams.

Source: Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

03 Architecture: The Missing Piece in Diffusion Language Models

The language model roadmap has open, studyable models for two routes: autoregressive (predict tokens one by one) and masked diffusion (mask a span, then fill it). Uniform diffusion — which lets any token update at any step and is in theory more flexible — had no model at scale anyone could build on. Sumi fills that gap: a 7B uniform diffusion model trained from scratch on 1.5T tokens, fully open down to weights, checkpoints, the complete recipe, and even the data mix.

It matches autoregressive models on knowledge, reasoning, and code at equal token budgets, and lags on commonsense, which the team attributes to an education-heavy data mix. The point was never the leaderboard. It's that the community now has a real object for studying scaling laws, generation dynamics, and controllability in this route.

Key takeaways: - Uniform diffusion lacked an at-scale open reference; Sumi supplies one. - Teams studying diffusion language models now have a reproducible 7B baseline and a full data recipe. - The commonsense weakness reads as a data-mix problem, not an architectural ceiling — worth confirming in follow-up work.

Source: Sumi: Open Uniform Diffusion Language Model from Scratch

04 Agent: Make Every Reasoning Step Leave a Trail

Automated science has a blind spot. The chain from "which evidence informed this" to "why design the experiment this way" to "the final conclusion" mostly lives inside model inference. You see the output, not the thinking. Xcientist externalizes that chain. Literature evidence, idea states, implementation plans, ablation records, and repair trajectories all persist as contract-governed artifacts, each mechanism traceable back to its evidence.

The paper names a concrete failure mode: claim drift. Code gets edited until the working artifact no longer supports the mechanism it originally claimed, and that drift is invisible if you only inspect the final result. Traceability is validated across three domains — memory systems, traffic forecasting, and physics-informed neural networks.

One question stays open: does this harness make research reasoning more reliable, or mainly move complexity from the model into orchestration? Auditability is real value, but the full paper has to show what it costs.

Key takeaways: - The bar for AI scientists is moving from "is the output good" to "can the reasoning be audited." - Claim drift is a concept worth keeping — in automated pipelines, artifacts and claimed mechanisms quietly diverge. - AI-for-science tool teams should watch this artifact-externalization approach, while weighing the orchestration overhead it adds.

Source: Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

05 Training: Stop Matching the Line, Start Passing as a Person

The standard way to train a user simulator — an LLM that mimics a real user — forces the model to match one reference reply, either by maximizing its log probability or scoring similarity. Real people say the same thing many ways. Aligning to a single answer narrows "act like a person" down to "reproduce this exact line."

Turing-RL changes the objective. An LLM judge runs a Turing-test discriminator reward, scoring whether a generated reply can be told apart from a real one given the user's history. The model learns to produce replies that are hard to distinguish, not replies that hug a ground truth. Across chat and Reddit forum discussions, it beats matching baselines on both automatic and human evaluation.

The exact gains and whether the judge can be gamed need the full paper. But the direction of this recalibration is right for anyone training personalization evals or agent assistants.

Key takeaways: - User simulation really wants "act like a person," not "copy one line"; a discriminator reward fits that better than similarity matching. - If you build personalization evals or train agents, reexamine whether your reward is being skewed by a single ground truth. - The LLM judge is the key variable — its reliability and resistance to gaming decide how far this goes, so check the full paper.

Source: Learning User Simulators with Turing Rewards

A 7B Video Agent Beats a 72B Model by Looking Less

Also Worth Noting

To Use a Video Model as a World Model, First Check Whether It Understands Physics Video GenPhysics-IQ pulls physical understanding out of generation quality and quantifies it on its own. link

Materials Foundation Models Adapt to New Systems via Sparsity-Promoting Fine-Tuning AI for Sciencerobust, interpretable calibration for machine-learning interatomic potentials transferring to new domains, accepted at ICLR. link

Home Assistants Ignore How Instructions Get More Elliptical as a Conversation Goes On AgentPEC-Home handles this progressive, context-accumulating ellipsis in instruction interpretation, accepted at ACL. link

Idioms Transfer Poorly Across Languages Because of Non-Compositionality and Weak Surface Grounding EvaluationG-IdiomAlign uses English glosses as anchors for a cross-lingual alignment benchmark, accepted at ACL. link

Today's Observation

Read OmniAgent (2606.19341) and RNG-Bench (2606.19338) together and you see two papers attacking the same default assumption: that a multimodal model faces a complete, currently visible state. The first rejects watch-it-all and argues for actively choosing what to look at. The second rejects full-state exposure and asks the model to rebuild observations it can no longer see. One solves "don't watch everything, learn to select." The other solves "you can't see everything, learn to reconstruct." Together they point at the same shift: when a multimodal model is deployed as a closed-loop policy, the capability frontier moves from "how accurately it sees" to "how it manages a limited observation budget under partial observability." For anyone building video agents or long-context multimodal systems, that has a direct design consequence — stop assuming everything worth seeing is sitting in the context.

One concrete thing to do: take the multimodal agent you're running, pick a few failure cases, and hand-attribute them with the RNG-Bench Memory Gap lens. Did the error come from "didn't see it," "forgot what it saw," or "saw it but decided wrong"? The three call for completely different fixes. Sort them before you invest — cheaper than reaching for a bigger model.