A 4B Web Agent Catches Up to Closed CUAs on a Few Thousand Trajectories

Today's Overview

PEFT isn't just cheap fine-tuning — it's per-user persistent state. A framing paper recasts small adapters as local state attached to a shared trillion-parameter base, arguing along three scaling axes toward a future of "millions of personal models."
RAG crosses from text into video generation. LongLive-RAG borrows retrieval augmentation to fix identity drift in long videos, retrieving earlier trustworthy clips as anchors and ranking first on average on VBench-Long across several AR backbones.
Online RL frees open web agents from trajectory dependence. OpenWebRL trains a 4B model to trade blows with OpenAI and Gemini's closed CUAs using just 0.4K init trajectories and 2.2K open-ended tasks, with a promise to fully open-source.
Concurrent streams are an evaluation blind spot. X-Stream is the first benchmark built for multi-stream understanding, and the strongest MLLM scores only about 50% on concurrent streams.

Featured

01 Adapters Are Per-User Persistent State, Not Cheap Fine-Tuning

PEFT (parameter-efficient fine-tuning, like LoRA) has always been treated as a budget substitute for full fine-tuning — same goal, lower cost. This paper reframes it. A small adapter becomes "persistent local state" sitting on top of a strong shared base: the base handles general capability, the adapter carries one user's preferences, skills, tool habits, and memory-like updates.

The authors organize the idea along three scaling axes. Scale Up — a stronger base makes the same-sized local update more useful. Scale Down — how small can an adapter shrink while staying reliable. Scale Out — how do millions of personalized instances coexist. They also sketch an infrastructure example called MinT for managing adapter identity, versioning, provenance, evaluation, and online residency. Stitched together, the real argument is a deployment shape: one trillion-parameter base with a million distinct personal models hanging off it.

Read this as a position paper, not an empirical one. Separate the scaling laws already validated from the bets on a future shape. The "million/trillion" in the title is vision more than shipped scale.

Key takeaways: - The real contribution is reframing adapters from "cheap fine-tuning" into "per-user persistent state." The technical details are secondary. - If this shape holds, the hard problem in multi-tenant serving shifts from training cost to managing identity, versioning, and residency for a million adapters. That's what MinT points at. - Take the framework, but judge the strength of the scaling laws after the full text and follow-up replication.

Source: On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

02 RAG Crosses Into Video to Cure Long-Generation Drift

Autoregressive video generation has a structural dead end. For efficiency, most models use sliding-window attention and see only the last few frames. Once appearance errors accumulate inside that window, every later frame builds on the already-degraded trajectory, and identity drifts further with no way back.

LongLive-RAG's twist is that it doesn't patch the attention mechanism or the sampling. It ports retrieval-augmented generation — a paradigm from text LLMs — straight over. Already-generated history latents become a retrievable memory bank. Each new block triggers a lookup that finds earlier, more trustworthy clips to anchor character identity, instead of fixating on the degraded recent window. To sharpen retrieval, a Window Temporal Delta Loss suppresses locally redundant similarity so embeddings capture meaningful temporal change.

The model ranks first on average on VBench-Long across several AR backbones, and the retrieval step adds little overhead. What's worth remembering isn't another long-video SOTA. It's that content-addressable memory, an old NLP trick, is spilling into a completely different modality.

Key takeaways: - Identity drift in long video comes from the sliding window only seeing recent frames. RAG-style retrieval of history is a way around that dead end. - Cross-modal transfer signal — video generation teams can add NLP's mature retrieval and memory mechanisms to their toolbox. - Low overhead, attachable to many AR backbones, so the engineering bar is low. The effect still needs validation at longer durations and on diversity metrics.

Source: LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

03 What Teams Without Proprietary Trajectories Can Do to Train Web Agents

Training visual web agents has stuck the open-source camp in one place. The strongest systems are closed, and the public recipes lean hard on human-curated batches of web-action trajectories for supervised training. Those high-quality demonstrations are expensive to collect, and static datasets can't cover the daily churn of real websites.

OpenWebRL routes around this by running online multi-turn RL directly on live websites. It fills in the whole pipeline: scalable real-time browser infrastructure, supervised initialization, multimodal context management, trajectory-level success/failure judging, and multi-turn policy optimization. Initialization used just 0.4K trajectories and RL training used 2.2K open-ended tasks, enough to train a 4B model to trade blows with OpenAI and Gemini's closed CUAs.

The signal here isn't the score — it's that this path cuts the dependence on trajectory data sharply. The paper promises to open-source the data, models, and code. If that lands, it's a reproducible starting point for teams that want to train their own agents but can't get proprietary demonstrations. How costly and stable "online RL on live websites" actually is will take the code and full text to judge.

Key takeaways: - The open-source bottleneck for visual web agents is trajectory-data dependence. Online RL is a practical way around it. - 0.4K init plus 2.2K training tasks is enough, which means no need to hoard expensive human demonstrations. - The value is full open-source reproducibility, not topping a leaderboard. The engineering bar for live-browser RL is still unproven.

Source: OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

04 Why Accuracy Drops to Half When a Model Watches Three Screens

Almost every video-understanding benchmark assumes one frame, one stream. Real scenes — live sports, autonomous driving, multi-screen collaboration — are concurrent signals that demand online cross-stream reasoning. No one has tested that ability directly.

X-Stream fills the gap. It's the first benchmark for multi-stream streaming understanding: 932 videos, 4,220 QA pairs, 11 sub-tasks spanning multi-window, multi-view, and multi-device, with a dual-verification process to stop models from bluffing off a single frame. The authors test multimodal LLMs as "multiplexers" that must reuse several input streams.

The result is blunt. The strongest current MLLM scores about 50% on concurrent streams, and its proactive response is weak. For teams building real-time multi-source applications, that number says today's models are still far from handling several feeds at once. Good single-stream scores don't carry over.

Key takeaways: - Concurrent streams are the norm in real scenes, but existing benchmarks only test single streams — a systematic blind spot. - SOTA models score about 50% on concurrent streams with weak proactivity. Strong single-stream doesn't mean usable multi-stream. - Teams building real-time multi-source apps (surveillance, multi-screen, self-driving) can use it to measure a model's real ceiling on cross-stream reasoning.

Source: X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

A 4B Web Agent Catches Up to Closed CUAs on a Few Thousand Trajectories

Also Worth Noting

First Web-Browsing Agent Benchmark Grounded in Korean EvaluationK-BrowseComp puts frontier models like GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1 head-to-head on a native-speaker-verified subset, pushing agent evaluation toward linguistic and cultural localization. link

Testing Agents on Your Own Accounts and Local Databases AgentMCP-Persona uses environment simulation to evaluate an agent's real capability on personal social apps, covering a blind spot in general information-retrieval benchmarks. link

Letting a VLM Tutor a Video Generation Model MultimodalTest-time adaptive optimization corrects the logical failures of video models that render realistically but break task rules. link

A Training-Free PRM Substitute ReasoningUse an off-the-shelf LLM as a process scorer for chunk-level guided generation, skipping step-level annotation and reward-model training. link

Fixing Distortion to Improve Visual Token Pruning EfficiencyEases the quadratic-complexity memory and latency bottleneck from the flood of visual tokens in MLLMs. link

Novelty Signals as Training Supervision for Latent Memory AgentJAMEL jointly learns exploration and memory compression, solving the lack of reliable memory supervision over long trajectories. link

Generating Physically Consistent, Collision-Free Interactive 3D Tabletop Scenes RoboticsAimed at general robot learning, handling dense object hierarchies and irregular affordances. link

Locating AI-Edited Forgeries by Catching Intrinsic Energy Anomalies SafetyBypasses the physical-noise cues that traditional methods rely on but synthetic data lacks. link

Unified Co-Design of Proteins and Small-Molecule Ligands AI for ScienceJointly models the coupled modalities of sequence and 3D structure through intrinsic geodesic coupling. link

Initial Noise Is the Overlooked Source of Mode Collapse Image GenSamples initial noise from a guided-potential posterior to improve diversity, rather than intervening only mid-trajectory. link

Today's Observation

Two papers today pin "personalization" to two completely different layers. Read together, they're more interesting than apart.

The PEFT paper bets at the model-weight layer: one trillion-parameter shared base, a million small personal adapters hanging off it, turning your preferences, skills, tool habits, and memory-like state into persistent local state. It bets the deployment shape shifts from "one big model serving everyone" to "one AI per user." MCP-Persona pokes at the same thing from the application and evaluation layer. Most agent benchmarks today still test general information-retrieval tools, but the scenes that are genuinely "personal" mean operating your own accounts and your own local database — exactly the blind spot in current evaluation, and the hardest part.

Put the two ends together and the conclusion isn't the tired "personalization is a trend." It's more specific: "one AI per user" is being pushed forward in earnest from both the bottom weight layer and the top application layer at once. Each end exposes an unsolved problem — at the weight layer, the adapter scaling laws are far from proven; at the application layer, agent reliability in personal-data environments hasn't been established at all. This is its own thread (the deployment shape of personalization). Don't conflate it with the other thread about agents going from one-shot tasks to getting durably better.

One concrete move: if you're building a personal-facing agent product, don't rush to pile on general tool capability. First give your agent a checkup with MCP-Persona-style scenes — operating personal accounts and local data — and watch where reliability falls off. The real bottleneck probably isn't how smart the model is. It's whether it can reliably touch your own data.