Trillion-Parameter Multimodal, 4B Agents Match 671B, PPO Exposed

Today's Overview

Baidu Releases ERNIE 5.0, a Trillion-Parameter Unified Multimodal Model, training text, image, video, and audio jointly from scratch with ultra-sparse MoE and elastic training that yields a family of deployable sub-models from one run.
Multi-agent RL enables "width scaling" where a 4B model swarm matches a 671B monolith. WideSeek-R1 improves with more parallel sub-agents, matching DeepSeek-R1 on information retrieval.
PPO's Core Mechanism Doesn't Fit LLMs: DPPO shows ratio clipping systematically over-constrains low-probability tokens and under-constrains high-probability ones in large vocabularies; direct divergence constraints stabilize training.
Multimodal PRM training data is massively redundant: 10% of the data is enough. BIS filtering achieves full-data performance with one-tenth the samples.

Featured

01 Multimodal One Training Run, a Family of Models

"Unified multimodal" has been a buzzword for years. But a trillion-parameter model that trains text, image, video, and audio from scratch under a single autoregressive objective — with a public technical report — is a first.

ERNIE 5.0 makes two design choices worth watching. Ultra-sparse MoE with modality-agnostic expert routing: regardless of whether the input is text or video frames, the same routing logic applies, letting the model learn which experts handle what. Then there's the elastic training paradigm: during a single pre-training run, the model simultaneously learns sub-models of varying depths, expert capacities, and sparsity levels. At deployment time, you pick the configuration that fits your hardware.

This is fundamentally an engineering answer to: you've trained a trillion-parameter model, now how do teams with different compute budgets actually use it? The report includes detailed expert routing visualizations and elastic training analysis.

Key takeaways: - Modality-agnostic routing avoids the complexity of separate pathways per modality — the key design decision for unified multimodal - Elastic training turns "one model" into "a model family," lowering the deployment barrier for large models - First public technical report of its kind at trillion-parameter scale

Source: ERNIE 5.0 Technical Report

02 Agent Model Too Small? Send More of Them

Deep learning's dominant narrative has been depth scaling — making a single model deeper and stronger. But when a task is inherently "wide" — say, gathering information across multiple dimensions — the bottleneck isn't individual capability. It's organizational efficiency.

WideSeek-R1 proposes a complementary axis: width scaling. A lead agent decomposes the task, multiple sub-agents execute information retrieval in parallel, and the entire system is jointly optimized through multi-agent reinforcement learning (MARL). The result: WideSeek-R1-4B hits 40.0% item F1 on the WideSearch benchmark, matching standalone DeepSeek-R1-671B — a 167x parameter gap.

Performance keeps improving as you add more parallel sub-agents, suggesting width scaling has its own scaling law.

Key takeaways: - Width scaling naturally complements depth scaling, especially for inherently parallelizable tasks like information retrieval - 4B x N matching 671B x 1 validates "small model swarms" as a viable cost-reduction path - MARL joint training is the critical ingredient — not just stitching agents together, but training them to collaborate

Source: WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

03 Training PPO's Trust Region Is Wrong for LLMs

PPO is the de facto standard for LLM reinforcement learning, but this paper identifies a structural flaw: ratio clipping is pathological in large-vocabulary settings. The probability ratio is a per-token sample estimate, not a true policy divergence measure. Low-probability tokens have wildly volatile ratios and get over-clipped — they can't learn. High-probability tokens barely move the ratio and evade clipping entirely, risking catastrophic drift.

DPPO takes the direct approach: drop the heuristic clipping and estimate actual policy divergence (Total Variation or KL). Binary and Top-K approximations keep memory overhead manageable. Experiments show improved stability and efficiency over existing methods.

If you're running RLHF or GRPO pipelines, this is a PPO alternative worth investigating.

Key takeaways: - Ratio clipping under large vocabularies systematically misallocates constraint strength — this explains common instabilities in LLM RL training - DPPO's divergence constraint is theoretically cleaner and remains engineering-feasible through approximation - A credible alternative to PPO for RLHF and GRPO pipelines

Source: Rethinking the Trust Region in LLM Reinforcement Learning

04 Training 90% of Your PRM Data Is Wasted

Training multimodal process reward models (PRMs) typically requires large-scale Monte Carlo annotation — expensive to produce. But this paper delivers good news: most of that data is redundant. Random subsampling saturates quickly, meaning the useful information is concentrated in a small fraction of samples.

What makes a data point informative? Two factors: the mix ratio of positive and negative labels, and label reliability (average MC score of positive samples). Based on these signals, the authors propose BIS (Balanced-Information Score) filtering.

Validated on InternVL2.5-8B and Qwen2.5-VL-7B backbones, BIS matches or exceeds full-data performance using just 10% of the training set.

Key takeaways: - Smart filtering beats brute-force data scaling for PRM training - BIS requires no extra annotation — it works directly from existing MC signals - For teams training visual reasoning PRMs, this means an order-of-magnitude reduction in training cost

Source: Training Data Efficiency in Multimodal Process Reward Models

Trillion-Parameter Multimodal, 4B Agents Match 671B, PPO Exposed

Also Worth Noting

Cut 75% of Omni-Modal Tokens, Get Better Results EfficiencyOmniSIFT applies spatiotemporal pruning for video and vision-guided filtering for audio tokens; Qwen2.5-Omni-7B outperforms its full-token version on multiple tasks using just 25% of tokens. link

No Chain-of-Thought? Distill Anyway Agentpi-Distill jointly trains a privileged teacher and an unprivileged student, outperforming the standard "SFT+RL" pipeline across multiple agent benchmarks even when the closed-source teacher only provides actions. link

Humanoid Robots Learn Spatial Awareness via First-Person Vision RoboticsEgoActor uses a VLM to unify walking, head turning, and object manipulation from pure RGB input; the 8B model runs under 1-second inference latency, validated in both simulation and real hardware. link

Predict Model Misbehavior Before Training Starts SafetyData2Behavior injects mean data representations into a forward pass to predict post-finetuning bias and safety risks without actually training, using only 20% of finetuning's GPU cost. link

Add Communication Barriers, Agent Cooperation Drops 45% EvaluationSocialVeil simulates semantic ambiguity, cultural differences, and emotional interference; all four frontier LLMs degrade sharply, with repair strategies showing limited effectiveness. link

Autoregressive Protein Backbone Generation AI for SciencePAR uses coarse-to-fine multi-scale prediction to generate protein backbones, sculpting rough shapes first then refining, supporting zero-shot conditional generation and motif scaffolding. link

MoE Routing Leaks 91% of Original Text Safetyattackers can reconstruct original tokens from routing decisions alone using a transformer decoder; noise injection doesn't fully eliminate the risk, making routing decisions sensitive information in distributed inference. link

DPO's Fixed Reference Policy Degrades Over Time SafetyGAPO replaces the static reference with dynamic geometric anchors and uses adversarial perturbation to identify fragile preference signals, improving robustness under noisy labels. link

Automated 3D Character Rigging, 98-133% More Accurate Image GenSkinTokens discretizes skinning weights into token sequences; TokenRig uses an autoregressive framework to unify skeleton and skinning generation, with RL fine-tuning for generalization. link

Long Video Understanding: Let the Model Pick Its Own Frames MultimodalVideoBrain uses a dual-agent strategy for adaptive frame sampling, improving accuracy 3.5-9.0% while using 30-40% fewer frames, with cross-dataset generalization. link

Today's Observation

Two signals stand out today. First, the unified multimodal race enters the trillion-parameter era: ERNIE 5.0 is the first publicly reported trillion-parameter unified autoregressive multimodal model, while OmniSIFT tackles the inference cost side of the same problem — training and serving are advancing in parallel. Second, "small model swarms" are emerging alongside "big model solo": WideSeek-R1's width scaling, pi-Distill's privileged-information distillation, and heterogeneous small-model stacks all demonstrate that you don't always need the largest model. Teams working on multimodal deployment should look at OmniSIFT's token compression; teams building agent systems should study WideSeek-R1's MARL training framework.