Running Out of RL Training Data? Just Combine the Easy Problems

Today's Overview

Combine solved easy problems into new hard ones. Composition-RL turns wasted RLVR training samples into effective composite challenges, with consistent gains from 4B to 30B models.
5B parameters doing the job of 80B. DeepGen 1.0 beats opponents 10x its size in both image generation and editing — code and weights fully open-sourced.
Students can surpass their teachers. ExOPD breaks the distillation performance ceiling through "reward extrapolation," and multi-domain expert knowledge can be merged back into a single small model.
1M-token context on a single A6000D. MiniCPM-SALA's sparse + linear attention hybrid cuts long-context inference cost to a third.

Featured

01 Training Run Out of RL Data? Combine Easy Problems Into Hard Ones

RLVR (reinforcement learning with verifiable rewards) works great for training reasoning ability, but there's a practical problem: training problems are finite. As pass rates climb, a growing pile of "already solved" problems contributes nothing to learning — pure waste of compute. Previous approaches prioritize hard problems, but that leaves easy ones sitting idle.

Composition-RL takes the direct approach: automatically combine multiple easy problems into a single composite challenge. Each sub-problem is independently verifiable, so you get free new training samples out of what was essentially dead data. Consistent improvements across 4B to 30B models, with a curriculum variant (gradually increasing composition depth) performing even better.

The practical bonus: this naturally supports cross-domain composition. Combine math and code problems in the same training sample, and the model's cross-domain reasoning benefits too.

Key takeaways: - Solves the late-stage RLVR problem where easy problems pile up and waste compute - Automatic composition with verifiable answers — no manual problem creation needed - Cross-domain composition is a free bonus, covering multiple capabilities in one training framework

Source: Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

02 Image Gen How Does 5B Beat 80B?

Unified image generation + editing models are the trend, but current approaches demand 10B+ parameters, making training and deployment expensive. DeepGen 1.0 has just 5B parameters yet delivers strong results in both generation and editing: 28% ahead of the 80B HunyuanImage on WISE, 37% ahead of the 27B Qwen-Image-Edit on UniREditBench.

The core design is Stacked Channel Bridging — extracting hierarchical features from multiple VLM layers, fusing them with learnable "think tokens," and feeding structured reasoning guidance to the generative backbone. Training follows three stages: alignment pretraining, joint fine-tuning, then GRPO reinforcement learning with mixed rewards. Total training data: only ~50M samples.

Code, weights, and datasets are all open-sourced. For teams that want to build unified generation models but can't afford massive compute, this is a ready-made starting point.

Key takeaways: - 5B parameters outperforms models an order of magnitude larger in both generation and editing - Three-stage training + GRPO reinforcement learning is the key recipe - Fully open-sourced, lowering the barrier to unified multimodal generation

Source: DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

03 Training Can a Distilled Student Surpass Its Teacher? Yes — If You Extrapolate

The ceiling of model distillation is usually the teacher's performance — students can approach it but never exceed it. G-OPD (Generalized On-Policy Distillation) reframes distillation as dense KL-constrained RL, then discovers a key lever: the reward scaling factor.

Standard distillation treats the reward signal and KL constraint equally (scaling factor = 1). Crank it above 1 — what the authors call ExOPD (reward extrapolation) — and the student breaks through the teacher's performance ceiling. In a particularly useful scenario, merging knowledge from different domain-expert models back into the original student, ExOPD enables the student to surpass every domain expert simultaneously.

For teams doing model compression or knowledge fusion, this "scaling factor > 1" trick is well worth trying.

Key takeaways: - Distillation = dense KL-constrained RL — this unified view opens up a new tuning space - Reward extrapolation (scaling factor > 1) makes student-surpasses-teacher possible - Merging multi-domain expert knowledge back into a small model has strong practical value

Source: Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

04 Architecture 1M Tokens on a Consumer GPU — Long Context Doesn't Require a Cluster

Full-attention 8B models already hit memory walls at 256K tokens. 1M tokens? Not happening. MiniCPM-SALA (Tsinghua) mixes attention mechanisms: 1/4 of layers use sparse attention (InfLLM-V2) to preserve precise long-range modeling, 3/4 use linear attention (Lightning Attention) to cut global computation overhead, with hybrid positional encoding handling the different attention mechanisms.

The practical part: you don't need to train from scratch. A continual training framework converts existing Transformer models into the hybrid architecture at just 25% of the from-scratch cost. On a single NVIDIA A6000D, 256K-token inference runs 3.5x faster than full attention, with support up to 1M tokens.

Key takeaways: - 1:3 sparse + linear attention hybrid is a cost-effective recipe for long context - Continual training from existing models costs only 25% of training from scratch - Million-token context on a single GPU lowers the deployment bar for long documents and conversations

Source: MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

Running Out of RL Training Data? Just Combine the Easy Problems

Also Worth Noting

Robots Learn to "Imagine Before Acting" RoboticsGigaBrain-0.5M* uses imagined trajectories from a world model for RL training, improving complex tasks like laundry folding, box packing, and espresso making by ~30%. Ranks first on RoboChallenge. link

Encouraging Models to "Think Longer" Improves Reasoning ReasoningLength-Incentivized Exploration uses length rewards + redundancy penalties to encourage more in-context exploration and verification. +4.4% on in-domain tasks, +2.7% out-of-domain. link

Diffusion Language Models Get Their Own "Majority Voting" ReasoningdVoting exploits dLLMs' ability to generate at arbitrary positions in parallel, using consistency analysis to identify uncertain tokens and resample them. Up to 7.66% gain on GSM8K, no training required. link

Models That "Draft" to Verify Their Own Reasoning ReasoningThinking with Drafting reframes visual reasoning as "optical decompression." The model sketches logical structures in a DSL first, then self-verifies — more reliable than answering directly. link

A Legal Reasoning Agent That Explores, Verifies, and Remembers AgentLawThinker's DeepVerifier runs atomic verification after every retrieval across three dimensions: knowledge accuracy, fact-law relevance, and procedural compliance. 24% better than direct reasoning. link

Vector Sketches That Create Visual Illusions Image GenStroke of Surprise makes a drawing undergo a semantic "transformation" as strokes are added (duck becomes sheep), using dual-branch SDS and Overlay Loss. link

A Systematic Framework for Training Data Mixing TrainingAI2's Olmix studies the design space of data mixture, introducing a mixture reuse mechanism: when datasets update, only recompute the affected portions, saving 74% of compute. link

Jailbreak Attacks Leave Detectable "Fingerprints" Inside Models SafetyLayer-by-layer analysis of GPT-J, LLaMA, and Mistral reveals consistent structural patterns from jailbreak prompts in hidden layers. Selectively bypassing high-susceptibility layers blocks 78% of attacks. link

In Multimodal RL, Only 15% of Tokens Actually "Look at the Image" MultimodalAT-RL finds that only a few "anchor tokens" in cross-modal attention connections do the visual reasoning. Selectively reinforcing these lets a 32B model surpass a 72B baseline. link

When Audio and Text Conflict, Speech LLMs Trust Text 10x More Often MultimodalThe ALME benchmark across 8 languages shows this isn't an audio quality issue — it's the LLM reasoning layer's preference for text representations. LoRA fine-tuning the language model layers cuts text bias in half. link

GPT-4o Doesn't Actually Have Theory of Mind EvaluationYale research finds LLMs score well on standard ToM tasks but fail on logically equivalent variants, with inconsistency between behavior prediction and mental state inference. link

Today's Observation

Three of today's featured papers attack the same problem: how to squeeze more signal from limited data in RL training. Composition-RL combines easy problems into composite challenges. ExOPD uses reward extrapolation to push distillation past the teacher ceiling. Length-Incentivized Exploration uses length rewards to encourage deeper exploration. Three paths, one goal: increasing the marginal return of every training sample. Teams doing RL post-training should consider running all three head-to-head — especially Composition-RL's cross-domain composition paired with ExOPD's multi-expert merging, which could compound into something bigger.