8% of Tokens Decide the Reasoning Gap

Today's Overview

  • "Unlearnable" Samples in RLVR. A set of hard examples never gets learned across training, even though rollouts produced correct answers. The reward curve climbs anyway — the easier subset does the work.
  • The Reasoning Advantage Is Sparse. The gap between base and reasoning models concentrates in about 8% of tokens, enriched at early planning decisions.
  • Single-Model Red-Teaming Isn't Real Protection. Query a set of frontier models concurrently and any weak link delivers harmful output. Success rates reach 100%.
  • WOW-Seg Skips the Text Prompt. Meta's Mask2Token aligns masks directly to VLLM feature space. 1/8 the parameters, beats prior SOTA on LVIS.
  • 3D Reconstruction Adds Hallucination Score Maps to Diffusion Priors. HAD uses a feedforward novel-view network for cross-validation. Unreliable pixels get masked at pixel resolution.

Featured

01 The Reward Curve Climbs. The Hard Examples Never Learn

This ICML paper does something awkward. It tracks the "hard examples" the model initially fails on, then watches a sizable fraction never get learned across the entire training run — even when rollouts produced correct answers. The reward curve climbs anyway. The easier subset is doing the work, and hard examples get quietly abandoned.

Cross-sample gradient similarity analysis reveals the cause. Gradient directions for unlearnable samples barely overlap with the rest. Reasoning paths don't generalize. The model's internal representations simply lack the parts needed to solve these problems. Common optimization and sampling tricks all fail to recover them, and data augmentation makes no difference. Representation-layer defects don't get patched at the RL stage.

The implication for practitioners is direct. Pouring every failure case into RL burns compute while inflating metrics. Curriculum and data filtering logic needs a rethink. RLVR fixes "knows it but unreliable," not "doesn't know it at all."

Key takeaways: - A meaningful fraction of hard examples are unlearnable for RLVR. Reward improvements come from the easier subset. - The defect is at the representation layer, not the optimizer. Data augmentation doesn't help. - Tune RL recipes with difficulty filtering. Push unlearnable samples back to SFT or out of training.


02 Reasoning's Entire Advantage Hides in 8% of Tokens

Deploying reasoning models is expensive, but full RL training is out of reach for most teams. This ICML work runs a token-level diagnostic. About 8% of tokens in a generated response account for nearly all of the performance gap between base and reasoning models. These tokens concentrate at early planning decisions, enriched 17x over the average, and tend to appear right where the base model itself is most uncertain. The base model knows it doesn't know — it just picks the wrong direction.

The intervention follows naturally. At inference, only at high-disagreement positions does the reasoning model take over for a single token, then hands back to the base. This sparse delegation matches or exceeds same-size reasoning model performance under a small compute budget.

For cost-sensitive deployment teams, this is a lightweight alternative worth evaluating. The catch: validation runs on small models like Qwen3-0.6B. Scaling behavior and cross-task results need the full paper or follow-up work.

Key takeaways: - The reasoning advantage is sparse. About 8% of early planning tokens carry most of the gap. - Token-level intervention can substitute for full reasoning training at low cost. - Validation sits on small models. Behavior at production-scale reasoning models is open.


03 Single-Model Red-Teaming Misses the Real Threat Model

Mainstream LLM safety evaluation assumes the attacker hammers one model. This ICML paper flips to the attacker side. Real threats don't pick a single target. Query a set of frontier models concurrently and any weak link delivers the harmful output.

The jailbreak method built for this "wide net" scenario reaches 100% success rate on undefended model groups. One soft model in the set drags the entire safety boundary down to that level, no matter how hardened the others are.

Key takeaways: - Safety evaluation needs a "multi-model joint leakage" axis. Single-model red-teaming doesn't reflect real-world risk. - Frontier model alignment shows a weakest-link effect. Counting on competitors to compensate is a bad assumption. - Safety and compliance teams should treat "user can hit multiple providers in parallel" as the default threat model, not a corner case.


04 Drop the Text Bridge in Open-Vocabulary Segmentation

SAM cuts good masks but can't name them. CLIP-style models do the opposite. For the past few years, the standard repair has been a text prompt bridging the two. Meta's WOW-Seg writes "word-free" into the name. Mask2Token converts image masks directly into visual tokens aligned to the VLLM feature space, bypassing text entirely, and Cascade Attention Mask isolates information bleed across instances.

The paper claims SOTA on LVIS at 1/8 the parameters, plus a 7,662-class region recognition benchmark called RR-7K. Whether this direction is worth backing depends less on absolute SOTA and more on what the no-prompt setting gives up compared to text-input baselines. That comparison isn't in the abstract. The full paper has to settle it.

Key takeaways: - Open-vocabulary segmentation is starting to drop the text prompt dependency. Pure visual alignment is a real bet now. - Mask2Token aligns masks directly to VLLM feature space. An architecture choice worth tracking. - Don't fixate on SOTA when evaluating word-free methods. Look at retention rate against text-input baselines.


05 Score the Hallucinations in 3D Reconstruction

Sparse-view 3D reconstruction's hot recipe over the past year: use a diffusion model to fill in novel views, then feed those into the reconstruction pipeline. The problem is diffusion models hallucinate. Content that doesn't exist in the input views ends up baked into the final 3D asset.

HAD doesn't erase after the fact. A pretrained feedforward novel-view synthesis network cross-validates each generated image, producing a pixel-level hallucination score map. Unreliable pixels get masked during reconstruction. Augmented versions from different input views get fused for broader context. The paper is CVPR-accepted and posts SOTA on several novel-view synthesis benchmarks.

Key takeaways: - Diffusion priors for novel views are now standard in sparse reconstruction. "Hallucination pollution" is the acknowledged cost. - Writing hallucination detection into the pipeline is more honest than post-hoc erasure. A new baseline for 3D asset teams. - Pixel-level confidence generalizes beyond 3D — anywhere generation fills missing data.

8% of Tokens Decide the Reasoning Gap

Also Worth Noting

06
D²Evo Pairs Two-Level Difficulty Estimation With "Medium Samples Drifting During Training." TrainingRead alongside today's RLVR Unlearnability paper. Together they cover both ends of curriculum recalibration: cut the unlearnable, chase the medium. link
07
GUI Agent Self-Evolution Writes Past Episodes Into Retrievable Memory Instead of Context. AgentSidesteps the two old problems with multi-step tasks: context window limits and static policy adaptability. link
08
TRACE Does Evidence Grounding Across Multiple Videos. MultimodalVideo agents handling long heterogeneous corpora no longer get capped by context budget. Locate and attribute evidence scattered across multiple videos. link
09
Geometric Theory for SSL Projection Heads. ArchitectureModels the head as a trainable Riemannian metric. Gives an explanation for collapse and invariance observations from engineering practice. link
10
PluRule: Same Content, Different Community Rules, Different Compliance Calls. EvaluationPluralistic governance pushes content moderation models into compositional stress tests, not single rulebooks. link
11
Modality-Missing Sentiment Analysis Drops Feature Completion for Decision Drift. MultimodalModality loss and quality imbalance are the real-data norm. Generative completion has its own costs. link
12
Contamination Robustness for Multi-Task Linear Regression. TrainingTheoretical, but back-solves an upper bound on outlier-task tolerance for real multi-task training. link

Today's Observation

Three papers landing together — Unlearnability, Reasoning Restored, D²Evo — give a concrete convergence signal. RLVR research focus is shifting from "can we raise the score" to "what is actually moving during training dynamics." Unlearnability shows that a set of hard examples never gets learned, yet the reward curve climbs anyway. The reward source isn't what it looks like. Reasoning Restored finds the base-to-reasoning capability gap concentrates on a small number of token decisions. Most training compute is reworking parts the model already knows. D²Evo concedes that medium-difficulty samples themselves drift during training, so difficulty estimation has to drift with them.

All three point at the same thing. Reward curves, loss, and benchmark scores are coarse-grained signals. The next layer to watch is gradient direction, token position, and sample difficulty evolution.

For engineering teams, the action is concrete. Curriculum needs recalibration. Stuffing every failed sample into the training pile burns compute and inflates metrics. Run gradient similarity or simple pass-rate tracking. Strip out the samples that aren't moving. The compute saved and the metric self-deception avoided are not small numbers.