Today's Overview
- "Unlearnable" Samples in RLVR. A set of hard examples never gets learned across training, even though rollouts produced correct answers. The reward curve climbs anyway — the easier subset does the work.
- The Reasoning Advantage Is Sparse. The gap between base and reasoning models concentrates in about 8% of tokens, enriched at early planning decisions.
- Single-Model Red-Teaming Isn't Real Protection. Query a set of frontier models concurrently and any weak link delivers harmful output. Success rates reach 100%.
- WOW-Seg Skips the Text Prompt. Meta's Mask2Token aligns masks directly to VLLM feature space. 1/8 the parameters, beats prior SOTA on LVIS.
- 3D Reconstruction Adds Hallucination Score Maps to Diffusion Priors. HAD uses a feedforward novel-view network for cross-validation. Unreliable pixels get masked at pixel resolution.
Featured
01 The Reward Curve Climbs. The Hard Examples Never Learn
This ICML paper does something awkward. It tracks the "hard examples" the model initially fails on, then watches a sizable fraction never get learned across the entire training run — even when rollouts produced correct answers. The reward curve climbs anyway. The easier subset is doing the work, and hard examples get quietly abandoned.
Cross-sample gradient similarity analysis reveals the cause. Gradient directions for unlearnable samples barely overlap with the rest. Reasoning paths don't generalize. The model's internal representations simply lack the parts needed to solve these problems. Common optimization and sampling tricks all fail to recover them, and data augmentation makes no difference. Representation-layer defects don't get patched at the RL stage.
The implication for practitioners is direct. Pouring every failure case into RL burns compute while inflating metrics. Curriculum and data filtering logic needs a rethink. RLVR fixes "knows it but unreliable," not "doesn't know it at all."
Key takeaways: - A meaningful fraction of hard examples are unlearnable for RLVR. Reward improvements come from the easier subset. - The defect is at the representation layer, not the optimizer. Data augmentation doesn't help. - Tune RL recipes with difficulty filtering. Push unlearnable samples back to SFT or out of training.
Source: The Unlearnability Phenomenon in RLVR for Language Models
02 Reasoning's Entire Advantage Hides in 8% of Tokens
Deploying reasoning models is expensive, but full RL training is out of reach for most teams. This ICML work runs a token-level diagnostic. About 8% of tokens in a generated response account for nearly all of the performance gap between base and reasoning models. These tokens concentrate at early planning decisions, enriched 17x over the average, and tend to appear right where the base model itself is most uncertain. The base model knows it doesn't know — it just picks the wrong direction.
The intervention follows naturally. At inference, only at high-disagreement positions does the reasoning model take over for a single token, then hands back to the base. This sparse delegation matches or exceeds same-size reasoning model performance under a small compute budget.
For cost-sensitive deployment teams, this is a lightweight alternative worth evaluating. The catch: validation runs on small models like Qwen3-0.6B. Scaling behavior and cross-task results need the full paper or follow-up work.
Key takeaways: - The reasoning advantage is sparse. About 8% of early planning tokens carry most of the gap. - Token-level intervention can substitute for full reasoning training at low cost. - Validation sits on small models. Behavior at production-scale reasoning models is open.
Source: Reasoning Can Be Restored by Correcting a Few Decision Tokens
03 Single-Model Red-Teaming Misses the Real Threat Model
Mainstream LLM safety evaluation assumes the attacker hammers one model. This ICML paper flips to the attacker side. Real threats don't pick a single target. Query a set of frontier models concurrently and any weak link delivers the harmful output.
The jailbreak method built for this "wide net" scenario reaches 100% success rate on undefended model groups. One soft model in the set drags the entire safety boundary down to that level, no matter how hardened the others are.
Key takeaways: - Safety evaluation needs a "multi-model joint leakage" axis. Single-model red-teaming doesn't reflect real-world risk. - Frontier model alignment shows a weakest-link effect. Counting on competitors to compensate is a bad assumption. - Safety and compliance teams should treat "user can hit multiple providers in parallel" as the default threat model, not a corner case.
Source: New Wide-Net-Casting Jailbreak Attacks Risk Large Models
04 Drop the Text Bridge in Open-Vocabulary Segmentation
SAM cuts good masks but can't name them. CLIP-style models do the opposite. For the past few years, the standard repair has been a text prompt bridging the two. Meta's WOW-Seg writes "word-free" into the name. Mask2Token converts image masks directly into visual tokens aligned to the VLLM feature space, bypassing text entirely, and Cascade Attention Mask isolates information bleed across instances.
The paper claims SOTA on LVIS at 1/8 the parameters, plus a 7,662-class region recognition benchmark called RR-7K. Whether this direction is worth backing depends less on absolute SOTA and more on what the no-prompt setting gives up compared to text-input baselines. That comparison isn't in the abstract. The full paper has to settle it.
Key takeaways: - Open-vocabulary segmentation is starting to drop the text prompt dependency. Pure visual alignment is a real bet now. - Mask2Token aligns masks directly to VLLM feature space. An architecture choice worth tracking. - Don't fixate on SOTA when evaluating word-free methods. Look at retention rate against text-input baselines.
Source: WOW-Seg: A Word-free Open World Segmentation Model
05 Score the Hallucinations in 3D Reconstruction
Sparse-view 3D reconstruction's hot recipe over the past year: use a diffusion model to fill in novel views, then feed those into the reconstruction pipeline. The problem is diffusion models hallucinate. Content that doesn't exist in the input views ends up baked into the final 3D asset.
HAD doesn't erase after the fact. A pretrained feedforward novel-view synthesis network cross-validates each generated image, producing a pixel-level hallucination score map. Unreliable pixels get masked during reconstruction. Augmented versions from different input views get fused for broader context. The paper is CVPR-accepted and posts SOTA on several novel-view synthesis benchmarks.
Key takeaways: - Diffusion priors for novel views are now standard in sparse reconstruction. "Hallucination pollution" is the acknowledged cost. - Writing hallucination detection into the pipeline is more honest than post-hoc erasure. A new baseline for 3D asset teams. - Pixel-level confidence generalizes beyond 3D — anywhere generation fills missing data.
Source: HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

Also Worth Noting
Today's Observation
Three papers landing together — Unlearnability, Reasoning Restored, D²Evo — give a concrete convergence signal. RLVR research focus is shifting from "can we raise the score" to "what is actually moving during training dynamics." Unlearnability shows that a set of hard examples never gets learned, yet the reward curve climbs anyway. The reward source isn't what it looks like. Reasoning Restored finds the base-to-reasoning capability gap concentrates on a small number of token decisions. Most training compute is reworking parts the model already knows. D²Evo concedes that medium-difficulty samples themselves drift during training, so difficulty estimation has to drift with them.
All three point at the same thing. Reward curves, loss, and benchmark scores are coarse-grained signals. The next layer to watch is gradient direction, token position, and sample difficulty evolution.
For engineering teams, the action is concrete. Curriculum needs recalibration. Stuffing every failed sample into the training pile burns compute and inflates metrics. Run gradient similarity or simple pass-rate tracking. Strip out the samples that aren't moving. The compute saved and the metric self-deception avoided are not small numbers.