Unlimited RLVR Data From Web Text, FP4 Pretraining Matches BF16

Today's Overview

  • Running out of RLVR training data? Synthesize unlimited tasks from web text. Golden Goose turns unverifiable pages into verifiable reasoning tasks, revives saturated models, and sets new SOTA for 1.5B and 4B across 15 benchmarks.
  • The biggest computational waste in diffusion LLMs just got plugged — recycling discarded token representations nearly doubles AIME accuracy. Residual Context Diffusion upgrades existing dLLMs with only ~1B tokens.
  • Blackwell's FP4 pretraining finally matches BF16 accuracy — Quartet II achieves 4.2x speedup with a new unbiased quantization method; kernels are open-source.
  • Stronger reasoning models are less safe, but the fix doesn't need an external teacher. ThinkSafe unlocks the model's own latent safety knowledge to generate alignment data, outperforming GRPO at a fraction of the compute.

Featured

01 Training RL Data Ceiling? Make More From Web Text

RLVR (Reinforcement Learning with Verifiable Rewards) is the primary tool for teaching LLMs to reason, but it has a practical ceiling: verifiable training data is finite, and models eventually saturate.

Golden Goose proposes a deceptively simple fix — take "unverifiable" internet text (science textbooks, for instance) and reformat it as multiple-choice fill-in-the-middle tasks. An LLM identifies key reasoning steps and generates plausible distractors. This pipeline produced GooseReason-0.7M, spanning math, programming, and general science.

Models that had plateaued on existing RLVR data started climbing again, with 1.5B and 4B-Instruct models hitting new SOTA across 15 benchmarks. The cybersecurity validation is especially telling: RLVR tasks synthesized from raw FineWeb scrapes let Qwen3-4B beat a 7B model that had undergone domain-specific pretraining and post-training.

Key takeaways: - Internet text is a nearly unlimited source of reasoning tasks once you know how to make it verifiable - Model saturation on current data doesn't mean RL training has hit its ceiling — new data sources restart growth - Small model + good data beats large model + generic data


02 Architecture Diffusion LLMs Were Throwing Away Their Best Work

The appeal of diffusion language models (dLLMs) is parallel decoding, but the best current methods have a massive waste problem: each step keeps only the highest-confidence tokens and discards everything else.

Residual Context Diffusion (RCD) shows those discarded tokens are far from useless — their representations carry rich contextual information. RCD converts them into residual signals and injects them back into the next denoising step, giving the model cross-iteration memory. Training uses a decoupled two-stage pipeline that sidesteps backpropagation memory bottlenecks, and converting an existing dLLM to RCD takes only ~1 billion tokens.

The payoff: 5-10 point accuracy gains across multiple benchmarks, near-doubled accuracy on AIME, and 4-5x fewer denoising steps to reach equivalent quality.

Key takeaways: - The "discard and redo" strategy was leaving substantial compute on the table - Retaining intermediate representations as residual signals is a low-cost improvement path - ~1B token conversion cost means existing dLLMs can upgrade quickly


03 Efficiency FP4 Pretraining Finally Matches BF16

NVIDIA Blackwell GPUs natively support NVFP4, theoretically enabling full 4-bit pretraining. But previous quantized training methods sacrificed precision for unbiased gradient estimation, leaving a noticeable accuracy gap versus FP16/FP8.

Quartet II closes that gap with MS-EDEN, a new unbiased quantization routine that cuts quantization error by more than 2x compared to stochastic rounding. Integrated into a fully NVFP4 scheme for linear layers, it produces consistently better gradient estimates in both forward and backward passes.

Validated on end-to-end training at 1.9B parameters and 38B tokens, with open-source Blackwell GPU kernels delivering up to 4.2x speedup over BF16.

Key takeaways: - FP4 pretraining has moved from "runs but loses accuracy" to "matches accuracy and runs 4x faster" - Blackwell GPU users now have a practical quantized training path with open-source kernels - Teams sensitive to pretraining cost should track this closely


04 Safety Stronger Reasoning, Weaker Safety. Let the Model Fix Itself

Large reasoning models chase reasoning ability through RL, but the training over-optimizes for compliance — making them more willing to follow harmful requests. Existing fixes distill safety behavior from external teachers, introducing distribution shift that degrades native reasoning.

ThinkSafe's key insight: compliance suppresses safety mechanisms, but the model still retains latent knowledge of what is harmful. Lightweight refusal steering unlocks this, guiding the model to generate in-distribution safety responses as its own training data.

On DeepSeek-R1-Distill and Qwen3, ThinkSafe significantly outperforms baselines on safety while keeping reasoning intact — and costs far less than GRPO.

Key takeaways: - RL training creates real tension between reasoning capability and safety — pursuing one degrades the other - The model's safety knowledge is suppressed, not erased — refusal steering can unlock it - Self-generated alignment avoids distribution shift, striking a better balance than external distillation

Unlimited RLVR Data From Web Text, FP4 Pretraining Matches BF16

Also Worth Noting

05
AI Draws Publication-Ready Figures, Validated on 292 NeurIPS Cases AgentPaperBanana uses multi-agent collaboration for academic illustrations, outperforming baselines on faithfulness, readability, and aesthetics; 137 upvotes on HF. link
06
Compress CoT Into Images, Extract Visual Semantics ReasoningReGuLaR uses a VAE framework to compress explicit reasoning chains into latent space via rendered images, matching or beating explicit CoT. link
07
Frequency Domain Reveals dLLM Generation Preferences ArchitectureFourierSampler finds low-frequency components encode global structure while high-frequency encodes local detail; frequency-domain sliding window yields 20% improvement on LLaDA-8B. link
08
Unified Serving System for Speech LLMs EfficiencyVoxServe decouples model architecture from system optimizations, achieving 10-20x throughput at comparable latency; code is open-source. link
09
Predict 1000-Attack Success From 100 Samples EvaluationSABER models Best-of-N jailbreak risk with Beta distributions, cutting prediction error from 12 to 1.66 and revealing nonlinear risk amplification in seemingly safe models. link
10
Test-Time Self-Evolution: Generate, Solve, Get Stronger ReasoningTTCS co-evolves a question synthesizer and a reasoning solver with self-consistency rewards, producing continuous gains on math reasoning. link
11
Perplexity Can Pick the Wrong Model ReasoningGoogle proves theoretically that if a model confidently predicts any sequence, there must exist low-perplexity sequences it predicts incorrectly. link
12
GRPO Is Too Soft — Binary Splits Work Better ReasoningReNCE replaces advantage estimation with noise contrastive estimation, matching DAPO on math benchmarks without clipping or filtering tricks. link
13
20x Faster Recovery for LLM Serving EfficiencyKevlarFlow uses decoupled initialization, dynamic traffic rerouting, and background KV cache replication to cut hardware failure recovery time by 20x. link

Today's Observation

Three high-quality diffusion language model papers dropped today — RCD recycling discarded tokens, FourierSampler guiding generation in the frequency domain, and masked diffusion regularization tuning. dLLMs are moving fast from proof-of-concept to engineering viability, with practical efficiency bottlenecks being picked off one by one. Meanwhile, RLVR data expansion (Golden Goose) and FP4 pretraining (Quartet II) both point in the same direction: making it cheaper for more teams to train stronger models. If you work on reasoning model training, dLLM decoding optimization and RLVR data synthesis both belong on your tracking list.