Adaptive Decoding Hits 4.2x, Joint Training 10x Faster

Today's Overview

Fixed block size leaves speedup on the table. BlockPilot predicts the best block per input instead of using one constant, reaching 4.20x lossless speedup on Qwen3-4B at temperature T=1.
Treat "knowing what you don't know" as a training target, not a patch. Yale's RLMF uses metacognitive feedback as the RL signal so a model's stated "I'm unsure" matches its internal state, lifting faithful calibration by up to 63%.
The tokenizer and generator stop training separately. GEAR's dual-readout design trains a VQ tokenizer and AR generator end-to-end, converging up to 10x faster on ImageNet gFID.
Give "intermediate rewards" a checkup before you commit. QVal ranks dense-supervision signals by Q-value agreement without running downstream training, and across 1,200 experiments a simple prompting baseline beats most fancy methods.

Featured

01 Efficiency Fixed Block Size Wastes Speedup

Speculative decoding — a small model drafts, a large model verifies, lossless — moved into the diffusion regime and pushed parallelism to SOTA by generating several tokens per block. These methods share one assumption: every input uses the same fixed block size and one shared decoding policy. BlockPilot argues that assumption is suboptimal. Different instances want different amounts of parallelism, so the optimal block size varies per sample. A fixed value splits some requests too aggressively and starves the parallelism of others.

The fix is light. After prefilling, BlockPilot uses the representation at that moment to predict the right block size for this input, then decodes normally without further intervention. The key observation is that these optimal values cluster near the training-time block size, so the decision space collapses to a low-dimensional problem and policy learning stays cheap.

On Qwen3-4B at temperature T=1, it hits an acceptance length of 5.92 and 4.20x speedup. The T=1 setting matters, because high-temperature sampling is usually the worst case for speculative decoding. Holding up here means the method is not sensitive to sampling randomness.

Key takeaways: - The fixed decoding hyperparameters in your pipeline may be leaving speedup on the table, and instance-adaptive prediction is a low-cost fix. - This is a plug-and-play policy-layer change. It leaves the draft model and verification logic untouched, so integration cost is low. - The cost of the prediction itself, and whether lossless speedup holds under real mixed workloads, needs production data to confirm.

Source: BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

02 Safety Make Metacognition a Training Target

The standard defense against hallucination wraps the model in an extra layer: confidence thresholds, retrieval checks, second-pass review. Yale takes a different entry point. One root cause of LLM unreliability is a metacognitive deficit — confidently wrong answers, blurry knowledge boundaries, and a gap between internal uncertainty and what the model actually says.

Their RLMF (reinforcement learning with metacognitive feedback) does not calibrate that confidence number. It treats the quality of a model's self-judgment about its own performance as the RL signal, training the model so that a stated "I'm unsure" corresponds to its internal state. The paper reports up to a 63% gain on faithful calibration over standard RL without sacrificing accuracy. That 63% is a relative gain, so the baseline and task distribution matter before you weigh it.

For teams that care about trustworthiness, the shift in framing is the point. Uncertainty expression moves from a bolt-on filter to a capability you can optimize end-to-end.

Key takeaways: - A new path for hallucination control: make metacognition a trainable target rather than adding confidence filtering at the output. - RLMF uses the quality of the model's self-judgment as the RL signal, getting past the ceiling of earlier internal-feedback methods. - The 63% is relative to standard RL. Check the full paper's baseline and task range before judging real-world value.

Source: Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

03 Image Gen The Tokenizer and Generator Stop Training Apart

Mainstream visual generation runs in two stages: train a tokenizer for reconstruction, freeze it, then train an AR generator on its discrete codes. The problem is that the tokenizer has no idea which codes the generator finds easy to predict, so the two are misaligned by construction. GEAR trains the VQ tokenizer and AR generator end-to-end, guided by representation alignment.

The hard part is that VQ's discrete indices are non-differentiable, and straight-through gradients collapse. GEAR uses a dual-readout design. A hard one-hot branch trains the AR with next-token as usual. A differentiable soft branch passes only the alignment loss back to the tokenizer, letting the AR pull the tokenizer toward a code distribution it predicts better. The alignment burden shifts from tokenizer to AR: the tokenizer's own features become less DINOv2-like while the AR's features grow more semantic — the opposite of the diffusion camp's recipe of making the latent itself semantic.

Against a LlamaGen-REPA baseline, ImageNet gFID converges up to 10x faster, and the method generalizes to quantizers like LFQ and IBQ and to text-to-image.

Key takeaways: - On the "tokenizer-generator mismatch" problem, the pixel-AR camp drops the tokenizer while GEAR makes the two learn jointly. If you build image generation, read the two paths side by side. - Dual-readout is the key engineering trick for the non-differentiable VQ. The soft branch only feeds the tokenizer and never contaminates the AR's next-token training. - 10x is convergence speed, not final quality. Final FID needs the full paper, but generalizing across quantizers and to text-to-image is a plus.

Source: GEAR: Guided End-to-End AutoRegression for Image Synthesis

04 Evaluation Check Your Intermediate Rewards Before Training

A long-horizon agent trajectory runs hundreds to thousands of actions, and rewarding only the final outcome is too sparse. So dense supervision — scoring intermediate steps — keeps appearing, from model confidence to self-distillation to embedding similarity. Validating any of these usually means wiring it into a full training pipeline and measuring downstream results. That is expensive, and it tangles the signal's own quality with training engineering, so different approaches can't be compared fairly.

QVal skips training entirely. Given a state-action pair, it checks how well a scoring method agrees with a strong reference policy's Q-value ranking, so you can judge whether a dense signal is trustworthy before any run starts. The authors ran a head-to-head over 21 methods, 7 families, and 6 open models. After more than 1,200 experiments, the conclusion is unflattering: a simple prompting baseline reliably beats the fancy dense-supervision methods from recent literature, and scores cluster strongly by family.

If you design intermediate rewards for agents, this is a low-cost selection tool. It measures whether a signal aligns with Q-value ranking, and whether that equals real training value still needs the full paper's validation.

Key takeaways: - Evaluating dense supervision doesn't require a full downstream run. Q-value agreement can screen out unreliable signals early. - A simple prompting baseline beats most recent methods. Don't assume the complex method is stronger before you select. - This is a training-free open testbed. Anyone designing agent rewards can plug in their own environment and try it.

Source: QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Adaptive Decoding Hits 4.2x, Joint Training 10x Faster

Also Worth Noting

Same Target as BlockPilot: Against Fixed Policies EfficiencyStanford's LearnStop learns a hidden-state-agnostic checkpoint stopper and asks when a learned stopping rule actually beats a simple confidence or convergence threshold. Read it alongside the lead. link

GUI Agent Training and Eval Never Left Offline Trajectories Agentstandard benchmarks are far from real apps' interfaces, interaction logic, and error-state distributions, so Xiaomi's GUI-0 technical report moves the whole pipeline onto real applications. link

Video World Models Lack Memory, So Long Scenes Drift Video Genrule-based frame selection fails under occlusion and moving objects, so MemLearner switches to a learnable adaptive context-memory query. link

Text-Image Data Pipelines Crawl, Filter, Freeze, and Toss Rejects AgentDataEvolver uses a self-evolving multi-agent setup to recycle the signal buried in failed samples. link

Training-Free Photo Mosaics at Any Resolution Image GenPhotoQuilt uses bootstrapped tiled denoising to satisfy both scales at once: each tile looks right alone, and the whole forms a scene. link

A Missing Domain Backbone for Sheet-Music Understanding MultimodalMuSViT is the first sheet-music vision foundation model, pretraining a ViT with MAE on 9.7 million IMSLP pages. link

Satellite-Image Synthesis Borrows the Natural-Image Recipe AI for Sciencedense rasters or sparse prompts both break the vector primitives geography relies on, so TerraDiT-Ω builds unified spatial control. link

End-to-End Self-Driving on Instant Sensor Data Only Reacts RoboticsPriorEye (ECCV) adds a geo-visual prior anchored to the streetscape, restoring the foresight humans get from experience. link

Today's Observation

Two unrelated systems papers today are unscrewing the same bolt: the hardcoded fixed policy. BlockPilot drops a constant block size in speculative decoding and predicts per instance how much to parallelize. In the notable list, Stanford's LearnStop drops a fixed confidence or convergence threshold in inference early-stopping and learns a per-instance stopper. One governs how many tokens to generate at once, the other governs when to stop thinking. Two subfields that barely overlap land on the same conclusion: static heuristics waste compute per instance — too conservative where they should push, still spinning where they should quit.

The takeaway is concrete. Go through the one-size-fits-all hyperparameters in your serving and inference pipeline — block size, early-stop threshold, retrieval count, speculative draft length. Pick one or two high-traffic paths and measure whether the optimal value actually varies by instance. If the variance is clear, those hardcoded numbers are probably the cheapest speedup within reach, and you change the policy layer without touching the model.