Drop 90% of Vision Tokens, Keep the Performance

Today's Overview

Spatial relationships in image generation can now be optimized, not just hoped for. SpatialScore trains a reward model that outperforms GPT-4V on spatial evaluation, then uses it to RL-fine-tune generators. CVPR accepted, dataset open-sourced.
Masked image generation gets a 4x speedup with no quality loss. Learning feature dynamics replaces static caching, recovering semantic information that discrete sampling throws away.
VLM quantization can't be one-size-fits-all. Vision and language tokens have different distributions. MoE-style dynamic error compensation routes each token type through separate repair paths. Works from 2B to 70B, CVPR accepted.
90% of vision tokens are compressible. HiDrop finds that early layers handle feature alignment and shouldn't be pruned. A layer-aware strategy that matches each layer's actual role is the key. ICLR accepted.

Featured

01 Image Gen Turn "Cat Left, Dog Right" Into an Optimization Target

Text-to-image models understand semantics fine, but spatial relationships remain a coin flip. "A to the left of B" often takes multiple samples to get right. The root cause: no explicit feedback signal for spatial correctness during generation.

SpatialScore fixes this by modeling spatial accuracy as a reward signal. They train a dedicated reward model on 80K+ preference pairs, then use it for online RL to directly optimize the generator's spatial understanding. The reward model beats GPT-4V and other closed-source models on spatial evaluation, confirming that a focused small model with good data still wins on vertical tasks.

CVPR accepted, dataset and code open-sourced. Engineering barrier to adoption is low. The bigger story is transferability: spatial reasoning is just one of many text-to-image weak spots. Text rendering, counting, attribute binding — all could follow the same "build preference data → train reward model → RL fine-tune" playbook.

Key takeaways: - Turning spatial correctness from a sampling lottery into an optimizable reward signal is the real shift - A dedicated reward model outperforms GPT-4V on spatial evaluation; small model + good data still works - The same reward-modeling pattern could generalize to text rendering, counting, and attribute binding

Source: Enhancing Spatial Understanding in Image Generation via Reward Modeling

02 Efficiency Discrete Sampling Throws Away Semantic Information. You Can Learn It Back.

Masked image generation models (MIGM) run full bidirectional attention at every step. The rich semantics in continuous features get discarded when sampling discrete tokens. Previous speedup methods cache old features to approximate future ones, but error compounds fast at higher acceleration ratios.

MIGM-Shortcut takes a different angle: train a lightweight model that consumes both previous features and already-sampled tokens to regress the average velocity field of feature evolution. Dynamics modeling replaces static caching. On the current best MIGM (Lumina-DiMOO), it achieves 4x+ speedup with no quality drop, pushing the efficiency-quality frontier forward.

The same day, SenCache attacked a similar problem from sensitivity analysis: learn which computations matter, not how they evolve. Two independent approaches, both answering "what can we skip," with completely different methodologies.

Key takeaways: - Feature caching fails because it ignores sampling information; dynamics modeling is a more expressive alternative - 4x speedup with no quality loss has direct implications for MIGM deployment - Two parallel efforts on MIGM efficiency signal that this problem is getting concentrated attention

Source: Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

03 Efficiency Multimodal Quantization Needs Separate Repair Paths for Vision and Language

Running VLMs on-device almost always means quantization. An underappreciated problem: image tokens and text tokens have very different numerical distributions. Standard PTQ applies uniform error compensation across all channels. That's leaving performance on the table.

Quant Experts borrows from MoE architecture. It splits important channels into "globally shared" and "token-dependent" groups. The first group gets a shared low-rank adapter for repair. The second routes dynamically to specialized expert modules based on the specific token. Vision tokens and language tokens each get their own repair path instead of compromising on a shared one.

Tested from 2B to 70B parameter VLMs, quantized accuracy approaches full-precision. CVPR accepted.

Key takeaways: - The bottleneck in VLM quantization is cross-modal distribution mismatch, not just the precision-compression tradeoff - MoE-style dynamic compensation beats static global strategies for heterogeneous inputs - Validated across 2B to 70B scale, directly applicable to on-device multimodal deployment

Source: Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization

04 Multimodal Drop 90% of Vision Tokens and MLLMs Don't Get Worse?

Previous token pruning methods share a common mistake: treating early layers as redundant and pruning there. Early layers actually perform vision-language feature alignment. Pruning them breaks the fusion process.

HiDrop corrects this. Early layers stay untouched. Vision tokens get injected only when actual multimodal fusion begins (Late Injection). Middle layers apply a concave pyramid pruning curve. Deep layers allow early exit. This layered strategy removes roughly 90% of vision tokens with near-flat performance and 1.72x training speedup.

The engineering details matter too: persistent positional encoding and FlashAttention-compatible token selection avoid the hidden overhead that dynamic pruning usually introduces. ICLR accepted.

Key takeaways: - Early layers do feature alignment and should not be pruned; prior methods got this wrong - 90% of vision tokens are compressible when pruning strategy matches each layer's actual function - Directly useful for deploying MLLMs under resource constraints

Source: HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

Drop 90% of Vision Tokens, Keep the Performance

Also Worth Noting

Automated Caching for Diffusion Model Inference Efficiencysensitivity analysis decides which steps can be reused, turning cache strategy from hand-tuning to data-driven. link

Real-Time Multimodal Interaction With Simultaneous Speech and Vision Output Multimodaltargets embodied agent scenarios where current systems handle only single-modality output. link

NVIDIA Bridges Neural Reconstruction and Photorealistic Simulation Roboticsonline diffusion enhancement brings neural-reconstruction-based simulators closer to real sensor quality for autonomous driving. link

Zero-Shot Scene-Aware Object Replacement Without Per-Object Fine-Tuning Image Geninitial noise perturbation handles object swaps while preserving scene coherence. link

Static Benchmarks Can't Keep Up With Model Evolution Evaluationagent-driven dynamic evaluation protocol evolves test items alongside model capabilities. link

Reflective RL for Emotional Reasoning in MLLMs Multimodaladdresses poor generalization of SFT on affective understanding tasks through chain-of-thought reflection. link

Unified Token Pruning for Visual Tracking Efficiencyprunes both template and search region tokens simultaneously for real-time deployment. link

Interpretable Debiasing for VLMs via Reasoning Chain Intervention Safetymoves bias correction from black-box post-processing to transparent, auditable reasoning-level fixes. link

Dynamic Retrieval and Topological Constraints for Dataset Distillation Trainingbreaks the diversity bottleneck of static anchor points, improving synthetic dataset representativeness. link

First Benchmark for Small-Object Editing in Instruction-Based Models Evaluationfills a blind spot in existing evaluations that miss fine-grained editing capability. link

Today's Observation

"Redundancy" gets thrown around too loosely in efficiency research. Fixed-interval caching assumes every Nth step can be skipped. Uniform pruning assumes all vision tokens are equally disposable. Several papers today independently disprove these assumptions.

SenCache finds that sensitivity varies enormously across diffusion steps; blindly caching certain critical steps causes quality cliffs. MIGM-Shortcut discovers that feature evolution is predictable, making dynamics modeling strictly better than static snapshots. HiDrop shows that early layers handle feature alignment and should never be pruned, while mid-to-deep redundancy varies nonlinearly with depth.

Three independent results point the same direction: computational redundancy in models is dynamic. It shifts with input content, timestep, and layer depth. Static strategies approximate every specific case with the average case. When models get complex enough, that approximation costs real quality.

If your inference pipeline still uses fixed-interval caching or uniform token dropping, run a profiling pass on your actual production data. Measure sensitivity distributions across steps and layers. You'll likely find 20% of computation drives 80% of quality. Reallocating saved compute to those critical positions beats uniform acceleration every time.