Vision Models Start Redesigning How They Output

Today's Overview

  • Why Is VLM Box Drawing So Slow? LocateAnything traced it to spitting out coordinates one at a time, then made the model emit a whole box in parallel — faster and more accurate, hitting 91 on HF that day.
  • One Embedding Model For Video, Audio, Image, And Text: Google's Gemini Embedding 2 packs them all into a single space and tops several retrieval and cross-modal leaderboards at once.
  • Spatial Foundation Models Claim To Do Everything, But SpatialBench Tested 41 Of Them And Found No All-Rounder — and data quality matters more than scale.
  • What A Diffusion Model Predicts Isn't An Arbitrary Choice: JLT uses a 130M model to show that predicting the clean image beats predicting velocity, geometrically, in latent space.

Featured

01 Multimodal The Slowest Part Of Drawing A Box Is Writing The Coordinates

Vision language models do object detection by breaking a box into four numbers — top-left x, top-left y, width, height — then emit them token by token, like writing prose. But a box is one thing. Its four corners form a tightly correlated geometric structure, and forcing the model to split them apart and generate them in strict order is slow and tends to lose the box's internal consistency.

LocateAnything changes the decode. It treats a box (or a point) as a single atom and solves it in one parallel step, instead of writing coordinates character by character. A data engine that generates 138 million training samples does the other half of the work.

The result moves speed and high-precision localization forward together: decode throughput is clearly higher, and localization quality under high IoU is better. For products doing detection, grounding, or anything that needs a VLM to point at things precisely, this improves latency and accuracy at the same time. From the NVIDIA team (Jan Kautz, Andrew Tao, and others).

Key takeaways: - Token-by-token coordinate generation is an overlooked speed bottleneck in VLM detection. - Treating a box as an atomic unit solved in one parallel step buys both throughput and accuracy. - The 138M-sample data engine is the other half of why the method works.


02 Retrieval One Model Instead Of A Separate Retriever Per Modality

Anyone building RAG, recommendation, or search knows the pain: text has its embedding model, images have theirs, video and audio another set, and cross-modal retrieval means stitching them together. Google's answer is to want all of it at once. Gemini Embedding 2 puts video, audio, image, and text into one representation space and accepts arbitrarily interleaved mixed inputs.

It gets there through large-scale contrastive learning plus multi-task, multi-stage training. The payoff is SOTA across several key leaderboards at once — image-text retrieval, video retrieval, multilingual text, and code retrieval — beating models trained specifically for each. Zero-shot behavior is the part that saves the most effort: it works out of the box on niche domains from astronomy and biology to food and art.

For teams building retrieval or recommendation, one model might unify the whole multimodal mess.

Key takeaways: - A unified embedding space for video, audio, image, and text drops the cost of maintaining several models. - Multiple retrieval leaderboards SOTA at once, with support for interleaved mixed inputs. - Zero-shot usable on niche verticals, ready to drop into RAG, search, or recommendation.


03 Evaluation Spatial Models Claim To Be All-Rounders. Line Them Up And Test.

"Spatial foundation models" have been hot for two years — reconstruction, depth, pose, and 3D understanding, all from one model. But everyone reports numbers on the domain they designed or trained for. Change the viewpoint, the scene, or the input density, and nobody tests it. The "general" claim was never really checked.

SpatialBench fills the gap: 19 datasets, 546 scenes, 5 spatial domains, with deterministic sampling that pulls 41 models across 6 paradigms into one comparison. The verdict is sober — no model today is a true all-rounder. A few useful findings: full-context attention is the most accurate, but handling long sequences still needs bounded-memory strategies. On hard tasks like embodied and first-person, strict domain alignment and high data quality beat simply making the dataset bigger.

The team also released a large dataset, DA-Next-5M, and a strong baseline, DA-Next. For anyone selecting a spatial model or working on embodied and 3D, this is a panorama worth consulting.

Key takeaways: - No spatial foundation model is an all-rounder yet; cross-viewpoint and cross-scene generalization is the real weak spot. - On embodied tasks, domain alignment plus data quality matters more than scaling up the dataset. - DA-Next-5M and the DA-Next baseline are ready to use.


04 Image Gen What A Diffusion Model Predicts Isn't A Free Choice

Training diffusion or flow models comes with a choice that looks like it doesn't matter: predict the clean image, or predict the noise or velocity? Mathematically these quantities are linearly interchangeable at a fixed timestep, so many people treat the parameterization as equivalent and pick arbitrarily.

JLT takes the question seriously. In the VAE-compressed latent space, does the choice still matter? Using a 130M model on the same backbone and the same settings, they find a real gap. Predicting velocity inherits an isotropic variance floor and amplifies the low-variance latent directions; predicting the clean image suppresses that noise instead.

The numbers follow: clean-image prediction reaches FID 2.50 on ImageNet, clearly better than velocity prediction. The conclusion is that the prediction target isn't an interchangeable algebraic parameter but a representation-dependent geometric choice. For anyone tuning diffusion models, that's a design instinct worth noting — though the exact payoff depends on the full ablations.

Key takeaways: - In latent space, predicting the clean image vs. velocity is not an equivalent choice; the former wins geometrically. - Velocity prediction amplifies noise in low-variance directions; clean-image prediction suppresses it. - A 130M model is enough to surface a clear gap, worth a try when tuning.

Vision Models Start Redesigning How They Output

Also Worth Noting

05
A 20B Model Brings Photoshop Layers Into Generation Image GenMRT unifies text-to-layer, image-to-layer, and layer-editing-layer into one model, producing editable multi-layer transparent images, distilled to 8 steps for real time. A user study puts its image-to-layer quality above the contemporary Qwen-Image-Layered, with 10-100x faster inference and more than half the memory saved. CVPR 2026. link
06
Writing Correct Code Isn't The Same As Writing Correct Specs Code IntelligenceAmazon's Verus-SpecGym tests whether LLMs can translate natural-language problems into verifiable formal specs, and even the strongest, Gemini 3.1 Pro, only hits 77.8%, often missing input assumptions and letting bad outputs through. The sharper finding: using an LLM as judge misses 26% of the errors their execution-based evaluation catches. link
07
To Make A Model "Think With Images," Force It To Actually Look MultimodalMila found models often generate an intermediate thinking image and then ignore it. Their View Dropout hides part of the input views during training, leaving only the thinking-image tokens able to see them, forcing the model to rely on its own drawing when answering. Paired with panoramic thinking images, it generalizes best out-of-domain on cross-viewpoint spatial reasoning. link