DeepSeek V4 Cuts KV to 13.5%, Video Memory Runs 10x Faster

Today's Overview

DeepSeek V4 bakes "index then attend" into the main architecture. Decoding no longer keeps the full KV cache in VRAM. A Neural Memory Indexer fetches relevant history on demand, cutting KV usage to 13.5% on long-context evals while downstream accuracy ticks up 0.6 points.
Video world models move memory into latent space and skip the pixel round-trip. Mirage drops explicit RGB point clouds, runs end-to-end generation 10.57x faster on 1/55th the VRAM, and takes SOTA on WorldScore.
Reading a scene is easy. Acting in it is not. SpatialWorld puts agents in first-person environments where they operate and reason about space at once. The best model averages 17.4% success, bottlenecked on active exploration and long-horizon planning rather than single-step reasoning.
Imitation learning breaks out of distribution, but a bigger policy net isn't the fix. DARP retrieves expert demos at inference time and models the difference between query and neighbor states, beating standard behavior cloning by 15–46% across several domains.

Featured

01 DeepSeek Bets Sparse Indexing on the V4 Main Path

DeepSeek V4 doesn't add another sparse-attention variant. It moves lookahead sparse attention (LSA) into the main architecture. During decoding, the full KV cache no longer sits in VRAM. A Neural Memory Indexer predicts which history chunks will actually be needed and pulls the relevant KV into memory on demand.

The most practical part is how the indexer trains. It's a standard dual-encoder, trained independently with off-the-shelf retrieval frameworks, and never requires loading the full backbone onto the GPU. That decouples "train a good index" from "train a good model." On long-context evals, average KV cache usage drops to 13.5% of the full baseline, with downstream accuracy flat or up 0.6 points. At an extreme 500K length, VRAM overhead falls by over 90% without collapsing inference quality.

Hold some judgment here: the abstract gives report-level numbers only. Recall quality, and accuracy loss on tasks that truly depend on long-range memory, need the full paper and real testing. The signal for practitioners isn't the figures — it's that DeepSeek is willing to commit indexing to V4's main path. If it holds, the cost structure of long-context serving shifts from "stack VRAM for KV" to "spend compute on the index."

Key takeaways: - The long-context serving bottleneck is moving from KV cache VRAM to index recall quality, and the cost model needs a rethink. - Decoupled training of indexer and backbone is the key engineering signal — the mechanism is reusable and can iterate on its own, not locked to one model. - 13.5% usage and +0.6% accuracy are report-level figures. The real impact of recall misses on long-range tasks waits on hands-on testing.

Source: FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

02 Why World Models Shouldn't Detour Through Pixels

A video world model that pans away and back should land on the same room. That takes cross-frame 3D memory. The common approach builds an explicit point cloud in RGB pixel space: render every few frames, re-encode with a VAE, push it back into the model. It's slow, and crossing in and out of pixel space throws away features the latent already learned.

Mirage puts that memory directly in the diffusion model's latent space. Depth guidance lifts latent tokens into 3D and stores them in a persistent cache. At query time, geometric transforms synthesize new views inside latent space, so the pipeline never returns to pixels. The result: end-to-end generation 10.57x faster, VRAM down to 1/55th, and SOTA on WorldScore.

This isn't another sparsification trick. The savings come from deleting the encode-render round-trip that was never necessary. For teams building controllable video generation and world models, it's a change you can borrow directly.

Key takeaways: - Cross-frame 3D consistency isn't only a VRAM problem — it's the repeated encode-decode trip through pixel space. - Keeping memory in latent space buys 10x speed and 55x VRAM at once, and quality goes up rather than down (WorldScore SOTA). - From Microsoft, with code, 58 upvotes on HF — worth a run if you build world models.

Source: Latent Spatial Memory for Video World Models

03 Reading a Scene Is Easy, Acting in It Is Not

Multimodal models answer spatial questions well on static VQA, but that's passive viewing — the prompt hands them the angle and the information. SpatialWorld changes the test. Agents sit in first-person environments with partial information, look and explore on their own, and express actions through a text interface.

The same models stall on interactive tasks. The strongest averages just 17.4% success, which puts a wide gap between "answers correctly" and "can do it." The breakdown is the interesting part. Many models can reason fine — they fail on active exploration and long-horizon planning, and success rate and execution efficiency don't line up.

For embodied and agent teams, that gap matters more than the score.

Key takeaways: - Static VQA measures passive viewing and misses the spatial understanding an agent needs to actually operate. - The main bottleneck is active exploration and long-horizon planning, not single-step reasoning. - Embodied and agent teams can use it to pin down whether their system is stuck on "can't see" or "can't plan."

Source: SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

04 Fix Imitation Learning Without a Bigger Policy Net

Behavior cloning has an old flaw: errors compound at deployment, and the robot breaks once it hits a state training never covered. DARP doesn't make the policy net bigger. It retrieves expert demonstrations at inference time and uses them — a semi-parametric route.

It isn't plain nearest-neighbor copying. DARP explicitly models the difference vector between the query state and neighbor states. The model learns to adjust: the neighbor did this, my state is off by that much, so the action shifts accordingly. The paper reports 15–46% gains over standard behavior cloning across continuous control and robotic manipulation, with no extra data collection, online expert feedback, or task priors.

The cost is carrying the demo set for retrieval at inference. For teams building robot policies, that trade-off is worth comparing against a pure parametric approach.

Key takeaways: - Retrieval-based imitation is a low-cost robustness patch for behavior cloning — no need to retrain a bigger policy net. - Difference-awareness is the key. Modeling the query-neighbor difference vector, not copying the nearest action, decides whether it generalizes out of distribution. - The 15–46% gain looks strong, but weigh the cost of carrying the demo set for inference-time retrieval — worth it depends on the deployment.

Source: Difference-Aware Retrieval Policies for Imitation Learning

DeepSeek V4 Cuts KV to 13.5%, Video Memory Runs 10x Faster

Also Worth Noting

ToM Post-Training Hits 99%, Maybe All Shortcut. ReasoningThe task has exploitable shortcuts, so this kind of post-training gain deserves a question mark first. link

Safety Judges Are Brittle, One Perturbation Flips Them. SafetyThey're sensitive to small changes in prompt and rubric; this uses curriculum training to move the judge from reliable to expressive. link

Directly Translated Benchmarks Miss Cultural Context. SafetyMultilingual safety evals lose local context under direct translation; this does culturally adapted red-teaming for East and Southeast Asian contexts. link

Differential Privacy Has Guarantees, Real Protection Is Doubtful. SafetyOverlap in pretraining data discounts DP's privacy effect; this builds an empirical benchmark to test actual protection. link

RL Reasoning for Video Grounding Often Stays Shallow. MultimodalReasoning paths look sound but ring hollow; this does temporally aware reasoning optimization for sharper grounding. link

3D Semantic Scene Generation Drops the Triplane. Image GenNo more triplanes or other heavy 3D architectures — unconditional diffusion produces editable semantic occupancy for autonomous driving. link

Diffusion Both Generates and Learns Representations. Image GenThe link between the two abilities was never clear; this evaluates its representation space through a self-supervised lens. link

AI Paper Writing Shifts From Generation to Verification. AI for ScienceThis uses a deterministic integrity gate to block fabricated citations and numbers that don't match source tables. link

A Bit-Exact Consistency Catalog for 84 Numeric Formats. EfficiencyPorting models across accelerators with FP8/BF16/MXFP4 and others, use it as a reference to catch silent precision drift. link

Wastewater Sees Flu Spread Before Clinical Reports. AI for ScienceBut wastewater isn't a clean proxy for population burden; this uses Bayesian selective latent inference for wastewater-first evidence. link

Today's Observation

Two papers today look unrelated — FlashMemory-DeepSeek-V4 chips at the VRAM bottleneck in ultra-long-context LLM serving, Latent Spatial Memory chips at rendering overhead in video world models. Read only their individual pain points and you miss the deeper thing they share. Both are tearing down the same object: an explicitly stored memory.

V4 throws out the full KV cache that lives in VRAM and swaps in a Neural Memory Indexer that fetches on demand. Mirage throws out the point cloud that gets rendered and VAE-encoded over and over in RGB space and swaps in a persistent 3D cache in latent space. One complains VRAM is too expensive, the other that rendering is too slow, but both point at the same thing: storing memory explicitly outside the model. It's costly and lossy.

Their fix is also the same move. Let memory live in the latent representation the model already learned, and stop crossing that expensive explicit space. If you run any system with memory or caching, audit it through this lens: is your memory getting encoded and decoded over and over in some explicit space, shuttled back and forth? Could it live directly in the model's latent and skip the trip?