Adaptive DiT Patches Hit 3x Speedup, Mamba Improves by Subtraction

Today's Overview

  • Latent Diffusion's Two-Step Training Collapses Into One. Aligning encoder output noise with the diffusion schedule yields a unified objective. FID 1.4 on ImageNet-512 with lower training FLOPs.
  • DiT Denoising Doesn't Need Fine-Grained Patches at Every Step. DDiT adjusts patch size by content complexity and denoising stage. 3.5x speedup, no quality loss, no retraining.
  • A Full Validation of MoE Best Practices at Scale. Arcee Trinity combines sigmoid routing, interleaved attention, and gated attention in a 400B-parameter model trained on 17T tokens with zero loss spikes.
  • Mamba-2 Gets More Accurate by Removing Components. Systematic ablation produces a simplified variant that nearly matches softmax attention while keeping linear complexity.

Featured

01 Unify the Encoder and Diffusion, Train Once

Standard latent diffusion trains the encoder/decoder first, then the diffusion model. Two stages, two separate objectives. Unified Latents (UL) aligns the encoder's output noise level with the diffusion prior's minimum noise level. That alignment yields a single training objective: a tight upper bound on latent bitrate.

FID 1.4 on ImageNet-512, reconstruction quality stays high, and training FLOPs drop below equivalent models trained on Stable Diffusion latents. On video, FVD 1.3 on Kinetics-600 sets a new record.

The real value isn't any single benchmark number. Collapsing "two-step separate training" into "one-step unified training" simplifies the entire pipeline. If this generalizes to video and 3D generation, the impact extends well beyond image metrics. More downstream validation is needed.

Key takeaways: - Unifies encoder and diffusion training into a single objective, simplifying the latent space pipeline - More efficient than equivalent models trained on Stable Diffusion latents - Sets a new video generation SOTA, but generalization to other modalities needs further validation


02 DiT Doesn't Need Fine-Grained Patches at Every Step

Early denoising steps process near-pure noise. Using the smallest patches to model global structure pixel by pixel is like examining a blank wall through a magnifying glass. DDiT takes the obvious fix: large patches early to sketch structure, small patches later to refine detail. Granularity adapts to content complexity and denoising stage.

No retraining required. Drop it into any existing DiT. FLUX-1.Dev gets a 3.52x speedup; Wan 2.1, 3.2x. Generation quality and prompt adherence stay intact.

No architecture changes, no weight modifications — just a different tokenization strategy that eliminates most redundant computation. For video generation workloads, the cost savings compound fast.

Key takeaways: - Coarse patches early, fine patches late: allocates compute by need instead of uniformly - Plug-and-play on deployed DiT models, no retraining needed - 3x+ speedup on FLUX and Wan with no quality loss, especially valuable for video generation costs


03 MoE Best Practices, Validated at 400B Parameters

Individual MoE techniques are no longer scarce. Sigmoid routing, interleaved local/global attention, gated attention, depth-scaled sandwich norm — each backed by its own paper. Arcee Trinity's contribution is combining them all, then running the combination at three scales (6B/26B/400B total parameters, 1B to 13B active) across 17 trillion tokens. Zero loss spikes.

This isn't a single method breakthrough. It's a complete snapshot of current MoE best practices under real training conditions. Two details worth tracking: SMEBU (momentum-based expert bias updates with soft clamping) for load balancing, and the choice of Muon as optimizer.

For teams making architecture decisions right now, this technical report answers the question individual papers can't: "Do these things actually work together?"

Key takeaways: - Sigmoid routing + interleaved attention + gated attention validated across 17T tokens at three scales, zero loss spikes - Not a single-method innovation — a tested combination of current MoE best practices - Teams choosing architectures can reference this as a verified recipe


04 Mamba Gets Better by Removing Parts

When linear attention falls short, the instinct is to add: more complex gating, more attention heads, hybrid architectures. 2Mamba2Furious goes the other direction. It systematically strips components from Mamba-2, testing what actually matters for accuracy.

The surprise: the simplified Mamba-2S performs better, not worse. Improving the A-mask and increasing hidden state order nearly closes the gap with softmax attention, while retaining linear complexity's memory efficiency on long sequences. The authors also found configurations where Mamba exceeds softmax attention accuracy.

Key takeaways: - Subtracting from Mamba-2 rather than adding to it brings accuracy closer to softmax attention - A-mask improvements and higher hidden state order are the key factors; not every component earns its keep - Finding the right minimal recipe beats stacking patches on the efficiency-accuracy tradeoff

Adaptive DiT Patches Hit 3x Speedup, Mamba Improves by Subtraction

Also Worth Noting

05
A Complete Pipeline From arXiv LaTeX to Scientific Language Models. TrainingCovers metadata filtering, archive validation, and engineering details papers usually omit. A 1.36B-parameter training walkthrough. link
06
Current LLM Alignment Is Fail-Open. SafetySuppressing one dominant feature bypasses safety constraints. Proposes a fail-closed alternative. link
07
Optimizing Tree Expansion for Multi-Path Speculative Decoding. EfficiencyDelayed expansion reduces wasted draft tokens. link
08
GRPO Lets Models Brute-Force With Verbose Chains. TrainingA weakly supervised variant uses fewer rollouts and controls reasoning length. link
09
When Should an Agent Ask a Human vs. Act Alone? AgentFirst dataset modeling human intervention behavior in web agents. link
10
Black-Box Safety Evaluation Has Information-Theoretic Limits. SafetySafety on test distributions cannot guarantee deployment safety. link
11
Can Agents Infer Preferences From Browsing History? AgentA personalized web agent benchmark for vague user queries. link
12
Math Formalization Scales From Single Theorems to Textbooks. Code IntelligenceRequires handling cross-file dependencies and import management. link
13
Bending the Scaling Law Curve for Recommendation Systems. EfficiencyEmpirical evidence on long-sequence modeling and deeper architectures in production serving. link

Today's Observation

The three highest-ranked papers today do the same thing in three different parts of the pipeline: find a one-size-fits-all default and replace it with something adaptive.

UL discovers that separate latent spaces for encoder and diffusion are an unnecessary split. Align the noise levels and training becomes more efficient. DDiT discovers that minimum patch size at every denoising step is unnecessary uniformity. Adapt granularity by stage and content, and most computation vanishes. 2Mamba2Furious discovers that many Mamba-2 components are unnecessary complexity. Strip them out and accuracy goes up.

Same pattern: existing designs are full of "uniform processing" defaults that aren't optimized choices. They're whatever was convenient at implementation time. Replacing fixed granularity with adaptive granularity consistently reveals that most of the computation was redundant.

Audit the uniform-processing stages in your own pipeline: fixed token lengths, undifferentiated inference steps across stages, same batch configuration for all inputs. Ask whether each processing granularity is a validated optimum or just a default. The latter is where efficiency gains live.