dots.tts Hits 54ms First Packet, SWE Agent Self-Evolves Past 50%

Today's Overview

Open-source TTS takes the continuous-latent route, with three design choices all aimed at deployment. dots.tts is a 2B continuous autoregressive speech model, Apache 2.0, that pushes first-packet latency down to 54–85ms and reports 0.94%/1.30% CN/EN WER on Seed-TTS-Eval.
One set of weights for every camera. UniSHARP unifies perspective, wide-angle, fisheye, and panoramic cameras into a single panoramic latent space for monocular view synthesis, instead of training a separate model per camera type.
Let the coding agent write its own problems and see if it escapes its comfort zone. Socratic-SWE mines an agent's execution traces for reusable skills, generates tasks from them, and reaches 50.40% on SWE-bench Verified after three iterations.
Tabular foundation models start cutting back toward deployability. TabSwift matches the heavier TabPFN v2 and TabICL with a lightweight row-wise attention backbone, adds per-layer early exit, and bets squarely on low latency.

Featured

01 Continuous-Latent TTS Built for Production, Not Leaderboards

dots.tts is a 2B text-to-speech foundation model that runs autoregression in a continuous latent space. The weights, training, and inference code are all out under Apache 2.0. What makes it worth reading isn't "another TTS" — it's three deliberate design choices.

The AudioVAE trains with a multi-objective loss, shaping the continuous speech space to be semantically structured and prediction-friendly. The flow-matching head uses full-history conditioning to fight drift, where long audio wanders further off as it generates. Reward-free self-correction post-training plus CFG-aware MeanFlow distillation then push first-packet latency to 85ms for output streaming and 54ms for duplex streaming.

On Seed-TTS-Eval, CN/EN/CN-hard WER comes in at 0.94%/1.30%/6.60%, which the authors call open-source SOTA. All three choices target real deployment pain, not benchmark numbers — that's the difference from most research-grade TTS. The caveat: the abstract is still mostly self-reported novelty. Real audio quality, voice cloning, and emotional stability on the continuous-latent route need a listen to the released samples before you judge.

Key takeaways: - Open-source speech foundation models have long sat behind closed doors. This gives teams building voice products a deployable continuous-latent option worth testing on their own samples. - The design choices — representation polishing, drift resistance, distillation for latency — matter more than any single metric, because each maps to a real deployment problem. - First-packet latency of 50–85ms covers streaming interaction. Confirm quality and stability by listening to samples; don't go on WER alone.

Source: dots.tts Technical Report

02 One Set of Weights for Every Camera

Most monocular view synthesis hides an assumption: the camera is a standard pinhole lens. Real cameras span wide-angle, fisheye, and panoramic, and the usual fix is a separate model per type. UniSHARP aligns all of them into one panoramic latent space, with implicit alignment in both feature and Gaussian space, so a single representation holds field-of-view from perspective to panorama.

The team built a benchmark stratified by field of view and claims a large lead over existing methods. The abstract mostly explains the alignment idea. Real cross-camera generalization depends on the actual results, especially extreme distorted views like fisheye and panorama.

Key takeaways: - View synthesis is breaking free of the pinhole assumption, and camera generality is becoming a selling point in itself. - "One set of weights for every camera," if it holds, removes the cost of training a separate version per camera type. - Cross-FoV generalization quality still depends on results. Don't judge on the alignment idea alone.

Source: UniSHARP: Universal Sharp Monocular View Synthesis

03 Can a Coding Agent Out-Train Itself?

The tightest constraint in training SWE agents is the supply of high-quality tasks. Existing synthesis leans on fixed mutations or bug injection — single-pattern, blind to real-repo complexity. Socratic-SWE swaps the data source: it treats the agent's own execution traces as raw material, distills them into structured agent skills (recurring failure modes and effective fix patterns), then uses those skills to generate targeted repair tasks in real repositories.

It doesn't let self-authored tasks run loose. Candidates must pass execution verification (they run and test), then get scored by a solver-gradient-aligned reward, keeping only tasks that are both verifiable and genuinely push the model forward. New models produce new traces, and the curriculum iterates. After three rounds it hits 50.40% on SWE-bench Verified, beating other self-evolving baselines at the same compute budget.

The open question is the one practitioners care about most. Skills distilled from traces come from paths the model already walked. Whether this loop keeps surfacing new weaknesses — rather than getting fluent inside its own distribution — needs longer iteration curves to settle.

Key takeaways: - Using execution traces as the data source gives self-synthesized tasks a supply closer to real engineering than fixed mutations. - Execution verification plus gradient alignment is the double filter that decides whether self-authored tasks are signal or noise. - The ceiling depends on continuing to find new weaknesses. Short-term gains are clear; whether it escapes its own distribution long-term needs more iterations to observe.

Source: Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

04 Tabular Foundation Models Start Trimming for Deployment

The tabular foundation model line, with TabPFN as the flagship, has a clever setup: no fine-tuning, just pack labeled training samples into the context and infer test labels through in-context learning. It's strong on small-to-medium datasets. The problem is that chasing accuracy stacked the architecture deeper, inference costs climbed, and deployment got harder.

TabSwift goes back to TabPFN's original simplicity. A lightweight backbone does only row-wise attention, plus two small changes — gated attention for stability and learnable register tokens for global context. It matches the stronger TabPFN v2 and TabICL while spending less at inference. It also ships per-layer early exit that adjusts reasoning depth per sample, aimed at latency-sensitive online serving.

Tabular data is the most common enterprise scenario. A genuinely deployable lightweight tabular FM is more grounded than yet another general-purpose large model.

Key takeaways: - The value of tabular FMs is shifting from accuracy-chasing to deployability, and TabSwift bets clearly on the lightweight side. - A row-wise attention backbone matching the heavier TabPFN v2 and TabICL shows complexity doesn't always buy accuracy. - Teams doing tabular work with tight inference latency should try it, but confirm the savings on your own data scale with a benchmark.

Source: TabSwift: An Efficient Tabular Foundation Model with Row-Wise Attention

dots.tts Hits 54ms First Packet, SWE Agent Self-Evolves Past 50%

Also Worth Noting

Same Prompt Keeps Yielding Similar Images, and You Can Restore Diversity Without Retraining Image Gentackles mode collapse in flow-based text-to-image with representation modulation, no retraining needed. Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation

VLMs Read Events but Miss Fine Motion, So Borrow From Video Diffusion Multimodalinjects video diffusion motion priors into VLMs to fix fine-grained motion understanding. MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

Easy Questions Shouldn't Burn as Many Tokens as Hard Ones Reasoningcurbs overthinking by scaling reasoning to difficulty, with difficulty modeling that evolves during training. DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

AIGC Detectors Fail on a New Generator, So This Exposes the Criteria Safetybuilds interpretable, transferable forensic concepts to counter black-box detectors' generalization collapse. ForensicConcept: Transferable Forensic Concepts for AIGI Detection

Skip Skeletons and Pose Estimation, Learn Character Animation Straight From Driving Video Video Genavoids error propagation from pose estimation under occlusion and complex poses. Beyond Skeletons: Learning Animation Directly from Driving Videos

Unsupervised Disease Staging That Explains Its Representations and Clusters AI for Scienceuses Huntington's disease to add the interpretability clinical use needs. Explaining Unsupervised Disease Staging in Huntington's Disease

LLM Research Watches Semantics and Spelling but Ignores Sound Evaluationa benchmark for Chinese phonological understanding to fill the gap. Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese

Adding Text Supervision Improves Geospatial Representations in VLMs Multimodalhelps overlooked dimensions like geolocation and spatial reasoning. Textual Supervision Enhances Geospatial Representations in Vision-Language Models

Cultural Alignment Always Says What to Suppress, So This Defines What Counts as Coherent Safetyuses Korean culture to give cultural alignment a constructive, positive definition. Korean Culture into LLM Alignment: Toward Cultural Coherence

Today's Observation

Read dots.tts, TabSwift, and UniSHARP together — speech, tables, view synthesis, three unrelated fields — and they're doing the same structural thing: porting the foundation-model template into a new modality. The point isn't "three more foundation models." It's that the ground they fight over has moved. None of the three sells on capability ceiling. They sell on generality, deployability, and low latency. dots.tts leads with open weights plus distillation, first packet at 54ms. TabSwift says outright it bets on lightweight, deployment, and latency, choosing parity over complexity. UniSHARP's pitch is simply "one set of weights for every camera type." The competition has shifted from "who's strongest" to "who's most general and actually usable." The FM playbook seems to be leaving the land-grab phase and entering the deployment phase.

If you're choosing or building a foundation model for some modality, stop staring at leaderboard scores. Pull out deployment cost, latency, and supported input types as their own columns and compare them directly. When capability converges, those columns are usually what decides whether you can actually ship.