Late Layers Quietly Rewrite Correct Answers for Alignment

Today's Overview

A model's final layers quietly rewrite correct answers to sound more aligned. Early layers guess, middle layers sharpen the reasoning, but late layers drag the polished prediction toward "safer, more aligned" tokens. Confident Decoding skips the damage by decoding from a confident earlier layer — training-free, zero extra memory, under 2% latency, with steady gains on hard reasoning benchmarks.
Enterprise agents finally have a benchmark built from real work. EnterpriseClawBench reconstructs 852 reproducible tasks with fixtures from actual workplace sessions. The strongest setup — Codex on GPT-5.5 — reaches only 0.663, well short of something you'd trust to ship.
What stalls terminal-agent training is the data, not the algorithm. Tmax hits 27% on Terminal-Bench 2.0 with a 9B model and a plain outcome-only recipe, matching far larger models. CLI-Universe builds verifiable tasks in parallel. Both point at the training data.
One model speaks molecules and proteins natively — sequence, structure, language. BioMatrix hits SOTA or near-SOTA on 77 of 80 tasks, so drug and protein work no longer needs a stack of specialist models. The clean-sweep claim needs a check on per-task baseline strength.
Move passage compute offline and reranking becomes deployable. KaLM-Reranker-V1 decouples query and passage with "fast but not late interaction." The 0.27B Nano version trades blows with 7-12B embedding models and matches industrial rerankers on BEIR.

Featured

01 Late Layers Trade Reasoning for Alignment

Pull apart how a model predicts a single token and an odd three-stage pattern shows up. Early layers make a rough guess. Middle layers sharpen the reasoning-relevant meaning. The last few layers don't add polish — they pull the finished prediction toward more generic, more alignment-friendly tokens, and that perturbs the correct answer. We've assumed deeper representations are more reliable, so we decode from the final layer. The price of that final-layer alignment is reasoning accuracy. The paper calls it the alignment tax.

Confident Decoding targets the problem directly. When the model is already confident at a layer near the end — measured by entropy — it decodes from that layer instead. This sidesteps the late-layer perturbation and skips the remaining layers, saving latency. The authors frame "which layer" as an optimal-stopping problem and prove this conservative backtracking search filters out the noise when late-layer perturbation dominates.

Gains hold on GPQA-Diamond, Omni-MATH, and HLE, with no extra memory and under 2% added latency. It works on both dense and MoE models. Best of all, it's training-free — you stack it on an existing aligned model without retraining.

Key takeaways: - The "deeper is more reliable" default can hurt reasoning tasks, since alignment tuning adds perturbation in the last few layers. - Confident Decoding is a training-free decoding change you can drop onto an already-deployed aligned model to test. - Zero extra memory and under 2% latency make it nearly free for reasoning apps. Worth a try if you build reasoning products.

Source: Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

02 Enterprise Agents, Measured Against Real Work

Most agent benchmarks use synthetic or rewritten tasks. They look plausible but float above reality — not quite what happens inside a company. EnterpriseClawBench goes the other way. It pulls 852 reproducible tasks from a large set of real workplace session records, keeping the fixtures, role categories, hard rules, and scoring rubrics intact.

These tasks share one shape: do work inside a workspace. Read heterogeneous files, call tools, deliver a business artifact at the end — not answer a quiz. The results stay sober. The strongest setup, Codex on GPT-5.5, reaches only 0.663, a long way from trusting an agent with the job.

Because the data is internal company content, the benchmark itself isn't open. The reusable contribution is the construction and evaluation protocol. It also shows how enterprise agents should be judged: report harness and model as a pair, and track artifact delivery, visual quality, cost, time, and skill transfer rather than crushing everything into one number.

Key takeaways: - Tasks come from real company sessions, closer to how agents actually work than synthetic benchmarks. - The best setup scores only 0.663 — enterprise reliability is far from ready. - Break the evaluation apart: harness × model, delivery, cost, time. A single score hides the differences that matter at deployment.

Source: EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

03 Terminal Agents Are Bottlenecked on Data

Terminal agents are the most widely deployed downstream use of language models, yet academic work on RL training for them is strangely thin. The reason isn't algorithmic difficulty. There are no good benchmarks, no data, and no reproducible simple baselines.

Tmax answers all three. It uses a data-generation taxonomy — controlling difficulty, injecting personas, diversifying verifiers — to cheaply produce many executable terminal environments. It then trains with a plain outcome-only RL recipe. A 9B model scores 27% on Terminal-Bench 2.0, beating earlier larger models, and the open dataset is over 2.5x the previous size.

Read it alongside CLI-Universe, a verifiable task-synthesis engine built to fix vague synthetic instructions, shallow execution paths, and brittle tests. Both papers point at the same spot: the competition in terminal agents is shifting from the model itself to building tasks that yield reliable learning signals. If you train your own agents, that's a signal to reallocate effort — solve data and verification before you tune the algorithm.

Key takeaways: - The real bottleneck in terminal-agent training is verifiable, high-quality data, not the RL algorithm. - Tmax matches larger models with 9B parameters and a simple outcome-only recipe, so open baselines are nearing the frontier. - Teams training agents should shift effort from tuning algorithms to building verifiable training tasks.

Source: Tmax: A simple recipe for terminal agents

04 One Model for Molecules and Proteins

Biological foundation models have been stuck choosing one of two trade-offs. Fuse sequence, structure, and language under one objective but cover only a single entity type — molecules or proteins, not both. Or cover many entity types but drop explicit structure modeling, or only read without generating.

BioMatrix maps everything into one discrete token space: molecular sequences (SMILES and SELFIES), molecular structure, protein sequence, protein structure, and natural language. A decoder-only architecture — Qwen3-based at 1.7B and 4B — continues pretraining on 304B tokens. Every modality reads and writes through the same next-token prediction, with no bolted-on encoders or modality-specific output heads.

It reaches SOTA or near-SOTA on 77 of 80 tasks, spanning single-entity and cross-entity understanding and generation. For drug discovery or protein engineering teams, the value isn't a leaderboard number. One general model now handles work that used to need several specialists. The 77/80 claim still needs the full paper to confirm how strong each task's baseline really is.

Key takeaways: - A single model natively connects molecule and protein sequence, structure, and language, cutting the cost of stitching together specialist models. - It runs at 4B, a friendly deployment bar for small and mid-size teams. - 77/80 SOTA is the headline, but verify the comparison baselines — don't judge on task coverage alone.

Source: BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

05 Move Passage Compute Offline to Deploy Reranking

To deploy reranking well, you have to decouple query and passage compute and precompute the passage half offline. The standard reranker does the opposite. Encoder or decoder, it concatenates query and passage into one encoding, binding their compute together. Every incoming query reruns the candidate passages through the model, so latency and cost land on the request path.

KaLM-Reranker-V1 takes a "fast but not late interaction" route. An encoder precomputes passages — compressible to different dimensions via Matryoshka pooling — while the decoder handles only the query's instruction and intent. Cross-attention then models relevance. Passages get precomputed and cached offline, leaving only lightweight query-side work online, without collapsing into the limited expressiveness of late-interaction dot products.

On BEIR it matches industrial models like Qwen3-Reranker. The 0.27B Nano version trades blows with 7-12B embedding models on LMEB. It wasn't trained on much multilingual data, though, so MIRACL performance needs the full paper's breakdown before any verdict.

Key takeaways: - RAG teams often overlook reranking latency and cost. This work hits passage precompute, exactly where you can save money. - Passages compute offline and cache, leaving light query-side interaction online — good for large candidate sets and high QPS. - It competes down at 0.27B, worth a build for teams on a tight deployment budget.

Source: KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

Late Layers Quietly Rewrite Correct Answers for Alignment

Also Worth Noting

Train "Data Processing" Itself as a Skill TrainingDataClaw0 has an agent actively trim and structure raw multimodal streams to serve post-training, rather than passively label. link

Real Phones Force Two Training Environments Agentreal devices are slow, stateful, side-effect-prone, and hard to reset, so the authors combine real and mock apps into a recipe for training open phone agents. link

Computer-Use Skill Learning Usually Assumes a Static, Safe World SafetySkillHarness handles safe skill learning under adversarial, dynamic risks like prompt injection and pop-ups. link

Bring the MoE Idea to GQA Self-Attention Architectureactivate attention heads by token difficulty to cut the quadratic compute of long context. link

Long-Horizon Agents Lock Onto One Reading of the Evidence Too Early Agentanswer-only scoring misses this "process collapse," and this paper diagnoses it directly. link

Use a VLM as a Driving Brain — Does It Generalize Across Geographies? EvaluationRobusto-2 uses OOD corner cases in new cities like Lima and New York as the test. link

Entity-Level Membership Inference: Ask an LLM Whether an Entity Was in Training Safetydirectly relevant to privacy-leak and copyright-compliance risk assessment. link

Today's Observation

Put terminal, phone, and enterprise agents side by side — three unrelated settings — and an unobvious thread appears. What stalls them, and where the work goes, isn't the RL algorithm or the model. It's whether you can build real, executable, verifiable environments and task data. CLI-Universe (2606.22883) calls scarce executable training data a critical bottleneck. Tmax (2606.23321) blames hard terminal-agent training on missing data and missing reproducible baselines. PhoneBuddy (2606.23049) got pushed into real-plus-mock environments by devices that are slow, stateful, side-effect-prone, and hard to reset. EnterpriseClawBench (2606.23654) drops synthetic data entirely and rebuilds reproducible tasks with fixtures from real work sessions.

Skim the titles and you'd shrug — another batch of agent papers. The real signal underneath is that the engineering value of agents is migrating from the model and algorithm layer to the environment and verifiable-data layer. If you're training or evaluating your own agent, don't rush to tune RL or swap models. Take stock of your environments first: are the tasks actually executable, are results auto-verifiable, can failures reproduce and reset. Those three decide how far you get more than the algorithm does.