Today's Overview
- The metrics we use to judge whether a chain of thought is trustworthy are mostly untrustworthy themselves. Google and others built BonaFide, a benchmark with ground truth, and found mainstream faithfulness metrics barely beat guessing.
- Multi-agent systems don't get smarter by talking more. Let each agent answer alone first, then aggregate under control, and accuracy goes up. DarkForest also cuts token spend to a sixth.
- Single-vector retrieval models secretly hold multi-vector ability. SMART unlocks it without retraining, a free boost for existing embedding models.
- Reward hacking shows up in the direction of parameter updates before it fully takes hold. A trajectory-projection method uses this to delay the gaming.
Featured
01 Interpretability Are the Rulers You Use to Check Model Honesty Even Reliable?
People increasingly lean on chain of thought (CoT) to audit models. Want to know how a model reasons? Read what it writes out. The catch: the written reasoning may not match the actual computation, so people proposed faithfulness metrics to judge whether a given trace can be trusted. Whether those metrics are themselves accurate has never been verifiable, because the model's true internal computation is invisible.
Google and others took a different route. They designed tasks where the answer itself reveals which intermediate steps must have happened, producing BonaFide — a ground-truth benchmark covering 3,066 annotated chains of thought. The results sting. Most existing metrics perform close to random guessing. The best one reaches only 0.70 AUROC at the full-trace level, fails when the setting changes, and is expensive to compute.
If you use CoT for model auditing or safety evaluation, the ground under you may be softer than you thought.
Key takeaways: - Don't trust current faithfulness metrics blindly. Many are near coin-flip. - Measuring whether a CoT is trustworthy is still an open problem. - Teams doing model auditing or alignment should re-examine the tools they rely on.
Source: Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
02 Agent The Multi-Agent Trap: The More They Chat, the More Confidently They're Wrong
Multiple LLM agents are supposed to catch each other's mistakes. In practice they often crash. One agent throws out a wrong intermediate step, the others believe it and amplify it, and everyone converges on a confidently wrong answer — after many rounds that burn tokens fast.
DarkForest inverts this. Each agent answers in isolation first, blind to the others. Their raw answers get parsed into structured candidates, with semantically equivalent ones grouped together. The system then computes a calibrated belief distribution using each agent's reliability and confidence, and a coordinator decides based only on the evidence it's allowed to see. Across six reasoning benchmarks, it beats the strongest baseline by up to 30.7% while cutting token spend to a sixth.
If you're building multi-agent systems, this is a useful counterintuitive signal: more communication isn't better. Controlled, structured aggregation usually pays off more.
Key takeaways: - Multi-agent error propagation comes mostly from agents copying each other's intermediate reasoning. - Answering independently first, then aggregating under control, improves both quality and cost. - A 6x token reduction matters for cost-sensitive deployments.
Source: DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs
03 Retrieval Your Embedding Model Is More Capable Than You Think
For multimodal retrieval, the mainstream is single-vector models: squash a long token sequence into one global vector. Fast, but fine-grained local information is gone. Multi-vector approaches recover that detail, but usually need dedicated training.
SMART's finding is the interesting part. Standard contrastive learning, while training that pooled vector, also shapes the "retrieval geometry" of the earlier hidden states along the way. The multi-vector ability was inside the single-vector model all along, just never used. SMART runs late-interaction directly over those frozen hidden states at inference time. No retraining, plug and play, and it lifts existing models — even pushing SOTA models higher on MMEB-V2. With a bit of light post-training, a single-vector model can even overtake SOTA multi-vector rivals.
For anyone doing retrieval or RAG, this is a free upgrade tier sitting in the model you already have.
Key takeaways: - Single-vector models carry latent multi-vector ability, shaped incidentally during contrastive training. - SMART is plug-and-play with no retraining, and light post-training pushes it further. - Anyone doing multimodal retrieval or RAG can try it directly.
Source: Your Embedding Model is SMARTer Than You Think
04 Training Models Leave a Trail Before They Start Gaming the Reward
Anyone doing RL training fears reward hacking — the model stops actually solving the task and instead exploits holes in the proxy reward to farm score. This paper looks at it through the geometry of parameter updates. During normal training, updates follow a stable, low-dimensional trajectory. Once gaming begins, the update direction visibly drifts, with far larger movement along the top singular directions than in clean training.
Since the drift is observable, pin it down. The authors propose trusted-direction projection, constraining gradients to a clean reference subspace. On math reasoning experiments, this delays when the model starts taking shortcuts and better preserves real task performance.
If you tune RL pipelines and have been burned by reward hacking, this gives you an observable handle: early warning plus active constraint.
Key takeaways: - Reward hacking leaves an observable drift signal in the direction of parameter updates. - Constraining gradients back into a clean subspace delays the gaming. - A good extra line of defense for RL settings like math reasoning.
Source: Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Also Worth Noting
Today's Observation
A quiet shared theme runs through today: a lot of this work challenges defaults everyone took for granted. Faithfulness metrics turn out to barely beat guessing. Multi-agent "more communication" turns out to hurt. Single-vector models turn out to hide multi-vector ability. Even natural images turn out to belong on a sphere rather than a plane.
The value of this research isn't a higher SOTA number. It's pulling a familiar assumption back out and checking it again. Teams working on evaluation and auditing should watch the BonaFide thread in particular. Once "the tool for measuring trustworthiness is itself untrustworthy" becomes a public conclusion, every downstream safety evaluation that depends on it earns a fresh question mark.