$15 Per Paper, Healthcare Agents Cap at 28%

Today's Overview

Auto-Research Cost Curve Has Crossed. $15 produces a full paper, but frontier LLMs still fabricate results and miss errors. End-to-end autonomy still falls short of the conference acceptance bar.
OProver Pulls the Compiler Loop Into Training. Failed trajectories plus verifier-repaired proofs feed SFT directly. MiniF2F 93.3% Pass@32 puts it in the current top tier among open whole-proof provers.
CHI-Bench Tests Policy Density, Role Switching, and Mid-Task Dialogue Together. The best agent config clears only 28%. Strict pass^3 keeps everyone under 20%.
CompactAttention Targets the Chunked Prefill Gap. Demoting the 2D block-sparse mask from execution plan to KV selection signal gets 2.72x attention speedup at 128K context, with dense-equivalent accuracy.

Featured

01 $15 Buys a Paper, Not Integrity

The auto-research cost curve has crossed. $15 runs a full paper. Long-horizon agents take over literature review, code writing, drafting, even simulated critique. This survey through April 2026 traces a second curve too. Under scientific pressure, frontier LLMs still fabricate results, miss errors, and make unreliable novelty judgments.

The authors split the research lifecycle into four stages — Creation, Writing, Validation, Dissemination — and reach stage-specific conclusions. Structured, retrieval-anchored, tool-mediated work the models handle solidly. Genuinely novel ideas and research-grade experiments stay brittle. Generated ideas degrade when turned into code, and research code lags noticeably behind pattern-matching benchmark performance. Worse, higher automation hides failure modes rather than removes them. End-to-end fully autonomous systems can't clear the mainstream conference acceptance bar today.

Teams thinking about putting agents into a research workflow can read this as a risk checklist. Not a technical recipe. The value is a map of which stages can run hands-off and which still need a human.

Key takeaways: - $15 per paper is technically feasible, but the integrity layer hasn't caught up. "Can do it" and "deployable" are still apart. - Task reliability is stage-specific. Structured, retrieval-anchored, tool-mediated steps are delegatable. Novel ideas and research-grade experiments need a human in the loop. - More automation makes failure modes harder to spot. That's the thing to watch before slotting agents into workflows.

Source: AI for Auto-Research: Roadmap & User Guide

02 Write the Compiler Loop Into Training, Not Inference

Most agentic papers leave external feedback to inference-time scaffolds — the model generates, the verifier judges, an outer loop decides whether to retry. OProver picks a different route. Iterative post-training pulls the loop into the training side. Each round runs an agent proving pass. Newly verified proofs index into a retrieval library (cumulative 1.77M Lean statements, 6.86M compiler-verified proofs). Failed trajectories with compiler feedback and repairs become SFT data. The remaining unsolved hard cases go to RL.

The mechanism works because the verifier is strong and cheap. Lean's compiler meets both criteria. Failure signals drive training directly. The model's own recovery strategy gets explicit optimization instead of riding on inference-time scaffold logic.

Results put it in the current top tier for open whole-proof provers (Pass@32): MiniF2F 93.3%, ProverBench 58.2%, PutnamBench 11.3%. Teams holding compilers, test suites, or simulators as hard verifiers have a useful comparison here. Rather than piling complexity into inference-time scaffolds, push the recovery behavior into the weights.

Key takeaways: - Strong-verifier domains can pull the agent loop into training. Failed trajectories plus compiler feedback are natural SFT assets. - Compiler-in-loop works only when the verifier is strong and cheap. Lean fits. Adapting to your own scenario starts with verifier cost. - MiniF2F 93.3% Pass@32 puts OProver in the current top tier among open whole-proof provers.

Source: OProver: A Unified Framework for Agentic Formal Theorem Proving

03 Three Things Standard Agent Benchmarks Miss

Common agent benchmarks measure whether a single task completes. Real enterprise failures rarely sit at single-point capability. CHI-Bench stitches three current evaluation blind spots into one pipeline. Policy density: 1,290+ insurance and operations manuals queried on the fly. Multi-role composition: doctor, reviewer, and nurse roles switching inside a single task. Multilateral interaction: peer review and patient communication run as mid-task multi-turn dialogue rather than terminal output.

The pipeline runs on 20 simulated healthcare systems and 87 MCP tools, covering prior authorization, utilization management, and care management. Across 30 agent configurations, the best clears just 28%. Strict pass^3 keeps everyone under 20%. Stuffing all tasks into a single session drops the number to 3.8%.

For agent deployment in compliance-heavy enterprise settings, this evaluation track maps closer to real failure modes than "can it complete a single step."

Key takeaways: - Policy-dense lookup, multi-role switching, and multi-turn mid-task dialogue together expose the current blind spot in agent benchmarks. - Best 28%, single-session 3.8%. The gap is too large to close by model scaling. - Compliance-heavy deployments should run this kind of long-flow evaluation first. Surface the failure modes before shipping.

Source: CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

04 Sparse Attention and Production Serving Are Drifting Apart

Chunked prefill is the default in production serving now. Most of the past year's sparse attention work still optimized for one-shot prefill workloads. The output doesn't land on a production line cleanly. Block-sparse kernels fly on long queries, but the advantage vanishes once a chunk size shreds the query. Switching to fine-grained pattern search forces repeated search cost against an accumulating KV cache.

CompactAttention treats the 2D block-sparse mask as a KV selection signal instead of an execution plan. Q-block union and within-group union compress it into a minimal block table for paged execution. Selected blocks get accessed in place. No explicit copies.

On LLaMA-3.1-8B-Instruct at 128K context, chunked prefill attention gets a 2.72x speedup with RULER accuracy staying at dense levels. For inference infra teams, this is a sparse approach that drops into existing serving stacks without surgery.

Key takeaways: - Academic sparse attention optimization has drifted out of sync with the chunked prefill workload running in production serving. - CompactAttention demotes the block-sparse mask from execution plan to KV selection signal. The maneuver bypasses the chunked-scenario double bind of short queries and accumulating KV. - 2.72x attention speedup at 128K context with dense-level accuracy. Drops into existing serving stacks.

Source: CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

$15 Per Paper, Healthcare Agents Cap at 28%

Also Worth Noting

Tool-Using Agents Tested in One Real-Work Pipeline. AgentReal professional tasks force end-to-end failure modes out of tool-using agents. link

Training-Free N-Gram Memory Module. ArchitecturePlug-and-play path for MoE schemes and trainable memory embeddings. link

Auto-Generated Abstract Reasoning Tasks, Formally Verifiable. EvaluationSidesteps human annotation cost and memorization contamination. Accuracy scoring stops getting dragged by data leakage. link

SFT That Adds New Knowledge Without Losing Old Capability. TrainingDistribution-aligned self-distillation without an external teacher. Post-training stops trading old capability for new. link

GPU Kernel Agent With Generalization-Aware Evaluation. Code IntelligencePushes kernel agents from single-point capability tests to unseen-config generalization. link

Expert-Guided Merging Then Quantization. EfficiencyCompresses model merging and quantization into one low-resource deployment pipeline. link

Today's Observation

Five agentic frameworks and benchmarks today (Auto-Research roadmap, OProver, CHI-Bench, TOBench, AgentKernelArena) split cleanly along a verifier axis into two camps. Hard-verifier domains have a compiler as ground truth. OProver and AgentKernelArena can pull the agent loop into the training side, getting the model's recovery strategy explicitly optimized. Policy-rich workflow domains have no formal verifier. CHI-Bench and TOBench have to use operational benchmarks to surface end-to-end pipeline failure modes.

The Auto-Research roadmap sits right between the two camps. Generation isn't validation. Fabrication and novelty judgment are the real chokepoints today.

Pulled together, the next bottleneck in agent systems isn't model capability. It's what the verification surface around the agent looks like. For practitioners: if your domain has compilers, test suites, or simulators, look at OProver's training-side loop. Use failure trajectories and verifier feedback as SFT assets. If you don't have a hard verifier, follow CHI-Bench-style operational evaluation. Surface multi-role switching, policy-dense lookup, and multi-turn mid-task interaction failure modes before you ship. At the next iteration review, audit your agent line. Which kind of verifier does it hang on? Can training take in the corresponding feedback signal? Which failure modes still need runtime scaffolding to patch?