Swap the Arm Without Retraining; VLMs See Both the Duck and the Rabbit

Today's Overview

Swap a robot arm and the whole skill set breaks — the fix is rewiring, not retraining. RECENT writes skills as executable code and locally refactors only the execution bindings that shift with body or environment, letting a small model handle grounding on-device and matching the large-model version's task performance.
Robust-U1 makes the model repair the image before answering, turning robustness into an observable intermediate. A three-stage self-recovery path handles blur, noise, and occlusion — the visual corruption that only shows up in production — at the cost of an extra reconstruction step.
VLMs actually "see" both readings of a duck-rabbit image. Probes find 72% of bistable images light up features for both interpretations on the vision side; the bottleneck for steering sits downstream in language, not in the vision tower.
Atmospheric compensation in standoff infrared imaging, long shelved, gets a set-based treatment. The work jointly inverts multiple radiance measurements of one scene as an unordered set; what transfers is the modeling stance, not the LWIR setting itself.

Featured

01 Robotics: Swap the Arm, Retrain Every Skill

Reusing robot skills runs into an awkward reality: a small change to the body or environment — a different gripper, a new table height — breaks the whole skill set that used to work. The usual move is to call in a large language model to regenerate. But in dynamic, partially observable real-robot settings, deploying an LLM is impractical, and a small model (sLM) can't supply the reliable grounding that long-horizon control needs.

RECENT takes an engineering view instead. It writes skills as executable code, keeps the semantic intent (the control structure) fixed, and locally refactors only the execution bindings. This downgrades the problem from "relearn" to "rewire." The small model never generates a whole policy from scratch — it edits the few lines that change with body or environment. Across multiple robot bodies in dynamic environments, the paper reports RECENT as the best small-model Code-as-Policies method, matching the large-model version on task performance. That's from the abstract; the exact transfer range needs the full paper to confirm.

The choice of a small model over a large one is the part worth chewing on. It isn't about topping a leaderboard — it's about being deployable and iterable, running on-device with low iteration cost. For teams shipping embodied AI, "refactor, don't retrain" is a more useful frame than the "matches the large model" number.

Key takeaways: - The real blocker for skill reuse is small body/environment differences breaking everything, not failing to learn new skills. - Treating skills as code and refactoring only the execution bindings lets a small model do grounding on-device, sidestepping the LLM deployment problem. - This is an engineering trade — deployable and iterable over chasing SOTA. Teams shipping real systems should borrow the frame.

Source: Efficient Skill Grounding via Code Refactoring with Small Language Models

02 Multimodal: Repair the Image First, Then Answer

Real-world images are rarely clean. Phone snaps blur, security footage carries noise, objects sit half-occluded. This visual corruption is exactly what multimodal models (MLLMs) never see on benchmarks and stumble on the moment they ship. Past robustness work either went black-box on feature alignment (no way to explain what the model is filling in) or leaned on text reasoning (which can't recover lost pixel detail).

Robust-U1 flips the order: have the model restore the corrupted image first, then reason over the completed version. The recipe runs in three stages — supervised fine-tuning learns an initial reconstruction, reinforcement learning aligns quality with both pixel-level (SSIM) and semantic-level (CLIP) rewards, and the model reasons over the original corrupted image and the restored one together.

Turning robustness into an observable intermediate is the value here. The cost is just as clear: an extra reconstruction step, a heavier inference chain, and the risk that a bad reconstruction misleads the downstream answer. That last part needs the paper's failure cases before you can call it.

Key takeaways: - Visual corruption is the easiest trap on the path from demo to deployment, and clean-benchmark scores won't show it. - The self-recovery path makes robustness an interpretable intermediate step, easier to diagnose than black-box alignment. - The cost is an extra reconstruction and a heavier chain; a failed reconstruction can backfire, so price that in before shipping.

Source: Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

03 Interpretability: The Model Sees Two Answers, Says One

The intuition: when a VLM captions an ambiguous image as "duck," it picked duck because it never registered the rabbit. This paper pries open LLaVA's vision tower with sparse-autoencoder probes and finds otherwise. Across 69 bistable images, 72% light up features for both readings on the vision side. The model does see both answers — the commitment to duck or rabbit happens further downstream.

The intervention asymmetry is more counterintuitive. For duck-rabbit images with a clear default bias, a causal edit at CLIP layer 22 flips 33% of captions to "rabbit." For an image like the old-woman/young-woman that starts out a coin flip, no coefficient moves the needle — even though both feature sets are plainly superimposed on the vision side. Seeing and seeing-as are two different things. The ambiguous information lives in the visual representation, but the bottleneck for steering sits in language, and it doesn't pry open easily.

For anyone doing controllable generation or trying to correct a model's leanings, this is a reality check. A feature probe tells you where the information is. It does not mean you can rewrite the output from there.

Key takeaways: - A VLM's leaning on ambiguous images isn't random — it's structured and locatable, but locating it doesn't mean you can intervene. - The vision tower encodes multiple readings at once; the real commitment point is downstream in language. - Teams hoping to fix model bias through activation edits should first check whether the bottleneck is in a layer they can reach.

Source: Vision-Language Asymmetry in Bistable Image Captioning

04 AI for Science: Inverting Physics From Noisy Observations, Set-Style

Standoff passive LWIR (long-wave infrared) hyperspectral imaging hits a problem it can't go around: the target signal gets muddied by atmospheric absorption and emission along the way. Seeing the target means doing atmospheric compensation first, and that's been shelved for years because the modeling is hard. This work skips the popular topics and takes on the niche one. A lightweight set-based deep framework feeds in multiple radiance measurements of the same scene at different distances and jointly inverts transmittance, atmospheric path radiance, and the shared downwelling spectrum.

The transferable part isn't LWIR. It's the modeling stance: treat "multiple noisy observations" as an unordered set and process them jointly, rather than grinding through them one at a time. The authors also probed the learned representation with a sparse autoencoder and found some latents responding to geographically coherent data subsets without any geographic supervision. Interesting, but whether it holds up needs the full paper. For now the results are validated only on MODTRAN simulation data, a step short of real deployment.

Key takeaways: - Atmospheric compensation is the hard bone of standoff infrared imaging, long shelved for modeling difficulty; this offers a lightweight fix. - What transfers is "set-based joint inversion of multiple noisy observations" — useful anywhere you extract physical quantities from repeated measurements. - It's validated only on simulation data, and the emergent geographic representation needs the full paper to confirm.

Source: Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

Swap the Arm Without Retraining; VLMs See Both the Duck and the Rabbit

Also Worth Noting

Multiple Teaching Agents Each Propose a Reasonable Plan, but the Student Gets One Answer Agenta voting protocol coordinates multi-agent collaboration, treating disagreement as a governance problem rather than a capability gap. link

A Map for Spending More Compute at Inference Time in Multimodal Models Multimodala systematic survey of test-time scaling across generation and reasoning in multimodal foundation models. link

Today's Observation

When the visual input is less than ideal, what is the model actually doing? Two VLM papers pry at that question from opposite ends today. Robust-U1 deals with corrupted input — blur, noise, occlusion — and asks whether the model can fill the missing content back in itself. Bistable captioning deals with input that's ambiguous by nature, like a duck-rabbit, and asks at which internal step the model locks in "duck or rabbit." One repairs, the other locates. One wants the model usable under degraded input; the other wants to see which layer pins down its leaning under ambiguous input.

Together they point at something a clean-test-set accuracy number tends to hide: VLM reliability isn't just hit rate. It's whether the model's behavior on degraded or ambiguous visuals is predictable and steerable. This isn't a new robustness trend — the two papers just happen to touch the same layer of the problem. But for anyone putting a VLM into a real product, that layer is closer to the things that break in production than any leaderboard score.

One concrete thing to do: build a separate "dirty input" eval for your VLM. Take your existing test images, blur them, add noise, occlude them, then mix in a few that are genuinely ambiguous. Look at performance and stability on that slice alone, not just overall accuracy. It'll surface the model's behavior boundary under poor visual conditions before you ship.