Today's Overview
- Flip LoRA into a measuring stick and the real capacity of parametric memory falls out: it follows a power law you can estimate in advance, and a token-prediction probability of 0.5 is the threshold for verbatim recall.
- Unified retrieval isn't about the interface — it's about not throwing away structure. OmniRetrieval routes queries to each source's native engine instead of crushing everything into a shared vector space, and beats single-source baselines across 309 knowledge bases.
- Play a real video backwards and you get a free counterfactual. YoCausal uses reversed clips as expectation-violating negatives, and finds 13 video diffusion models can sense the arrow of time but can't explain the causality.
- Image agents shift from rewriting prompts to writing code. GenClaw has the LLM nail down composition in SVG/HTML/Three.js as an executable sketch, then hands it to a generative model to color in — the value is control, not fidelity.
- Agent guardrails pile on "lightweight" and "real-time," but the real novelty hides in the taxonomy. AgentDoG 1.5's substantive contribution is an updated open-world agent risk taxonomy; discount the "1k samples matches closed-source" claim, and verify it yourself since the model and dataset are open.
Featured
01 Flip LoRA Into a Ruler and Measure What Models Remember
Patch knowledge into a model with LoRA and you're mostly guessing — how big to set the rank, when you've run out of room. This work inverts the tool: treat LoRA as a controllable memory probe and measure how much parametric memory a model can actually hold. The answer is a memory law. Loss reduction ΔL tracks a stable power law against effective parameter count and sequence length, so capacity isn't mysticism — you can estimate it ahead of time.
The sharper finding sits at the token level: a clean phase transition. Under greedy decoding, once predicted probability clears 0.5, the model recites that token verbatim. They build MemFT on this, tilting the training budget dynamically toward tokens still under the 0.5 line. Same cost, higher memory fidelity. In their tests MemFT lifts verbatim recall of target facts without extra budget — compute that was wasted on already-memorized tokens goes to the ones near the threshold, buying fidelity for free.
The power law has an edge. As the facts you want to store approach the capacity ceiling for a given rank, ΔL bends off the line and saturates, and that knee is exactly the signal to stop stacking rank and switch to full fine-tuning. Whether it holds at larger ranks and longer sequences needs the paper's extrapolation experiments to confirm. Either way, a framework that quantifies memory capacity beats one more fine-tuning trick.
Key takeaways: - Memory capacity follows a power law you can estimate up front, so rank and how many facts fit are no longer guesswork. - A token-prediction probability of 0.5 is the threshold for verbatim recall — a hard signal for whether something stuck. - When new facts exceed LoRA's capacity ceiling, switch to full fine-tuning rather than stacking more rank.
Source: How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
02 The Hard Part of Unified Retrieval Is Not Losing the Structure
Teams doing RAG usually keep several retrievers around: text goes to a vector store, tables to SQL, knowledge graphs to yet another query language. Ask one question across all of them and you're stitching it together with glue code. OmniRetrieval adds a single entry point. Feed it natural language, it decides which sources to hit, translates the query into each source's native language, and hands it to that source's own execution engine.
It deliberately avoids the shortcut of pressing every source into one shared vector space. That erases the schema, ontology, and composition operators that make structured data valuable in the first place. Across a benchmark of 13 datasets and 309 knowledge bases, it beats single-source baselines, which says the "route plus native execution" architecture holds up.
The abstract only gives the qualitative result. How far it leads on each source type, and what happens when the routing guesses wrong, needs the full paper.
Key takeaways: - The right answer for cross-source retrieval is routing to each source's native engine, not a unified embedding — worth borrowing if you build retrieval systems. - Preserving schema and graph structure extracts more value from each source than forcing everything into the same shape. - 309 knowledge bases is broad enough to argue generalization, but per-source margins and routing-failure fallbacks still need the full paper.
Source: OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources
03 Run the Video Backwards and the "World Models" Fall Apart
The simplest causality test turns out to be playing a real video in reverse. Water flying out of a glass and back into the bottle is physically absurd, which makes reversed clips a free supply of expectation-violating negatives. YoCausal builds a two-stage benchmark on this. Denoising loss measures whether a model can sense the arrow of time (RSI), then a VLM splits the data into true causality versus mere temporal correlation, separating memorized statistics from real causal reasoning (CCI).
The surprise: across 13 mainstream video diffusion models, sensing time direction does not mean understanding cause. A model knows the clip is reversed but can't say why it shouldn't happen, leaving it far short of human-level causal reasoning. Generating something realistic and understanding the physics behind it are two different things, and clearing the first bar says nothing about the second.
Key takeaways: - Reversed real video as a counterfactual negative is a near-zero-cost, infinitely scalable evaluation idea — worth borrowing for world-model work. - Sensing the arrow of time is not understanding causality; don't take realistic generation as proof a model grasps physics. - Current video diffusion models are visibly far from genuine world models, and the marketing should be discounted accordingly.
Source: YoCausal: How Far is Video Generation from World Model? A Causality Perspective
04 When the Image Agent Writes Code Instead of Rewriting Prompts
Image-generation agents hit an awkward wall: they understand the brief and can call tools, but they have no direct control over the canvas. The black-box model gives what it gives, and the agent can only rewrite the prompt and regenerate — a client who keeps rephrasing the same request. GenClaw splits the labor differently. The LLM first draws composition, layout, and geometry as executable code (SVG, HTML, Three.js), pinning down position and proportion exactly, then hands it to an image model to fill in material and lighting.
That mirrors how a human painter works — conceive, sketch, then color — with code as a controllable canvas between language reasoning and pixel synthesis. The value here is control and interpretability, not image quality. How good an agent can get at sketching with code, and how far the final image drifts from the code draft, needs the full paper and real tests to settle.
Key takeaways: - The bottleneck in agentic image generation isn't quality — it's the lack of direct control over the canvas, and code-as-sketch is one way to close the black box. - Teams doing precise layout or geometric composition should watch the "code as intermediate canvas" approach. - This is early work in a new direction; whether the control gains outweigh the ceiling on code-sketching ability is unproven.
Source: GenClaw: Code-Driven Agentic Image Generation
05 Agent Guardrails Stack Buzzwords, but the Real Work Is in the Taxonomy
The abstract reads smoothly — lightweight, scalable, real-time in a row — but only one thing here actually cuts: an updated taxonomy of agent safety risks, built to cover the new failure modes from open-world agents like Codex and OpenClaw executing across environments. The rest of the pitch — 0.8B-to-8B small models, matching GPT-5.4-class closed-source with roughly 1k samples, two orders of magnitude lower Docker deployment overhead — all looks good as numbers. But "1k samples matches top closed-source" usually depends heavily on how the eval set is built, and tends to prove itself inside the scenarios the home taxonomy defines.
What's actually useful to practitioners is which new attack surfaces it folds in — privilege escalation when an agent calls tools or executes across environments — because that understanding transfers straight into your own defenses. The leaderboard score doesn't. The model and dataset are open, so rather than trust the marketing words, read the new taxonomy and check it yourself.
Key takeaways: - The substantive contribution is the updated agent risk taxonomy; discount the small-model and sample-efficiency numbers. - "1k samples matches closed-source" depends heavily on eval construction and self-proves inside the home taxonomy's scenarios. - The model and dataset are open — teams building agent defenses should read the risk taxonomy directly rather than trust the benchmark.
Source: AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Also Worth Noting
Today's Observation
The "world model" label is getting slapped on video diffusion models in bulk, and today's papers happen to put both halves of that upgrade on the table at once. minWM is heads-down building the engineering base — making video models actually run in real time and stay interactive. YoCausal uses reversed real video, a near-zero-cost counterfactual, to poke at the gap between realistic generation and causal understanding. Read together, the signal isn't "world models are hot again." It's that one side is racing to deploy video models as world models while the other is quietly readying the tools to test whether the label holds. Capability and verifiability showed up in sync this time, which is rare.
If you're betting on "video as world model," don't let a smooth demo set the pace. Put causality benchmarks like YoCausal on your acceptance checklist, and confirm your model actually understands physics rather than just memorizing temporal statistics.