Issue #004: The Stack Under Your AI Is More Broken Than You Think

A pattern is becoming hard to ignore. Each issue has surfaced a different layer of the Physical AI stack — voice interfaces (Issue #002), authentication and cloning (Issue #003) — and each time, the failure mode is the same: a component that was assumed to work correctly in production does not.

This week, the failures go deeper. VLA models have architectural shortcuts that cause them to overlook the exact visual details that matter for safe actuation. Agent memory systems are structured too flatly to reuse experience efficiently across related tasks. RAG — the retrieval architecture powering most document AI in production — provides no meaningful improvement for the document reasoning tasks that matter most. The assumption in each case is not wrong in theory. It is wrong in deployment.

No new industry moves this week that have not already been covered. Four signals, all research.

🔴 SIGNAL #1 — PHYSICAL AI · CRITICAL

Your Robot Cannot Reliably See What It Is Looking At

arXiv:2603.28740 · FocusVLA · 2026-03-30 · cs.RO · cs.AI

Vision-Language-Action (VLA) models are the dominant architecture for robots that reason about their environment before acting. They are also broken at the visual attention layer in ways that production deployments have not accounted for.

FocusVLA identifies three specific failure modes in auto-regressive VLA policies. First: architectural bias — the model structure creates shortcut pathways that allow the policy to skip visual detail in favor of statistically likely actions. Second: token flooding — excessive visual tokens make it computationally easier for the model to ignore specific regions than to attend to them. Third: task-irrelevant noise — visual inputs that are present but unrelated to the current task contaminate attention, pulling the model away from the regions that determine safe actuation.

The fix: Modality Cascaded Attention eliminates the architectural shortcuts by enforcing attention paths that cannot bypass visual input. Focus Attention dynamically selects task-relevant patches, concentrating computation where it matters. Combined, the two mechanisms deliver a 30%+ performance improvement on robotic benchmarks.

The safety implication is direct and significant. A 30%+ gap between a VLA model's actual attention behavior and its intended attention behavior is not a performance optimization opportunity. It is an undocumented failure mode in every VLA-powered deployment operating without these corrections. Warehouse robotics, surgical assistance, home care robots — each one is currently attending to the scene in ways that differ materially from what was tested and assumed.

What to do now:

Treat all VLA deployments as operating with broken visual attention until evaluated against FocusVLA's three failure modes;
Audit for shortcut pathway artifacts — policies that perform well on benchmark distributions may be exploiting statistical regularities rather than scene understanding;
Implement Modality Cascaded Attention to close the architectural gap;
Add Focus Attention for dynamic patch selection in cluttered or high-noise environments;
Re-validate sim-to-real transfer assumptions: synthetic training environments may not expose these attention failures at the severity they manifest in real deployment

Source: https://arxiv.org/abs/2603.28740

🟠 SIGNAL #2 — AI AGENTS · HIGH

Agents Are Forgetting What They Already Know How to Do

arXiv:2603.28716 · D2Skill · 2026-03-30 · cs.AI

Agentic RL systems in production face a specific inefficiency that accumulates over time: they re-solve problems they have already solved. Experience is stored in flat, undifferentiated memory structures that do not distinguish between high-level strategic knowledge and fine-grained execution patterns. When a new task arrives, the agent cannot efficiently retrieve the relevant prior experience, and so it does not benefit from it.

D2Skill addresses this with a dual-granularity skill bank. Task skills capture high-level goal structures for strategic guidance. Step skills capture fine-grained decision patterns for execution support. The two layers are maintained separately, trained jointly through paired rollouts, and retrieved through a utility-aware mechanism that uses performance gap signals to identify which stored experience is worth pulling.

The gains are consistent and substantial: 10 to 20 point success rate improvement on ALFWorld and WebShop benchmarks. The improvement holds across task types and difficulty levels because it addresses a structural deficit rather than a benchmark-specific tuning.

For Physical AI: multi-step task automation, web-based agent pipelines, and any agentic system operating across varied but structurally similar tasks benefits directly. The key insight is not the specific architecture but the dual-granularity principle — collapsing all experience into a single memory layer wastes the structure already present in what the agent has learned.

What to evaluate:

Audit current agentic RL skill storage: if experience is stored in a single undifferentiated pool, the D2Skill gap applies to your deployment;
Implement dual-granularity separation between strategic task-level skills and execution-level step skills;
Add utility-aware retrieval with pruning — stale or low-value experience should not contaminate retrieval quality;
Evaluate reflection-based skill expansion for agents operating in evolving environments where new task variants are introduced incrementally
Source: https://arxiv.org/abs/2603.28716

🟠 SIGNAL #3 — AI AGENTS · HIGH

RAG Does Not Work for Real Document Reasoning — ScholScan Makes This Precise

arXiv:2603.28651 · ScholScan · ICLR 2026 · 2026-03-27· cs.AI · cs.CL

Retrieval-Augmented Generation is the dominant architecture for document-based AI in enterprise deployment. Research assistance, compliance verification, legal analysis, technical documentation review — most production implementations assume that RAG meaningfully improves performance on document reasoning. ScholScan, accepted at ICLR 2026, measures this assumption directly and finds it does not hold for the task class that matters most.

The benchmark introduces a scan-oriented setting: questions that require locating, integrating, and verifying information across an entire document — the kind of reasoning a skilled analyst performs when reviewing a paper for specific evidence, methodological errors, or cross-section consistency. 1,800 questions across 9 error categories, 13 natural science domains, 715 papers. RAG methods yield no significant improvements over baseline. The failure is not marginal — it is systematic, and it reflects a genuine architectural mismatch.

The finding identifies what scan-oriented reasoning actually requires: full-document cross-checking capability, evidence localization across non-contiguous sections, and reasoning trace annotation that connects claims to the specific locations in the document that support or contradict them. RAG's retrieval-first architecture cannot do this because the bottleneck is not retrieval of the right passage — it is the ability to hold and reason over the document's full structure simultaneously.

For Physical AI teams using document AI in regulated contexts — compliance review, clinical documentation, safety audit verification — the ScholScan finding is directly operational. A system that appears to perform well on extracted passage questions may fail entirely on the integrated document reasoning that regulatory review requires.

What to address:

Audit deployed document AI against ScholScan's scan-oriented task categories — extracted passage performance does not predict full-document reasoning performance;
RAG augmentation alone is insufficient for compliance and verification use cases: implement full-document context handling;
Add evidence localization capability — the system should identify not just the answer but the document location that supports it;
For high-stakes document reasoning, implement reasoning trace annotation: the chain from claim to source must be auditable

Source: https://arxiv.org/abs/2603.28651

🔵 SIGNAL #4 — INFRASTRUCTURE· MEDIUM

Sovereign Communication Infrastructure for AI Deployments Operating Outside Centralized Control

arXiv:2603.28727 · BitSov · BlockArch 2026 · 2026-03-30

Physical AI deployments in healthcare, elder care, and critical infrastructure increasingly face a question that was not part of the original design surface: what happens when the communication and authentication infrastructure they depend on is unavailable, censored, or compromised at the platform level?

BitSov proposes a composable 8-layer protocol stack built entirely on decentralized infrastructure — Bitcoin, Lightning Network, decentralized storage, federated messaging, and mesh connectivity. Payment-gated messaging deters spam economically, eliminating reliance on centralized moderation. Timechain-locked contracts anchor subscription and authentication events to Bitcoin block height, creating a verifiable audit trail that does not depend on any single platform's continued availability or good faith.

The relevance for Physical AI is not primarily technical — it is structural. Deployments that depend on cloud APIs, centralized authentication services, or platform-controlled communication channels carry a category of infrastructure risk that is not captured by standard security reviews. A companion robot that authenticates through a cloud service that is unavailable during a network outage is not secure; it is simply offline. BitSov-style architecture externalizes this dependency explicitly and provides a concrete alternative.

This is a MEDIUM signal: the architecture is proposed, not production-validated at scale. But the design surface it addresses — censorship-resistant, economically spam-deterred, platform-independent communication for AI agents — is increasingly relevant as Physical AI deployments extend into jurisdictions and contexts where platform reliability cannot be assumed.

What to Consider:

Map all infrastructure dependencies in current Physical AI deployments: cloud authentication, communication, storage — which of these can become single points of failure?
Evaluate whether deployments in regulated or high-availability contexts require platform-independent fallback communication
Review subscription and audit trail architecture: timechain-anchored records provide tamper-evident verification that centralized logs do not

Source: https://arxiv.org/abs/2603.28727

⚠️ REGULATORY WATCH — EU AI ACT

Article 5 Enforcement Deadline: August 2, 2026

124 days remaining.

Signal #1 adds a new dimension to Article 5 compliance for Physical AI teams. FocusVLA establishes that VLA models operating in production may be attending to scenes in ways that materially diverge from their evaluated behavior. A robot in a care setting that responds to statistically likely action patterns rather than actual scene content is precisely the kind of system Article 9's risk management documentation is designed to cover — and precisely the kind of gap that internal evaluations, run on benchmark distributions, will not surface.

The practical implication: if your Article 9 documentation describes behavioral validation procedures that were run before FocusVLA's findings were available, those procedures may not cover the visual attention failure mode. This is worth reviewing before the August deadline, not after it.

On Signal #3: document AI used for compliance verification, audit reporting, or regulatory documentation review falls within Article 9's high-risk classification in most EU member state implementations. A system whose document reasoning architecture is known to fail on full-document scan tasks — and that is deployed for regulatory purposes — requires updated risk documentation. ScholScan makes this failure mode concrete enough to reference directly in a risk assessment.

That is Issue #004.

Across four issues, the signals form a consistent picture: the gap between evaluated performance and deployment reality runs through every layer of the Physical AI stack — voice recognition, authentication, visual attention, agent memory, document reasoning. Each week adds specificity. FocusVLA gives a number: 30%. D2Skill gives a number: 10 to 20 points. ScholScan gives a category: RAG cannot do what it was assumed to do for full-document tasks.

Specificity is what makes the gap actionable. We will keep surfacing it.

No financial relationship with any AI company, hardware manufacturer, or standards body. We don't certify. We don't consult. We watch.

Credentialed press at HumanX 2026.

→ sentinelbase.ai · [email protected]