Issue #002: Your Voice Agent Is Executing Commands You Never Spoke

Last week's leading signal is not theoretical. Researchers evaluated seven production ASR systems across four languages and found a consistent pattern: under degraded inputs, models hallucinate plausible but unspoken content — and voice agents execute on it. If you have a voice interface in production, this affects you.

Also last week: a framework that fixes a fundamental contradiction inside multimodal AI, a breakthrough for surgical and home-assistance robotics, a gap in educational AI that classroom deployments can't ignore, and three industry moves that shift the competitive and regulatory landscape.

🔴 SIGNAL #1 — PHYSICAL AI · CRITICAL

Your Voice Agent Is Executing Commands You Never Spoke

arXiv:2603.25727 · WildASR Benchmark · 2026-03-26 · cs.AI

ASR systems achieve near-human accuracy on curated benchmarks. They fail in production — systematically, and in ways that benchmarks do not measure.

WildASR is a multilingual diagnostic benchmark sourced entirely from real human speech. It evaluates seven widely used ASR systems across four languages, factorizing robustness along three axes: environmental degradation, demographic shift, and linguistic diversity.

The core finding: under partial or degraded inputs, models hallucinate plausible but unspoken content. Robustness does not transfer across languages. Performance degrades severely and unevenly across conditions — and not in ways that correlate with benchmark scores.

For Physical AI: voice interfaces sit at the center of companion robots, smart home devices, elder care systems, and medical AI. A hallucinated command in these contexts does not produce a wrong answer in a chat window. It triggers an action. Unauthorized transactions. Restricted data access. Physical actuation.

What to do now:

Evaluate your ASR systems against WildASR's three robustness axes — environmental degradation, demographic shift, linguistic diversity
Implement confidence thresholds and cross-validation before executing critical or irreversible commands
If you serve multilingual users, audit cross-language performance separately — robustness does not transfer
Integrate factor-isolated testing into your continuous evaluation pipeline

Source: https://arxiv.org/abs/2603.25727

🟠 SIGNAL #2 — AI AGENTS · HIGH

Multimodal AI Models Contradict Themselves. RC2 Fixes It.

arXiv:2603.25720 · RC2 Framework · 2026-03-26 · cs.AI

Current multimodal models violate a basic requirement of reliable reasoning: they produce contradictory outputs for the same concept depending on which modality they process it through. A visual representation of a finding and a textual description of the same finding can yield conflicting predictions from the same model.

Standard voting mechanisms used to reconcile disagreements between modalities can amplify systematic biases rather than resolve them.

RC2 (Cycle-Consistent Reinforcement Learning) treats cross-modal inconsistency as a learning signal rather than masking it. The model is required to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference. The cyclic constraint forces internal representation alignment — no labeled data required. Accuracy improvement: up to 7.6 points across benchmarks.

For Physical AI: medical diagnosis AI, legal analysis agents, and financial decision systems processing both text and visual inputs are directly exposed to cross-modal inconsistency errors. RC2-style cycle consistency enforcement is now a production standard worth implementing.

What to do:

Audit deployed multimodal agents for cross-modal consistency — this failure mode is rarely tested
Evaluate RC2-style cycle consistency training for high-stakes applications
Add automatic detection of modality-specific disagreements with escalation to human review

Source: https://arxiv.org/abs/2603.25720

🟠 SIGNAL #3 — PHYSICAL AI · HIGH

SoftMimicGen Breaks the Data Bottleneck for Deformable Robot Manipulation

arXiv:2603.25725 · SoftMimicGen · 2026-03-26 · cs.RO · cs.AI

The single largest obstacle to deploying robots in home assistance, warehouse automation, and surgical settings has been data: deformable objects — soft toys, ropes, surgical drapes, towels — resist the simulation frameworks that work for rigid manipulation. Real-world demonstration collection is slow, expensive, and does not scale.

SoftMimicGen is an automated data generation pipeline that resolves this bottleneck. It covers four robot embodiments: single-arm, bimanual, humanoid, and surgical. Tasks span high-precision threading, dynamic swinging, folding, and pick-and-place across soft materials. Synthetic simulation data reduces real-world data requirements by 10x or more.

The deployment implication is direct: teams building robots for care settings, home assistance, or surgical assistance can now train capable deformable manipulation policies without the data collection overhead that previously blocked deployment.

⚠️ Safety note: simulation-to-reality transfer validation is required before production use. Synthetic data does not capture all failure modes of real-world deformable contact. Implement phased rollout with automatic performance monitoring against real-world success metrics.

Source: https://arxiv.org/abs/2603.25725

🔵 SIGNAL #4 — PHYSICAL AI · MEDIUM

Educational AI Has an Assessment Problem — And It Is Not the One You Expect

arXiv:2603.25633 · 2026-03-26 · cs.AI

The finding is specific and measurable: LLMs assess mathematics problems they solved correctly with substantially higher accuracy than problems they solved incorrectly. Assessment accuracy is directly tied to problem-solving performance. Assessment remains harder than direct problem solving.

This creates a systematic blind spot. The problems a model struggles to assess accurately are precisely those where student errors are most likely to occur — and where reliable step-level feedback matters most.

An agent that achieves 90% on math problem-solving benchmarks is not a 90% reliable assessor of student work. Incorrect assessment that validates a student's wrong approach — or flags a correct one as wrong — produces measurable learning harm over time.

What to audit:

Measure the gap between assessment accuracy and problem-solving performance for your deployed educational agent
Test assessment reliability specifically on items the model cannot solve — this is the high-risk zone
Implement step-level confidence scoring with human review escalation for edge cases

Source: https://arxiv.org/abs/2603.25633

📡 INDUSTRY MOVES — MARCH 2026

Three developments this week that shift the landscape for Physical AI teams.

OpenAI Launches a Dedicated Safety Bug Bounty

openai.com/index/safety-bug-bounty/ · 2026-03-25

Traditional software bug bounty programs target infrastructure vulnerabilities. OpenAI's safety-specific program addresses a harder-to-define surface: prompt injection, jailbreaks, unsafe output generation, and agent manipulation — the attack classes that matter for Physical AI deployment.

The significance is not just the program itself. It is the acknowledgment that internal red teams cannot independently surface all relevant vulnerabilities at scale, and that the external security research community's contribution is necessary, not optional.

For teams without equivalent external reporting channels: the competitive security posture gap is widening. Physical AI-specific attack surfaces — voice injection, vision manipulation, guidance injection — warrant dedicated scope definition in any disclosure program.

OpenAI Details Its Model Spec Approach

openai.com/index/our-approach-to-the-model-spec/ · 2026-03-25

Model Spec enters the industry as a structured, published framework for defining AI behavior boundaries — a second major specification paradigm alongside Constitutional AI-style approaches.

The compliance implication: as the EU AI Act and other frameworks mature, explicit behavioral specifications are becoming required documentation. Teams operating cross-vendor Physical AI pipelines may face behavioral specification conflicts that neither vendor's spec anticipates. Physical AI action spaces — voice, motion, sensor fusion — extend beyond what conversational AI specs were designed to cover. Document the gaps now, before regulatory review requires it.

Google DeepMind: Gemini 3.1 Flash Live and an AGI Progress Framework

deepmind.google · March 2026

Two separate but related moves. Gemini 3.1 Flash Live advances natural audio AI in ways that raise the baseline for voice-based Physical AI applications — and that warrant a re-audit of identity leakage protections in healthcare and elder care deployments. Higher-quality audio capture encodes more identity information, not less. Previous privacy analyses may no longer hold at the new capability level.

Separately: DeepMind's proposed cognitive framework for measuring AGI progress is positioned as a candidate industry standard for capability assessment. Safety frameworks will increasingly need to demonstrate effectiveness against standardized benchmarks. Early alignment with emerging measurement standards positions teams before procurement and regulatory adoption accelerates.

⚠️ REGULATORY WATCH — EU AI ACT

Article 5 Enforcement Deadline: August 2, 2026

127 days remaining.

No change to the Article 5 picture this week. The clock is running. If you started your system inventory and prohibited practices review after Issue #001, you have used one week of your buffer. If you have not started: the four actions from last week stand.

Complete AI system inventory and intended use documentation
Conduct Article 5 prohibited practices review against each system
Begin Article 9 risk management documentation for high-risk systems
Establish post-market monitoring and incident reporting procedures

The companion robot and educational AI exposure noted in Issue #001 is worth re-reading alongside this week's Signal #4. An educational AI that assesses student work incorrectly — and does so systematically — sits at the intersection of Article 5 (behavior distortion in vulnerable groups) and Article 9 (high-risk system risk management). The research this week makes that exposure more concrete.

The pattern across last week's signals: benchmark accuracy does not equal production reliability. ASR hallucinates under real-world conditions. Multimodal models contradict themselves under real-world inputs. Educational AI assesses unreliably on the exact items where students need reliable assessment most. The gap between what passes evaluation and what fails in deployment is where Physical AI safety incidents originate.

We will keep watching it.

No financial relationship with any AI company, hardware manufacturer, or standards body. We don't certify. We don't consult. We watch. Credentialed press at HumanX 2026.

→ sentinelbase.ai · [email protected]