The Deployment of Elsa at the FDA: Regulatory Concerns, Potential Consequences, and a Path Forward

A critical look at FDA’s AI rollout, its pitfalls, and a SMART framework for responsible regulatory AI in life sciences.

In June 2025, the U.S. Food and Drug Administration (FDA) launched Elsa, a generative AI copilot designed to streamline regulatory workflows. While initially celebrated, Elsa’s deployment quickly revealed critical flaws—hallucinated citations, omitted safety data, and governance gaps. This article examines Elsa’s shortcomings, contextualizes them within broader AI governance trends, and proposes a SMART (Specific, Measurable, Achievable, Relevant, Time-bound) framework to guide responsible AI integration in life sciences. Drawing on insights from “Illuminating the Dark Side of Pharma Algorithms”, we offer a path toward transparent, auditable, and ethically grounded regulatory AI.

Introduction: A Promising Launch, A Troubled Trajectory

On June 2, 2025, the FDA issued a press release announcing the agency-wide launch of Elsa, highlighting that the tool went live "ahead of schedule and under budget" after a successful pilot [1]. Built within a secure GovCloud environment, Elsa was expected to assist with summarizing adverse-event reports, comparing labels, and drafting code—freeing reviewers to focus on high-value scientific judgments. Yet by late July, multiple media outlets reported staff frustration with fabricated citations and missing safety data [2–6].

When AI Undermines Regulatory Fidelity: Failure Modes

Concerns about Elsa’s reliability emerged almost immediately, particularly around hallucinated citations. Similar challenges have been observed in broader LLM evaluations, where AI-generated content confers undue credibility to unsupported claims, distorting benefit–risk assessments before the errors are detected [7, 8]. While a formal FDA review is pending, our analysis draws on these early concerns to highlight critical fault lines between technological enthusiasm and regulatory reliability. We outline below the potential failure modes reported.

Table 1. FDA ELSA Flaws or Missteps & Potential failure modes reported in the media

Patient safety is the cornerstone of FDA regulatory review. Ironically, Elsa—intended to accelerate and improve regulatory processes—may have introduced new risks related to trust, transparency, and operational efficiency. Notably, Elsa was launched with an assumption of statutory accuracy in legacy benchmarks and responsible human oversight. Elsa’s rollout—absent rigorous testing or transparency—may be a “canary in the coal mine,” signaling that even elite public health institutions are vulnerable to algorithmic drift and degraded decision-making.

Is U.S. AI Governance Too Lax for Public Health?

Unlike the EU AI Act, which imposes legally binding obligations on high-risk systems beginning in 2024 [9], U.S. policy emphasizes advancing AI innovation to support economic competitiveness and national security [10]. Absent the kind of safeguards promulgated in the EU framework, there is growing concern that high-risk AI tools can be deployed without rigorous testing, transparency, or accountability.

Good Linguistic Practices (GLP): Language as a Regulatory Tool

In pharmaceutical regulation, ambiguous or misleading language can erode trust and compromise compliance. We advocate for GLP not just in a laboratory but also for language - Good Linguistic Practices (GLP)—the structured use of clear, consistent, and contextually anchored language in AI-assisted regulatory workflows [8]. This includes not only improved writing but also traceability, validation, and alignment with existing legal and regulatory standards. GLP is not just about clarity; it’s about accountability.

The Dueling Banjo Methodology: Human + AI Collaboration

To safeguard integrity in regulatory AI, we recommend the Dueling Banjo Methodology—a structured, iterative model that reinforces human-AI co-development:

1. Initial AI Drafting: Guided RAG prompting steers AI-generated responses, an FDA Copilot or genAI assistant grounded on www.fda.gov, www.ecfr.gov, etc. site(s), for example

2. Expert Review & Refinement: Domain experts assess factual accuracy and contextual relevance. Provisioning an on-demand anytime Human-in-the-Loop full-text searching actual FDA.gov published regulations (Guidance PDFs and 21 CFRs) readily available from the same app UI, thus ensuring no potential genAI LLM bias can ever set in.

3. AI Re-iteration: The model is fine-tuned using expert feedback, i.e., a back-and-forth machine-human interplay with genAI, fact-checking whenever desired

4. Final Verification: Outputs are audited against compliance benchmarks. This might involve benchmarking the same prompts and instructions to genAI from a second foundational LLM model aside from Anthropic Claude used by the FDA’s ELSA.

This cycle continues until the AI-generated outputs meet quality, transparency, and regulatory readiness standards. Importantly, human reviewers must be trained not only in the regulatory domain but also in AI prompt engineering and validation workflows.

Model Context Protocol (MCP): A Regulatory Backbone

Figure 1 illustrates a modern architecture integrating SharePoint Online (hosting FDA documents), eSTARHelper/FDA Copilot (powered by ChatGPT-4.0), and Microsoft Foundry indexed content, all structured via the Model Context Protocol (MCP).

Figure 1. eSTARHelper’s Current & Future FDA Copilot Tech Stack-Man-Machine with Human-in-the-Loop supported by purely autonomous agentic AI (a) tools, (b) resources, and (c) prompts supported by MCP protocol

MCP is an open-source standard that enables LLMs to interact predictably with structured data, databases, and tools—akin to a USB-C port for AI. It supports plug-and-play access to validated resources, dynamic context injection, and real-time data verification. By incorporating MCP with tools like SmartSearch+, LLMs like Elsa can reliably reference FDA guidance documents and 21 CFR regulations issued by CDER, CBER, and CDRH.

While Retrieval-Augmented Generation (RAG) grounds LLM responses with external data, it is limited to passive lookups. Function calling, a capability that allows LLMs like GPT-4 to interact with external APIs or software tools, enables LLMs to execute specific tasks or access live data, but it lacks robust context management. MCP bridges both, enabling structured reasoning, tool invocation, and full traceability [11]—attributes essential to the regulatory mission of the FDA. In regulatory contexts, this capability allows LLMs to query validated databases or execute rule-based assessments, ensuring responses are grounded in structured logic—not just language generation.

From Blind Spots to SMART Governance

Building on “Illuminating the Dark Side of Pharma Algorithms” [12], we revisit the SMART framework to address algorithmic opacity and ethical blind spots —one that transforms abstract principles into concrete safeguards.

Specificity begins with clearly defining the scope of AI tools and their decision boundaries. Generative models must not encroach on scientific judgment or regulatory interpretation without explicit constraints. By delineating what AI can—and cannot—do, agencies can prevent overreach and preserve domain expertise.

Measurability ensures that AI outputs are benchmarked against validated datasets and compliance standards. This allows reviewers to detect hallucinations, omissions, or drift early in the process, before flawed outputs influence regulatory decisions. Metrics must be tied to real-world performance, not just internal model confidence.

Achievability requires aligning AI capabilities with the actual workflows and training of human reviewers. Tools like Elsa should augment—not overwhelm—existing processes. Unrealistic expectations or poorly integrated systems risk misuse, frustration, and erosion of trust.

Relevance emphasizes that AI should support, not supplant, human expertise. In high-stakes domains like pharmaceutical regulation, contextual nuance and ethical judgment are irreplaceable. AI must be designed to enhance human decision-making, not automate it away.

Time-boundedness introduces regular audit cycles and bias surveillance protocols. AI systems evolve, and without ongoing monitoring, even well-trained models can degrade or drift. Scheduled reviews ensure that performance remains aligned with regulatory standards and public health priorities.

Together, these principles form a resilient backbone for regulatory AI. They are further strengthened by practices highlighted in “Illuminating the Dark Side of Pharma Algorithms”, including ALCOA+ rigor for data provenance, explainability by design, and cross-functional governance boards. By embedding SMART governance into the DNA of AI deployment, life sciences organizations can move from reactive oversight to proactive stewardship—transforming cautionary tales into benchmarks of trust.

Conclusion

The path forward is not just technical—it’s ethical, linguistic, and institutional. Elsa may have stumbled, but with the right guardrails, the next generation of regulatory AI can walk with confidence.

By employing a human-in-the-loop approach, which we describe as the Dueling Banjo Methodology with eSTARHelper’s implementation of its genAI FDA Copilot and SmartSearch+, pharmaceutical professionals can explore he utility of LLMs within the realm of regulatory science. Such an approach can be strengthened further by leveraging standards such as MCP. This approach can transform what is currently a cautionary example into a benchmark for digital trust and leadership in public health.