19 january 2026
The joint EMA–FDA Guiding Principles of Good AI Practice in Drug Development is an important signal – regulators recognize artificial intelligence (AI) is now embedded across the product development lifecycle, and global alignment is both necessary and feasible. These principles, published in January 2026, build on EMA’s earlier holistic and exploratory AI Reflection Paper (2024) and FDA’s ongoing technical workstream on AI use in regulatory decision-making.1,2,3
The new EMA-FDA document is intentionally high-level and principles based, which is understandable, but also creates a practical implementation gap. Without clear operational expectations, organizations may interpret “risk-based,” “fit-for-purpose,” and “governance” in divergent ways, and still claim compliance.
Then, how may organizations bridge the gap between non-prescriptive regulatory principles and an auditable validation approach that does not suffocate innovation?
The core argument is simple – regulators should avoid prescribing methods, but they can define the minimum level of specificity needed to make AI assurance testable, comparable, and enforceable.
Here are nine key observations and recommendations on turning high-level principles into auditable, low-burden controls.
The principles refer repeatedly to “risk-based approaches,” “context of use,” and “model risk.” In practice, AI spans different system types, with fundamentally different assurance strategies.
A validation approach that works for a fixed, reproducible pipeline (i.e., classical Machine Learning running on controlled data with fixed features and deterministic inference) does not map cleanly to modern generative AI (GenAI) systems where:
We can, for example, classify by assurance-relevant properties stable across generations of technology:
This kind of taxonomy is non-prescriptive, yet forces sponsors to adopt different qualification and change-control expectations depending on the system class. A similar approach has already been adopted by FDA in the Good Machine Learning Practice (GMLP) framework and software as a medical device (SaMD) categorization, where systems are not classified by algorithm, but by their control boundaries, update mechanisms, and risk implications.4
Principle 9 rightly recognizes AI systems evolve over time, and validation cannot be a one-time exercise. Qualification must therefore include controls for drift, corpus evolution, and runtime risk – treating the model not as a static tool but as a living system.
What “risk” means for GenAI is not the same as for classical predictive models:
Using GenAI, the regulatory-relevant risks often include:
It is not necessary to prescribe specific metrics (i.e., precision, calibration scores, etc.), but to define the domains of risk that must be addressed and evidenced. This is the minimum necessary specificity to make “risk-based” meaningful for non-deterministic systems. FDA’s draft guidance already points to the importance of evidence transparency, traceability, and auditability for AI models used in drug evaluation workflows, though detailed risk vocabularies remain undeveloped.3
Modern AI deployments frequently operate under a paradigm shift that traditional validation assumptions do not cover well. For example:
This is not inherently incompatible with compliance, but it shifts the control boundary. Sponsors need a structured way to demonstrate assurance when internals are inaccessible. EMA’s AI reflection paper acknowledges this reality, noting sponsors may need to validate AI-supported processes where full insight into the model architecture or training data is not available, especially in cloud-hosted contexts.2
A practical answer is to treat the external foundation model as a “black-box component,” while building validated controls around it:
Regulators do not need to mandate specific vendor requirements, but they can state external dependence increases the need for interface-level validation and compensating controls. This paradigm shift is not unprecedented – regulated cloud services, interactive response technology (IRT) systems, and electronic trial master files (eTMFs) already operate under shared responsibility models. GenAI should follow similar principles.
A major source of “validation impossibility” for GenAI is unstructured text. Evaluating the quality of free-form summaries or narratives at scale is difficult, subjective, and often non-repeatable. This is where a practical design decision can dramatically improve auditability.
Use large language models (LLMs) primarily at the “reasoning layer,” but constrain outputs
If the LLM is asked to produce a structured javascript object notation (JSON) output with closed sets (i.e., enums, booleans, bounded reason codes, evidence pointers), you gain:
This does not make the underlying model deterministic, but it enables system-level predictability and controlled decision-support.
Important nuance: binary output is not automatically stable
Even true/false decisions can flip under small input perturbations or changes in retrieval/provider model versions. Therefore, qualification must focus on stability within a validated operating envelope, not only on whether the output format is measurable.
The most pragmatic considerations would be guidance on:
Why design space matters: For GenAI, exhaustive testing of the input space is impossible. So, qualification must be:
What a regulator can specify without becoming prescriptive
Regulators may define what sponsors must document:
Such documentation is not method-prescriptive, but auditable and harmonizable.
Explainability is essential, but “raw chain-of-thought” or “thinking output logs” are not a robust basis for regulated assurance because they are not guaranteed to be faithful explanations of model behavior, are hard to standardize and evaluate, and can introduce data handling and privacy risks.
A stronger, regulator-aligned approach would define explainability as:
If narrative explanation is provided, it should be an auxiliary reviewer aid, not the primary validation artifact, aligning with recent stakeholder feedback to EMA, where explainability is reframed as traceability to structured inputs and sources, not introspection into opaque reasoning chains.5
While transparency is important, narrative ‘reasoning logs’ are often non-deterministic and may not faithfully reflect the model’s internal mechanisms – a known challenge in explainable AI research.
In pharmacovigilance, regulatory intelligence, and literature triage, retrieval quality is a first-order determinant of correctness. If the system relies on SmPCs (Summaries of product characteristics), internal knowledge bases, training content, or a brokered evidence layer, then retrieval failures become a direct safety and compliance risk. EMA has previously emphasized the need for traceable and verifiable evidence paths in AI-generated content, especially when used in regulatory literature screening and pharmacovigilance.2
Therefore, retrieval effectiveness should be:
One valuable way to reduce regulatory burden while maintaining safety would be to separate:
Regulated processing zone (controlled decision-support)
Non-regulated or ancillary drafting zone (supporting convenience)
Free-text summaries and generated narratives can be used, but they are:
This boundary would preserve innovation while keeping regulated processes auditable and defensible.
An optional Implementation Reference Layer could support harmonization by defining the minimum structure of assurance vocabularies, risk domains, and change triggers without constraining technology choices or innovation velocity.
To make the principles effective without constraining methods, EMA/FDA could publish an optional annex containing:
This would prevent checkbox compliance, enable consistent audits, and create a shared foundation for consensus standards while still allowing technical innovation. This annex would follow the precedent of Good Machine Learning Practice and SaMD guidance documents, which outline structured expectations for evidence and lifecycle control without enforcing specific algorithms or metrics.4
The EMA–FDA principles are a strong first step. The next step should not be heavy-handed prescription of algorithms, metrics, or fixed thresholds. Instead, regulators should provide minimum necessary specificity: a shared vocabulary of risk domains for GenAI, and a clear expectation that sponsors define (and evidence) a validated operating envelope with acceptable ranges, monitoring, and change control.
That is the practical middle ground: operationalizable assurance without innovation suffocation, aligned with patient safety, data integrity, and lifecycle governance.
This middle ground promotes both regulatory trust and innovation resilience, ensuring GenAI deployments in medicine support patient safety, data integrity, and globally aligned oversight.
If you are developing, selecting, or validating AI approaches and need support, our team is here. Contact us to begin the discussion.
All concepts, analyses, and viewpoints expressed in this commentary reflect the authors’ own expertise and perspectives. The authorship, structure, and content direction are original. Writing support—including phrasing refinement, clarity enhancement, and formatting—was provided with the assistance of a large language model (LLM) to support efficient drafting and communication.
Go to our Events to register
Go to our News to get insights