Good AI Practice in Drug Development - Turning High-Level Principles into Auditable, Low-Burden Controls

19 january 2026

The joint EMA–FDA Guiding Principles of Good AI Practice in Drug Development is an important signal – regulators recognize artificial intelligence (AI) is now embedded across the product development lifecycle, and global alignment is both necessary and feasible. These principles, published in January 2026, build on EMA’s earlier holistic and exploratory AI Reflection Paper (2024) and FDA’s ongoing technical workstream on AI use in regulatory decision-making.1,2,3

The new EMA-FDA document is intentionally high-level and principles based, which is understandable, but also creates a practical implementation gap. Without clear operational expectations, organizations may interpret “risk-based,” “fit-for-purpose,” and “governance” in divergent ways, and still claim compliance.

Then, how may organizations bridge the gap between non-prescriptive regulatory principles and an auditable validation approach that does not suffocate innovation?

The core argument is simple – regulators should avoid prescribing methods, but they can define the minimum level of specificity needed to make AI assurance testable, comparable, and enforceable.

Here are nine key observations and recommendations on turning high-level principles into auditable, low-burden controls.

1. How to Define “Model/System Risk” and Adapt a Validation Approach Accordingly

The principles refer repeatedly to “risk-based approaches,” “context of use,” and “model risk.” In practice, AI spans different system types, with fundamentally different assurance strategies.

A validation approach that works for a fixed, reproducible pipeline (i.e., classical Machine Learning running on controlled data with fixed features and deterministic inference) does not map cleanly to modern generative AI (GenAI) systems where:

Model may be delivered as a third-party service

Inference may be stochastic (sampling variability)

Output may be unstructured natural language

System may use retrieval (RAG), tool calls, or agentic workflows

Upstream data and downstream contexts evolve continuously

We can, for example, classify by assurance-relevant properties stable across generations of technology:

Replayability: can you reconstruct and re-run the same request path end-to-end?

Controllability: can you freeze versions/configuration (i.e., model, prompts/templates, retrieval corpus snapshots, decoding parameters)?

External dependency: is behavior supplier-managed, or under sponsor control?

Adaptivity: does behaviour change as data/corpus/tools evolve (even without code changes)?

This kind of taxonomy is non-prescriptive, yet forces sponsors to adopt different qualification and change-control expectations depending on the system class. A similar approach has already been adopted by FDA in the Good Machine Learning Practice (GMLP) framework and software as a medical device (SaMD) categorization, where systems are not classified by algorithm, but by their control boundaries, update mechanisms, and risk implications.4

Principle 9 rightly recognizes AI systems evolve over time, and validation cannot be a one-time exercise. Qualification must therefore include controls for drift, corpus evolution, and runtime risk – treating the model not as a static tool but as a living system.

2. How to Define a Risk Vocabulary for Generative AI

What “risk” means for GenAI is not the same as for classical predictive models:

Using GenAI, the regulatory-relevant risks often include:

Stability / Predictability at the decision level (not necessarily identical text outputs)

Traceability (inputs → retrieved evidence → output → human decision)

Hallucination / unsupported claims

Bias and representativeness (where population, location, language, or source types matter)

Robustness (prompt injection, malformed inputs, adversarial content, tool misuse)

Observability (ability to monitor drift, detect anomalies, investigate incidents)

It is not necessary to prescribe specific metrics (i.e., precision, calibration scores, etc.), but to define the domains of risk that must be addressed and evidenced. This is the minimum necessary specificity to make “risk-based” meaningful for non-deterministic systems. FDA’s draft guidance already points to the importance of evidence transparency, traceability, and auditability for AI models used in drug evaluation workflows, though detailed risk vocabularies remain undeveloped.3

3. How to Deal with External Dependencies and Bridging the Practical Reality: “Model Not Owned / Not Hosted / Consumed as a Service”

Modern AI deployments frequently operate under a paradigm shift that traditional validation assumptions do not cover well. For example:

Sponsor may not own the model (or even have full insight into training data)

Sponsor may not host the model; inference may be mutualized as a cloud service

Sponsor consumes GenAI like any third-party application programming interface (API), not like an in-house validated component

This is not inherently incompatible with compliance, but it shifts the control boundary. Sponsors need a structured way to demonstrate assurance when internals are inaccessible. EMA’s AI reflection paper acknowledges this reality, noting sponsors may need to validate AI-supported processes where full insight into the model architecture or training data is not available, especially in cloud-hosted contexts.2

A practical answer is to treat the external foundation model as a “black-box component,” while building validated controls around it:

Input constraints and preprocessing

Output schema enforcement and post-processing

Retrieval and evidence constraints

Monitoring, drift detection, and incident handling

Supplier qualification (quality, change notifications, SLAs)

Regulators do not need to mandate specific vendor requirements, but they can state external dependence increases the need for interface-level validation and compensating controls. This paradigm shift is not unprecedented – regulated cloud services, interactive response technology (IRT) systems, and electronic trial master files (eTMFs) already operate under shared responsibility models. GenAI should follow similar principles.

4. How to Constrain LLM Outputs to Discrete, Measurable Artifacts

A major source of “validation impossibility” for GenAI is unstructured text. Evaluating the quality of free-form summaries or narratives at scale is difficult, subjective, and often non-repeatable. This is where a practical design decision can dramatically improve auditability.

Use large language models (LLMs) primarily at the “reasoning layer,” but constrain outputs

If the LLM is asked to produce a structured javascript object notation (JSON) output with closed sets (i.e., enums, booleans, bounded reason codes, evidence pointers), you gain:

Objective evaluation (0/1 correctness, confusion matrices, stability checks),

Repeatable monitoring (drift and anomaly signals),

Schema validation (a defined failure mode),

Controlled downstream use (safe integration into workflows).

This does not make the underlying model deterministic, but it enables system-level predictability and controlled decision-support.

Important nuance: binary output is not automatically stable

Even true/false decisions can flip under small input perturbations or changes in retrieval/provider model versions. Therefore, qualification must focus on stability within a validated operating envelope, not only on whether the output format is measurable.

5. How to Scope Design Space and Acceptance Range

The most pragmatic considerations would be guidance on:

How to define the validated operating envelope (design space)

How to set acceptance criteria as ranges rather than single point “expected outputs”

Why design space matters: For GenAI, exhaustive testing of the input space is impossible. So, qualification must be:

Bounded (context-of-use constraints)

Stratified (define relevant operating conditions)

Risk-proportionate (higher stakes → tighter controls)

Continuously assured (monitor and re-evaluate over time)

What a regulator can specify without becoming prescriptive

Regulators may define what sponsors must document:

Design space dimensions (language, source types, document formats, OCR quality bands, decision criticality, etc.)

Acceptance region elements (minimum performance, stability bounds, error asymmetry, human review requirements)

Out-of-envelope behaviour (block/flag/escalate rules)

Change triggers (what constitutes material change requiring re-qualification)

Such documentation is not method-prescriptive, but auditable and harmonizable.

6. How to ensure Explainability by Focusing on Traceability and Evidence Rather Than “Reasoning Logs”

Explainability is essential, but “raw chain-of-thought” or “thinking output logs” are not a robust basis for regulated assurance because they are not guaranteed to be faithful explanations of model behavior, are hard to standardize and evaluate, and can introduce data handling and privacy risks.

A stronger, regulator-aligned approach would define explainability as:

Decision traceability (inputs → evidence → output → human decision)

Evidence grounding (citations/links to supporting sources)

Structured justification (bounded reason codes tied to evidence)

If narrative explanation is provided, it should be an auxiliary reviewer aid, not the primary validation artifact, aligning with recent stakeholder feedback to EMA, where explainability is reframed as traceability to structured inputs and sources, not introspection into opaque reasoning chains.5

While transparency is important, narrative ‘reasoning logs’ are often non-deterministic and may not faithfully reflect the model’s internal mechanisms – a known challenge in explainable AI research.

7. How to ensure Retrieval Effectiveness

In pharmacovigilance, regulatory intelligence, and literature triage, retrieval quality is a first-order determinant of correctness. If the system relies on SmPCs (Summaries of product characteristics), internal knowledge bases, training content, or a brokered evidence layer, then retrieval failures become a direct safety and compliance risk. EMA has previously emphasized the need for traceable and verifiable evidence paths in AI-generated content, especially when used in regulatory literature screening and pharmacovigilance.2

Therefore, retrieval effectiveness should be:

A named pillar (or explicit sub-domain)

Evaluated pre-deployment

Monitored continuously (coverage, freshness, precision, evidence mismatch)

Governed under change control when corpora are updated

8. A Practical “Two-Zone Model” to Avoid Suffocating Innovation

One valuable way to reduce regulatory burden while maintaining safety would be to separate:

Regulated processing zone (controlled decision-support)

LLM outputs must be structured, measurable, and schema-validated

Acceptance criteria are defined within the operating envelope

Monitoring and change control are mandatory

Human accountability is explicit

Non-regulated or ancillary drafting zone (supporting convenience)

Free-text summaries and generated narratives can be used, but they are:

Clearly non-authoritative

Require human verification

Must not trigger automated regulated actions

This boundary would preserve innovation while keeping regulated processes auditable and defensible.

9. Implementation Reference Layer (Non-Prescriptive Annex)

An optional Implementation Reference Layer could support harmonization by defining the minimum structure of assurance vocabularies, risk domains, and change triggers without constraining technology choices or innovation velocity.

To make the principles effective without constraining methods, EMA/FDA could publish an optional annex containing:

Assurance-relevant taxonomy (replayability/controllability/external dependency/adaptivity)

GenAI risk vocabulary (stability, hallucination, bias, robustness, retrieval, observability, traceability)

Minimum evidence categories proportional to risk

Change-control categories and triggers (without mandated tools)

Monitoring expectations (signals and governance cadence, not universal thresholds)

This would prevent checkbox compliance, enable consistent audits, and create a shared foundation for consensus standards while still allowing technical innovation. This annex would follow the precedent of Good Machine Learning Practice and SaMD guidance documents, which outline structured expectations for evidence and lifecycle control without enforcing specific algorithms or metrics.4

Conclusion

The EMA–FDA principles are a strong first step. The next step should not be heavy-handed prescription of algorithms, metrics, or fixed thresholds. Instead, regulators should provide minimum necessary specificity: a shared vocabulary of risk domains for GenAI, and a clear expectation that sponsors define (and evidence) a validated operating envelope with acceptable ranges, monitoring, and change control.

That is the practical middle ground: operationalizable assurance without innovation suffocation, aligned with patient safety, data integrity, and lifecycle governance.
This middle ground promotes both regulatory trust and innovation resilience, ensuring GenAI deployments in medicine support patient safety, data integrity, and globally aligned oversight.

If you are developing, selecting, or validating AI approaches and need support, our team is here. Contact us to begin the discussion.

This article was written by:

Melissa Bou Jaoudeh

Innovative Product Development Officer

Florian Pereme

Digital Innovation Lead

Gabriele Piaton

Research & Innovation Director

References

EMA–FDA (2026). Guiding Principles of Good AI Practice in Drug Development.

EMA (2024). Reflection Paper on the Use of Artificial Intelligence in the Medicinal Product Lifecycle.

FDA (2025). Considerations for the Use of Artificial Intelligence to Support Regulatory Decision Making for Drugs and Biological Products (Draft Guidance).

FDA (2021). Good Machine Learning Practice (GMLP) for Medical Device Development: Guiding Principles.

EFPIA (2023). AI Across the Medicines Lifecycle – Reflections and Recommendations for Trustworthy Use in the EU.

Disclaimer

All concepts, analyses, and viewpoints expressed in this commentary reflect the authors’ own expertise and perspectives. The authorship, structure, and content direction are original. Writing support—including phrasing refinement, clarity enhancement, and formatting—was provided with the assistance of a large language model (LLM) to support efficient drafting and communication.

Register to our news and events

Go to our Events to register
Go to our News to get insights

See all our news and events

Good AI Practice in Drug Development – Turning High-Level Principles into Auditable, Low-Burden Controls