by Vaibhavi M.
8 minutes
EU Annex 22 Explained: What Pharma Manufacturers Need to Know About AI in GMP
A practical breakdown of the European Commission's new guidance on Artificial Intelligence in pharmaceutical manufacturing

The European Commission has introduced a new chapter to the Good Manufacturing Practice (GMP) rulebook, Annex 22 on Artificial Intelligence. This is a landmark development for the pharmaceutical industry. For the first time, there is a dedicated regulatory framework that specifies how AI models should be designed, tested, validated, and monitored when used in critical GMP environments.
This guide walks you through everything covered in the draft Annex 22 guideline, published by the European Commission's Health and Food Safety Directorate, using plain language so your team, whether in QA, IT, data science, or manufacturing, can understand what it means in practice.
What Is EU Annex 22?
EU Annex 22 is a new addition to Volume 4, Chapter 4 of the EU GMP guidelines. It provides supplementary guidance to Annex 11 (which covers computerised systems), specifically for situations where AI models are embedded in those systems.
The annex covers AI models used in critical applications, those that have a direct impact on:
- Patient safety
- Product quality
- Data integrity
Think of AI models used to detect visual defects in tablets, classify raw material batches, or predict process outcomes. These are the types of applications this annex is designed to govern.
Key point: Annex 22 applies to machine learning models that have learned their behaviour from training data. It does not apply to explicitly programmed rule-based systems.
Annex 22 builds on Annex 11, but they are not the same rulebook. Know exactly where one ends and the other begins before you map your compliance gap.
→ Read: 21 CFR Part 11 vs EU GMP Annex 11: What Every Life Sciences Professional Must Know
What Annex 22 Does NOT Cover
Before diving deeper, it is important to clarify the boundaries of this guideline. The following are explicitly out of scope:
- Dynamic models — models that continuously learn and adapt during operation. These should not be used in critical GMP applications.
- Probabilistic models — models that may give different outputs even with identical inputs. These, too, should not be used in critical GMP settings.
- Generative AI and Large Language Models (LLMs) — tools like ChatGPT or similar models. These are excluded from critical GMP use.
If generative AI is used in non-critical GMP applications (those not affecting patient safety, quality, or data integrity), a qualified human must always review and be accountable for the output; this is known as the human-in-the-loop (HITL) principle.
The Core Principles: Who Is Responsible?
Annex 22 sets out three foundational principles right at the start.
1. People and collaboration: Implementing an AI model in a GMP setting is not a solo IT project. It requires close cooperation between process subject matter experts (SMEs), Quality Assurance (QA), data scientists, IT teams, and external consultants, where needed. Everyone involved must have adequate qualifications, clearly defined responsibilities, and appropriate system access.
2. Documentation All activities, from algorithm selection to model testing, must be fully documented. This applies whether the model is developed in-house or supplied by a third-party vendor. The regulated company is always responsible for reviewing this documentation.
3. Quality Risk Management Everything in Annex 22 should be implemented using a risk-based approach, always keeping in focus the impact on patient safety, product quality, and data integrity.
Intended Use: Define It Before You Build It
One of the most critical steps in Annex 22 is properly defining the intended use of the AI model. This must be done in detail and documented before acceptance testing begins.
The intended use description should include:
- What specific task is the model designed to assist or automate
- A thorough description of the input data the model will use, including all common and rare variations (the input sample space)
- Any known limitations or potential sources of error or bias in the input data
The input sample space may also be divided into subgroups. For example, a visual inspection model might have subgroups based on defect type or severity, geographic site, or specific product characteristics. This segmentation allows for more granular testing and performance assessment.
Where a human operator is involved in the decision-making process (HITL), the intended use must also describe what the operator's responsibility is, and operator performance must be monitored like any other manual process.
Acceptance Criteria: Setting the Performance Bar
AI models in GMP settings must meet clearly defined performance standards before they are approved for use. Annex 22 requires the following:
Test metrics must be appropriate to the model's task. For a classification model (e.g., accept/reject decisions), relevant metrics may include:
Metric | What It Measures |
|---|---|
Sensitivity (Recall) | Ability to correctly identify true positives |
Specificity | Ability to correctly identify true negatives |
Accuracy | Overall correct prediction rate |
Precision | Proportion of positive predictions that are correct |
F1 Score | Harmonic mean of precision and recall |
Confusion Matrix | Full picture of prediction outcomes across all categories |
Acceptance criteria must be pre-defined and documented before testing starts. A process SME must sign off on these criteria. Critically, the AI model's performance must be at least as good as the process it replaces. This means you need a baseline; knowing how well your existing manual or automated process performs is a prerequisite.
Test Data: Quality In, Quality Out
The quality of test data is fundamental to the validity of any AI model. Annex 22 is very specific about what good test data looks like.
Test data must be:
- Representative — covering the full intended input sample space, including all subgroups
- Statistically sufficient — large enough that the chosen metrics can be calculated with adequate statistical confidence
- Correctly labelled — verified through independent review by multiple experts, validated equipment, or laboratory tests
- Pre-processed consistently — any transformation, normalisation, or standardisation must be pre-specified with a documented rationale.
Any exclusion or cleaning of test data must be fully documented and justified. Generating synthetic test data using generative AI is not recommended, and if done, requires strong justification.
Test Data Independency: Keeping the Walls Up
This section of Annex 22 addresses one of the most common pitfalls in AI development, data leakage, where test data is accidentally exposed during training or validation, leading to overly optimistic performance results.
Requirements include:
- Test data must never be used during model development, training, or validation
- If data is split from a larger pool before training, employees involved in training must never have had access to the test data.
- Test data must be protected by access controls and audit trails
- There must be no copies of test data outside the controlled repository
- Staff who have had access to test data cannot be involved in training the same model, unless they work under the 4-eyes principle
A record must be kept of which data was used for testing, when it was used, and how many times.
Test Execution: Proving the Model Works
Before testing begins, a formal test plan must be prepared and approved. It should contain:
- A summary of the intended use
- Pre-defined metrics and acceptance criteria
- A reference to the test data
- A step-by-step test script
- A description of how to calculate each test metric
The test itself must demonstrate that the model generalises well, meaning it performs reliably with new, unseen data, not just the data it was trained on. Both overfitting (over-tuned to the training data) and underfitting (too general) must be detected and addressed.
Any deviation from the test plan, failure to meet acceptance criteria, or omission of test data must be investigated and fully documented.
All test documentation, including test data, physical test objects where relevant, and audit trail records, must be retained in accordance with standard GMP documentation requirements.
Explainability: Opening the Black Box
Explainability is one of the more technically demanding requirements in Annex 22. For AI models used in critical GMP applications, the system must be able to capture and record which features in the input data contributed to a particular output (such as a rejection decision).
Acceptable techniques include:
- SHAP (Shapley Additive Explanations) — assigns each input feature a contribution score for a particular prediction
- LIME (Local Interpretable Model-Agnostic Explanations) — approximates a complex model locally to explain individual predictions
- Heat maps — visually highlight areas of an image that influenced a classification decision
Reviewing these features must be part of the approval process for test results. The goal is to ensure the model makes decisions based on relevant and appropriate features, not on spurious correlations.
Confidence Scores: When to Say "I'm Not Sure"
AI models should not be forced to make a binary decision when the evidence is ambiguous. Annex 22 requires that systems log the confidence score for each prediction or classification.
Where a model's confidence score is very low, the system should flag the outcome as "undecided" rather than committing to a potentially unreliable prediction. This prevents incorrect classifications from slipping through the process undetected.
Operation and Ongoing Monitoring
Once a model is approved and deployed, it must be maintained under strict controls throughout its operational life.
Annex 22 Operational Checklist
- Place the model, the system it runs on, and the process it supports under change control before deployment
- Evaluate any change to the model, system, or process, including changes to physical input objects, to determine if retesting is needed.
- Implement configuration control to detect any unauthorised changes to the model after deploymen.t
- Monitor model performance metrics regularly to detect deterioration (e.g., changes in lighting conditions for visual inspection models)
- Monitor whether input data remains within the model's intended sample space, and track any drift using defined metric.s
- Where HITL is used with reduced testing, keep records of all human review decisions and apply a consistent review procedure.
Monitoring doesn't stop at deployment, and neither does regulatory scrutiny. Here's how AI is already reshaping drug safety monitoring and what that means for your team.
→ Read: How AI Is Changing Drug Safety Monitoring — And What the Future Looks Like
Summary: What Annex 22 Means for Your Organisation
EU Annex 22 marks a significant step forward in how regulators expect AI to be governed in pharmaceutical manufacturing. Rather than applying general computer system validation principles loosely, companies now have a specific, structured framework to follow.
The key message is clear: AI in GMP is not just an IT decision. It is a quality and compliance responsibility that spans teams, disciplines, and the full lifecycle of the model, from design and training to deployment and ongoing monitoring.
Organisations that start building these practices now, clear intended use definitions, independent test data management, explainability mechanisms, and robust operational monitoring, will be far better positioned when Annex 22 moves from consultation to enforcement.
Frequently Asked Questions
1. What is EU Annex 22 in pharma?
EU Annex 22 is a new GMP guideline from the European Commission that sets out specific requirements for the use of Artificial Intelligence (AI) models in pharmaceutical manufacturing systems where patient safety, product quality, or data integrity is at stake.
2. Does EU Annex 22 apply to ChatGPT or generative AI?
No. Annex 22 explicitly excludes generative AI and Large Language Models from critical GMP applications. If used in non-critical settings, a qualified human must always review the output.
3. What types of AI models does Annex 22 cover?
Annex 22 covers static, deterministic machine learning models — those that do not change after training and give the same output for the same input every time.
4. What is the human-in-the-loop (HITL) requirement in Annex 22?
HITL means a trained human operator is responsible for reviewing and approving the AI model's output before it influences a GMP decision. When HITL is applied, testing requirements for the model may be reduced, but the operator's performance must be monitored.
5. When will EU Annex 22 come into effect?
Annex 22 is currently in public consultation. The final implementation date has not yet been confirmed, but organisations are advised to start aligning their AI governance practices with the draft guideline now.




