All Lessons Course Details All Courses Enroll
Courses/ AIGP Certification Prep/ Day 25
Day 25 of 30

Continuous Monitoring of Deployed AI Systems

⏱ 18 min 📊 Medium AIGP Certification Prep

Welcome to Domain IV — the domain most candidates under-prepare for. Post-deployment governance is where governance meets reality. An AI system that passes all pre-deployment tests can still fail in production. Monitoring catches what testing misses.

AI monitoring dashboard showing performance metrics, drift indicators, and fairness scores
A production AI monitoring dashboard tracks performance, drift, and fairness in real-time — enabling early intervention before harm occurs.

Performance Monitoring

Key performance metrics for deployed AI:

Accuracy metrics — Is the model performing as expected? Track overall accuracy, precision, recall, and F1 score against baseline values established during testing.

Latency — How fast does the model respond? Performance degradation can indicate infrastructure issues or model complexity problems.

Throughput — How many decisions is the model processing? Unexpected changes (spikes or drops) may signal issues.

Error rates — Track error types (false positives, false negatives) and their distribution. A sudden increase in a specific error type may indicate model degradation.

Business outcome metrics — Connect AI performance to business outcomes. If the AI is approving loans, track default rates. If it's screening resumes, track hiring success rates.

Data Drift and Concept Drift

Drift is the silent killer of AI systems. The model doesn't change — the world does.

Data drift (covariate shift) — The input data distribution changes from what the model was trained on. Example: A fraud detection model trained on in-store transaction patterns encounters a surge in mobile payments.

Concept drift — The relationship between inputs and outputs changes. Example: Customer behavior patterns that predicted churn in 2023 no longer predict churn in 2026 due to market changes.

Detection methods:

- Statistical tests comparing production data distributions to training data distributions

- Monitoring prediction confidence scores — declining confidence may indicate drift

- Tracking performance metrics over time — gradual degradation suggests concept drift

- Periodic re-evaluation on labeled production data

Governance response to drift:

- Define drift thresholds that trigger alerts

- Establish escalation procedures for significant drift

- Define retraining criteria and approval processes

- Document all drift events and responses

Knowledge Check
A credit scoring model's accuracy gradually decreases over 6 months even though the model hasn't been modified. The most likely explanation is:
Gradual performance degradation in an unchanged model is the hallmark of concept drift. The world has changed (economic conditions, consumer behavior, market dynamics) while the model remains static. Software bugs would cause sudden changes. Training data issues and overfitting would have been apparent from the start.

Fairness Monitoring

Pre-deployment bias testing is necessary but not sufficient. Fairness must be monitored continuously:

Why production fairness differs from test fairness:

- Production data may have different demographic distributions

- Data drift may affect demographic groups unequally

- Real-world feedback loops can amplify initial biases over time

- User behavior may interact with the AI in unexpected ways

Fairness monitoring checklist:

- Track the same fairness metrics used in pre-deployment testing

- Monitor outcomes by demographic group on an ongoing basis

- Set alert thresholds for disparities exceeding defined tolerance levels

- Conduct periodic fairness audits on production data (not just training data)

- Document all fairness monitoring results and any corrective actions

Knowledge Check
An AI hiring tool passed all fairness audits during testing. After 6 months in production, analysis reveals that approval rates for one demographic group have dropped significantly. What is the MOST likely governance failure?
Pre-deployment testing can't predict how an AI will perform as production data evolves. Without continuous fairness monitoring, disparities that emerge over time go undetected. The governance failure is the absence of ongoing monitoring, not necessarily the pre-deployment testing.

Monitoring Dashboards and Alerting

Effective monitoring requires structured alerting:

Green/Yellow/Red alert system:

- Green — All metrics within normal operating parameters

- Yellow — One or more metrics approaching threshold values. Trigger investigation.

- Red — Metrics exceed defined thresholds. Trigger escalation and potential system intervention.

Alert configuration:

- Define thresholds for each metric based on risk tolerance

- Differentiate between gradual degradation and sudden changes

- Route alerts to appropriate stakeholders based on severity

- Avoid alert fatigue by tuning thresholds carefully

- Document all alert events, investigations, and outcomes

Final Check
An AI content moderation system triggers a yellow alert: the false positive rate (incorrectly flagging safe content) has increased by 15% over the past month. What is the MOST appropriate governance response?
Yellow alerts warrant investigation, not immediate shutdown or dismissal. The governance response is to understand why the false positive rate is increasing (data drift? content pattern changes?), assess the trajectory, and prepare mitigation. Lowering thresholds just masks the signal.
🎯
Day 25 Complete
"Monitoring catches what testing misses. Track performance metrics, data drift, concept drift, and fairness continuously. Define Green/Yellow/Red alert thresholds and escalation procedures before deployment."
Next Lesson
Human Oversight Models for AI Systems