Courses/ AIGP Certification Prep/ Day 25

Day 25 of 30

Continuous Monitoring of Deployed AI Systems

⏱ 18 min 📊 Medium AIGP Certification Prep

Welcome to Domain IV — the domain most candidates under-prepare for. Post-deployment governance is where governance meets reality. An AI system that passes all pre-deployment tests can still fail in production. Monitoring catches what testing misses.

AI monitoring dashboard showing performance metrics, drift indicators, and fairness scores

A production AI monitoring dashboard tracks performance, drift, and fairness in real-time — enabling early intervention before harm occurs.

Performance Monitoring

Key performance metrics for deployed AI:

Accuracy metrics — Is the model performing as expected? Track overall accuracy, precision, recall, and F1 score against baseline values established during testing.

Latency — How fast does the model respond? Performance degradation can indicate infrastructure issues or model complexity problems.

Throughput — How many decisions is the model processing? Unexpected changes (spikes or drops) may signal issues.

Error rates — Track error types (false positives, false negatives) and their distribution. A sudden increase in a specific error type may indicate model degradation.

Business outcome metrics — Connect AI performance to business outcomes. If the AI is approving loans, track default rates. If it's screening resumes, track hiring success rates.

Data Drift and Concept Drift

Drift is the silent killer of AI systems. The model doesn't change — the world does.

Data drift (covariate shift) — The input data distribution changes from what the model was trained on. Example: A fraud detection model trained on in-store transaction patterns encounters a surge in mobile payments.

Concept drift — The relationship between inputs and outputs changes. Example: Customer behavior patterns that predicted churn in 2023 no longer predict churn in 2026 due to market changes.

Detection methods:

- Statistical tests comparing production data distributions to training data distributions

- Monitoring prediction confidence scores — declining confidence may indicate drift

- Tracking performance metrics over time — gradual degradation suggests concept drift

- Periodic re-evaluation on labeled production data

Governance response to drift:

- Define drift thresholds that trigger alerts

- Establish escalation procedures for significant drift

- Define retraining criteria and approval processes

- Document all drift events and responses

Knowledge Check

A credit scoring model's accuracy gradually decreases over 6 months even though the model hasn't been modified. The most likely explanation is:

Gradual performance degradation in an unchanged model is the hallmark of concept drift. The world has changed (economic conditions, consumer behavior, market dynamics) while the model remains static. Software bugs would cause sudden changes. Training data issues and overfitting would have been apparent from the start.

Fairness Monitoring

Pre-deployment bias testing is necessary but not sufficient. Fairness must be monitored continuously:

Why production fairness differs from test fairness:

- Production data may have different demographic distributions

- Data drift may affect demographic groups unequally

- Real-world feedback loops can amplify initial biases over time

- User behavior may interact with the AI in unexpected ways

Fairness monitoring checklist:

- Track the same fairness metrics used in pre-deployment testing

- Monitor outcomes by demographic group on an ongoing basis

- Set alert thresholds for disparities exceeding defined tolerance levels

- Conduct periodic fairness audits on production data (not just training data)

- Document all fairness monitoring results and any corrective actions

Knowledge Check

An AI hiring tool passed all fairness audits during testing. After 6 months in production, analysis reveals that approval rates for one demographic group have dropped significantly. What is the MOST likely governance failure?

Pre-deployment testing can't predict how an AI will perform as production data evolves. Without continuous fairness monitoring, disparities that emerge over time go undetected. The governance failure is the absence of ongoing monitoring, not necessarily the pre-deployment testing.

Monitoring Dashboards and Alerting

Effective monitoring requires structured alerting:

Green/Yellow/Red alert system:

- Green — All metrics within normal operating parameters

- Yellow — One or more metrics approaching threshold values. Trigger investigation.

- Red — Metrics exceed defined thresholds. Trigger escalation and potential system intervention.

Alert configuration:

- Define thresholds for each metric based on risk tolerance

- Differentiate between gradual degradation and sudden changes

- Route alerts to appropriate stakeholders based on severity

- Avoid alert fatigue by tuning thresholds carefully

- Document all alert events, investigations, and outcomes

Real-World Scenario

Tesla's Autopilot system provides one of the most prominent real-world examples of continuous AI monitoring — and the consequences of inadequate post-deployment governance. Between 2016 and 2024, the National Highway Traffic Safety Administration (NHTSA) opened multiple investigations into Tesla's Autopilot and Full Self-Driving (FSD) systems after a series of fatal crashes, including collisions with emergency vehicles parked on highways. In 2023, NHTSA issued a recall affecting over 2 million Tesla vehicles, finding that the system's driver monitoring controls were insufficient to prevent misuse. The recall required an over-the-air software update to strengthen driver attention alerts.

This case illustrates several critical monitoring concepts. First, Tesla's system experienced concept drift as road conditions, vehicle types, and driving patterns evolved beyond the training data — stationary emergency vehicles with flashing lights were a persistent blind spot. Second, the case demonstrates why fairness and performance monitoring must be disaggregated: aggregate safety statistics may look acceptable while specific failure modes (emergency vehicle recognition, certain lighting conditions) remain dangerous. Third, NHTSA's ongoing oversight represents the regulatory dimension of post-deployment monitoring — authorities can mandate monitoring requirements, demand incident data, and force corrective action.

For the AIGP exam, Tesla's Autopilot governance challenges demonstrate that monitoring is not optional for high-risk AI systems. A governance professional should ensure that monitoring frameworks include scenario-specific performance tracking (not just aggregate metrics), defined escalation triggers for recurring failure patterns, and clear regulatory reporting obligations. The case also shows that over-the-air updates to AI systems constitute retraining events that require their own governance review.

Final Check

An AI content moderation system triggers a yellow alert: the false positive rate (incorrectly flagging safe content) has increased by 15% over the past month. What is the MOST appropriate governance response?

Yellow alerts warrant investigation, not immediate shutdown or dismissal. The governance response is to understand why the false positive rate is increasing (data drift? content pattern changes?), assess the trajectory, and prepare mitigation. Lowering thresholds just masks the signal.

🎯

Day 25 Complete

"Monitoring catches what testing misses. Track performance metrics, data drift, concept drift, and fairness continuously. Define Green/Yellow/Red alert thresholds and escalation procedures before deployment."

Go Deeper

Want to see these concepts applied to full case studies? Check out AIGP Scenarios — 10 real-world governance simulations mapped to the AIGP exam domains.

Next Lesson

Human Oversight Models for AI Systems

→