Data Management Controls for AI — ISACA AAISM Certification Prep

Data is the fuel for AI and the primary attack surface. Today we cover data controls across the AI data lifecycle — from training through inference — with AI-specific classification and governance requirements.

Controls for training data

Training data requires controls beyond traditional data security:

Access controls — Who can read, modify, add, or delete training data? Apply least privilege. Data scientists may need read access; only authorized pipelines should modify training datasets.

Encryption — Encrypt training data at rest and in transit. This includes intermediate datasets, feature stores, and data lakes used for training.

Anonymization and pseudonymization — Apply before training when possible. However, understand that anonymized data used for training may still allow models to memorize and regenerate identifying information.

Quality gates — Automated checks before training data enters the pipeline: completeness checks, format validation, outlier detection, and bias screening. Low-quality data produces low-quality models.

Provenance tracking — Document where each dataset came from, when it was collected, who processed it, and what transformations were applied. This is your data audit trail.

Version control — Version training datasets alongside model versions. When you retrain, you need to know exactly what data was used.

Controls for inference data

Production inference introduces different control requirements:

Input validation — Validate inputs before they reach the model. Check for format compliance, range validation, and anomaly detection. Malformed or adversarial inputs should be filtered before inference.

Output filtering — Review model outputs before they reach users or downstream systems. For generative AI: content filtering for harmful, biased, or inappropriate content. For decision AI: confidence thresholds and sanity checks.

PII redaction — If models process personal data during inference, implement controls to prevent PII from appearing in outputs, logs, or monitoring data where it shouldn't be.

Rate limiting — Protect model APIs from abuse. Rate limiting prevents both denial-of-service attacks and systematic model extraction attempts.

Logging — Log inputs, outputs, and metadata for audit and monitoring. Balance logging needs with privacy requirements — you may need to log enough for security monitoring without retaining raw PII.

Knowledge Check

A generative AI chatbot occasionally includes fragments of training data in its responses, including what appears to be personal information from customer support transcripts used in training. What is the MOST appropriate control?

**Address both symptom and root cause.** Output filtering provides immediate protection (contain the risk). Reviewing training data practices addresses why memorization occurred (prevent recurrence). Either control alone is insufficient — filtering misses some cases, and data remediation takes time.

Data classification for AI

Extend your data classification scheme with AI-specific categories:

Training data — Classified based on the sensitivity of the source data. Customer PII used for training carries the same classification as the original PII, regardless of whether it's been transformed.

Validation and test data — Often derived from production data. Same classification considerations as training data, plus the additional requirement for independence from training data.

Model weights and parameters — These are derived artifacts that encode information from training data. A model trained on confidential data should be classified at least at the confidential level. Model weights are intellectual property and a security-relevant asset.

Feature data — Intermediate data produced during feature engineering. Classification depends on source data and whether the features could be used to reconstruct sensitive information.

Production inference data — Inputs and outputs during production use. Classification depends on content — inputs may contain PII, outputs may contain sensitive business decisions.

Metadata and logs — Usage logs, performance metrics, and monitoring data. May contain indirect PII (usage patterns, timing information) even when individual records are anonymized.

Retention and deletion requirements

AI creates complex retention and deletion challenges:

Regulatory requirements — GDPR right to erasure, CCPA deletion rights, and sector-specific retention rules all apply to AI training data. But what about the model trained on that data?

The model retention paradox — If you train a model on personal data and then delete the personal data, the model still contains learned patterns from that data. Is the model "processing" the deleted data? Legal interpretations vary, but the trend is toward treating models as data processing artifacts.

Practical approach:

- Define retention periods for each data category

- Implement automated deletion for expired data

- Document the relationship between training data and model versions

- Plan for model retraining when underlying data must be deleted

- Maintain audit trails for all deletion actions

Lineage and audit trails — You must be able to answer: what data was used to train which model version, when was the data collected, and who approved its use? Data lineage is a governance requirement, not just a nice-to-have.

Knowledge Check

An audit reveals that training data for a production model was supposed to be deleted six months ago per the retention policy. The model is still in production and performing well. What is the MOST appropriate action?

**Policy compliance with practical management.** The training data should be deleted per policy. The model may continue operating (depending on regulatory interpretation), but the relationship must be documented and retraining planned if required. Retroactively extending retention violates the original policy justification.

Data quality as a security control

Poor data quality isn't just a performance issue — it's a security issue:

Bias from data — Unrepresentative data creates biased models. Data quality controls must include diversity and representation checks.

Poisoning vulnerability — Low quality controls make poisoning easier. If anyone can contribute data without validation, an attacker can inject malicious training examples.

Decision integrity — AI decisions are only as good as the data they're based on. A credit model trained on inaccurate financial data makes inaccurate credit decisions.

Monitoring data quality — Implement ongoing data quality monitoring: completeness, accuracy, consistency, timeliness, and representativeness. Degradation in any dimension should trigger investigation.

Think of data quality as a security control, not just a data management practice.

Final Check

A data pipeline automatically ingests data from multiple sources for model retraining. One source begins sending data with a significantly different distribution than historical patterns. What control should detect this?

**Preventive control at the data layer.** Catching distribution anomalies before they enter training is more effective than detecting the impact after retraining. While inference validation and model monitoring are important detective controls, data quality monitoring is the **preventive** control that stops the problem at its source.

🗃️

Day 14 Complete

"Data controls must span the entire AI lifecycle — training, inference, and retention. Model weights trained on sensitive data carry the classification of that data."