Data is the fuel for AI and the primary attack surface. Today we cover data controls across the AI data lifecycle — from training through inference — with AI-specific classification and governance requirements.
Training data requires controls beyond traditional data security:
Access controls — Who can read, modify, add, or delete training data? Apply least privilege. Data scientists may need read access; only authorized pipelines should modify training datasets.
Encryption — Encrypt training data at rest and in transit. This includes intermediate datasets, feature stores, and data lakes used for training.
Anonymization and pseudonymization — Apply before training when possible. However, understand that anonymized data used for training may still allow models to memorize and regenerate identifying information.
Quality gates — Automated checks before training data enters the pipeline: completeness checks, format validation, outlier detection, and bias screening. Low-quality data produces low-quality models.
Provenance tracking — Document where each dataset came from, when it was collected, who processed it, and what transformations were applied. This is your data audit trail.
Version control — Version training datasets alongside model versions. When you retrain, you need to know exactly what data was used.
Production inference introduces different control requirements:
Input validation — Validate inputs before they reach the model. Check for format compliance, range validation, and anomaly detection. Malformed or adversarial inputs should be filtered before inference.
Output filtering — Review model outputs before they reach users or downstream systems. For generative AI: content filtering for harmful, biased, or inappropriate content. For decision AI: confidence thresholds and sanity checks.
PII redaction — If models process personal data during inference, implement controls to prevent PII from appearing in outputs, logs, or monitoring data where it shouldn't be.
Rate limiting — Protect model APIs from abuse. Rate limiting prevents both denial-of-service attacks and systematic model extraction attempts.
Logging — Log inputs, outputs, and metadata for audit and monitoring. Balance logging needs with privacy requirements — you may need to log enough for security monitoring without retaining raw PII.
Extend your data classification scheme with AI-specific categories:
Training data — Classified based on the sensitivity of the source data. Customer PII used for training carries the same classification as the original PII, regardless of whether it's been transformed.
Validation and test data — Often derived from production data. Same classification considerations as training data, plus the additional requirement for independence from training data.
Model weights and parameters — These are derived artifacts that encode information from training data. A model trained on confidential data should be classified at least at the confidential level. Model weights are intellectual property and a security-relevant asset.
Feature data — Intermediate data produced during feature engineering. Classification depends on source data and whether the features could be used to reconstruct sensitive information.
Production inference data — Inputs and outputs during production use. Classification depends on content — inputs may contain PII, outputs may contain sensitive business decisions.
Metadata and logs — Usage logs, performance metrics, and monitoring data. May contain indirect PII (usage patterns, timing information) even when individual records are anonymized.
AI creates complex retention and deletion challenges:
Regulatory requirements — GDPR right to erasure, CCPA deletion rights, and sector-specific retention rules all apply to AI training data. But what about the model trained on that data?
The model retention paradox — If you train a model on personal data and then delete the personal data, the model still contains learned patterns from that data. Is the model "processing" the deleted data? Legal interpretations vary, but the trend is toward treating models as data processing artifacts.
Practical approach:
- Define retention periods for each data category
- Implement automated deletion for expired data
- Document the relationship between training data and model versions
- Plan for model retraining when underlying data must be deleted
- Maintain audit trails for all deletion actions
Lineage and audit trails — You must be able to answer: what data was used to train which model version, when was the data collected, and who approved its use? Data lineage is a governance requirement, not just a nice-to-have.
Poor data quality isn't just a performance issue — it's a security issue:
Bias from data — Unrepresentative data creates biased models. Data quality controls must include diversity and representation checks.
Poisoning vulnerability — Low quality controls make poisoning easier. If anyone can contribute data without validation, an attacker can inject malicious training examples.
Decision integrity — AI decisions are only as good as the data they're based on. A credit model trained on inaccurate financial data makes inaccurate credit decisions.
Monitoring data quality — Implement ongoing data quality monitoring: completeness, accuracy, consistency, timeliness, and representativeness. Degradation in any dimension should trigger investigation.
Think of data quality as a security control, not just a data management practice.