Data is the foundation of every AI system. Bad data governance doesn't just produce bad models — it produces discriminatory, non-compliant, and potentially dangerous models. Today you'll learn the governance controls that must be applied to AI development data.
Traditional data quality (accuracy, completeness, timeliness) is necessary but not sufficient for AI. Add these AI-specific dimensions:
Representativeness — Does the data adequately represent all groups the AI will affect? If a facial recognition system is trained primarily on lighter-skinned faces, it will perform poorly on darker-skinned faces. This isn't just a technical problem — it's a governance failure.
Label accuracy — For supervised learning, labels define truth. Inaccurate or inconsistent labels directly degrade model quality. Governance must ensure labeling guidelines, quality assurance, and inter-rater reliability checks.
Temporal relevance — Is the data current enough for the intended use? A credit scoring model trained on pre-pandemic data may not reflect current economic conditions.
Distributional alignment — Does the training data distribution match the deployment environment? A model trained on US data deployed in EU markets may produce unreliable results.
Sufficiency — Is there enough data to train a reliable model? Insufficient data, especially for minority classes, leads to unreliable predictions for those groups.
Bias can enter data at multiple points. Governance requires systematic detection:
Historical bias — Data reflecting past discrimination (e.g., historical hiring data in industries that excluded certain groups).
Selection bias — Non-random sampling that overrepresents or underrepresents certain populations.
Measurement bias — Inconsistent data collection methods across groups (e.g., different diagnostic criteria applied to different demographics).
Label bias — Annotators' subjective judgments reflecting personal or cultural biases.
Aggregation bias — Combining data from different contexts without accounting for population differences.
Governance response: Require demographic parity analysis of training datasets before model development begins. Document any identified biases and the mitigation strategies employed.
Data labeling (annotation) is where human judgment enters the AI pipeline. Governance controls include:
Annotator guidelines — Clear, detailed instructions for labeling decisions. Reduce ambiguity to improve consistency.
Quality assurance — Double-labeling (two annotators label the same data independently), spot-checking, and regular accuracy reviews.
Inter-rater reliability — Statistical measures (Cohen's kappa, Fleiss' kappa) of agreement between annotators. Low reliability indicates unclear guidelines or subjective labeling.
Annotator demographics — The composition of the annotator team can introduce bias. A monolingual team labeling sentiment in multilingual data will produce biased labels.
Working conditions — Ethical treatment of annotators, especially for content moderation and sensitive data. This is both an ethical and quality concern — fatigued or distressed annotators produce lower-quality labels.
Two widely recognized standards for AI data documentation:
Datasheets for Datasets (Gebru et al., 2021) — A structured documentation template covering:
- Motivation: Why was the dataset created?
- Composition: What's in the dataset? Demographics?
- Collection: How was the data collected? By whom?
- Preprocessing: What cleaning or transformation was applied?
- Uses: What is the dataset intended for? What should it NOT be used for?
- Distribution: How is the dataset shared?
- Maintenance: Who maintains the dataset? How are errors corrected?
Data Cards — A similar concept used by organizations like Google, providing a summary of dataset characteristics, intended uses, and limitations.
These documentation artifacts serve governance purposes: they create accountability, enable auditing, and inform downstream users about data limitations.