Framework 11: Data Integrity & Quality | AI Governance Essentials

The Core Challenge

AI systems are only as good as their data. Quality issues that might be minor annoyances in traditional analytics become critical vulnerabilities in AI. Corrupted training data can permanently embed incorrect behaviour. Data quality degrades over time.

Key Concepts

Data provenance	Documentation of where data comes from, how it was collected, and what processing it underwent.
Data lineage	Tracing data from source through all transformations to its use in AI systems.
Data drift	Changes in data characteristics over time that can affect AI performance.
Data versioning	Maintaining history of datasets alongside model versions for reproducibility.
Data quality dimensions	Completeness, accuracy, consistency, timeliness, and validity.

Warning Signs

Watch for these indicators of data problems:

Training data sources are undocumented or unknown
Data quality validation is manual or ad hoc rather than automated
No monitoring exists for data drift over time
Heavy dependence on external data sources with limited visibility
When AI fails, tracing the issue to data problems is difficult or impossible
Data from different sources is combined without consistency validation

Questions to Ask in AI Project Reviews

"Can we trace the lineage of data through the entire pipeline?"
"What data quality validation occurs at each stage?"
"What monitoring exists for data drift after deployment?"

Questions to Ask in Governance Discussions

"How dependent are we on external data sources? What due diligence was done?"
"What happens if our primary data source becomes unavailable or corrupted?"
"How would we detect if data quality was degrading?"

Questions to Ask in Strategy Sessions

"What's our overall data quality posture for AI systems?"
"Where are our most significant data risks?"
"Are we investing appropriately in data infrastructure?"

        Reflection Prompts
        Your awareness: For AI systems in your area, do you know where the data comes from and how good it is?
Your dependencies: What external data sources are you relying on? What happens if they fail?
Your oversight: How would you know if data quality problems were affecting AI performance?

      

Good Practice Checklist

Data provenance is documented for all AI training data
Automated pipelines validate data quality at every stage
Data drift is monitored continuously
Data versions are maintained alongside model versions
External data sources are vetted and have contingency plans
Data issues can be traced and root-caused effectively

Quick Reference

Element	Question to Ask	Red Flag
Provenance	Where does data come from?	Unknown or undocumented
Quality	What validation exists?	Manual or ad hoc
Drift	How are changes detected?	No monitoring
Versioning	Can we reproduce past states?	No history maintained
Dependencies	What's our exposure to external sources?	Unknown or unmanaged

The Data-AI Connection

Training data shapes behaviour. AI systems learn from examples. If examples are biased, incomplete, or wrong, the system will reflect that. This isn't a bug that can be simply fixed—it's fundamental to how AI works.

Quality compounds. Poor data quality at the start produces poor AI outputs. The problem magnifies through the pipeline.

Drift happens. The world changes. Data that was representative a year ago may not be representative now. AI trained on old patterns may fail on new situations.

Dependencies create risk. If you depend on external data sources, you inherit their quality problems. You need visibility and contingencies.

← Previous: Reliability & Performance Next: Operational Resilience →