The Core Challenge
AI systems are only as good as their data. Quality issues that might be minor annoyances in traditional analytics become critical vulnerabilities in AI. Corrupted training data can permanently embed incorrect behaviour. Data quality degrades over time.
Key Concepts
| Data provenance | Documentation of where data comes from, how it was collected, and what processing it underwent. |
| Data lineage | Tracing data from source through all transformations to its use in AI systems. |
| Data drift | Changes in data characteristics over time that can affect AI performance. |
| Data versioning | Maintaining history of datasets alongside model versions for reproducibility. |
| Data quality dimensions | Completeness, accuracy, consistency, timeliness, and validity. |
Warning Signs
Watch for these indicators of data problems:
- Training data sources are undocumented or unknown
- Data quality validation is manual or ad hoc rather than automated
- No monitoring exists for data drift over time
- Heavy dependence on external data sources with limited visibility
- When AI fails, tracing the issue to data problems is difficult or impossible
- Data from different sources is combined without consistency validation
Questions to Ask in AI Project Reviews
- "Can we trace the lineage of data through the entire pipeline?"
- "What data quality validation occurs at each stage?"
- "What monitoring exists for data drift after deployment?"
Questions to Ask in Governance Discussions
- "How dependent are we on external data sources? What due diligence was done?"
- "What happens if our primary data source becomes unavailable or corrupted?"
- "How would we detect if data quality was degrading?"
Questions to Ask in Strategy Sessions
- "What's our overall data quality posture for AI systems?"
- "Where are our most significant data risks?"
- "Are we investing appropriately in data infrastructure?"
Reflection Prompts
- Your awareness: For AI systems in your area, do you know where the data comes from and how good it is?
- Your dependencies: What external data sources are you relying on? What happens if they fail?
- Your oversight: How would you know if data quality problems were affecting AI performance?
Good Practice Checklist
- Data provenance is documented for all AI training data
- Automated pipelines validate data quality at every stage
- Data drift is monitored continuously
- Data versions are maintained alongside model versions
- External data sources are vetted and have contingency plans
- Data issues can be traced and root-caused effectively
Quick Reference
| Element | Question to Ask | Red Flag |
|---|---|---|
| Provenance | Where does data come from? | Unknown or undocumented |
| Quality | What validation exists? | Manual or ad hoc |
| Drift | How are changes detected? | No monitoring |
| Versioning | Can we reproduce past states? | No history maintained |
| Dependencies | What's our exposure to external sources? | Unknown or unmanaged |
The Data-AI Connection
Training data shapes behaviour. AI systems learn from examples. If examples are biased, incomplete, or wrong, the system will reflect that. This isn't a bug that can be simply fixed—it's fundamental to how AI works.
Quality compounds. Poor data quality at the start produces poor AI outputs. The problem magnifies through the pipeline.
Drift happens. The world changes. Data that was representative a year ago may not be representative now. AI trained on old patterns may fail on new situations.
Dependencies create risk. If you depend on external data sources, you inherit their quality problems. You need visibility and contingencies.