The Core Challenge

AI systems are only as good as their data. Quality issues that might be minor annoyances in traditional analytics become critical vulnerabilities in AI. Corrupted training data can permanently embed incorrect behaviour. Data quality degrades over time.

Key Concepts

Data provenance Documentation of where data comes from, how it was collected, and what processing it underwent.
Data lineage Tracing data from source through all transformations to its use in AI systems.
Data drift Changes in data characteristics over time that can affect AI performance.
Data versioning Maintaining history of datasets alongside model versions for reproducibility.
Data quality dimensions Completeness, accuracy, consistency, timeliness, and validity.

Warning Signs

Watch for these indicators of data problems:

  • Training data sources are undocumented or unknown
  • Data quality validation is manual or ad hoc rather than automated
  • No monitoring exists for data drift over time
  • Heavy dependence on external data sources with limited visibility
  • When AI fails, tracing the issue to data problems is difficult or impossible
  • Data from different sources is combined without consistency validation

Questions to Ask in AI Project Reviews

  • "Can we trace the lineage of data through the entire pipeline?"
  • "What data quality validation occurs at each stage?"
  • "What monitoring exists for data drift after deployment?"

Questions to Ask in Governance Discussions

  • "How dependent are we on external data sources? What due diligence was done?"
  • "What happens if our primary data source becomes unavailable or corrupted?"
  • "How would we detect if data quality was degrading?"

Questions to Ask in Strategy Sessions

  • "What's our overall data quality posture for AI systems?"
  • "Where are our most significant data risks?"
  • "Are we investing appropriately in data infrastructure?"

Reflection Prompts

  1. Your awareness: For AI systems in your area, do you know where the data comes from and how good it is?
  2. Your dependencies: What external data sources are you relying on? What happens if they fail?
  3. Your oversight: How would you know if data quality problems were affecting AI performance?

Good Practice Checklist

  • Data provenance is documented for all AI training data
  • Automated pipelines validate data quality at every stage
  • Data drift is monitored continuously
  • Data versions are maintained alongside model versions
  • External data sources are vetted and have contingency plans
  • Data issues can be traced and root-caused effectively

Quick Reference

Element Question to Ask Red Flag
Provenance Where does data come from? Unknown or undocumented
Quality What validation exists? Manual or ad hoc
Drift How are changes detected? No monitoring
Versioning Can we reproduce past states? No history maintained
Dependencies What's our exposure to external sources? Unknown or unmanaged

The Data-AI Connection

Training data shapes behaviour. AI systems learn from examples. If examples are biased, incomplete, or wrong, the system will reflect that. This isn't a bug that can be simply fixed—it's fundamental to how AI works.

Quality compounds. Poor data quality at the start produces poor AI outputs. The problem magnifies through the pipeline.

Drift happens. The world changes. Data that was representative a year ago may not be representative now. AI trained on old patterns may fail on new situations.

Dependencies create risk. If you depend on external data sources, you inherit their quality problems. You need visibility and contingencies.