Framework 10: Reliability & Performance | AI Governance Essentials

The Core Challenge

AI reliability differs fundamentally from traditional software reliability. Systems can perform excellently on average while failing badly for specific conditions, populations, or edge cases. They can degrade silently and produce confident-sounding wrong outputs.

Key Concepts

Distributional shift	When real-world data differs from training data, causing unpredictable performance degradation.
Edge cases	Unusual situations not well-represented in training data where AI may fail unexpectedly.
Uncertainty quantification	AI systems expressing confidence levels rather than just single-point predictions.
Model drift	Gradual performance degradation over time as the world changes from training conditions.
Graceful degradation	System design ensuring that AI failures don't cascade into catastrophic outcomes.

Warning Signs

Watch for these indicators of reliability problems:

Performance is only measured as overall accuracy, not across conditions
Edge cases and failure modes aren't systematically tested
AI systems produce outputs without confidence indicators
No monitoring exists to detect performance degradation in production
When AI fails, there's no fallback—systems depend entirely on AI working
Performance SLAs don't exist or aren't enforced

Questions to Ask in AI Project Reviews

"What's the performance variation across different conditions and populations?"
"What edge cases have been tested? What failure modes have been identified?"
"Does the system express confidence levels? What happens when confidence is low?"

Questions to Ask in Governance Discussions

"How would we know if this system started degrading in production?"
"What fallback exists when AI fails or underperforms?"
"What are the performance SLAs, and who is accountable for meeting them?"

Questions to Ask in Strategy Sessions

"How reliable do our AI systems need to be for their use cases?"
"What's the cost of AI unreliability in our context—operational, reputational, human?"
"Are we investing appropriately in AI reliability given the stakes?"

        Reflection Prompts
        Your confidence: Do you actually know how reliably AI systems in your area perform? Or do you assume they work?
Your exposure: If an AI system produced wrong outputs consistently for some period, how would you know? What would the impact be?
Your standards: What level of AI reliability should you be demanding? Is current practice meeting that?

      

Good Practice Checklist

Testing goes beyond average accuracy to examine conditions and populations
Edge cases and failure modes are systematically identified and tested
Systems express confidence, and low-confidence outputs trigger review
Continuous monitoring detects performance degradation
Graceful fallbacks exist when AI underperforms
Clear SLAs define acceptable performance with accountability

Quick Reference

Element	Question to Ask	Red Flag
Testing	What conditions were tested?	Only average performance
Edge cases	What failure modes were found?	Not systematically examined
Confidence	Does the system know when it's uncertain?	Only single outputs
Monitoring	How is production performance tracked?	Set and forget
Fallback	What happens when AI fails?	Complete dependence on AI

Why AI Reliability Is Different

Traditional software: Same input produces same output. Bugs are deterministic. When something fails, it usually fails obviously.

AI systems:

Similar inputs can produce different outputs
Failures can be silent—plausible-looking wrong answers
Performance depends on similarity to training data
Behaviour can change over time without code changes
Testing average performance doesn't reveal all failure modes

← Previous: Security & Resilience Next: Data Integrity →