The Core Challenge
AI reliability differs fundamentally from traditional software reliability. Systems can perform excellently on average while failing badly for specific conditions, populations, or edge cases. They can degrade silently and produce confident-sounding wrong outputs.
Key Concepts
| Distributional shift | When real-world data differs from training data, causing unpredictable performance degradation. |
| Edge cases | Unusual situations not well-represented in training data where AI may fail unexpectedly. |
| Uncertainty quantification | AI systems expressing confidence levels rather than just single-point predictions. |
| Model drift | Gradual performance degradation over time as the world changes from training conditions. |
| Graceful degradation | System design ensuring that AI failures don't cascade into catastrophic outcomes. |
Warning Signs
Watch for these indicators of reliability problems:
- Performance is only measured as overall accuracy, not across conditions
- Edge cases and failure modes aren't systematically tested
- AI systems produce outputs without confidence indicators
- No monitoring exists to detect performance degradation in production
- When AI fails, there's no fallback—systems depend entirely on AI working
- Performance SLAs don't exist or aren't enforced
Questions to Ask in AI Project Reviews
- "What's the performance variation across different conditions and populations?"
- "What edge cases have been tested? What failure modes have been identified?"
- "Does the system express confidence levels? What happens when confidence is low?"
Questions to Ask in Governance Discussions
- "How would we know if this system started degrading in production?"
- "What fallback exists when AI fails or underperforms?"
- "What are the performance SLAs, and who is accountable for meeting them?"
Questions to Ask in Strategy Sessions
- "How reliable do our AI systems need to be for their use cases?"
- "What's the cost of AI unreliability in our context—operational, reputational, human?"
- "Are we investing appropriately in AI reliability given the stakes?"
Reflection Prompts
- Your confidence: Do you actually know how reliably AI systems in your area perform? Or do you assume they work?
- Your exposure: If an AI system produced wrong outputs consistently for some period, how would you know? What would the impact be?
- Your standards: What level of AI reliability should you be demanding? Is current practice meeting that?
Good Practice Checklist
- Testing goes beyond average accuracy to examine conditions and populations
- Edge cases and failure modes are systematically identified and tested
- Systems express confidence, and low-confidence outputs trigger review
- Continuous monitoring detects performance degradation
- Graceful fallbacks exist when AI underperforms
- Clear SLAs define acceptable performance with accountability
Quick Reference
| Element | Question to Ask | Red Flag |
|---|---|---|
| Testing | What conditions were tested? | Only average performance |
| Edge cases | What failure modes were found? | Not systematically examined |
| Confidence | Does the system know when it's uncertain? | Only single outputs |
| Monitoring | How is production performance tracked? | Set and forget |
| Fallback | What happens when AI fails? | Complete dependence on AI |
Why AI Reliability Is Different
Traditional software: Same input produces same output. Bugs are deterministic. When something fails, it usually fails obviously.
AI systems:
- Similar inputs can produce different outputs
- Failures can be silent—plausible-looking wrong answers
- Performance depends on similarity to training data
- Behaviour can change over time without code changes
- Testing average performance doesn't reveal all failure modes