The Core Challenge
AI systems create new categories of operational risk. When AI is embedded in critical processes, failure can cascade across dependent systems. Concentration among a small number of providers creates systemic risk. Recovery is complicated by the difficulty of understanding what went wrong.
Key Concepts
| Dependency mapping | Comprehensive documentation of all systems, services, and data sources an AI system relies on. |
| Single points of failure | Components whose failure would cause complete system failure with no fallback. |
| Vendor concentration risk | Over-reliance on one or few AI providers, creating vulnerability to their failures or decisions. |
| Chaos engineering | Deliberately introducing failures to test and improve system resilience. |
| Recovery time objective (RTO) | The maximum acceptable time to restore service after failure. |
Warning Signs
Watch for these indicators of operational vulnerability:
- AI system dependencies are not comprehensively documented
- Single points of failure exist without redundancy or fallback
- Heavy reliance on one AI provider with no alternatives evaluated
- Resilience testing is theoretical—systems aren't actually tested for failure
- Incident response doesn't include AI-specific runbooks
- Recovery capabilities are assumed but not verified
Questions to Ask in AI Project Reviews
- "What are the single points of failure? What happens if [component] goes down?"
- "What's the recovery time objective? Has it been tested?"
- "What dependencies exist that we don't control?"
Questions to Ask in Governance Discussions
- "How comprehensively are AI system dependencies documented?"
- "What resilience testing occurs? When did we last test recovery?"
- "Do incident response teams have skills to diagnose AI-specific failures?"
Questions to Ask in Strategy Sessions
- "What's our exposure to key AI providers? What contingency plans exist?"
- "Are we meeting regulatory expectations on operational resilience?"
- "What would a major AI provider outage mean for our operations?"
Reflection Prompts
- Your visibility: Do you actually know what your AI systems depend on? Could you produce a dependency map?
- Your confidence: If a critical AI system failed tomorrow, how confident are you that recovery would work? Has it been tested?
- Your exposure: How concentrated are your AI dependencies? What's the plan if a key provider fails?
Good Practice Checklist
- Dependencies are comprehensively documented and maintained
- Single points of failure are identified and addressed
- Vendor concentration is understood and actively managed
- Resilience is tested through deliberate failure injection
- Incident response includes AI-specific procedures
- Recovery capabilities are verified, not just planned
Quick Reference
| Element | Question to Ask | Red Flag |
|---|---|---|
| Dependencies | What does this system rely on? | Incomplete or outdated documentation |
| Redundancy | What happens if [X] fails? | No fallback exists |
| Concentration | How dependent are we on [provider]? | Single provider dominance |
| Testing | When was resilience last tested? | Never actually tested |
| Recovery | Can we meet our RTO? | Assumed but unverified |
The Regulatory Context
Financial services: Bank of England and FCA have operational resilience requirements. Critical third-party providers will face oversight. AI providers are likely to be included as frameworks evolve.
Healthcare: NHS operational resilience requirements apply to AI systems in clinical use.
Cross-sector: The government's approach to critical national infrastructure increasingly considers AI dependencies.
The implication: Operational resilience isn't just good practice—it's increasingly a regulatory expectation. Building resilience now positions you for compliance.
Testing Resilience
- Tabletop exercises: Walk through failure scenarios with relevant teams. What would happen? Who would do what?
- Chaos engineering: Deliberately inject failures in controlled conditions. Does the system behave as expected?
- Recovery drills: Actually execute recovery procedures. Does the team know what to do? Do the procedures work?
- Dependency audits: Regularly review and validate dependency documentation. Has anything changed?