The Core Challenge

AI systems create new categories of operational risk. When AI is embedded in critical processes, failure can cascade across dependent systems. Concentration among a small number of providers creates systemic risk. Recovery is complicated by the difficulty of understanding what went wrong.

Key Concepts

Dependency mapping Comprehensive documentation of all systems, services, and data sources an AI system relies on.
Single points of failure Components whose failure would cause complete system failure with no fallback.
Vendor concentration risk Over-reliance on one or few AI providers, creating vulnerability to their failures or decisions.
Chaos engineering Deliberately introducing failures to test and improve system resilience.
Recovery time objective (RTO) The maximum acceptable time to restore service after failure.

Warning Signs

Watch for these indicators of operational vulnerability:

  • AI system dependencies are not comprehensively documented
  • Single points of failure exist without redundancy or fallback
  • Heavy reliance on one AI provider with no alternatives evaluated
  • Resilience testing is theoretical—systems aren't actually tested for failure
  • Incident response doesn't include AI-specific runbooks
  • Recovery capabilities are assumed but not verified

Questions to Ask in AI Project Reviews

  • "What are the single points of failure? What happens if [component] goes down?"
  • "What's the recovery time objective? Has it been tested?"
  • "What dependencies exist that we don't control?"

Questions to Ask in Governance Discussions

  • "How comprehensively are AI system dependencies documented?"
  • "What resilience testing occurs? When did we last test recovery?"
  • "Do incident response teams have skills to diagnose AI-specific failures?"

Questions to Ask in Strategy Sessions

  • "What's our exposure to key AI providers? What contingency plans exist?"
  • "Are we meeting regulatory expectations on operational resilience?"
  • "What would a major AI provider outage mean for our operations?"

Reflection Prompts

  1. Your visibility: Do you actually know what your AI systems depend on? Could you produce a dependency map?
  2. Your confidence: If a critical AI system failed tomorrow, how confident are you that recovery would work? Has it been tested?
  3. Your exposure: How concentrated are your AI dependencies? What's the plan if a key provider fails?

Good Practice Checklist

  • Dependencies are comprehensively documented and maintained
  • Single points of failure are identified and addressed
  • Vendor concentration is understood and actively managed
  • Resilience is tested through deliberate failure injection
  • Incident response includes AI-specific procedures
  • Recovery capabilities are verified, not just planned

Quick Reference

Element Question to Ask Red Flag
Dependencies What does this system rely on? Incomplete or outdated documentation
Redundancy What happens if [X] fails? No fallback exists
Concentration How dependent are we on [provider]? Single provider dominance
Testing When was resilience last tested? Never actually tested
Recovery Can we meet our RTO? Assumed but unverified

The Regulatory Context

Financial services: Bank of England and FCA have operational resilience requirements. Critical third-party providers will face oversight. AI providers are likely to be included as frameworks evolve.

Healthcare: NHS operational resilience requirements apply to AI systems in clinical use.

Cross-sector: The government's approach to critical national infrastructure increasingly considers AI dependencies.

The implication: Operational resilience isn't just good practice—it's increasingly a regulatory expectation. Building resilience now positions you for compliance.

Testing Resilience

  • Tabletop exercises: Walk through failure scenarios with relevant teams. What would happen? Who would do what?
  • Chaos engineering: Deliberately inject failures in controlled conditions. Does the system behave as expected?
  • Recovery drills: Actually execute recovery procedures. Does the team know what to do? Do the procedures work?
  • Dependency audits: Regularly review and validate dependency documentation. Has anything changed?