Framework 12: Operational Resilience | AI Governance Essentials

The Core Challenge

AI systems create new categories of operational risk. When AI is embedded in critical processes, failure can cascade across dependent systems. Concentration among a small number of providers creates systemic risk. Recovery is complicated by the difficulty of understanding what went wrong.

Key Concepts

Dependency mapping	Comprehensive documentation of all systems, services, and data sources an AI system relies on.
Single points of failure	Components whose failure would cause complete system failure with no fallback.
Vendor concentration risk	Over-reliance on one or few AI providers, creating vulnerability to their failures or decisions.
Chaos engineering	Deliberately introducing failures to test and improve system resilience.
Recovery time objective (RTO)	The maximum acceptable time to restore service after failure.

Warning Signs

Watch for these indicators of operational vulnerability:

AI system dependencies are not comprehensively documented
Single points of failure exist without redundancy or fallback
Heavy reliance on one AI provider with no alternatives evaluated
Resilience testing is theoretical—systems aren't actually tested for failure
Incident response doesn't include AI-specific runbooks
Recovery capabilities are assumed but not verified

Questions to Ask in AI Project Reviews

"What are the single points of failure? What happens if [component] goes down?"
"What's the recovery time objective? Has it been tested?"
"What dependencies exist that we don't control?"

Questions to Ask in Governance Discussions

"How comprehensively are AI system dependencies documented?"
"What resilience testing occurs? When did we last test recovery?"
"Do incident response teams have skills to diagnose AI-specific failures?"

Questions to Ask in Strategy Sessions

"What's our exposure to key AI providers? What contingency plans exist?"
"Are we meeting regulatory expectations on operational resilience?"
"What would a major AI provider outage mean for our operations?"

        Reflection Prompts
        Your visibility: Do you actually know what your AI systems depend on? Could you produce a dependency map?
Your confidence: If a critical AI system failed tomorrow, how confident are you that recovery would work? Has it been tested?
Your exposure: How concentrated are your AI dependencies? What's the plan if a key provider fails?

      

Good Practice Checklist

Dependencies are comprehensively documented and maintained
Single points of failure are identified and addressed
Vendor concentration is understood and actively managed
Resilience is tested through deliberate failure injection
Incident response includes AI-specific procedures
Recovery capabilities are verified, not just planned

Quick Reference

Element	Question to Ask	Red Flag
Dependencies	What does this system rely on?	Incomplete or outdated documentation
Redundancy	What happens if [X] fails?	No fallback exists
Concentration	How dependent are we on [provider]?	Single provider dominance
Testing	When was resilience last tested?	Never actually tested
Recovery	Can we meet our RTO?	Assumed but unverified

The Regulatory Context

Financial services: Bank of England and FCA have operational resilience requirements. Critical third-party providers will face oversight. AI providers are likely to be included as frameworks evolve.

Healthcare: NHS operational resilience requirements apply to AI systems in clinical use.

Cross-sector: The government's approach to critical national infrastructure increasingly considers AI dependencies.

The implication: Operational resilience isn't just good practice—it's increasingly a regulatory expectation. Building resilience now positions you for compliance.

Testing Resilience

Tabletop exercises: Walk through failure scenarios with relevant teams. What would happen? Who would do what?
Chaos engineering: Deliberately inject failures in controlled conditions. Does the system behave as expected?
Recovery drills: Actually execute recovery procedures. Does the team know what to do? Do the procedures work?
Dependency audits: Regularly review and validate dependency documentation. Has anything changed?

← Previous: Data Integrity Take the Assessment →