Resilience is circular. Security is regenerative.
What makes something reliable, really? That it’s always there? Or always there when you need it? That it always works? Or that you want the work that it does?
In tech, we love our metrics. We track Unplanned Unavailability (UU), Recovery Time Objective (RTO), Recovery Point Objective (RPO), Mean Time Between Failures (MTBF), Mean Time To Recovery (MTTR) – the whole alphabet soup – and strive for ever-better numbers. There’s nothing wrong with these metrics inherently… except that they are divorced from business outcomes.
Let me explain: Think about your car for a moment. You can know its maintenance history, its fuel efficiency, the precise air pressure in each tire, even lookup failure statistics for its specific model year. You can measure all these things diligently and accurately. But all that work will tell you nothing about the outcome you actually need: getting to work on time this morning.
Now, scale that up. Imagine instead of just your car, you're managing a global fleet: thousands of vehicles, different makes and models, driving different distances over varied roads, subject to local laws, with drivers of varying experience, all trying to get people where they need to go, safely and on time, every single day. How would you approach that reliability challenge?
Would you centrally track the mean time between failure of every alternator, tire brand, and windshield wiper? Write a comprehensive plan for each driver covering every possible contingency: traffic jam, unexpected detour, engine rattle, animal crossing? Require daily reports on fuel levels and tire pressure? Does that sound remotely manageable, scalable, or even helpful in achieving the actual goal: reliably getting to work in the morning? The complexity explodes, the overhead becomes astronomical, and you'd drown in data that likely doesn't correlate directly with successful trips.
This hypothetical fleet isn't so far removed from the challenge many of us face today. In my current role, I’m responsible for the reliability of my company’s physical compute, network, and storage infrastructure that powers our internal IaaS platform – the bedrock for everything. And demand is exploding. That infrastructure footprint is set to triple, maybe quintuple, across new regions in just a few short years. Our specialized reliability teams? Remain lean, sharp, but decidedly not scaling at the same rate.
Applying that traditional, component-focused, hyper-detailed planning approach here leads us straight off a cliff at hyperscale. The sheer overhead required to meticulously track, document, and update these plans across a global footprint would be prohibitively expensive and operationally paralyzing. Not to mention that these documents are likely outdated before they're even finalized. It wouldn’t just be operationally difficult; it's fiscally irresponsible, consuming vast resources for activities with little demonstrable return on investment.
Beyond the unsustainable economics, this traditional approach suffers from a fundamental conceptual flaw: the immense effort often creates a dangerous false sense of security. It treats our systems as merely complicated – like an intricate watch where detailed blueprints might hold true.
But today's large-scale infrastructure is profoundly complex, more like the weather, full of unpredictable interactions and emergent behaviors. Rigid, top-down plans born from a Safety-I mindset are inherently fragile in such environments. They inevitably fail to account for unforeseen adaptations or cascading effects, forcing us to spend our time and money meticulously polishing a map that no longer reflects the rapidly changing terrain – a map that likely won't guide us effectively when disruption actually hits.
So, let’s step back and return to the question we started with: What does reliability mean? Does all this comprehensive component level planning, data collection, and metrics tell us anything about business outcomes?
Does your product owner celebrate your data center’s achieving another nine of uptime on a report? Or, do they celebrate when a user is able to seamlessly share a New Year's video with their distant but dear friends?
No. The goal isn't perfecting infrastructure; it's the reliable delivery of the business service running on that infrastructure. This disconnect, amplified by scale, complexity, and cost, demands a fundamental pivot. We need to stop looking down at the components and start looking up at the purpose. We need a North Star to guide our efforts.
So, let’s pivot.
Let’s rethink our starting point. Instead of focusing first on the myriad components and hoping reliability emerges, what if we begin with the desired business outcome and work backwards? This immediately forces us to confront a fundamental question, one I've wrestled with since joining the Continuity and Reliability team: What is the actual unit of reliability we should be managing?
Is it the individual server, switch, or storage array? Or is it something bigger?
Working backwards from the outcome makes the answer clear: the unit of concern must be the end-to-end service that delivers the business value.
We stop asking "Is this specific server reliable?" and start asking, "Is the video sharing service reliable?" If the goal is enabling that user to seamlessly share their New Year's video, what truly needs to work? It’s the entire chain: the application logic, the databases, the IaaS platform APIs, the network paths, the storage systems, the compute resources – all working together in service of that specific outcome.
Now, the reliability metrics of an individual component become necessary inputs, but they are no longer the goal. The focus elevates naturally to the service itself. We move from managing parts to ensuring pathways.
So, what constitutes a 'service' when we talk about reliability this way? From my perspective on the team responsible for the physical foundation, for instance, it’s not just one piece of hardware. It’s the complete, end-to-end chain of infrastructure components, operational processes, and logistical support required to deliver a specific capability to our internal customers (like application teams or data scientists) via our internal IaaS platform. We have to trace the dependencies from the abstract service consumed by our users all the way down to the concrete, power, cooling, and even the supply chain feeding the data center.
Consider GPU compute for AI training as an example. The outcome the business expects is that teams can reliably schedule and execute large-scale AI training workloads, accessing the necessary GPU resources with predictable performance and availability. The physical technology stack that delivers this service encompasses
racks of specialized GPU servers;
high-bandwidth, low-latency network connecting them;
sensors, tooling, and networks for monitoring and managing the equipment in the racks;
power conditioning and delivery;
battery backup;
air or liquid cooling systems;
repair and maintenance teams, and spare equipment;
logistics inside the data center to move, install, test, and decommission racks;
construction and maintenance of facilities to house and power all the above, including amenities for humans like culinary, bathrooms, and meeting space;
logistics to deliver equipment to data center and remove waste,
warehouses to store spares and equipment when new or not in use;
teams to design, develop, source, and build all of the above;
physical and cybersecurity to protect all of this;
legal and administrative teams to protect and manage it; and
leaders to conceive, direct, and organize the work.
That intricate web – stretching from silicon chips and power grids all the way through global logistics, specialized human expertise, and even foundational corporate support – represents the true scope required to reliably deliver just one type of infrastructure service like GPU Compute. It underscores that our 'unit of reliability' is this entire complex system working in concert towards a specific outcome.
Now, multiply that inherent complexity across all the distinct services our IaaS platform offers: high-performance storage, standard virtual machines, managed databases, cold storage archives, content delivery networks, and more. Each has its own unique technology chain and operational dependencies.
It becomes immediately obvious that business continuity planning cannot realistically happen at the level of individual components, nor can we sustain bespoke, hyper-detailed reliability plans for every single one of these diverse services. The sheer complexity revealed by tracing even one service end-to-end demands a higher level of abstraction. We need a way to simplify without losing sight of what matters.
Furthermore, just as the underlying chains are complex and unique, the business needs for each of these services also differ significantly. This brings us back to the crucial concept of tolerance for disruption. The tolerance for an issue impacting the GPU Training service might be measured in hours and impact future product deployment schedules, while the tolerance for the 'Flash Storage Service' underpinning customer transactions might be measured in seconds and directly impact revenue.
The recognition that different services have different tolerances combined with the complexity of managing each service's unique chain individually, naturally leads us to group services based on their required resilience profiles. We need tiers.
For example, let’s sketch out a three-tiered system:
Tier 0: The 'Crown Jewels' – Services demanding the highest resilience due to minimal tolerance for disruption, often linked to immediate revenue, customer transactions, or critical safety functions.
Tier 1: The 'Vital Organs' – Important services requiring high resilience, where disruptions are costly or painful, but the tolerance might be slightly higher (e.g., recovery within hours vs. minutes, some planned downtime acceptable).
Tier 2 (and perhaps lower): ‘The Dev Playground’ – Services supporting the business that can operate effectively with standard resilience, tolerating longer recovery times or more flexible maintenance schedules without critical impact.
Now, let’s overlay this tiered service concept onto the physical reality: a single data center facility rarely houses just one type of service or tier. Critical Tier 0 applications often co-exist alongside Tier 1 business systems and Tier 2 development environments, sometimes sharing the same floor space, power/cooling, network, and support infrastructure.
This reality instantly highlights the inadequacy of traditional, monolithic, facility-wide uptime targets. Instead, it demands that we apply reliability standards and design principles differentially based on the service tier, even within these shared, mixed environments. How do we achieve that crucial differentiation and ensure the Tier 0 services get the resilience they need without incurring the unnecessary cost of gold-plating everything?
This is where Service Level Objectives (SLOs) become essential. Instead of dictating how every component must perform everywhere, we focus on defining desired outcomes using clear SLOs tailored for each tier's tolerance profile. These SLOs become the shared language of reliability, specifying measurable targets for things like availability, recovery speed (RTO), and data freshness (RPO).
These SLOs then directly guide our decisions on design, operational focus, and investment. Think of it like defining different levels of resilience investment appropriate for each tier's required tolerance and potential impact of failure.
Tier 0 services warrant a platinum-level investment – ensuring maximum redundancy, dedicated monitoring, potentially geographically distributed failover, and the fastest possible recovery capabilities, because the cost of failure is immense.
Tier 1 might receive a gold-level investment – ensuring robust resilience through redundancy and solid recovery processes, but perhaps with slightly less aggressive (and less costly) recovery time targets than Tier 0.
Tier 2 receives a standard, fit-for-purpose investment – providing solid, functional reliability while allowing for significant cost efficiencies by avoiding over-engineering. This means accepting standard hardware configurations, potentially longer maintenance windows, or less instantaneous recovery mechanisms.
This tiered investment approach allows us to make conscious, defensible choices – allocating budget for N+2 redundancy or active-active setups for Tier 0 within a facility, while perhaps using standard N+1 or accepting planned downtime for Tier 2 infrastructure located just racks away. It aligns spending and effort directly with business criticality, even in shared environments.
Taking a step back, what we’ve done is defined an approach toward reliability at hyperscale that solves most of the challenges of traditional business continuity. This approach, which we can call Service-Tier Aligned Reliability (STAR) gives us a way to solve the core challenges of traditional business continuity at hyperscale. It works because it’s not about controlling every detail—it’s about aligning everyone to what matters.
A North Star for Scale
STAR gives lean reliability teams a lever. Instead of tracking thousands of components, they govern a handful of service tiers. That’s direction, not overload.
A Shared Language for Matrixed Teams
In complex orgs, everyone has their own map. STAR gives us a common one. A Tier 0 SLO means the same thing to network, storage, and facilities—cutting through silos with shared urgency and purpose.
Empowerment Instead of Rigidity
STAR doesn’t prescribe how to fix problems—it defines what good looks like. It trusts the teams closest to the work to meet the SLO however makes sense in their context. Resilience through clarity, not control.
Efficiency Through Intentional Investment
Trying to gold-plate everything is wasteful. STAR helps us target time, money, and effort where they matter most—based on business impact, not habit or fear.
Thinking in terms of service tiers and outcome-focused SLOs isn't just a different methodology; it's a more honest and effective way to answer that core question: "Will the service reliably deliver the needed business outcome?"
It provides clarity across diverse teams – including facilities, logistics, and security – whose work underpins the entire physical environment. It provides the focus needed for efficient and defensible investment of limited resources, and the scalability needed to manage ever-growing infrastructure complexity with lean operational teams.
Implementing this shift requires commitment – building shared understanding across business units so that everyone, from engineers to maintenance crews, understands how their work contributes to meeting the SLOs of the services being hosted. It involves developing new skills in defining meaningful SLOs derived from business needs, and fostering a culture that trusts empowered teams to meet outcome-based goals rather than just follow rigid procedures. But the alternative – clinging to outdated maps detailing components while the terrain of complex services transforms beneath us – is no longer viable.
Perfect plans won’t get us where the business needs us to go. Shared purpose will. It’s time to stop gold-plating everything and start aligning around what matters. That’s the only reliability that scales.
April 20th, 2025
In my last post, I argued that policy is like nuclear reactor control code: clear, structured, and critical to managing the immense power of data. But having good code isn’t enough if your mental model of the system is wrong.
Most organizations still approach reliability the way early industrial systems did: assume failure is the result of human error, isolate the cause, and clamp down on variation. This is the Safety-I mindset—and in today’s complex environments, it builds brittle systems and burns out the people trying to keep them running.
There’s a better way. It’s called Safety-II, and it starts with a simple question: instead of only asking what went wrong, why not ask what went right?
Born from early 20th-century manufacturing, Safety-I focuses on preventing bad outcomes. Its toolkit includes:
Root cause analysis
Human error reduction
Compliance enforcement
The core assumption? That bad outcomes happen when something (or someone) fails. So Safety-I tries to eliminate variability and enforce strict procedures. That’s fine in a controlled, predictable environment.
But today’s systems aren’t just complicated—they’re complex.
A complicated system is like a Swiss watch: intricate, but ultimately predictable. A complex system is more like the weather. Small changes ripple unpredictably, and outcomes aren’t deterministic. You can’t just debug your way to certainty.
In these systems, people aren’t the problem—they’re the reason things don’t fall apart. Every day, operators adapt, troubleshoot, and improvise to keep services running. And yet, Safety-I often punishes those same behaviors when they deviate from the script.
That’s not just counterproductive—it’s dangerous.
Safety-II flips the lens. It defines safety as ensuring that as many things as possible go right, even under changing conditions. It focuses on:
Learning from success, not just failure
Observing everyday work, not just incidents
Embracing adaptation as a strength
This mindset came into focus in the mid-2010s, thanks to Erik Hollnagel’s work in healthcare. While studying the UK’s National Health System (NHS), he noticed that most improvements came not from dissecting failures, but from understanding how frontline workers navigated complexity in real time, adapting to adverse conditions by deviating from standard procedures.
Think of a nurse skipping a required form to save a life, or a technician hotwiring a workaround when a system fails. In Safety-I, those actions are noncompliant and to be avoided. In Safety-II, they’re signals of resilience, and potential design improvements.
If you’ve made it this far, you might think I’m anti-standard. After all, standards often come with rigid checklists, procedural mandates, and no room for context.
But I love standards—when they’re done right.
A Safety-I-style standard says: "Here’s exactly how to do the thing. No deviations allowed."
A Safety-II-style standard says: "Here’s the outcome we expect. Here’s what’s worked before. If you’ve got a better way, show us."
Done right, standards provide shared clarity—not centralized control. They help teams understand what success looks like and give them the tools (and permission) to get there however makes the most sense for their environment.
That’s the kind of standard that scales. That’s the kind of standard that survives contact with the real world.
Let’s say you’re managing infrastructure for a global company. A Safety-I standard might mandate a specific way to test backup power. It’ll be rigid, overly detailed, and fragile across sites.
A Safety-II standard, by contrast, will define what a successful failover looks like, highlight techniques that have worked, and let site teams determine the best method. That flexibility isn’t a weakness—it’s resilience engineering in action.
As organizations scale, central teams simply can’t predict every scenario. Safety-I collapses under that weight. Safety-II equips teams closest to the risk with ownership, context, and trust.
To move fast and stay resilient, we don’t need more rules—we need more clarity. Safety-II helps us write standards that:
Empower, rather than constrain
Learn from adaptation, not just error
Align on outcomes, not just procedures
It’s the difference between traffic lights and roadblocks. One keeps things flowing. The other stops everything cold.
Let’s stop building roadblocks. Let’s build smarter lights.
April 16th, 2025
Some folks say data is the lifeblood of our business, but that metaphor falls short. After 20 years in this space, I can tell you data is more like the nuclear core that powers our business: potent yet dangerous, it can propel us to new heights or blow up in our faces. It's why I've come to love what most people hate: policy. Done right, policy acts as the reactor control system for our "nuclear engine," ensuring we harness its power safely and efficiently. However, creating effective policies isn't easy. Today, I want to highlight some techniques I've discovered along my journey—sometimes the hard way. I think of them as the advanced technology that optimizes and controls the nuclear engine of our data.
Think of writing a policy like writing code. Well-structured code is easy to interpret, and the same goes for policy. Structure is the “X marks the spot” of good policy – not every map uses it, but when they do it saves everyone a lot of time and effort. Consistent style helps readers know what to expect and where to find information. For example, if each policy begins with a purpose followed by a background section, readers will know not to look for background information in the purpose section, reducing confusion (and complaints!). Like good code, sets of well-crafted policy follow a consistent, logical structure.
Consistency is boring, right? Well, in policy, it’s crucial for interpretability. When policies follow a consistent format and use the same terminology throughout, users can more easily understand and apply them. For example, imagine a policy on data security. If one section refers to "sensitive data" while another says "confidential information," readers might be confused about whether these terms mean the same thing. Using consistent terminology avoids this confusion. Establishing a style guide is like setting coding standards, helping maintain uniformity and clarity across documents.
I know, nobody likes reading definitions sections. They can feel like speed bumps in your flow. But trust me, they’re necessary. Without them, users will be lost, constantly wondering what certain terms mean. By putting a clearly-sourced terms section at the beginning, you’re giving everyone a legend to follow before they dig into the text. Think of it like defining constants and variables with meaningful comments at the start of your code: it makes everything easier to interpret, maintain, and debug.
Look, we all learned in school how to plumb the thesaurus for those fancy 5-cent words to spice up our writing. It might have earned you extra points on a quiz, but in policy writing, it multiplies complexity, like bespoke bits of code. They introduce ambiguity and make things harder to understand. Words like "significant," "reasonable," or "appropriate" are subjective and open to interpretation, much like unclear variable names in code. Instead, use precise, quantifiable terms that leave little room for misinterpretation. "importantFrodoPotatoes" might be cute and make sense at the time, but it's no substitute for just saying what you mean.
Let's talk about lists. It’s tempting to think listing everything out in a policy makes it clearer, right? Wrong. Lists can be little hobgoblins that sneak in confusion and loopholes. While they might seem to provide clarity, they often create gaps and chances for teams to wriggle or fall out of compliance. For example, what's the value in listing apples, oranges, bananas, and raspberries when you could just say "fruits"? Not only does this simplify the policy, but it also stops the inevitable question about whether plums are included. Long lists also make readers' eyes glaze over. So, it's almost always preferable to consolidate the list into a broader category. This approach makes the policy clearer and reduces misunderstandings, much like using well-defined modular functions in code instead of overly specific, convoluted ones.
You know those moments when you realize someone else has already done all the hard work for you? That's what published standards are like. They’re your best friend in policy writing. They provide a common language that everyone in the field can understand. Using these standards saves you from reinventing the wheel and ensures that your policies are more easily understood by new hires, regulators, and auditors.
Here's the deal: you’ve got to know your boundaries. Clearly defining the scope of your policy is critical. Think of it like coloring within the lines. If your policy is about network access controls, don’t start talking about budget approvals. That’s not your lane. Keeping the scope focused ensures clarity and prevents mission creep. You don’t want to dictate how another team designs or implements their particular module. By defining precise interfaces and responsibilities, each team can innovate and optimize their module while ensuring seamless integration and communication.
Let's be honest, nobody likes being micromanaged. Policies should define the outcomes you want, not prescribe the exact steps to get there. Instead, explain what success looks like and how you'll measure it, then let the teams on the ground figure out the best way to make it happen. For example, instead of specifying in a policy that seven FTEs are needed to handle incoming requests, outline the service levels or response times you expect. By focusing on the desired outcomes and performance metrics, you empower teams to leverage their expertise and creativity to find the most efficient and effective solutions.
If the code controlling our data-powered nuclear engine is a tangled mess of inconsistencies, ambiguous terms, and endless lists, it’s a disaster waiting to happen, right? Our policies need to be as well-crafted as high-quality code to harness all this power safely. Clear structure, consistent terms, and precise scopes are essential. Using published standards and focusing on outcomes allows us to harness the immense power of our data safely and efficiently. Good policy is like good code: clear, consistent, and capable of handling complexity.
So, join me in becoming an advocate for clear, concise policy on your team. When you see a confusing or inconsistent policy, speak up and offer to help revise it. Together, we can make policies a tool for super powering the business, not a barrier to getting things done.
May 30, 2024