Chaos Engineering Practices

What is chaos engineering and how would you implement it safely in a production environment?

senior

advanced

SRE

Question

What is chaos engineering and how would you implement it safely in a production environment?

Answer

Chaos engineering is the practice of intentionally injecting failures to test system resilience. Implementation: start with hypothesis about system behavior, define minimal blast radius, begin in staging, use tools like Chaos Monkey or Litmus, inject failures (pod terminations, network latency, resource exhaustion), monitor golden signals, automate rollback on unexpected impact, gradually expand to production during low-traffic periods with team monitoring.

Why This Matters

Chaos engineering's goal is finding weaknesses before they find you. It builds confidence in system resilience and uncovers hidden dependencies. The key is controlled, observable experiments with safety measures - not random destruction. Netflix pioneered this with Chaos Monkey, and it's now standard practice at large-scale organizations.

Code Examples

Litmus ChaosEngine example

yaml

Manual chaos injection

bash

Common Mistakes

Running chaos experiments without proper monitoring in place
Starting with production before validating in staging
No automated rollback mechanism when experiments go wrong

Follow-up Questions

Interviewers often ask these as follow-up questions

How do you define and control the blast radius of chaos experiments?
What metrics should you monitor during chaos experiments?
How do you convince leadership that intentionally breaking production is valuable?

Also worth your time on this topic

Checklist

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.

45-90 minutes

Interview

SLO vs SLI vs SLA Differences

Your team just launched a new API service. Your manager asks you to set up SLOs for it. Can you walk me through what SLOs, SLIs, and SLAs are, and how they relate to each other?

junior

Article

When One Data Center Room Got Hot: AWS US-EAST-1, Coinbase, and the DR Drill That Was Not

On May 7, 2026, cooling failed in a single hall of one US-EAST-1 data center. Coinbase, FanDuel, and CME Group went down for hours, and Coinbase publicly confirmed their backup systems did not work as expected. Here is what happened, the multi-AZ checklist that would have caught it, and the AWS Fault Injection Simulator commands to run the drill before the next thermal event.

Chaos Engineering Practices

More SRE interview questions

Also worth your time on this topic

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

SLO vs SLI vs SLA Differences

When One Data Center Room Got Hot: AWS US-EAST-1, Coinbase, and the DR Drill That Was Not