You can't afford to wait for your customers and their request patterns to battle-test your resource planning. Make unpredictable service-to-service transaction predictable and unleash rare real-world events like load balancer unavailability and server downtimes on a fraction of your fleet to identify weaknesses during production. Through a thorough empirical approach continuous testing exceeds your CI/CD pipelines and confidence in deployments as well as running code is established under real-world conditions.
Measure your systems core metrics during uninterupted operations to gauge unhealthy patterns instantly during continuous chaos testing. Througput, error rates, latency percentiles baselines act as the determining factors during chaos wether your system works or not.
The use of probable events as disruptive tests ensures readyness throughout your system. Collected, prioritized and estimated in frequency these events should act as the building blocks of your automated chaos testing to determine your services behaviour in rare events. Both non-failure events such as traffic spikes as well as garbage responses and "hardware" failures play their respective parts.
Rather than painstakingly recreating background noise in traffic patterns on your test systems you should do the right thing: use your production environment. After a quick rampup phase establishing initial resiliency in your services the real experiments need to be ran against the real deal so outages happen controlled and observable, not in chaos.
Reproducability and hands-off execution are key factors to building healthy patterns in your tests and experiments. Both orchestration of chaos events as well as the analysis of logs and metrics need to be as automated as possible so you and your teams can focus on the important stuff: writing software that works.
Negative customer experiences are a short-term trade-off to allow for a minimized impact of unplanned failure events in the future. Even though, customer churn needs to be a key concern of all your measures and can be avoided by thorough planning and intelligent application of your experiments.
Building a strong foundation for more than just one infrastructure provider is a key concern for us. We are therefore constantly expanding our expertise and tooling to support your stack - no matter where.