Distributed systems are difficult.
Uncover their flaws before they uncover your business.
Start leveraging your army of chaos monkeys.

This is the age of chaos engineering.

By systematically uncovering structural weaknesses in your architecture you establish confidence in your running systems.

You can't afford to wait for your customers and their request patterns to battle-test your resource planning. Make unpredictable service-to-service transaction predictable and unleash rare real-world events like load balancer unavailability and server downtimes on a fraction of your fleet to identify weaknesses during production. Through a thorough empirical approach continuous testing exceeds your CI/CD pipelines and confidence in deployments as well as running code is established under real-world conditions.

What we do and how we help you

Establish observability and steady-state baselines.

Measure your systems core metrics during uninterupted operations to gauge unhealthy patterns instantly during continuous chaos testing. Througput, error rates, latency percentiles baselines act as the determining factors during chaos wether your system works or not.

Disrupt operations often to observe application behaviour.

The use of probable events as disruptive tests ensures readyness throughout your system. Collected, prioritized and estimated in frequency these events should act as the building blocks of your automated chaos testing to determine your services behaviour in rare events. Both non-failure events such as traffic spikes as well as garbage responses and "hardware" failures play their respective parts.

Use your production systems - testing just isn't the same.

Rather than painstakingly recreating background noise in traffic patterns on your test systems you should do the right thing: use your production environment. After a quick rampup phase establishing initial resiliency in your services the real experiments need to be ran against the real deal so outages happen controlled and observable, not in chaos.

Automate all procedures and run them continuously.

Reproducability and hands-off execution are key factors to building healthy patterns in your tests and experiments. Both orchestration of chaos events as well as the analysis of logs and metrics need to be as automated as possible so you and your teams can focus on the important stuff: writing software that works.

Assist all business units to adapt to the challenge without customers churn.

Negative customer experiences are a short-term trade-off to allow for a minimized impact of unplanned failure events in the future. Even though, customer churn needs to be a key concern of all your measures and can be avoided by thorough planning and intelligent application of your experiments.

We're agnostic and adaptive

Building a strong foundation for more than just one infrastructure provider is a key concern for us. We are therefore constantly expanding our expertise and tooling to support your stack - no matter where.

google azure aws kubernetes

Yes, you should pay us to break your things.

  • Identifying the weak spots in your system helps you build a roadmap to mitigation

  • Know about an outage before it occurs - failure patterns guide your monitoring and alerting

  • Increased technical operations excellence in your engineering reduce both known and unknown unknowns

Get in touch so we can mess with your services.

Wear the brand - buy our swag!