Chaos Engineering in practice
Friday afternoon. The whole team is in a great mood. The weekend is about to start. Hooray. See you Monday, suckers!
* 2 hours later at service desk *
The phone rings. “A system is failing”, they hear. “We’ll get into touch with whoever is responsible”, they say. “Let’s call the owners asap”, they act.
Business is getting anxious. A fix is applied soon. Finally the weekend can start for real. So they hope at least.
* Monday morning *
A Postmortem session is about to start. The root cause needs to be found. Budget for structural solutions to prevent this problem from occurring again is non-negotiable available. A mind wonders. Why haven’t we spend this money in the first place to avoid this from happening at the most impossible moment?
What is this article about?
This article is about the practice of Chaos Engineering. We’ll try to explain why this practice is a non-negotiable part of the delivery of software within squads at our company. Software will be scored upon its ability to withstand chaos. The delivery squads will be scored upon efforts put into testing their software to withstand chaos. This article not only addresses the technical crowd. We keep things abstract with minimal technical details to keep it lucid for everybody.
Chaos engineering. An experimental pro-active empirical discipline which tries to build confidence in the operational behavior and resilience of a complex distributed system by trying to find new weaknesses when running chaos experiments.
So many and so micro, these services of ours
Software systems we build today are more complex than before. Micro service architectures are becoming widespread. This extra complexity is allowed because this architecture enables delivery teams to increase development velocity. Functionality is contained in small components. Ownership is higher. Affiliation with the bigger purpose increases. We are all helping to enable big business goals through small software systems.
A micro service architecture — however — has many moving parts. All these distributed components making up a system are often interacting in an unplanned, unanticipated, and uncoordinated way. Or are just evolving too quickly. So much so that it often becomes difficult to grasp all these patterns of interaction and change.
Chaos Engineering, a solution
This is where Chaos Engineering comes in. The ambition is to find weaknesses in these unexpected patterns of behavior of our software systems before they are loudly and clearly visible in production. We will pro-actively try to make our systems more resilient by exploring the unknown. We’ll try to increase confidence in “what we are running in production” by empirical experimentation. We’ll go and contain future outages today. Blast radius from real-world events tomorrow will be limited right now.
This won’t happen overnight, though. We won’t be putting down critical production servers to figure out what we already know: we will have a big outage costing a lot of money. Known weaknesses are nothing more than a temporary embarrassment for the owners. A call for action is only dependent on priorities of budget. The purpose of chaos engineering is to uncover new weaknesses. We will not be looking for a broken part, but we will try to understand how unexpected distributed component interactions could result in a chaotic state. How close are we to the edge of a chaotic state?
Domains of failure
We all agree that whenever a system is alive long enough, it will become subject to unpredictable events and conditions. Even combinations of unwanted events. Micro services allow to keep the potential failure domain small. There are individual components which can fail individually and each component making up a small part of functionality can define its own fallback. We often make semiofficial declarations about what resources are shared and what services are not critical for normal operation, but not aware of another painful reality.
Test in production, where do you get the nerve?
This also explains the need for running experiments right up in the production environment. Claims about a similar production setup called staging or acceptance are not always truthful. Input data is not production data by definition and the setup is somewhat and somehow different whether we like it or not. Many stateful services such as databases, message queues, and caches are configured differently in production. And even if you think you have an exact copy of your production environment, other people’s systems are maybe not so much alike their production systems. Also, we are not trying to induce chaos in the production network of pacemakers which people’s lives depend on. We are a media concern bringing news to the people. And soon with even less problems in production because we already fixed future problems before they hit production in an uncontrolled fashion.
Known weaknesses are nothing more than a temporary embarrassment for the owners.
Steady State, as in being stable
Let’s get practical. How will we try to get away from this edge of chaos? First off, we will need to know what characterizes our system as stable. In other words: What makes up for a steady state? What is considered a “normal” operation?
The main question to answer is: “In what state must our system be in to keep customer reception positive?”. Your system is only as good as considered by your clients. Whether you are a consumer facing product, or a mid-tier generic service product.
There are 2 classes of metrics which determine your steady state: the business metric and the system metric. A business metric could sound like: “How are our number of paying customers evolving?”, while a system metric could be: “What’s our 99 percentile service response time?”. A business metric is typically the more difficult one to define. It is however the most interesting one. This is because you are metering your edge to chaos by using favorable business outcomes as units. This can make up for a strong signal to business to increase awareness and budget to get things straightened out.
Both metrics are more interesting if it can be monitored with low latency. Anomalies showing up fast are easier to work with. That being said, a consistent pattern of behavior over time during normal operation in the past can be used as a baseline to calculate anomalies from today. Think: “How many consumers of content were there ‘typically’ during lunch hours?”. Without knowing what makes your system steady, it is not possible to form meaningful hypotheses about behavior. You should know what to look for during experiments. What is changing within which expected thresholds? Which deviation in the metric is problematic, and which one is not?
The Experiment
- Hypothesis, and other assumptions
Having determined your steady state you can start to formulate a hypothesis about your system. Some of the most interesting ones today are about recent problems. It would be neat to set the hypothesis around this exact recent problem and try to disprove it by experimentation. Try to re-enact the outage in a contained and controlled environment. Other examples could be: “If author service becomes latent, then the content edge gateway will not start to respond ‘slow’”. Or: “if AD.nl’s frontend servers go down, the CDN will serve stale for known content”.
2. Scope, and other limitations
Next determine the scope and duration of the experiment you’re going to run. Try to run the experiment as close to production as you can, but minimize impact. Do not start your testing in production straight up. Go for the smallest possible test you can run and learn from. Also figure out how to terminate the experiment early. Having a big red “stop” button is the best thing ever when things go completely bananas.
3 & 4. Metrics, the things that measure happiness
Up next is to choose the metrics you’re going to watch. Make sure you have a good understanding of what thresholds and values are considered “ok”, and which ones aren’t.
As a following step you should notify the organization. Explain to all parties involved what you’re going to do and why you’re going to do it. Surely when you’re just getting started with chaos engineering, this is a very mandatory step in the process. When you get to a point where experimentation in production is getting widely accepted, we’ve become very mature as an organization and our systems probably are a lot more resilient than ever before.
5. Run the experiment, already
Now we will run the experiment. Monitor your metrics as the experiment is active. When the experiment is finished, we will analyze the results. Did anything happen out of the ordinary? Were our hypothesized thresholds respected? What was the impact of the injected event? Is our hypothesis correct or refuted?
6, 7, & 8. Experiment aftercare, and more
If everything went just fine, we could increase the scope of the test (e.g. pull more production traffic to systems in experimental group) and go again to learn more about the behavior of our system with a more daring test. If satisfied we should think about automating the experiment that we just pulled off with probably a lot of manual labor. There is an old saying in the rather new field of chaos engineering: “if experimentation is not automated, it is obsolescent”. Doing things manually is a good first step, but it is time-consuming. Confidence in the results of our tests will decay over time. Things change. Systems change. And often very fast. We could even have experiments run after each change, hence doing canary releases and increase confidence when deploying changed versions to production. Bottom line is that we cannot purloin time from development resources on a regular basis for chaos testing.
Closing notes
Chaos engineering as such is still a very new subject. Practices are not necessarily new, but not always known as chaos engineering. We started investing in chaos engineering by assigning a pair of ninjas/experts to these practices. They will start experimentation based upon recent issues and future issues. They will join postmortem sessions to give ad-hoc feedback. They will help spread the definition and practices of chaos engineering. They will try to keep in sync of what is now known as chaos engineering, try to spread knowledge, and help teams implement the practices to become better.
….
Hope your learned a thing or two. Many of these insights are based upon insights from — but not limited to — the list in postscriptum below.
— Glenn Heylen, Chaos Ninja @ De Persgroep Publishing
References
- Principles of Chaos Engineering, http://www.principlesofchaos.com
2. Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, Casey Rosenthal, “Chaos Engineering”, IEEE Software, vol.33, no. 3, pp. 3541, MayJune 2016, DOI: 10.1109/MS.2016.60
3. DD Woods, “STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity”, March 14–16, 2017
4. Jonny LeRoy, “Reliability under abnormal conditions, Part One”, ThoughtWorks, November 5, 2017
5. Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, “Chaos Engineering”, O’Reilly, September 26, 2017
by the way, we’re hiring