Did you hear the urban legend about the Gremlins yet? A story about 2 men talking gibberish languages, of which one has huge feet and blue hair. The legend says these 2 men feed upon low-quality systems showing no resilient behavior.
They wreak havoc upon these systems to make them more resilient. You’ll spot them injecting small pieces of failure in systems to show the system’s weaknesses and make them stronger.
The story claims many men and women have made invalid claims about their systems. Different from their actual behavior in production and those men and women had to bear the consequences. Many believed they were outlaws and should be captive. The world would learn eventually, though. They exist with a clear purpose. A noble purpose. It was just going to be a matter of time.
Well, the urban legend is true. Those 2 men are Kyle Hultman and Casey Antocci. Two exotic creatures sent by Gremlin to aid us in our quest for technical excellence. To aid us in our mission to build and deliver more resilient systems. Systems with high availability and minimal unexpected customer impact.
Gremlin: Chaos Engineering
The resilience of our “product-in-the-making”
In Feb of 2019 Kyle and Casey came on site to run a GameDay to test the resilience of a new platform (work in progress). After an introduction to De Persgroep/Medialaan and the architecture of the platform, we had a talk with all individual squads involved in the project to cover which parts of the architecture they implemented, and they did it. We identified the tier 1services and their most likely paths of failure. Experiments were defined and prepared to test those paths to failure.
“On a personal note, the teams and the platform performed extremely well for a first GameDay event. It is my assessment that with continued chaos testing, this platform will achieve the highest level of excellence and resilience observable, within the constraints of the budget and the target audience. The collaboration observed between teams, the willingness to implement and change their platforms to conform to best practices, puts these teams in a great position to succeed if given the resources to continue with chaos engineering.”
Kyle
GameDay : the test
So, how well did the squads deliver? How resilient will their software behave under real-life production situations? We tested this during the GameDay. We got into a war room in the morning. Each squad involved was represented.
We executed experiment by experiment. We validated claims and refuted others. Pizza was delivered to fuel the troops. By the end of the day we had a 30 page document filled with results and findings. How the hell can service X impact service Y and make customer-facing frontend Z go bananas?
People were surprised and excited. Technical curiosity brought the excitement. Surprises were a less welcomed emotion. Teams were happy to be surprised during a GameDay, rather than during launch, though. We actually saved some error budget and spared some money that was going to be burned later in ’19 to fix incidents.
To give you an idea: there were at least 35 chaos findings that were unacceptable before going live with the platform. Minor packet loss that brings down a whole tier 1 service, resulting in complete downtime of the product? Minor datastore latency bringing down half of the features of the product?
Knowing that these minor chaotic modes are rather common when running your software on the cloud, it’s a rather cheap investment upfront to not have this happen in real production operation, right?
This is just the beginning however.
Chaos Engineering at De Persgroep IT
We bought tooling to make this failure injection easy like one two three. Chaos Monkey on steroids, is what we have available to us. It’s going to be a rather slow process to get adoption throughout our whole company, but expect to see more chaos engineering happening in ’19. Each squad of De Persgroep/Medialaan will be tracked through a adaption/maturity level and there will be application scorecards to track progress.
Dear colleagues and friends: remember why we invest in these practices of chaos engineering:
- We need more resilient systems.
- Software has become more complex over the years.
- Many distributed components and a lot of inter-communication between them.
- Fixing incidents is a lot more expensive than inducing them upfront through practices of chaos engineering.
- And you can’t even put a price on publicly losing face because of downtime.
Want to learn more?
Check out Kyle’s talk Chaos Engineering: from buzzword to GameDay !