Dancing With the Chaos Monkey

Chaosmonkey This monkey may look angry, but he's got a super-cool name!

Fail fast, sure, but proactively shooting my products in the head!?!


In 2010 Netflix moved their entire stack over to Amazon Web Services. They were one of AWS’s first really big customers, and I bet there still aren’t too many bigger. In December of that year they blogged about the trials and tribulations of this move and made a rather startling confession about what they called the Chaos Monkey. Yeah, strange title!

One of the first bits of code they wrote for the move was a little piece of heaven they called the Chaos Monkey. This code’s sole job was to randomly, at various times, kill off servers or services within their stack. The theory was that every portion of their implementation should be able, at some level of service, stand on its own. To quote:

If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.

I found this tidbit through a blog entry written by Jeff Atwood, co-founder of Stack Exchange. It appears that Jeff has had some experience dancing with the chaos monkey himself, and has several good tips. But curiously the monkey goes way back, back to the very early ages of the information revolution. All the way back to 1983 (I had just graduated high school the year before!) when the Apple MacIntosh developers needed a way to stress test MacPaint. This app was so memory intensive, and the hardware so memory limited, that they often ran at the very limits of what the MacIntosh system could handle. So they coded in a feature where the application could work itself, at a rather furious pace, until memory failure or other disaster struck. Then they would patch that hole as best as they could. Very interesting.

At its heart the monkey is a brilliant concept: Fear of failure? Hell no, we fail daily. We embrace it. We LOVE the failure. It only makes us stronger. But in its execution it is very subtle. The obvious effect of a monkey is that your teams take a completely different design approach. If you build it believing it WILL fail then you build a little differently. Jeff wrote about this in his blog entry.

But I come at this from a different point of view. When you aren’t dancing with your own monkey you are at the whims of fate; failures come randomly and hopefully seldom. You would think this would be a good thing, right? It isn’t. This means that when your systems fail you may or may not have teams on hand who are experienced with how they should recover. You haven’t practiced failure, therefore recovery becomes uncertain at best, disastrous at worst.

I had the pleasure of spending eight and a half years in the US Navy, or Uncle Sam’s Canoe Club as we liked to call it. Every single system on the ship had a playbook for what happened if/when that system took damage. Scenarios ranged from lowered response times to outright failure of the system to complete loss of the ship. This is no different from almost every IT department’s playbook, but here’s the kicker… we practiced our playbook. Extensively, repeatedly, often and at all hours of the night. When failure did happen we all knew, without hesitation, what we had to do, how we had to handle it. And I just don’t see that in today’s corporate world.

In every shop I’ve ever worked at they avoided service failure at all costs. The worked overtime to keep services up. And that is as it should be. But the monkey forces a perspective where failure absolutely will walk in and invite you to dance. Take that point of view and then throw your monkey into your production stack like Netflix did and you get something amazing. Now, the dances have an actual cost in terms of customer’s experience. There is no practice, there is only do. And that, ladies and gentlemen is the very heart of agility.

My current engagement is a data-warehouse and we use AWS as well. We’re just starting to plan another release cycle and I’m going to start talking with people about building a monkey. Maybe I can expense a pet monkey as a mascot? Who knows!

Estimate... Less
Closing a Backlog

Related Posts



No comments made yet. Be the first to submit a comment
Already Registered? Login Here
Monday, 25 October 2021
If you'd like to register, please fill in the username, password and name fields.