Netflix’s Chaos Engineering Should Be Mandatory—Everywhere


Most enterprises hire people to fix things. Netflix hires people to break things. 

Over and over. And over.

Rather than look at Netflix as some bizarre Silicon Valley curiosity, relevant only to those who live between 280 and 101, we should instead embrace Netflix’s culture of “chaos engineering” throughout organizations of all shapes and sizes. The world is moving to the cloud, and the cloud will break. It’s far better for you to break your cloud applications than to have your customers discover the flaws.

Doubling Down On Chaos

Just how does Netflix operate? 

For one thing, Netflix tends to engineer in the open. In fact, much of Netflix’s influence on the industry has less to do with the quality of its software, which has been generally been good, than it does with its willingness to share liberally. Netflix’s GitHub page is filled with an array of interesting open-source projects. The company also regularly speaks at industry events and stages its own. The Netflix tech blog is also highly informative—explaining, for instance, how the company leverages Hadoop, among other things.

See also: Chaos Monkey—How Netflix Uses Random Failure To Ensure Success

But Netflix’s flirtation with cloud chaos might be most interesting of all. Since at least 2010, Netflix has deliberately set out to introduce failure into its cloud systems running on Amazon Web Services using its “Simian Army.” While the “monkeys” do different things, one key element is the chaos monkey, open sourced in 2012.

This particular primate runs in the Amazon cloud, where it seeks out compute sessions—technically, what Amazon calls “auto scaling groups,” which bring on additional capacity as demand rises and releases it as it ebbs—and then randomly crashes virtual servers running within those sessions. Presto! It’s a great test of how fault-tolerant your application is.

Now Netflix is taking this to a new level by institutionalizing chaos engineering, which involves an entire army of engineers to try to break Netflix. As “chaos commander” Bruce Wong explains:

Our philosophy remains unchanged around injecting failure into production to ensure our systems are fault-tolerant. We are constantly testing our ability to survive “once in a blue moon” failures. In a sign of our commitment to this very philosophy, we want to double down on chaos aka failure-injection. We strive to mirror the failure modes that are possible in our production environment and simulate these under controlled circumstances. Our engineers are expected to write services that can withstand failures and gracefully degrade whenever necessary. By continuing to run these simulations, we are able to evaluate and improve such vulnerabilities in our ecosystem.

This is smart. After all, as Viktor Klang, chief architect at Typesafe, suggests, “Resilience has to be designed. Has to be tested.” It’s not something that happens around a table as a slew of exceptional engineers architect the perfect system. Perfection comes through repeatedly trying to break the system.

In this way, it’s not unlike Karl Popper’s definition of science: “the scientific status of a theory is its falsifiability, or refutability, or testability.” In the case of systems, the true test of their health is whether they can withstand random, constant attempts to break (or “falsify”) them.

This is what Netflix does. And as more enterprises move to the cloud, it’s precisely what mainstream companies should be doing, too.

Introducing Your Company To Chaos

Most enterprises don’t aspire to chaos. And yet whether running applications in data centers or public clouds, chaos happens. So why not embrace it?

This will become particularly important given the shift to public cloud environments. As Red Hat’s Dave Neary suggests, “To move applications to cloud, [you must] move resiliency from platform to application.” In other words, rather than bothering with pristine cloud infrastructure, you assume it will break, and try to push resiliency into the application.

See also: DevOps—The Future Of DIY IT

As but one indication of the push into public cloud, Gartner analyst Lydia Leong indicates she’s seeing a significant shift to enterprises skipping hybrid clouds and going “all in” on public cloud:

As this happens, enterprises are going to need to get comfortable with chaos. Call it DevOps or whatever you want; there’s a cultural mind shift necessary to effectively architect applications for the cloud, just as there was to embrace open-source development. 

Empowering Developers To Create Chaos

Already we’re seeing this as developers, tasked by lines of business to “get stuff done,” bypass traditional purchasing and vendor channels, as Forrester analyst Jeffrey Hammond details:

Traditional software companies are essentially creating the last generation of fine sailing ships just as the age of steam power takes over, with steam power being the social developer model. These developers are able to make informed decisions through the web, then buy when they want to.

This is translating into increased developer-led purchases of cloud services and associated hardware and software. Those have grown from 8% of all purchases in 2008 to 10.4% in 2015, even as IT-only purchases are dropping from 23.7% in 2013 to 21.6% in 2015, according to Forrester. It’s not happening overnight, but it’s happening, and cloud and open source are the two primary reasons for developers’ newfound independence.

The next step is to institutionalize chaos, perhaps by embracing Netflix’s open source Simian Army. But really it’s not so much a matter of technology as it is culture. Telling your developers to expect and foster failure as a way to drive resilience into your cloud systems is a big step on the path to engineering in the 21st Century. Time to get started.

Lead photo by Kevin Dooley

Facebook Comments