How Netflix forces themselves to prepare for site emergencies/outages.
Buried deep in a technical post on Netflix's use of Amazon Web Services comes this gem of a quote:
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
What Netflix is saying is that they built a system that would intentionally break things at regular intervals so that they can become good at graceful degradation and responding to emergencies. Gutsy move, but here's what they gain from it:
We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.
If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
via 5 Lessons We’ve Learned Using AWS